SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki - PowerPoint PPT Presentation

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki Yamakawa UL Systems, Inc.

About me Masaki Yamakawa UL Systems, Inc. Managing Consultant { “Sector” ：” Financial ” “Skills” ：[“ Distributed system ”, “ In-memory computing ”] “Hobbies”：” Marathon running ” } 1

Agenda 1.Current Issues of Real-Time Analytics Solutions 2.SnappyData Features 3.Our SnappyData Case Study 2

Current Issues of Real-Time Analytics Solutions PART 1 3

Are you satisfied with real-time analytics solutions? – Complex – Slow – Bad performance – Loading data to memory required – Difficulty with updates 4

What are common demands for data processing platform? Transaction Analytics Streaming RDBMS DWH Traditional data processing NoSQL SQL on Hadoop Stream data processing Bigdata processing 5

Tends to become complex system when integrates with multiple products Data visualization ETL processing Store enterprise data and analysis ETL processing Enterprise RDBMS DWH BI/Analytic Data visualization systems s and analysis AP Store and process big data Web/B2C services etc. Store streaming data Notification, Alert Process real-time data Real-time IoT / sensor / Stream data AP real-time data etc. processing 6

Tends to become complex system when integrates with multiple products Data visualization ETL processing Store enterprise data and analysis Increased Inefficiency ETL processing Enterprise RDBMS DWH BI/Analytic TCO Data visualization systems s and analysis AP Store and process big data It takes Difficult to time to maintain data Web/B2C services etc. Store streaming data consistency analyze Notification, Alert Process real-time data Real-time IoT / sensor / Stream data AP real-time data etc. processing 7

Although it became quite simple after Spark released…? Data visualization Store enterprise data and analysis Enterprise BI/Analytic systems s or AP Store and process big data Web/B2C services etc. or Notification, Alert Process real-time data Real-time IoT / sensor / AP real-time data etc. 8

SnappyData can build simpler real-time analytics solutions! Data visualization SnappyData Store enterprise data and analysis Enterprise BI/Analytic systems s AP Store and process big data Web/B2C services etc. Notification, Alert Process real-time data Real-time IoT / sensor / AP real-time data etc. 9

SnappyData Features PART 2 10

SnappyData is the Spark Database for Spark users 11

Apache Spark+Distributed In-memory DB+Own features Batch processing Analytics Row database Columnar database Stream Transaction Synapsis Data Engine processing Distributed Distributed In- computing framework memory database SnappyData's own features 12

What is SnappyData's core component? • Seamless integration of Spark and in-memory database conponents Spark Micro-batch Streaming Spark Core GemFire XD SnappyData's Spark SQL Catalyst additional Transaction Continuous SynopsisData features Query OLTP Query OLAP Query Engine In-Memory Database Stream Row Column Sample/TopK Table Table Index Table Table Distributed P2P Cluster Management Replication/Partition file system HDFS HDFS HDFS HDFS 13

Key to Spark programʼs accelerations 1 In-memory database In-memory data format 2 Unified cluster 3 Optimized SparkSQL 4 14

Key#1: Data exists in in-memory database In case of Spark In case of SnappyData Spark program Spark program Spark Spark In memory In memory Distributed On HDFS in-memory database disk 15

Key#1: Data access code example In case of Spark In case of SnappyData // create SnappySession from SparkContext val snappy = new org.apache.spark.sql. SnappySession (spark.sparkContext) // load data from HDFS val df = spark.sqlContext. read . format("com.databricks.spark.csv"). No need to load data option("header", "true"). load ("hdfs://...") df.createOrReplaceTempView("SparkTable") // create new DataFrame using SparkSQL // create new DataFrame using SparkSQL val filteredDf = val filteredDf = spark . sql ("SELECT * FROM SparkTable WHERE ...") snappy . sql ("SELECT * FROM SnappyTable WHERE …") val newDf = filteredDf. .... val newDf = filteredDf. .... // save processing results // save processing results newDf. write . newDf. write.insertInto ("NewSnappyTable") format("com.databricks.spark.csv"). option("header", "false"). save ("hdfs://...") 16

Key#2: SnappyData same data format as Sparkʼs In case of Spark In case of SnappyData Spark Spark DataFrame DataFrame serialization/ No serialization/deserialization, O/R deserializatio mapping n No O/R mapping reading/writin reading/writin g data g data HDFS/data storage GemFire XD : In-memory database CSV file 17

Key#3: Spark and GemFire XD cluster can be integrated Unified cluster mode SnappyData Leader （Spark Driver） Spark Context SnappyData DataServer SnappyData DataServer SnappyData DataServer Spark Executor Spark Executor Spark Executor DataFrame DataFrame DataFrame DataFrame DataFrame DataFrame Spark with GemFire XD cluster In-memory In-memory In-memory database database database SnappyData JVM JVM JVM Locator 18

Key#3: Another cluster mode (for your reference) Split cluster mode SnappyData Leader （Spark Driver） Spark Context Spark Executor Spark Executor Spark Executor Spark cluster DataFrame DataFrame DataFrame DataFrame DataFrame DataFrame JVM JVM JVM In-memory In-memory In-memory GemFire XD database database database SnappyData cluster Locator SnappyData DataServer SnappyData DataServer SnappyData DataServer JVM JVM JVM 19

Key#4: SparkSQL Acceleration In case of Spark In case of SnappyData Unique DAG is generated, SELECT A.CardNumber, less shuffle and faster SUM (A.TxAmount) FROM CreditCardTx1 A, CreditCardComm B WHERE Accelerate the A.CardNumber=B.CardNumber AND A.TxAmount+B.Comm < 1000 processing by modifying GROUP BY some workload of A.CardNumber ORDER BY SparkSQL Sort A.CardNumber SnappyHashJoi n SortMerg eJoin SnappyHash HashAggregat Aggregate e HashAggregat e 20

Our SnappyData Case Study: How to use SnappyData PART 3 21

Example of use: Production plan simulation system Messaging Middleware Production results Real-time notification Production results stream Machine sensor data Machine sensor stream Production Machine BOM results table sensor table table APP BOM stream BOM BI Tool Simulation parameters table In-memory database Simulation parameters 22

Architecture with SnappyData • Use SnappyData to realize all data processings such as stream processings, transactions, analytics • The key is that it includes in-memory database and can be processed by SQL A) Stream B) data Transaction processing SQL Messaging Middleware SQL C) Analytics In-memory APP database SQL 23

A) Stream Data Processing • The stream data is inserted into the table Difference from plain Spark • Stream data processing can be executed by SQL SnappyData A) Stream data processing Messaging Middleware SQL In-memory APP database 24

SnappyData implements stream data processing using SQL Stream table Process（Continuous Query） Senso Machin Poin VIN Value Timestamp rId eNo t SELECT * 28.076 2017/11/05 1 11AA 111 1 0 10:10:01 FROM MachineSensorStream 2017/11/05 2 22BB 222 37 60.069 WINDOW 10:10:20 (DURATION 10 SECONDS, 2017/11/05 3 11AA 111 2 37.528 SLIDE 2 SECONDS) 10:10:21 WHERE 2017/11/05 4 33CC 333 25 1.740 Point=1; 10:11:05 2017/11/05 5 11AA 111 3 88.654 10:11:15 394.39 2017/11/05 6 11AA 111 4 0 10:11:16 25

Only specifies stream data source info in table definition CREATE STREAM TABLE MachineSensorStream (SensorId long, Streaming data source other than VIN string, Kafka l TWITTER_STREAM MachineNo int, l DIRECTKAFKA_STREAM Point long l RABBITMQ_STREAM Value double, l SOCKET_STREAM l FILE_STREAM Timestamp timestamp) Streaming data source USING KAFKA_STREAM OPTIONS (storagelevel 'MEMORY_AND_DISK_SER_2', Storage level (Spark setting) rowConverter 'uls.snappy.KafkaToRowsConverter', Stream data row converter class kafkaParams 'zookeeper.connect->localhost:2181;xx', Setting for each streaming topics 'MachineSensorStream'); data source 26

Implements StreamToRowsConverter and converts to table format class KafkaToRowsConverter extends StreamToRowsConverter with Serializable { override def toRows (message: Any): Seq[Row] = { val sensor: MachineSensorStream = message.asInstanceOf[MachineSensorStream] Seq ( Row.fromSeq (Seq(sensor.getSensorId, Data for one row sensor.getVin, sensor.getMachineNo, of stream table sensor.getPoint, sensor.getValue, sensor.getTimestamp))) } } 27

Stream data processing using SQL SELECT DURATION 10 secs * FROM MachineSensorStream WINDOW ( DURATION 10 SECONDS , SLIDE 2 SECONDS ) WHERE SLIDE Point=1; 2 secs Point acquires “1" data in 2 secs sliding window 28

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki - PowerPoint PPT Presentation

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki Yamakawa UL Systems, Inc. About me Masaki Yamakawa UL Systems, Inc. Managing Consultant { Sector Financial Skills [ Distributed system ,

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Predicting Share Prices in Real-Time with Apache Spark and Apache Ignite MANUEL MOURATO Summary

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks What hat is is Sp

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES ECOSYSTEM Michael McCune 14

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Validation for Distributed Systems with Apache Spark & Beam Melinda Seckington Now

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal

Multiple graphs and composable queries in Cypher for Apache Spark Max Kieling openCypher

Apache SystemML Declarative Machine Learning Luciano Resende IBM | Spark Technology Center IBM

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Spark & Spark SQL High-Speed In-Memory Analytics over

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Your Program on Apache Spark GTC 2017 Kazuaki Ishizaki + , Madhusudanan Kandasamy * , Gita

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spatial Data Management in Apache Spark The GeoSpark Perspective and Beyond Jia Yu THIS TALK

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki - PowerPoint PPT Presentation

SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki Yamakawa UL Systems, Inc. About me Masaki Yamakawa UL Systems, Inc. Managing Consultant { Sector Financial Skills [ Distributed system ,

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Predicting Share Prices in Real-Time with Apache Spark and Apache Ignite MANUEL MOURATO Summary

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks What hat is is Sp

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES ECOSYSTEM Michael McCune 14

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Validation for Distributed Systems with Apache Spark &amp; Beam Melinda Seckington Now

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal

Multiple graphs and composable queries in Cypher for Apache Spark Max Kieling openCypher

Apache SystemML Declarative Machine Learning Luciano Resende IBM | Spark Technology Center IBM

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Spark &amp; Spark SQL High-Speed In-Memory Analytics over

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Your Program on Apache Spark GTC 2017 Kazuaki Ishizaki + , Madhusudanan Kandasamy * , Gita

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spatial Data Management in Apache Spark The GeoSpark Perspective and Beyond Jia Yu THIS TALK

Validation for Distributed Systems with Apache Spark & Beam Melinda Seckington Now

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark & Spark SQL High-Speed In-Memory Analytics over