SnappyData: Apache Spark Meets Embedded In-Memory Database Masaki Yamakawa UL Systems, Inc.
About me Masaki Yamakawa UL Systems, Inc. Managing Consultant { “Sector” :” Financial ” “Skills” :[“ Distributed system ”, “ In-memory computing ”] “Hobbies”:” Marathon running ” } 1
Agenda 1.Current Issues of Real-Time Analytics Solutions 2.SnappyData Features 3.Our SnappyData Case Study 2
Current Issues of Real-Time Analytics Solutions PART 1 3
Are you satisfied with real-time analytics solutions? – Complex – Slow – Bad performance – Loading data to memory required – Difficulty with updates 4
What are common demands for data processing platform? Transaction Analytics Streaming RDBMS DWH Traditional data processing NoSQL SQL on Hadoop Stream data processing Bigdata processing 5
Tends to become complex system when integrates with multiple products Data visualization ETL processing Store enterprise data and analysis ETL processing Enterprise RDBMS DWH BI/Analytic Data visualization systems s and analysis AP Store and process big data Web/B2C services etc. Store streaming data Notification, Alert Process real-time data Real-time IoT / sensor / Stream data AP real-time data etc. processing 6
Tends to become complex system when integrates with multiple products Data visualization ETL processing Store enterprise data and analysis Increased Inefficiency ETL processing Enterprise RDBMS DWH BI/Analytic TCO Data visualization systems s and analysis AP Store and process big data It takes Difficult to time to maintain data Web/B2C services etc. Store streaming data consistency analyze Notification, Alert Process real-time data Real-time IoT / sensor / Stream data AP real-time data etc. processing 7
Although it became quite simple after Spark released…? Data visualization Store enterprise data and analysis Enterprise BI/Analytic systems s or AP Store and process big data Web/B2C services etc. or Notification, Alert Process real-time data Real-time IoT / sensor / AP real-time data etc. 8
SnappyData can build simpler real-time analytics solutions! Data visualization SnappyData Store enterprise data and analysis Enterprise BI/Analytic systems s AP Store and process big data Web/B2C services etc. Notification, Alert Process real-time data Real-time IoT / sensor / AP real-time data etc. 9
SnappyData Features PART 2 10
SnappyData is the Spark Database for Spark users 11
Apache Spark+Distributed In-memory DB+Own features Batch processing Analytics Row database Columnar database Stream Transaction Synapsis Data Engine processing Distributed Distributed In- computing framework memory database SnappyData's own features 12
What is SnappyData's core component? • Seamless integration of Spark and in-memory database conponents Spark Micro-batch Streaming Spark Core GemFire XD SnappyData's Spark SQL Catalyst additional Transaction Continuous SynopsisData features Query OLTP Query OLAP Query Engine In-Memory Database Stream Row Column Sample/TopK Table Table Index Table Table Distributed P2P Cluster Management Replication/Partition file system HDFS HDFS HDFS HDFS 13
Key to Spark programʼs accelerations 1 In-memory database In-memory data format 2 Unified cluster 3 Optimized SparkSQL 4 14
Key#1: Data exists in in-memory database In case of Spark In case of SnappyData Spark program Spark program Spark Spark In memory In memory Distributed On HDFS in-memory database disk 15
Key#1: Data access code example In case of Spark In case of SnappyData // create SnappySession from SparkContext val snappy = new org.apache.spark.sql. SnappySession (spark.sparkContext) // load data from HDFS val df = spark.sqlContext. read . format("com.databricks.spark.csv"). No need to load data option("header", "true"). load ("hdfs://...") df.createOrReplaceTempView("SparkTable") // create new DataFrame using SparkSQL // create new DataFrame using SparkSQL val filteredDf = val filteredDf = spark . sql ("SELECT * FROM SparkTable WHERE ...") snappy . sql ("SELECT * FROM SnappyTable WHERE …") val newDf = filteredDf. .... val newDf = filteredDf. .... // save processing results // save processing results newDf. write . newDf. write.insertInto ("NewSnappyTable") format("com.databricks.spark.csv"). option("header", "false"). save ("hdfs://...") 16
Key#2: SnappyData same data format as Sparkʼs In case of Spark In case of SnappyData Spark Spark DataFrame DataFrame serialization/ No serialization/deserialization, O/R deserializatio mapping n No O/R mapping reading/writin reading/writin g data g data HDFS/data storage GemFire XD : In-memory database CSV file 17
Key#3: Spark and GemFire XD cluster can be integrated Unified cluster mode SnappyData Leader (Spark Driver) Spark Context SnappyData DataServer SnappyData DataServer SnappyData DataServer Spark Executor Spark Executor Spark Executor DataFrame DataFrame DataFrame DataFrame DataFrame DataFrame Spark with GemFire XD cluster In-memory In-memory In-memory database database database SnappyData JVM JVM JVM Locator 18
Key#3: Another cluster mode (for your reference) Split cluster mode SnappyData Leader (Spark Driver) Spark Context Spark Executor Spark Executor Spark Executor Spark cluster DataFrame DataFrame DataFrame DataFrame DataFrame DataFrame JVM JVM JVM In-memory In-memory In-memory GemFire XD database database database SnappyData cluster Locator SnappyData DataServer SnappyData DataServer SnappyData DataServer JVM JVM JVM 19
Key#4: SparkSQL Acceleration In case of Spark In case of SnappyData Unique DAG is generated, SELECT A.CardNumber, less shuffle and faster SUM (A.TxAmount) FROM CreditCardTx1 A, CreditCardComm B WHERE Accelerate the A.CardNumber=B.CardNumber AND A.TxAmount+B.Comm < 1000 processing by modifying GROUP BY some workload of A.CardNumber ORDER BY SparkSQL Sort A.CardNumber SnappyHashJoi n SortMerg eJoin SnappyHash HashAggregat Aggregate e HashAggregat e 20
Our SnappyData Case Study: How to use SnappyData PART 3 21
Example of use: Production plan simulation system Messaging Middleware Production results Real-time notification Production results stream Machine sensor data Machine sensor stream Production Machine BOM results table sensor table table APP BOM stream BOM BI Tool Simulation parameters table In-memory database Simulation parameters 22
Architecture with SnappyData • Use SnappyData to realize all data processings such as stream processings, transactions, analytics • The key is that it includes in-memory database and can be processed by SQL A) Stream B) data Transaction processing SQL Messaging Middleware SQL C) Analytics In-memory APP database SQL 23
A) Stream Data Processing • The stream data is inserted into the table Difference from plain Spark • Stream data processing can be executed by SQL SnappyData A) Stream data processing Messaging Middleware SQL In-memory APP database 24
SnappyData implements stream data processing using SQL Stream table Process(Continuous Query) Senso Machin Poin VIN Value Timestamp rId eNo t SELECT * 28.076 2017/11/05 1 11AA 111 1 0 10:10:01 FROM MachineSensorStream 2017/11/05 2 22BB 222 37 60.069 WINDOW 10:10:20 (DURATION 10 SECONDS, 2017/11/05 3 11AA 111 2 37.528 SLIDE 2 SECONDS) 10:10:21 WHERE 2017/11/05 4 33CC 333 25 1.740 Point=1; 10:11:05 2017/11/05 5 11AA 111 3 88.654 10:11:15 394.39 2017/11/05 6 11AA 111 4 0 10:11:16 25
Only specifies stream data source info in table definition CREATE STREAM TABLE MachineSensorStream (SensorId long, Streaming data source other than VIN string, Kafka l TWITTER_STREAM MachineNo int, l DIRECTKAFKA_STREAM Point long l RABBITMQ_STREAM Value double, l SOCKET_STREAM l FILE_STREAM Timestamp timestamp) Streaming data source USING KAFKA_STREAM OPTIONS (storagelevel 'MEMORY_AND_DISK_SER_2', Storage level (Spark setting) rowConverter 'uls.snappy.KafkaToRowsConverter', Stream data row converter class kafkaParams 'zookeeper.connect->localhost:2181;xx', Setting for each streaming topics 'MachineSensorStream'); data source 26
Implements StreamToRowsConverter and converts to table format class KafkaToRowsConverter extends StreamToRowsConverter with Serializable { override def toRows (message: Any): Seq[Row] = { val sensor: MachineSensorStream = message.asInstanceOf[MachineSensorStream] Seq ( Row.fromSeq (Seq(sensor.getSensorId, Data for one row sensor.getVin, sensor.getMachineNo, of stream table sensor.getPoint, sensor.getValue, sensor.getTimestamp))) } } 27
Stream data processing using SQL SELECT DURATION 10 secs * FROM MachineSensorStream WINDOW ( DURATION 10 SECONDS , SLIDE 2 SECONDS ) WHERE SLIDE Point=1; 2 secs Point acquires “1" data in 2 secs sliding window 28
Recommend
More recommend