Data at the Speed of your Users Apache Cassandra and Spark for - PowerPoint PPT Presentation

Data at the Speed of your Users Apache Cassandra and Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014

Rustam Aliyev Solution Architect at . � � @rstml

Big Data? Photo: Flickr / Watches En Masse

� � � Volume Variety Velocity

Velocity = Near Real Time

Near Real Time?

Near Real Time 0.5 sec ≤ ≤ 60 sec

Use Cases Photo: Flickr / Swiss Army / Jim Pennucci

Web Analytics Dynamic Pricing Recommendation Fraud Detection

Architecture Photo: Ilkin Kangarli / Baku Haydar Aliyev Center

Architecture Goals Low Latency High Availability Horizontal Scalability Simplicity

Stream Processing � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Collection Processing Storing Delivery

Stream Processing  � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Collection Delivery � � Spark Cassandra

Cassandra Distributed Database Photo: Flickr / Hypostyle Hall / Jorge Láscar

Data Model

Partition Cell 1 Cell 2 Cell 3 … Partition Key

Partition os: storage: version: weight: Nexus 5 Android 32GB 4.4 130g sort order on disk

Table os: storage: version: weight: Nexus 5 Android 32GB 4.4 130g os: storage: version: weight: iPhone 6 iOS 64GB 8.0 129g os: memory: version: weight: Other

Distribution

0000 Nexus 5 3D97 2000 E000 C000 4000 6000 A000 8000

0000 iPhone 6 9C4F 2000 E000 C000 4000 3D97 6000 A000 8000

Replication

1 replica 0000 2000 E000 C000 4000 3D97 6000 A000 9C4F 8000

2 replicas 0000 2000 E000 C000 4000 3D97 9C4F 6000 A000 3D97 9C4F 8000

Spark Distributed Data Processing Engine Photo: Flickr / Sparklers / Alexandra Compo / CreativeCommons

Fast In-memory

Logistic Regression 4000 3000 Running Time (s) Spark 2000 Hadoop 1000 1 5 10 20 30 Number of Iterations

map reduce �

map reduce sample filter count take groupBy fold first sort reduceByKey partitionBy union groupByKey mapWith join cogroup pipe leftOuterJoin cross save   rightOuterJoin zip ...

RDD Resilient Distributed Datasets Node 2 Node 1 Node 3 Node 2 Node 1 Node 3

Operator DAG groupBy join map filter Disk RDD Memory RDD

Spark Streaming Micro-batching

RDD Data Stream DStream

Spark + Cassandra DataStax Spark Cassandra Connector

https://github.com/datastax/spark-cassandra-connector

   M    Cassandra   Spark Worker M  Spark Master & Worker M

Demo � � Twitter Analytics

Cassandra Data Model

ALL: 2014-09-21: 2014-09-20: 2014-09-19: #hashtag 7139 220 309 129 sort order

CREATE ¡TABLE ¡hashtags ¡( ¡ ¡ ¡ ¡ ¡hashtag ¡text, ¡ ¡ ¡ ¡ ¡interval ¡text, ¡ ¡ ¡ ¡ ¡mentions ¡counter, ¡ ¡ ¡ ¡ ¡PRIMARY ¡KEY((hashtag), ¡interval) ¡ ) ¡WITH ¡CLUSTERING ¡ORDER ¡BY ¡(interval ¡DESC); ¡

Processing Data Stream

import ¡com.datastax.spark.connector.streaming._ ¡ � val ¡sc ¡= ¡new ¡SparkConf() ¡ ¡ ¡.setMaster("spark://127.0.0.1:7077") ¡ ¡ ¡.setAppName("Twitter-‑Demo") ¡ ¡ ¡.setJars("demo-‑assembly-‑1.0.jar")) ¡ ¡ ¡.set("spark.cassandra.connection.host", ¡"127.0.0.1") ¡ � val ¡ssc ¡= ¡new ¡StreamingContext(sc, ¡Seconds(2)) ¡ � val ¡stream ¡= ¡TwitterUtils. ¡ ¡ ¡createStream(ssc, ¡None, ¡Nil, ¡storageLevel ¡= ¡StorageLevel.MEMORY_ONLY_SER_2) ¡ � val ¡hashTags ¡= ¡stream.flatMap(tweet ¡=> ¡ ¡ ¡tweet.getText.toLowerCase.split(" ¡"). ¡ ¡ ¡filter(tags.contains(Seq("#iphone", ¡"#android")))) ¡ � val ¡tagCounts ¡= ¡hashTags.map((_, ¡1)).reduceByKey(_ ¡+ ¡_) ¡ � val ¡tagCountsAll ¡= ¡tagCounts.map{ ¡ ¡ case ¡(tag, ¡mentions) ¡=> ¡(tag, ¡mentions, ¡"ALL") ¡ } ¡ � tagCountsAll.saveToCassandra( ¡

� val ¡ssc ¡= ¡new ¡StreamingContext(sc, ¡Seconds(2)) ¡ � val ¡stream ¡= ¡TwitterUtils. ¡ ¡ ¡createStream(ssc, ¡None, ¡Nil, ¡storageLevel ¡= ¡StorageLevel.MEMORY_ONLY_SER_2) ¡ � val ¡hashTags ¡= ¡stream.flatMap(tweet ¡=> ¡ ¡ ¡tweet.getText.toLowerCase.split(" ¡"). ¡ ¡ ¡filter(tags.contains(Seq("#iphone", ¡"#android")))) ¡ � val ¡tagCounts ¡= ¡hashTags.map((_, ¡1)).reduceByKey(_ ¡+ ¡_) ¡ � val ¡tagCountsAll ¡= ¡tagCounts.map{ ¡ ¡ case ¡(tag, ¡mentions) ¡=> ¡(tag, ¡mentions, ¡"ALL") ¡ } ¡ � tagCountsAll.saveToCassandra( ¡ ¡ "demo_ks", ¡"hashtags", ¡Seq("hashtag", ¡"mentions", ¡"interval")) ¡ � ssc.start() ¡ ssc.awaitTermination() ¡

� val ¡ssc ¡= ¡new ¡StreamingContext(sc, ¡Seconds(2)) ¡ � val ¡stream ¡= ¡TwitterUtils. ¡ ¡ ¡createStream(ssc, ¡None, ¡Nil, ¡storageLevel ¡= ¡StorageLevel.MEMORY_ONLY_SER_2) ¡ � val ¡hashTags ¡= ¡stream.flatMap(tweet ¡=> ¡ ¡ ¡tweet.getText.toLowerCase.split(" ¡"). ¡ ¡ ¡filter(tags.contains(Seq("#iphone", ¡"#android")))) ¡ � val ¡tagCounts ¡= ¡hashTags.map((_, ¡1)).reduceByKey(_ ¡+ ¡_) ¡ � val ¡tagCountsByDay ¡= ¡tagCounts.map{ ¡ ¡ case ¡(tag, ¡mentions) ¡=> ¡(tag, ¡mentions, ¡DateTime.now.toString("yyyyMMdd")) ¡ } ¡ � tagCountsByDay.saveToCassandra( ¡ ¡ "demo_ks", ¡"hashtags", ¡Seq("hashtag", ¡"mentions", ¡"interval")) ¡ � ssc.start() ¡ ssc.awaitTermination() ¡

Data at the Speed of your Users Apache Cassandra and Spark for - PowerPoint PPT Presentation

Data at the Speed of your Users Apache Cassandra and Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014 Rustam Aliyev Solution Architect at . @rstml Big Data? Photo:

Fermilab Users Meeting Fermilab Users Meeting Fermilab Users Meeting Fermilab Users

Cedar Rapids RLR & Speed Des Moines RLR & Speed

Speed, speed, speed D. J. Bernstein University of Illinois at Chicago; Ruhr University Bochum

SPEED OF THOUGHT SPEED OF THOUGHT 120m/s SPEED OF THOUGHT COMMUNICATIVE The Artist is Absent:

Speed Bump? http://www.skepticalscience.com/graphics.php?g=47 Speed Bump?

POWERED STARTUPS Speed@BDD Presentation July 2017 SPEED@BDD IN A NUTSHELL Speed@BDD is a

MCC Speed Management Policy Agenda Purpose of the Speed Management Policy Results of

Lab 9. Speed Control of a D.C. motor Sensing Motor Speed (Tachometer Frequency Method) Motor

10 years of Speed Tables Peter da Silva FlightAware What are Speed Tables? What are Speed

Speed, speed, speed $1000 TCR hashing competition D. J. Bernstein Crowley: I have a problem

Lab 11. Speed Control of a D.C. motor Motor Characterization Motor Speed Control Project

Lab 11. Speed Control of a D.C. motor Motor Characterization Motor Speed Control Project

Scalable Concurrent Hash Tables via Relativistic Programming Josh Triplett April 29, 2010 Speed

What do you do if your data fail your specification? Target ... Repair your data.

Variable Speed Gensets VP Business Development Presented by David Brown CVT Corp Topics

Slow Speed Network Slow Speed Network Strategic Plan for the Strategic Plan for the South Bay

10/11/17 Ron Rogers Barbara Boone @ronbrogers @ boonebbuzz Ron_Rogers@ocali.org

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde DATA

Tagvisor: A Privacy Advisor for Sharing Hashtags Yang Zhang Joint work with Mathias Humbert,

Flexible Campus VLAN System Flexible Campus VLAN System Based on OpenFlow Yasuhiro Yamasaki

SEMI-SUPERVISED STANCE DETECTION IN TWEETS BASED ON SENTIMENT RULES Marcelo Dias and Karin

#ACTIVISM & YOU CREATING CHANGE IN 140 CHARACTERS Patwin Land ROLL CALL Who here has a

Social Media & Citizen Science Giulia Annovi SISSA | 20 March 2017 Social Media, in

The Politics of Squares Professor Helmut K Anheier Laurie Penny Dean, Hertie School of

Data at the Speed of your Users Apache Cassandra and Spark for - PowerPoint PPT Presentation

Data at the Speed of your Users Apache Cassandra and Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014 Rustam Aliyev Solution Architect at . @rstml Big Data? Photo:

Fermilab Users Meeting Fermilab Users Meeting Fermilab Users Meeting Fermilab Users

Cedar Rapids RLR &amp; Speed Des Moines RLR &amp; Speed

Speed, speed, speed D. J. Bernstein University of Illinois at Chicago; Ruhr University Bochum

SPEED OF THOUGHT SPEED OF THOUGHT 120m/s SPEED OF THOUGHT COMMUNICATIVE The Artist is Absent:

Speed Bump? http://www.skepticalscience.com/graphics.php?g=47 Speed Bump?

POWERED STARTUPS Speed@BDD Presentation July 2017 SPEED@BDD IN A NUTSHELL Speed@BDD is a

MCC Speed Management Policy Agenda Purpose of the Speed Management Policy Results of

Lab 9. Speed Control of a D.C. motor Sensing Motor Speed (Tachometer Frequency Method) Motor

10 years of Speed Tables Peter da Silva FlightAware What are Speed Tables? What are Speed

Speed, speed, speed $1000 TCR hashing competition D. J. Bernstein Crowley: I have a problem

Lab 11. Speed Control of a D.C. motor Motor Characterization Motor Speed Control Project

Lab 11. Speed Control of a D.C. motor Motor Characterization Motor Speed Control Project

Scalable Concurrent Hash Tables via Relativistic Programming Josh Triplett April 29, 2010 Speed

What do you do if your data fail your specification? Target ... Repair your data.

Variable Speed Gensets VP Business Development Presented by David Brown CVT Corp Topics

Slow Speed Network Slow Speed Network Strategic Plan for the Strategic Plan for the South Bay

10/11/17 Ron Rogers Barbara Boone @ronbrogers @ boonebbuzz Ron_Rogers@ocali.org

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde DATA

Tagvisor: A Privacy Advisor for Sharing Hashtags Yang Zhang Joint work with Mathias Humbert,

Flexible Campus VLAN System Flexible Campus VLAN System Based on OpenFlow Yasuhiro Yamasaki

SEMI-SUPERVISED STANCE DETECTION IN TWEETS BASED ON SENTIMENT RULES Marcelo Dias and Karin

#ACTIVISM &amp; YOU CREATING CHANGE IN 140 CHARACTERS Patwin Land ROLL CALL Who here has a

Social Media &amp; Citizen Science Giulia Annovi SISSA | 20 March 2017 Social Media, in

The Politics of Squares Professor Helmut K Anheier Laurie Penny Dean, Hertie School of

Cedar Rapids RLR & Speed Des Moines RLR & Speed

#ACTIVISM & YOU CREATING CHANGE IN 140 CHARACTERS Patwin Land ROLL CALL Who here has a

Social Media & Citizen Science Giulia Annovi SISSA | 20 March 2017 Social Media, in