An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @ KTH <parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org> 1
Motivation • Time-critical problems / Actionable Insights • Stock market predictions • Fraud detection • Network security • Fresh customer recommendations more like First-World Problems.. 2
How about Tsunamis 3
Deploy Sensors earth & wave activity Analyse Data Collect Regularly Data Q evacuation window = Q 4
Motivation Q Q Q = 5
Motivation Standing Query Q evacuation window Q = 6
The Data Stream Paradigm • Standing queries are evaluated continuously • Input data is unbounded • Queries operate on the full data stream or on the most recent views of the stream ~ windows 7
Data Stream Basics • Events/Tuples : elements of computation - respect a schema • Data Streams : unbounded sequences of events • Stream Operators/Tasks: consume and produce data streams • Events are consumed once - no backtracking! S1 S’1 where are computations f S2 stored? S’2 So 8
Synopsis-Task State We cannot infinitely store all events seen • Synopsis : A summary of an infinite stream • It is in principle any streaming operator state • Examples: samples, histograms, sketches, state machines… a summary of everything s seen so far 1. process t, s t t’ 2. update s f 3. produce t’ 9
Synopses-Aggregations • Discussion - Rolling Aggregations • Propose a synopsis, s=? when • f= max • f= ArithmeticMean • f= stDev 10
Synopses-Approximations • Discussion - Approximate Results • Propose a synopsis, s=? when • f= uniform random sample of k records over the whole stream • f= filter distinct records over windows of 1000 records with a 5% error 11
Synopses-ML and Graphs • Examples of cool synopses to check out • Sparsifiers/Spanners - approximating graph properties such as shortest paths • Change detectors - detecting concept drift • Incremental decision trees - continuous stream training and classification 12
Data Stream Basics Any other problems? S1 S’1 Does this scale? f S2 S’2 So 13
Task Parallelism • We need task parallelism: • Data might be too large to process • State can get too large to fit in memory (e.g. graphs) • Data Streams might already be partitioned! (e.g. by key/ kafka partitions) S1 S’1 how do streams f get partitioned? S2 S’2 So 14
Task Partitioning • Partitioning defines how we allocate events to each parallel task instance. Typical partitioners are: s • Broadcast f P s f s f • Shuffle P s f s by f color • Key-based P s f
Dataflow Pipelines Q approximations stream1 predictions alerts …… sources stream2 sinks 16
Dataflow Programming with Apache Storm • Step1: Implement input ( Spouts ) and intermediate operators ( Bolts ) • Step 2: Construct a Topology by combining operators Spouts are the Bolts represent all intermediate computation topology sources vertices of the topology They listen to data They do arbitrary data manipulation feeds Spout Bolt Bolt Each operator can emit/subscribe to Streams ( computation results ) 17
Example: Topology Definition numbers new_numbers toFile numbers new_numbers 18
Stream Analytics Systems Proprietary Open Source Google Flink DataFlow Samza IBM Infosphere Spark Storm Microsoft Azure Beam 19
Programming Models Declarative Compositional • Physical Representations • Logical Representations • Offer basic building blocks • Operators are transformations (Operators/Data Exchange) on abstract data types • Custom Optimisation/ • Advanced behaviour such as Tuning windowing is supported • Self-Optimisation 20
Programming Abstraction Levels • Transformations abstract DStream, DataStream, operator details PCollection… • Suitable for engineers and data analysts • Direct access to the execution graph / topology • Suitable for engineers 21
Introducing Apache Flink • A Top-level project #unique contributor ids by git 120 commits 100 80 60 • Community-driven open 40 source software development 20 0 juli-09 nov-10 apr-12 aug-13 dec-14 maj-16 • Publicly open to new contributors
Native Workload Support Scalable Batch Pipelines Machine Learning Stream Pipelines Graph Analytics Apache Flink
The Apache Flink Stack • Bounded Data Sources • Unbounded Data Sources • Staged/Pipelined Execution • Pipelined Execution DataSet DataStream APIs Distributed Dataflow Execution Deployment 24
The Big Picture Graph-Gelly Graph-Gelly Hadoop M/R Table Table CEP SQL SQL ML ML DataSet DataStream Distributed Dataflow Deployment
Basic API Concept Data Data Source Operator Sink Set Set Data Data Source Operator Sink Stream Stream Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks 26
Data Streams as Abstract Data Types Transformations: map, flatmap, filter, union… • DataStream Aggregations: reduce, fold, sum • Partitioning: forward, broadcast, shuffle, keyBy • Sources/Sinks: custom or Kafka, Twitter, Collections… • Tasks are distributed and run in a pipelined fashion. • State is kept within tasks. • Transformations are applied per-record or window. • 27
Example “live and let live” textStream .flatMap {_.split("\\W+")} “live” “and” “let” “live” .map {(_, 1)} (live,1) (and,1) (let,1) (live,1) .keyBy(0) .sum(1) .print() (live,1) (and,1) (let,1) (live,2) 28
Working with Windows Why windows? 15 38 65 88 110 120 We are often interested in fresh data! window buckets/panes SUM #1 15 38 1) Sliding windows SUM #2 38 65 myKeyedStream.timeWindow( SUM #3 65 88 Time.seconds(60), #sec Time.seconds(20)); 0 40 80 20 60 100 2) Tumbling windows SUM #1 SUM #2 15 38 65 88 myKeyedStream.timeWindow( Time.seconds(60)); #sec 0 20 40 60 80 100 120 Highlight : Flink can form and trigger windows consistently under different notions of time and deal with late events! 29
Example counting words over windows “live and” 10:48 “let live” 11:01 textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() (live,1) (and,1) Window (10:45-10:50) (let,1) (live,1) Window (11:00-11:05) 30
Example textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() where counts are kept in state print flatMap map window sum 31
Example textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .setParallelism(4) .print() print flatMap map window sum 32
Making State Explicit • Explicitly defined state is durable to failures • Flink supports two types of explicit states • Operator State - full state • Key-Value State - partitioned state per key • State Backends: In-memory, RocksDB, HDFS 33
Fault Tolerance State is not affected by failures When failures occur we revert computation and state back to a snapshot snapshotting snapshotting t1 t2 events snap - t2 snap - t1 34
Performance • Twitter Hack Week - Flink as an in-memory data store Jamie Grier - http://data-artisans.com/extending-the- yahoo-streaming-benchmark/ 35
So how is Flink different that Spark? Two major differences 1) Stream Execution 2) Mutable State 36
Flink vs Spark S • dedicated resources • mutable state dstream.updateStateByKey(…) put new states in output RDD In S ’ • leased resources • immutable state (Spark Streaming) 37
What about DataSets? • Sophisticated SQL-inspired optimiser • Efficient Join Strategies • Managed Memory bypasses Garbage Collection • Fast, in-memory Iterative Bulk Computations 38
Some Interesting Libraries 39
Detecting Patterns CEP Library Example (Java) PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500)); DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); }); 40
Mining Graphs with Gelly • Iterative Graph Processing • Scatter-Gather • Gather-Sum-Apply • Graph Transformations/Properties • Library Methods : Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc… Coming up next : Dynamic graph processing support 41
Machine Learning Pipelines • Scikit-learn inspired pipelining • Supervised : SVM, Linear Regression • Preprocessing : Polynomial Features, Scalers • Recommendation : ALS 42
Recommend
More recommend