Discretized Streams An Efficient and Fault-Tolerant Model for - PowerPoint PPT Presentation

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC ¡BERKELEY ¡

Motivation • Many important applications need to process large data streams arriving in real time – User activity statistics (e.g. Facebook’s Puma) – Spam detection – Traffic estimation – Network intrusion detection • Our target: large-scale apps that must run on tens-hundreds of nodes with O(1 sec) latency

Challenge • To run at large scale, system has to be both: – Fault-tolerant: recover quickly from failures and stragglers – Cost-efficient: do not require significant hardware beyond that needed for basic processing • Existing streaming systems don’t have both properties

Traditional Streaming Systems • “Record-at-a-time” processing model – Each node has mutable state – For each record, update state & send new records mutable state input records push node 1 node 3 input records node 2

Traditional Streaming Systems Fault tolerance via replication or upstream backup : input node 1 input node 1 node 3 node 3 input node 2 input node 2 synchronization node 1’ standby node 3’ node 2’

Traditional Streaming Systems Fault tolerance via replication or upstream backup : input node 1 input node 1 node 3 node 3 input node 2 input node 2 synchronization node 1’ standby node 3’ Fast recovery, but 2x Only need 1 standby, node 2’ hardware cost but slow to recover

Traditional Streaming Systems Fault tolerance via replication or upstream backup : input node 1 input node 1 node 3 node 3 input node 2 input node 2 synchronization node 1’ standby node 3’ node 2’ Neither approach tolerates stragglers

Observation • Batch processing models for clusters (e.g. MapReduce) provide fault tolerance efficiently – Divide job into deterministic tasks – Rerun failed/slow tasks in parallel on other nodes • Idea: run a streaming computation as a series of very small, deterministic batches – Same recovery schemes at much smaller timescale – Work to make batch size as small as possible

Discretized Stream Processing batch operation t = 1: input pull immutable dataset immutable dataset (output or state); (stored reliably) stored in memory without replication t = 2: input … … … stream 2 stream 1

Parallel Recovery • Checkpoint state datasets periodically • If a node fails/straggles, recompute its dataset partitions in parallel on other nodes map output dataset input dataset Faster recovery than upstream backup, without the cost of replication

How Fast Can It Go? • Prototype built on the Spark in-memory computing engine can process 2 GB/s ( 20M records/s ) of data on 50 nodes at sub-second latency Grep WordCount Cluster Throughput (GB/s) Cluster Throughput (GB/s) 3 3 1 sec 2.5 2.5 2 sec 2 2 1.5 1.5 1 1 1 sec c 0.5 0.5 2 sec c 0 0 0 20 40 60 60 0 20 40 60 # of Nodes in Cluster ter # of Nodes in Cluster Max throughput within a given latency bound (1 or 2s)

How Fast Can It Go? • Recovers from failures within 1 second Failure Happens Interval Processing 2.0 1.5 Time (s) 1.0 0.5 0.0 Time (s) 0 15 30 45 60 75 Sliding WordCount on 10 nodes with 30s checkpoint interval

Programming Model • A discretized stream ( D-stream ) is a sequence of immutable, partitioned datasets – Specifically, resilient distributed datasets (RDDs), the storage abstraction in Spark • Deterministic transformations operators produce new streams

API • LINQ-like language-integrated API in Scala • New “stateful” operators for windowing pageViews � ones � counts � pageViews = readStream("...", "1s") � t = 1: ones = pageViews.map(ev => (ev.url, 1)) � map reduce counts = ones.runningReduce(_ + _) � t = 2: Scala function literal sliding = ones.reduceByWindow( � . . . “5s”, _ + _, _ - _) � = RDD = partition Incremental version with “add” and “subtract” functions

Other Benefits of Discretized Streams • Consistency: each record is processed atomically • Unification with batch processing: – Combining streams with historical data � pageViews.join(historicCounts).map(...) � – Interactive ad-hoc queries on stream state � pageViews.slice(“21:00”, “21:05”).topK(10)

Conclusion • D-Streams forgo traditional streaming wisdom by batching data in small timesteps • Enable efficient, new parallel recovery scheme • Let users seamlessly intermix streaming, batch and interactive queries

Related Work • Bulk incremental processing (CBP , Comet) – Periodic (~5 min) batch jobs on Hadoop/Dryad – On-disk, replicated FS for storage instead of RDDs • Hadoop Online – Does not recover stateful ops or allow multi-stage jobs • Streaming databases – Record-at-a-time processing, generally replication for FT • Parallel recovery (MapReduce, GFS, RAMCloud, etc) – Hwang et al [ICDE’07] have a parallel recovery protocol for streams, but only allow 1 failure & do not handle stragglers

Timing Considerations • D-streams group input into intervals based on when records arrive at the system • For apps that need to group by an “external” time and tolerate network delays, support: – Slack time: delay starting a batch for a short fixed time to give records a chance to arrive – Application-level correction: e.g. give a result for time t at time t+1, then use later records to update incrementally at time t+5

D-Streams vs. Traditional Streaming Concern Discretized Streams Record-at-a-time Systems Latency 0.5–2s 1-100 ms Not in msg. passing systems; Consistency Yes, batch-level some DBs use waiting Failures Parallel recovery Replication or upstream bkp. Stragglers Speculation Typically not handled Unification Ad-hoc queries from Not in msg. passing systems; with batch Spark shell, join w. RDD in some DBs

Discretized Streams An Efficient and Fault-Tolerant Model for - PowerPoint PPT Presentation

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important applications need to

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Periodic Orbits of Discretized Rotations Shigeki Akiyama, Univ. of Tsukuba 11 December 2012,

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Data Streams Many large sources of data are generated as streams of updates: IP Network

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Fast Homomorphic Evaluation of Deep Discretized Neural Networks Florian Bourse Michele Minelli

Hyperbolicity of the Layerwise Discretized Shallow Water equations The bilayer case Martin

Convergence rates for discretized optimal transport Quentin M erigot Universit e Paris-Sud

3. Interpolation and Filtering Data is often discretized in space and / or time Finite

Differentiating discretized metrics and applications Filippo Santambrogio Laboratoire de Math

Spatio-temporal Models Again point-referenced vs. areal unit data Continuous time vs. discretized

Numerical Analysis of Discretized N=(2,2) SYM on Polyhedra Syo Kamata (Keio Univ. )

Making the Data Available Network HD maintains raw data storage archive and MajX NVH Shield TM

Africas leading route-to-market partner for pharmaceuticals and consumer OTC goods April

ImageNet in 18 minutes for the masses Motivation - training was fast in Google - no technical

IEF Advocacy Workshops October 2015 Welcome Agenda Session 1 - What is Advocacy? Session 2 -

SMALL-BATCH EXCELLENCE. LARGE-SCALE EXECUTION. April 2020 CSE: ROMJ | OTCQX: ROMJF [ 1 ]

03 sustainability @ our core 26 RBG TAC OSM BATCH 1 PRESENTATION 03.01 sustainable

My AmeriCorps AmeriCorps National Programs Member Enrollment Presentation developed for the

The Evolution of Spotify Home Architecture Emily Anil Staff Engineer Data Engineer

Sambuz

Useful Links

Newsletter

Mail Us

Discretized Streams An Efficient and Fault-Tolerant Model for - PowerPoint PPT Presentation

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY Motivation Many important applications need to

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Periodic Orbits of Discretized Rotations Shigeki Akiyama, Univ. of Tsukuba 11 December 2012,

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Data Streams Many large sources of data are generated as streams of updates: IP Network

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Fast Homomorphic Evaluation of Deep Discretized Neural Networks Florian Bourse Michele Minelli

Hyperbolicity of the Layerwise Discretized Shallow Water equations The bilayer case Martin

Convergence rates for discretized optimal transport Quentin M erigot Universit e Paris-Sud

3. Interpolation and Filtering Data is often discretized in space and / or time Finite

Differentiating discretized metrics and applications Filippo Santambrogio Laboratoire de Math

Spatio-temporal Models Again point-referenced vs. areal unit data Continuous time vs. discretized

Numerical Analysis of Discretized N=(2,2) SYM on Polyhedra Syo Kamata (Keio Univ. )

Making the Data Available Network HD maintains raw data storage archive and MajX NVH Shield TM

Africas leading route-to-market partner for pharmaceuticals and consumer OTC goods April

ImageNet in 18 minutes for the masses Motivation - training was fast in Google - no technical

IEF Advocacy Workshops October 2015 Welcome Agenda Session 1 - What is Advocacy? Session 2 -

SMALL-BATCH EXCELLENCE. LARGE-SCALE EXECUTION. April 2020 CSE: ROMJ | OTCQX: ROMJF [ 1 ]

03 sustainability @ our core 26 RBG TAC OSM BATCH 1 PRESENTATION 03.01 sustainable

My AmeriCorps AmeriCorps National Programs Member Enrollment Presentation developed for the

The Evolution of Spotify Home Architecture Emily Anil Staff Engineer Data Engineer

Sambuz

Useful Links

Newsletter

Mail Us

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams