Naiad a timely dataflow model Whats it hoping to achieve? 1. high - PowerPoint PPT Presentation

Naiad a timely dataflow model

What’s it hoping to achieve? 1. high throughput 2. low latency 3. incremental computation

Why? → So much data! Problems with other, contemporary dataflow systems: 1. Too specific (e.g. Map-Reduce, Hadoop) 2. Batch-based systems 3. Graph-based systems 4. Stream processing systems

An Example: Streaming via Twitter # values Twitter MAX tweet for a Tweets given CC Connected @values Components User Queries

A new computational model: timely dataflow → structured loops → stateful dataflow vertices → notifications for vertices IN OUT

Notifications for Vertices Vertex methods: v.OnRecv(e:Edge, m:Message, t:Timestamp) v.OnNotify(t:Timestamp) System-provided methods: this.SendBy(e:Edge, m:Message, t:Timestamp) this.NotifyAt(t:Timestamp)

An Example Program Dictionary<Time, Int> dict = ... void OnRecv(Edge e, int m, Time t): dict[t] = dict[t] + m void OnRecv(Edge e, int m, this.NotifyAt(t) Time t): if (isPrime(m)) this.SendBy(out, m, t) void onNotify(Time t) : this.sendBy(out, state[t], t)

Structured Loops & Stateful Vertices loop context IN I E OUT F

Timestamps: (e ∊ ℕ , <c 1 ...c k > in N k ) loop context IN I E OUT F (e, <c 1 ...c k >) → (e, <c 1 ,...,c k ,0>) (e, <c 1 ...c k+1 >) → (e, <c 1 ,...,c k >) (e, <c 1 ...c k >) → (e, <c 1 ...c k +1>)

Timestamps: (e ∊ ℕ , <c 1 ...c k > in N k ) loop context IN I E OUT F (e, <c 1 ...c k >) → (e, <c 1 ...c k ,0>) (e, <c 1 ...c k+1 >) → (e, <c 1 ...c k >) (e, <c 1 ...c k >) → (e, <c 1 ...c k +1>) {t 1 = (x 1 , c 1 )} ฀ {t 2 = (x 2 , c 2 )} ⇔ x 1 ฀ x 2 & c 1 ฀ c 2

A Single-Threaded scheduler Pointstamp : (t ∊ Timestamp, l ∊ Edge ∪ Vertex) - could-result-in : (t 1 ,l 1 ) ≤ (t 2 ,l 2 ) ⇔ Φ[l 1 ,l 2 ](t 1 ) ≤ t 2 1. maintains a set of active pointstamps 2. maintains an occurrence count 3. maintains a precursor count

A Single-Threaded scheduler: in action 1. A pointstamp P becomes active a. initialize precursor count to number of existing active pointstamps that could-result-in P b. increment precursor count of any pointstamp P could-result-in 2. A pointstamp P leaves the active set (occurrence count = 0) a. decrement precursor count of any pointstamp P could-result-in 3. A pointstamp P reaches the frontier of active pointstamps (precursor count = 0) a. scheduler can deliver any notification originating from P

A Single-Threaded scheduler: in action 1. A pointstamp P becomes active a. initialize precursor count to number of existing active pointstamps that could-result-in P b. increment precursor count of any pointstamp P could-result-in 2. A pointstamp P leaves the active set (occurrence count = 0) a. decrement precursor count of any pointstamp P could-result-in 3. A pointstamp P reaches the frontier of active pointstamps (precursor count = 0) a. scheduler can deliver any notification originating from P loop context IN I E OUT F

Distributed Implementation TCP/IP Network Process Worker Progress tracking protocol

Data parallelism: how do we achieve it? Logical Graph: Worker Physical Graph: Worker

Distributed Progress Tracking For each active pointstamp, a worker maintains its version of the global state: - a local occurrence count - a local precursor count - a local frontier

Distributed Progress Tracking For each active pointstamp, a worker maintains its version of the global state: - a local occurrence count - a local precursor count - a local frontier Optimisations: 1. projected pointstamps 2. use a local buffer 3. use UDP packets for updates before sending via TCP 4. threads can be woken either by a broadcast or unicast notifcation

Results: Throughput Benchmark : construct a cyclic dataflow network which repeatedly performs an all- to-all data exchange 1. linear scaling 2. not ideal

Results: Latency Benchmark : construct a simple cyclic graph in which vertices request/receive completeness notifications - median time: 753 us Caveat: Micro-stragglers 1. Networking: TCP over Ethernet 2. Data structure contention 3. Garbage Collection

Results: PageRank using Twitter

Results: Incremental computation Benchmark : in a continually arriving stream of tweets, extract hashtags and mentions of other users to determine the most popular hashtag for a given user. Setup : 1. two inputs for the stream of tweets and requests a. fed into an incremental computation 2. introduce 32,000 tweets per second 3. add a new query every 100 ms

Strengths 1. Generality 2. Simplicity 3. Incremental computation for iterations 4. Fine-grained control over partitioning

Weaknesses (on my opinion) 1. Do not test latency and throughput together 2. Though, using Naiad can achieve some substantial improvements, this depends on implementation 3. Use lines of code to measure simplicity 4. Stragglers

Limitations 1. Naiad is specifically designed for problems in which the working set fits in the total RAM of the cluster 2. Fault tolerance

Takeaway & Impact timely-dataflow computational model is powerful because of: 1. Incremental and iterative computation 2. A general, lightweight, framework for data-parallel applications that focusses on a wide domain (e.g. not just loops) while offering low-latency and high throughput

Naiad a timely dataflow model Whats it hoping to achieve? 1. high - PowerPoint PPT Presentation

Naiad a timely dataflow model Whats it hoping to achieve? 1. high throughput 2. low latency 3. incremental computation Why? So much data! Problems with other, contemporary dataflow systems: 1. Too specific (e.g. Map-Reduce, Hadoop)

Using Naiad to Analyze Twitter Data in Batch and Real-time George Wort University of Cambridge

Naiad: A Timely Dataflow System Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard,

Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry Rebecca Isaacs Michael Isard

Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry Rebecca Isaacs Michael Isard

Naiad: A Timely Dataflow System Indigo Orton R244 Computer Laboratory Motivation High

CS 744: NAIAD Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Course Project Proposal

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Naiad James Thomas Goals High-throughput batch processing Low-latency processing

In Depth: New Standard on NFP Financial Statement Presentation In Depth: New Standard on NFP

A proposal for a 100% use of bauxite residue: The process, results on the novel Fe-rich binder

Cost Overruns and Their Precursors : An Empirical Examination of Major DoD Acquisition Programs

PFAS OCCURRENCE & MONITORING GUIDANCE for California water systems Rick Zimmer May 2, 2019

Southeast Asia Professor Louisa Degenhardt Presenting slides developed by Gary Lewis and the

Space Weather Segment Precursor Services Part-1: Definition and Service Consolidation (SN-I)

E L P M Environmental Law A Clean Water Act S By: Brooke Miles, Lauren Pimental, Chloe

Materials and Processing Technology Area Cliff Eberle June 17, 2015 Materials and Process

Transportation Performance Management Overview Laura Toole 2018 Ohio Planning Conference What is

Federal Requirements for PM-2.5 Emissions and Anticipated Changes to NJ Rules Nine Proposals

Alliance Paris, May 28 th , 2019 European Commission, DG GROW Batteries Key Enabling Technology

Umicore H1 2019 performance 31 July 2019 Overview Highlights H1 2019 2019 outlook H1 2019

High Performance Insulation based on Nanostructure encapsulation of air Theme: EeB.NMP.2010 1

Clean Air Research Program: Strategic Directions Dan Costa National Program Director SAB

UPDATE: Screening and Coverage for Diabetes and Prediabetes Karin Gillespie, Changing Diabetes

Medicaid and National Diabetes Prevention Program (DPP) Briefing Maryland Medicaid Advisory

Naiad a timely dataflow model Whats it hoping to achieve? 1. high - PowerPoint PPT Presentation

Naiad a timely dataflow model Whats it hoping to achieve? 1. high throughput 2. low latency 3. incremental computation Why? So much data! Problems with other, contemporary dataflow systems: 1. Too specific (e.g. Map-Reduce, Hadoop)

Using Naiad to Analyze Twitter Data in Batch and Real-time George Wort University of Cambridge

Naiad: A Timely Dataflow System Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard,

Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry Rebecca Isaacs Michael Isard

Naiad: A Timely Dataflow System Derek G. Murray Frank McSherry Rebecca Isaacs Michael Isard

Naiad: A Timely Dataflow System Indigo Orton R244 Computer Laboratory Motivation High

CS 744: NAIAD Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Course Project Proposal

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Naiad James Thomas Goals High-throughput batch processing Low-latency processing

In Depth: New Standard on NFP Financial Statement Presentation In Depth: New Standard on NFP

A proposal for a 100% use of bauxite residue: The process, results on the novel Fe-rich binder

Cost Overruns and Their Precursors : An Empirical Examination of Major DoD Acquisition Programs

PFAS OCCURRENCE &amp; MONITORING GUIDANCE for California water systems Rick Zimmer May 2, 2019

Southeast Asia Professor Louisa Degenhardt Presenting slides developed by Gary Lewis and the

Space Weather Segment Precursor Services Part-1: Definition and Service Consolidation (SN-I)

E L P M Environmental Law A Clean Water Act S By: Brooke Miles, Lauren Pimental, Chloe

Materials and Processing Technology Area Cliff Eberle June 17, 2015 Materials and Process

Transportation Performance Management Overview Laura Toole 2018 Ohio Planning Conference What is

Federal Requirements for PM-2.5 Emissions and Anticipated Changes to NJ Rules Nine Proposals

Alliance Paris, May 28 th , 2019 European Commission, DG GROW Batteries Key Enabling Technology

Umicore H1 2019 performance 31 July 2019 Overview Highlights H1 2019 2019 outlook H1 2019

High Performance Insulation based on Nanostructure encapsulation of air Theme: EeB.NMP.2010 1

Clean Air Research Program: Strategic Directions Dan Costa National Program Director SAB

UPDATE: Screening and Coverage for Diabetes and Prediabetes Karin Gillespie, Changing Diabetes

Medicaid and National Diabetes Prevention Program (DPP) Briefing Maryland Medicaid Advisory

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

PFAS OCCURRENCE & MONITORING GUIDANCE for California water systems Rick Zimmer May 2, 2019