It’s About Time: An Introduction to Timely Dataflow Data Council, October ‘19
clockworks Malte Sandstede malte@clockworks.io / @MalteSandstede Nikolas Göbel In collaboration with: niko@clockworks.io / @NikolasGoebel Frank McSherry Vasia Kalavri (ETH) David Bach david@clockworks.io Moritz Moxter Systems Group moritz@clockworks.io
Stream Processing’s Trifecta Timeliness Consistency Expressivity
Stream Processing’s Trifecta Naive Stateless Processing Timeliness • Low latency • Issue: Late arrivals • Issue: Complex computations Consistency Expressivity
Stream Processing’s Trifecta MapReduce Timeliness • No late arrivals (by definition) • Easy to scale • Issue: Complex computations • Issue: High latency Consistency Expressivity
Stream Processing’s Trifecta Database Timeliness • No late arrivals • High expressivity • ACID • Issue: Not realtime! Consistency Expressivity
Stream Processing’s Trifecta Timeliness Consistency Expressivity
Use Case: Kafka Superpowers (Partitions complect physical representation & use case) P1 T1 P2
Use Case: Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 Reactivity Virtualization queries Repartitioning Physical Virtual Partitions Joins Business Logic Representation time order
Stream Processing as Dataflow data exchange sources sinks operators
Dataflow Parallelism
Dataflow Distribution w1 w2
Correctness Troubles DATA SUM (3, t 0 ) (4, t 1 ) (1, t 0 )
Correctness Troubles DATA SUM (3, t 0 ) (4, t 1 ) (1, t 0 )
Correctness Troubles DATA SUM (3, t 0 ) (4, t 1 ) (5, t 1 ) (1, t 0 )
Correctness Troubles DATA SUM (3, t 0 ) (4, t 1 ) (5, t 1 ) (1, t 0 )
Timely Dataflow A low-latency runtime for distributed cyclic dataflows github.com/ TimelyDataflow
Correctness with Progress Tracking DATA t 0 (3, t 0 ) (4, t 1 ) (1, t 0 ) SUM t 2 t 0 t 0 PROGRESS
Correctness with Progress Tracking DATA t 0 t 0 (3, t 0 ) (4, t 1 ) SUM t 2 t 0 PROGRESS (1, t 0 )
Correctness with Progress Tracking DATA t 0 t 0 t 0 (3, t 0 ) SUM t 2 PROGRESS (4, t 1 ) (1, t 0 )
Correctness with Progress Tracking DATA t 0 t 2 t 2 SUM PROGRESS (3, t 0 ) (4, t 1 ) (1, t 0 )
Correctness with Progress Tracking DATA t 0 t 2 t 2 (1, t 0 ) (3, t 0 ) (4, t 0 ) SUM t 2 PROGRESS (4, t 1 )
Correctness with Progress Tracking DATA t 0 t 2 (4, t 1 ) (8, t 1 ) (1, t 0 ) (3, t 0 ) (4, t 0 ) SUM t 2 t 2 PROGRESS
Progress Tracking… without Progress? (data sources with different event frequencies) CLICKSTREAM TOPIC t 0 (2, t 3 ) (3, t 2 ) (4, t 1 ) (1, t 0 ) (MIN) t 4 t 3 t 2 t 1 CLICKSTREAM PROGRESS JOIN Waiting on METADATA METADATA TOPIC … t 0 METADATA PROGRESS
Multidimensional Progress Tracking (track sources along independent timelines) CLICKSTREAM TOPIC t 0 t 0 t 0 (2, t 3 ) (3, t 2 ) (4, t 1 ) (1, t 0 ) t 4 t 3 t 2 t 1 CLICKSTREAM PROGRESS JOIN METADATA TOPIC … … t 0 METADATA PROGRESS
Multidimensional Progress Tracking (track sources along independent timelines) CLICKSTREAM TOPIC t 1 t 0 t 0 (2, t 3 ) (3, t 2 ) (4, t 1 ) t 4 t 3 t 2 (1, t 0 ) CLICKSTREAM PROGRESS … JOIN METADATA TOPIC t 1 t 0 METADATA PROGRESS … …
Multidimensional Progress Tracking (track sources along independent timelines) CLICKSTREAM TOPIC t 2 t 0 (2, t 3 ) (3, t 2 ) t 4 t 3 (4, t 1 ) (1, t 0 ) CLICKSTREAM PROGRESS … … JOIN METADATA TOPIC t 2 t 1 t 0 t 0 METADATA PROGRESS …
Creating Dataflows with Timely
Creating Dataflows with Timely
Creating Dataflows with Timely
Creating Dataflows with Timely
Running Dataflows with Timely
Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 ? ✔ Reactivity Virtualization ✔ queries Repartitioning Physical Virtual Partitions Joins Business Logic Representation time order
Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 Timely ? ✔ Reactivity Virtualization ✔ queries Repartitioning Physical Virtual Partitions Joins Business Logic Representation time order
The Trifecta? Timeliness Consistency Expressivity
The Trifecta? Timeliness (recursive) queries Consistency Expressivity
Recursive Graph Traversal B F E C A D
Recursive Graph Traversal B F E C A D
Recursive Graph Traversal B F E C A D
Recursive Dataflows /// Breadth-First Search let nodes = roots .map (|x| (x, 0)); EDGE CHANGES REACHABLE NODES nodes. iterate (|inner| { BFS let edges = edges.enter(&inner.scope()); let nodes = nodes.enter(&inner.scope()); inner .join (&edges, |_k,l,d| (*d, l+1)) .concat (&nodes) TRANSITIVE EDGES .reduce (|_, s, t| t.push((*s[0].0, 1))) })
Recursive Dataflows /// Breadth-First Search let nodes = roots .map (|x| (x, 0)); EDGE CHANGES REACHABLE NODES nodes. iterate (|inner| { BFS let edges = edges.enter(&inner.scope()); let nodes = nodes.enter(&inner.scope()); inner .join (&edges, |_k,l,d| (*d, l+1)) .concat (&nodes) TRANSITIVE EDGES .reduce (|_, s, t| t.push((*s[0].0, 1))) })
Progress Tracking… with Loops? (have to finish iterating before we can handle next input) Have to wait while transitive graph is being discovered. t 2 t 1 t 0 EDGE CHANGES BFS REACHABLE NODES TRANSITIVE EDGES t 0
Multidimensional Progress Tracking (track iteration depth separately) t 0 1 t 1 0 t 2 (Product Partial Order) EDGE CHANGES BFS REACHABLE NODES t 1 0 TRANSITIVE EDGES t 0 1
Lexicographical Order (Join) (visibility for ) t 2 t 2 t 0 t 1 t 2 t 3 ✔ ✔ ✔ ✔ t 0 ✔ ✔ ✔ ✔ t 1 ✔ ✔ ✔ t 2 t 3
Product Partial Order (Iteration) (visibility for ) t 2 2 0 1 2 3 ✔ ✔ ✔ t 0 ✔ ✔ ✔ t 1 ✔ ✔ ✔ t 2 t 3
Multidimensional Progress Tracking (track iteration depth separately) t 0 1 t 1 0 t 2 (Product Partial Order) EDGE CHANGES BFS REACHABLE NODES t 1 0 TRANSITIVE EDGES t 0 1
Incremental Execution? Have to start from scratch for every transaction? EDGE CHANGES BFS REACHABLE NODES TRANSITIVE EDGES
Differential Dataflow Iterative, incrementalized operators for Timely github.com/ TimelyDataflow
Performance
Streaming & Relational Queries Declarative Differential Dataflows (3DF) /// BFS let nodes = roots .map (|x| (x, 0)); [[( bfs ?from ?to) nodes. iterate (|inner| { [?from :edge ?to]] let edges = edges.enter(&inner.scope()); [( bfs ?from ?to) let nodes = nodes.enter(&inner.scope()); [?from :edge ?hop] ( bfs ?hop ?to)]] inner .join_map (&edges, |_k,l,d| (*d, l+1)) .concat (&nodes) .reduce (|_, s, t| t.push((*s[0].0, 1))) }) github.com/comnik/ declarative-dataflow
The Trifecta! Timeliness Consistency Expressivity
Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 Timely ✔ ✔ Reactivity Virtualization ✔ queries Repartitioning Physical Virtual Partitions Joins Business Logic Representation time order
Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 Timely DD+3DF ✔ ✔ Reactivity Virtualization ✔ queries Repartitioning Physical Virtual Partitions Joins Business Logic Representation time order
Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 clockworks.io/kplex
Timely as a Programming Model 3DF (Streaming Relational Queries) Di ff erential Dataflow (Iterative Incrementalized Operators) Timely Dataflow (Dataflows w/ Multidimensional Progress Tracking) github.com/ TimelyDataflow github.com/comnik/ declarative-dataflow
Sources Repositories clockworks • Timely: github.com/TimelyDataflow • ST2: github.com/li1/snailtrail www.clockworks.io • 3DF: github.com/comnik/declarative-dataflow {david, malte, moritz, niko}@clockworks.io • Di ff erential FAQ: github.com/eoxxs/di ff erential-aggregate-query Papers • Naiad (Timely Dataflow): http://dl.acm.org/citation.cfm?doid=2517349.2522738 • Di ff erential Dataflow: http://michaelisard.com/pubs/di ff erentialdataflow.pdf, arxiv.org/abs/1812.02639 • SnailTrail: hdl.handle.net/20.500.11850/228581 Talks • Reactive Datalog for Datomic (clojure/conj 2018): clockworks.io/2018/12/01/conj-talk.html • Across Time and Space (BobKonf 2019): clockworks.io/2019/03/22/across-time-space.html Blog Posts • frankmcsherry.org • Incremental Functional Aggregate Queries: clockworks.io/2019/07/06/Incremental-Functional-Aggregate-Queries.html • Dataflows you can’t refuse: clockworks.io/2019/02/10/dataflows-you-cant-refuse.html • Reactive Datalog with Vega: clockworks.io/2018/11/25/reactive-datalog-with-vega.html • Incremental Datalog with Di ff erential Dataflows: clockworks.io/2018/09/13/incremental-datalaog.html
Recommend
More recommend