it s about time an introduction to timely dataflow
play

Its About Time: An Introduction to Timely Dataflow Data Council, - PowerPoint PPT Presentation

Its About Time: An Introduction to Timely Dataflow Data Council, October 19 clockworks Malte Sandstede malte@clockworks.io / @MalteSandstede Nikolas Gbel In collaboration with: niko@clockworks.io / @NikolasGoebel Frank McSherry


  1. It’s About Time: An Introduction to 
 Timely Dataflow Data Council, October ‘19

  2. clockworks Malte Sandstede malte@clockworks.io / @MalteSandstede Nikolas Göbel In collaboration with: niko@clockworks.io / @NikolasGoebel Frank McSherry Vasia Kalavri (ETH) David Bach david@clockworks.io Moritz Moxter Systems Group moritz@clockworks.io

  3. Stream Processing’s Trifecta Timeliness Consistency Expressivity

  4. Stream Processing’s Trifecta Naive Stateless Processing Timeliness • Low latency • Issue: Late arrivals • Issue: Complex computations Consistency Expressivity

  5. Stream Processing’s Trifecta MapReduce Timeliness • No late arrivals (by definition) • Easy to scale • Issue: Complex computations • Issue: High latency Consistency Expressivity

  6. Stream Processing’s Trifecta Database Timeliness • No late arrivals • High expressivity • ACID • Issue: Not realtime! Consistency Expressivity

  7. Stream Processing’s Trifecta Timeliness Consistency Expressivity

  8. Use Case: Kafka Superpowers (Partitions complect physical representation & use case) P1 T1 P2

  9. Use Case: Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 Reactivity 
 Virtualization 
 queries Repartitioning 
 Physical 
 Virtual Partitions 
 Joins Business Logic Representation time order

  10. Stream Processing as Dataflow data exchange sources sinks operators

  11. Dataflow Parallelism

  12. Dataflow Distribution w1 w2

  13. Correctness Troubles DATA SUM (3, t 0 ) (4, t 1 ) (1, t 0 )

  14. Correctness Troubles DATA SUM (3, t 0 ) (4, t 1 ) (1, t 0 )

  15. Correctness Troubles DATA SUM (3, t 0 ) (4, t 1 ) (5, t 1 ) (1, t 0 )

  16. Correctness Troubles DATA SUM (3, t 0 ) (4, t 1 ) (5, t 1 ) (1, t 0 )

  17. Timely Dataflow A low-latency runtime for 
 distributed cyclic dataflows github.com/ TimelyDataflow

  18. Correctness with Progress Tracking DATA t 0 (3, t 0 ) (4, t 1 ) (1, t 0 ) SUM t 2 t 0 t 0 PROGRESS

  19. Correctness with Progress Tracking DATA t 0 t 0 (3, t 0 ) (4, t 1 ) SUM t 2 t 0 PROGRESS (1, t 0 )

  20. Correctness with Progress Tracking DATA t 0 t 0 t 0 (3, t 0 ) SUM t 2 PROGRESS (4, t 1 ) (1, t 0 )

  21. Correctness with Progress Tracking DATA t 0 t 2 t 2 SUM PROGRESS (3, t 0 ) (4, t 1 ) (1, t 0 )

  22. Correctness with Progress Tracking DATA t 0 t 2 t 2 (1, t 0 ) (3, t 0 ) (4, t 0 ) SUM t 2 PROGRESS (4, t 1 )

  23. Correctness with Progress Tracking DATA t 0 t 2 (4, t 1 ) (8, t 1 ) (1, t 0 ) (3, t 0 ) (4, t 0 ) SUM t 2 t 2 PROGRESS

  24. Progress Tracking… without Progress? (data sources with different event frequencies) CLICKSTREAM TOPIC t 0 (2, t 3 ) (3, t 2 ) (4, t 1 ) (1, t 0 ) (MIN) t 4 t 3 t 2 t 1 CLICKSTREAM PROGRESS JOIN Waiting on METADATA METADATA TOPIC … t 0 METADATA PROGRESS

  25. Multidimensional Progress Tracking (track sources along independent timelines) CLICKSTREAM TOPIC t 0 t 0 t 0 (2, t 3 ) (3, t 2 ) (4, t 1 ) (1, t 0 ) t 4 t 3 t 2 t 1 CLICKSTREAM PROGRESS JOIN METADATA TOPIC … … t 0 METADATA PROGRESS

  26. Multidimensional Progress Tracking (track sources along independent timelines) CLICKSTREAM TOPIC t 1 t 0 t 0 (2, t 3 ) (3, t 2 ) (4, t 1 ) t 4 t 3 t 2 (1, t 0 ) CLICKSTREAM PROGRESS … JOIN METADATA TOPIC t 1 t 0 METADATA PROGRESS … …

  27. Multidimensional Progress Tracking (track sources along independent timelines) CLICKSTREAM TOPIC t 2 t 0 (2, t 3 ) (3, t 2 ) t 4 t 3 (4, t 1 ) (1, t 0 ) CLICKSTREAM PROGRESS … … JOIN METADATA TOPIC t 2 t 1 t 0 t 0 METADATA PROGRESS …

  28. Creating Dataflows with Timely

  29. Creating Dataflows with Timely

  30. Creating Dataflows with Timely

  31. Creating Dataflows with Timely

  32. Running Dataflows with Timely

  33. Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 ? ✔ Reactivity 
 Virtualization 
 ✔ queries Repartitioning 
 Physical 
 Virtual Partitions 
 Joins Business Logic Representation time order

  34. Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 Timely ? ✔ Reactivity 
 Virtualization 
 ✔ queries Repartitioning 
 Physical 
 Virtual Partitions 
 Joins Business Logic Representation time order

  35. The Trifecta? Timeliness Consistency Expressivity

  36. The Trifecta? Timeliness (recursive) queries Consistency Expressivity

  37. Recursive Graph Traversal B F E C A D

  38. Recursive Graph Traversal B F E C A D

  39. Recursive Graph Traversal B F E C A D

  40. Recursive Dataflows /// Breadth-First Search let nodes = roots .map (|x| (x, 0)); EDGE CHANGES REACHABLE NODES nodes. iterate (|inner| { BFS let edges = edges.enter(&inner.scope()); let nodes = nodes.enter(&inner.scope()); inner .join (&edges, |_k,l,d| (*d, l+1)) .concat (&nodes) TRANSITIVE EDGES .reduce (|_, s, t| t.push((*s[0].0, 1))) })

  41. Recursive Dataflows /// Breadth-First Search let nodes = roots .map (|x| (x, 0)); EDGE CHANGES REACHABLE NODES nodes. iterate (|inner| { BFS let edges = edges.enter(&inner.scope()); let nodes = nodes.enter(&inner.scope()); inner .join (&edges, |_k,l,d| (*d, l+1)) .concat (&nodes) TRANSITIVE EDGES .reduce (|_, s, t| t.push((*s[0].0, 1))) })

  42. Progress Tracking… with Loops? (have to finish iterating before we can handle next input) Have to wait while transitive graph is being discovered. t 2 t 1 t 0 EDGE CHANGES BFS REACHABLE NODES TRANSITIVE EDGES t 0

  43. Multidimensional Progress Tracking (track iteration depth separately) t 0 1 t 1 0 t 2 (Product Partial Order) EDGE CHANGES BFS REACHABLE NODES t 1 0 TRANSITIVE EDGES t 0 1

  44. Lexicographical Order (Join) (visibility for ) t 2 t 2 t 0 t 1 t 2 t 3 ✔ ✔ ✔ ✔ t 0 ✔ ✔ ✔ ✔ t 1 ✔ ✔ ✔ t 2 t 3

  45. Product Partial Order (Iteration) (visibility for ) t 2 2 0 1 2 3 ✔ ✔ ✔ t 0 ✔ ✔ ✔ t 1 ✔ ✔ ✔ t 2 t 3

  46. Multidimensional Progress Tracking (track iteration depth separately) t 0 1 t 1 0 t 2 (Product Partial Order) EDGE CHANGES BFS REACHABLE NODES t 1 0 TRANSITIVE EDGES t 0 1

  47. Incremental Execution? Have to start from scratch for every transaction? EDGE CHANGES BFS REACHABLE NODES TRANSITIVE EDGES

  48. Differential Dataflow Iterative, incrementalized operators for Timely github.com/ TimelyDataflow

  49. Performance

  50. Streaming & Relational Queries Declarative Differential Dataflows (3DF) /// BFS let nodes = roots .map (|x| (x, 0)); [[( bfs ?from ?to) nodes. iterate (|inner| { [?from :edge ?to]] let edges = edges.enter(&inner.scope()); [( bfs ?from ?to) let nodes = nodes.enter(&inner.scope()); [?from :edge ?hop] ( bfs ?hop ?to)]] inner .join_map (&edges, |_k,l,d| (*d, l+1)) .concat (&nodes) .reduce (|_, s, t| t.push((*s[0].0, 1))) }) github.com/comnik/ declarative-dataflow

  51. The Trifecta! Timeliness Consistency Expressivity

  52. Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 Timely ✔ ✔ Reactivity 
 Virtualization 
 ✔ queries Repartitioning 
 Physical 
 Virtual Partitions 
 Joins Business Logic Representation time order

  53. Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 Timely DD+3DF ✔ ✔ Reactivity 
 Virtualization 
 ✔ queries Repartitioning 
 Physical 
 Virtual Partitions 
 Joins Business Logic Representation time order

  54. Kafka Superpowers P1 V1 V2 T1 V3 P2 V4 clockworks.io/kplex

  55. Timely as a Programming Model 3DF (Streaming Relational Queries) Di ff erential Dataflow (Iterative Incrementalized Operators) Timely Dataflow (Dataflows w/ Multidimensional Progress Tracking) github.com/ TimelyDataflow github.com/comnik/ declarative-dataflow

  56. Sources Repositories clockworks • Timely: github.com/TimelyDataflow • ST2: github.com/li1/snailtrail www.clockworks.io • 3DF: github.com/comnik/declarative-dataflow {david, malte, moritz, niko}@clockworks.io • Di ff erential FAQ: github.com/eoxxs/di ff erential-aggregate-query Papers • Naiad (Timely Dataflow): http://dl.acm.org/citation.cfm?doid=2517349.2522738 • Di ff erential Dataflow: http://michaelisard.com/pubs/di ff erentialdataflow.pdf, arxiv.org/abs/1812.02639 • SnailTrail: hdl.handle.net/20.500.11850/228581 Talks • Reactive Datalog for Datomic (clojure/conj 2018): clockworks.io/2018/12/01/conj-talk.html • Across Time and Space (BobKonf 2019): clockworks.io/2019/03/22/across-time-space.html Blog Posts • frankmcsherry.org • Incremental Functional Aggregate Queries: clockworks.io/2019/07/06/Incremental-Functional-Aggregate-Queries.html • Dataflows you can’t refuse: clockworks.io/2019/02/10/dataflows-you-cant-refuse.html • Reactive Datalog with Vega: clockworks.io/2018/11/25/reactive-datalog-with-vega.html • Incremental Datalog with Di ff erential Dataflows: clockworks.io/2018/09/13/incremental-datalaog.html

Recommend


More recommend