understanding distributed dataflow
play

UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris - PowerPoint PPT Presentation

UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris liagos@inf.ethz.ch SYSTEMS OUTPUT EXPLANATION AND PERFORMANCE ANALYSIS Google 3 May 2017 PART I: Why is this record in the output of my distributed dataflow? Concise explanations of


  1. UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris liagos@inf.ethz.ch SYSTEMS OUTPUT EXPLANATION AND PERFORMANCE ANALYSIS Google 3 May 2017

  2. PART I: Why is this record in the output of my distributed dataflow? ▸ Concise explanations of individual outputs ▸ On-demand output reproduction PART II: Why is my distributed dataflow slow? ▸ Bottleneck detection ▸ Critical path analysis 2

  3. COLLABORATORS Vasiliki Kalavri Ralf Sager Andrea Lattuada Desislava Dimitrova Zaheer Chothia Sebastian Wicki Frank McSherry Moritz Hoffmann Timothy Roscoe 3

  4. THE BIG PICTURE: UNDERSTANDING THE DATACENTER Strymon Enterprise Datacenter event logs ‣ The volume of datacenter logs is huge ‣ Keeping archives is not a viable solution ‣ We can process logs online 4

  5. THE BIG PICTURE: UNDERSTANDING THE DATACENTER Strymon Enterprise Datacenter event logs Strymon is a novel system able to: ‣ Perform deep analytics on thousands of distributed streams of event logs in parallel ‣ Explain its outputs interactively 5

  6. IDEAS IN STRYMON CAN BE GENERALIZED for dataflow systems iterative analytics input stream output stream streaming analytics and different execution models worker 1 synchronous vs asynchronous shared-nothing vs shared-memory worker 2 6

  7. TIMELY DATAFLOW D. Murray, F. McSherry, M. Isard, R. Isaacs, P. Barham, M. Abadi. Naiad: A Timely Dataflow System. In SOSP, 2013. ▸ A steaming framework for data-parallel computations ▸ Cyclic dataflows ▸ Logical timestamps (epochs) ▸ Asynchronous execution ▸ Low latency DIFFERENTIAL DATAFLOW F. McSherry, D. Murray, R. Isaacs, M. Isard. Differential Dataflow . In CIDR, 2013. ▸ A high-level API on top of Timely Dataflow ▸ Incremental computation 7

  8. PART I Why is this record in the output of my distributed dataflow? 8

  9. EXPLANATIONS IN DATABASES COMPUTATION 1 2 3 PROVENANCE 9

  10. THE PROBLEM: OUTPUT EXPLANATION OUTPUT INPUT 10

  11. THE PROBLEM: OUTPUT EXPLANATION THIS RECORD LOOKS WRONG! {App 115 344} {A 115 344} {VM 233 -22} {F 233 122} {App 100 55} {W 100 -95} {VM 333 -124} {V 30 23} … … … … … … OUTPUT INPUT 11

  12. THE PROBLEM: OUTPUT EXPLANATION THIS RECORD LOOKS WRONG! {App 115 344} {A 115 344} {VM 233 -22} {F 233 122} {App 100 55} {W 100 -95} {VM 333 -124} {V 30 23} … … … … … … OUTPUT INPUT 12

  13. THE PROBLEM: OUTPUT EXPLANATION THIS RECORD LOOKS WRONG! {App 115 344} {A 115 344} {VM 233 -22} {F 233 122} {App 100 55} {W 100 -95} {VM 333 -124} {V 30 23} … … … … … … OUTPUT INPUT Output explanation: A subset of the input that is sufficient to reproduce the selected subset of the output 13

  14. ANNOTATION-BASED TECHNIQUES metadata propagation 1 2 3 ▸ Fast ▸ Explode in size 14

  15. INVERSION-BASED TECHNIQUES 1’ 2’ 3’ ▸ Small memory footprint ▸ Not generally applicable 15

  16. BACKWARD TRACING dependencies 1 2 3 ▸ Small memory footprint ▸ Generally applicable ▸ Fast 16

  17. PROBLEM 1: TOO MUCH INFORMATION Use Case: Graph Rechability 2 5 3 1 4 17

  18. PROBLEM 1: TOO MUCH INFORMATION Use Case: Graph Reachability WHY IS (1,3) IN THE OUTPUT? ▸ Record (1,3) appears in the result 2 5 3 1 4 18

  19. PROBLEM 1: TOO MUCH INFORMATION Use Case: Graph Reachability WHY IS (1,3) IN THE OUTPUT? ▸ Record (1,3) appears in the result 2 ▸ Naive backward tracing returns as an explanation all 5 3 1 4 edges of the graph 19

  20. PROBLEM 1: TOO MUCH INFORMATION Use Case: Graph Reachability WHY IS (1,3) IN THE OUTPUT? ▸ Record (1,3) appears in the result 2 ▸ Naive backward tracing returns as an explanation all 5 3 1 4 edges of the graph ▸ A shortest path suffices 20

  21. PROBLEM 2: NOT ENOUGH INFORMATION Use Case: Word Set Difference THE QUICK A BROWN FOX … THE LAZY DOG B … 21

  22. PROBLEM 2: NOT ENOUGH INFORMATION Use Case: Word Set Difference WHY ONLY 3 WORDS ARE ▸ Record (doc A, 3 unique words) UNIQUE TO DOCUMENT A? appears in the result THE QUICK A BROWN FOX (doc A, 3 unique words) … THE LAZY DOG B (doc B, 2 unique words) … 22

  23. PROBLEM 2: NOT ENOUGH INFORMATION Use Case: Word Set Difference WHY ONLY 3 WORDS ARE ▸ Record (doc A, 3 unique words) UNIQUE TO DOCUMENT A? appears in the result THE QUICK A BROWN FOX (doc A, 3 unique words) … ▸ Naive backward tracing returns as an explanation only the words of doc A THE LAZY DOG B (doc B, 2 unique words) … 23

  24. PROBLEM 2: NOT ENOUGH INFORMATION Use Case: Word Set Difference WHY ONLY 3 WORDS ARE ▸ Record (doc A, 3 unique words) UNIQUE TO DOCUMENT A? appears in the result THE QUICK A BROWN FOX (doc A, 3 unique words) … ▸ Naive backward tracing returns as an explanation only the words of doc A THE LAZY DOG B (doc B, 2 unique words) … ▸ We also need the words of doc B to reproduce the record (doc A, 3 unique words) 24

  25. CAN WE SOLVE BOTH PROBLEMS? Yes! Given that the system is able to: ▸ Keep track of the exact point in the computation a data record was produced ▸ Detect divergent records when replaying the computation on a subset of the input We exploit the main features of Differential Dataflow 25

  26. EXPLANATIONS WITH DIFFERENTIAL DATAFLOW Op B Original INPUT OUTPUT Op A Op C dataflow: 26

  27. EXPLANATIONS WITH DIFFERENTIAL DATAFLOW Op B Original INPUT OUTPUT Op A Op C dataflow: Join Explanation INPUT OUTPUT Join Join dataflow: Augment the original dataflow with a shadow dataflow 27

  28. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: Join Explanation EXPL QUERY Join Join dataflow: 28

  29. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: Trace Backwards Join Explanation EXPL QUERY Join Join dataflow: 29

  30. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: Compare Replay Join Explanation EXPL QUERY Join Join dataflow: 30

  31. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: k1 v k2 v’ … … k1 v k2 v’’ … … Trace divergent records backwards Join Explanation EXPL QUERY Join Join dataflow: 31

  32. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: Compare Replay again (for the new records) Join Explanation EXPL QUERY Join Join dataflow: 32 Repeat until a fix-point

  33. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE THE QUICK A BROWN FOX … THE LAZY DOG B … 33 33

  34. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) THE QUICK A (BROWN, A) BROWN FOX MAP … (FOX, A) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) 34

  35. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) 35

  36. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (LAZY,B) (DOG,B) 36

  37. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (A, 3) (LAZY,B) GROUP (B, 2) (DOG,B) 37

  38. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (A, 3) (LAZY,B) GROUP (B, 2) (DOG,B) 38

  39. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (A, 3) (LAZY,B) GROUP (B, 2) (DOG,B) 39

  40. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (A, 3) (LAZY,B) GROUP (B, 2) (DOG,B) 40

  41. RESULTS: EXPLAINING CONNECTED COMPONENTS ▸ Dataset: A subset of the Twitter graph with 1B edges ▸ Algorithm: Label propagation ▸ Output: Records of the form (A,B) denoting that nodes A and B belong to the same connected component ▸ System used: Differential Dataflow ▸ Machine used: Intel Xeon E5-4640 at 2.4GHz with 32 cores and 500G RAM More results: Z. Chothia, J. Liagouris, F. McSherry, T. Roscoe Explaining Outputs in Modern Data Analytics PVDLB 9(12):1137-1148, 2016. 41

  42. EXPLAINING CONNECTED COMPONENTS Twitter 42

  43. PART II Why is my distributed dataflow slow? 43

  44. DISTRIBUTED DATAFLOWS client scheduler Apache Flink W1 Naiad W1 44

Recommend


More recommend