the dataflow model problem
play

The Dataflow Model Problem How can we process unbounded data? - PowerPoint PPT Presentation

The Dataflow Model Problem How can we process unbounded data? Example: track user activity on a website Key ideas Windowing Fixed windows Sliding windows Sessions Time domains Event time Processing time


  1. The Dataflow Model

  2. Problem • How can we process unbounded data? • Example: track user activity on a website

  3. Key ideas • Windowing • Fixed windows • Sliding windows • Sessions • Time domains • Event time • Processing time • Triggers

  4. Contribution • Dataflow API • Easily build pipelines with your choice of windowing, time domain, and trigger • Independent of execution engine • Choose batch, micro-batch, or streaming depending on tradeoffs

  5. Windowing

  6. Types of windows • Fixed windows • Sliding windows • Sessions

  7. Fixed windows

  8. Sliding windows Example: compute running average over past 5 minutes of data

  9. Session windows Example: YouTube viewing sessions

  10. Time domains • For many applications, windows should be based on “event time” (when the events actually occur) • Example: billing YouTube advertisers • Lag, partitions, etc, might cause an event to be processed later than its event time • Processing time

  11. Challenge: time skew

  12. Goal: Event-time windows Fixed windows Session windows

  13. Challenge: completion • With event times, how does the system know if it has received all of the data in a window? • Example: phones might watch YouTube videos (and ads) offline

  14. Watermarks • Heuristics that tell the system when it is likely to have received most of the data in a window • Based on global progress metrics • Watermarks are insufficient: • Late data might arrive behind the watermark • Watermark might be too slow due to one late datum and increase latency for the whole system

  15. Incremental processing • Difficult to get the single best result from a window • Instead, let windows produce multiple results (improving incrementally over time)

  16. Triggers • Triggers specify when to output window results • at watermark • at percentile watermark • every minute, etc • Triggers specify how to output results • discard previous window • accumulate • accumulate and retract • Triggers are composable

  17. Examples

  18. 12:09 1 8 9 12:08 Processing Time 3 12:07 8 3 3 4 12:06 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 5: Example Inputs

  19. PCollection<KV<String, Integer>> output = input .apply( Window . trigger ( Repeat ( AtPeriod (1, MINUTE))) . accumulating ()) .apply(Sum.integersPerKey()); 12:09 1 51 51 8 9 12:08 Processing Time 33 33 3 12:07 8 22 22 3 3 4 12:06 12 12 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Figure 7: GlobalWindows, AtPeriod, Accumulating

  20. PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtPeriod(1, MINUTE))) . discarding ()) .apply(Sum.integersPerKey()); 12:09 1 18 18 8 9 12:08 Processing Time 11 11 3 12:07 8 10 10 3 3 4 12:06 12 12 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Figure 8: GlobalWindows, AtPeriod, Discarding

  21. PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(2, MINUTES)) . trigger ( Repeat ( AtWatermark ()))) .accumulating()) .apply(Sum.integersPerKey()); Let’s run this pipeline under the three execution engines: batch, micro-batch, streaming

  22. 12:09 12 12 3 22 22 14 14 1 8 9 12:08 Processing Time 3 12:07 8 3 3 4 12:06 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 10: FixedWindows, Batch

  23. 12:09 1 12 12 14 14 8 9 12:08 Processing Time 3 22 22 3 12:07 8 3 14 14 3 3 4 12:06 7 5 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 11: FixedWindows, Micro-Batch

  24. 12:09 12 12 1 8 9 12:08 14 14 Processing Time 3 3 22 22 12:07 8 3 3 4 12:06 5 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 12: FixedWindows, Streaming

  25. 12:09 1 12 12 8 9 12:08 14 14 Processing Time 3 3 22 22 12:07 8 14 14 3 3 3 4 12:06 5 7 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 13: FixedWindows, Streaming, Partial

  26. PCollection<KV<String, Integer>> output = input .apply(Window.into( Sessions . withGapDuration (1, MINUTE)) .trigger(SequenceOf( RepeatUntil( AtPeriod(1, MINUTE), AtWatermark()), Repeat(AtWatermark()))) . accumulatingAndRetracting ()) .apply(Sum.integersPerKey()); 12:09 1 -3 -3 12 12 8 9 12:08 -5 -5 39 39 -25 -25 Processing Time 3 -7 -7 25 25 -10 -10 3 12:07 8 10 10 3 3 4 12:06 5 7 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 14: Sessions, Retracting

Recommend


More recommend