have your cake eat it too

Have Your Cake & Eat It Too Further Dispelling the Myths of the - PowerPoint PPT Presentation

Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa Have Your Cake & Eat It Too Further Dispelling the Myths of the Lambda Architecture Tyler Akidau Staff Software Engineer MillWheel - Stream Processing

  1. Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa Have Your Cake & Eat It Too Further Dispelling the Myths of the Lambda Architecture Tyler Akidau Staff Software Engineer

  2. MillWheel - Stream Processing System Streaming Flume - High-level API Cloud Dataflow - Data Processing Service

  3. Google Cloud Dataflow Optimize Schedule GCS GCS

  4. MillWheel - Slava Chernyak, Josh Haberman, Reuven Lax, Daniel Mills, Paul Nordstrom, Sam McVeety, Sam Whittle, and more... Streaming Flume - Robert Bradshaw, Daniel Mills, and more... Cloud Dataflow - Robert Bradshaw, Craig Chambers, Reuven Lax, Daniel Mills, Frances Perry, and more...

  5. Cloud Dataflow is unreleased. Things may change.

  6. Agenda 1 Lambda vs Streaming 2 Strong Consistency 3 Reasoning About Time

  7. Lambda vs Streaming 1

  8. http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

  9. The Lambda Architecture

  10. http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

  11. The Evolution of Streaming

  12. What does it take? Strong Consistency Tools for Reasoning About Time

  13. Strong Consistency 2

  14. Consistent Storage Storage

  15. Why consistency is important • Mostly correct is not good enough • Required for exactly-once processing • Required for repeatable results • Cannot replace batch without it

  16. How? • Sequencers (e.g. BigTable) • Leases (e.g. Spanner) • Federation of storage silos (e.g. Samza, Dataflow) • RDDs (e.g. Spark)

  17. http://research.google.com/pubs/pub41378.html

  18. Reasoning About Time 3

  19. Event Time vs Stream Time Batch vs Streaming Approaches Dataflow API

  20. Event Time - When Events Happened Stream Time - When Events Are Processed

  21. Batch vs Streaming

  22. Batch MapReduce

  23. Batch: Fixed Windows [10:00 - 11:00) [10:00 - 11:00) [11:00 - 12:00) [12:00 - 13:00) [13:00 - 14:00) [14:00 - 15:00) [15:00 - 16:00) [16:00 - 17:00) [18:00 - 19:00) [19:00 - 20:00) [21:00 - 22:00) MapReduce [22:00 - 23:00) [23:00 - 0:00)

  24. Batch: User Sessions [10:00 - 11:00) [11:00 - 12:00) [10:00 - 11:00) [11:00 - 12:00) Joan Larry Ingo MapReduce Amanda Cheryl Arthur

  25. Streaming 16:00 15:00 14:00 13:00 12:00 11:00 10:00

  26. Confounding characteristics of data streams Unordered Unbounded Of Varying Event Time Skew

  27. Event Time Skew Skew Stream Time Event Time

  28. Approaches

  29. Approaches to reasoning about time 1. Time-Agnostic Processing 2. Approximation 3. Stream Time Windowing 4. Event Time Windowing

  30. 1. Time-Agnostic Processing - Filters 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time Example Input: Web server traffic logs Example Output: All traffic from specific domains Pros: Straightforward Efficient Cons: Limited utility

  31. 1. Time-Agnostic Processing - Hash Join 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time Example Input: Query & Click traffic Example Output: Joined stream of Query + Click pairs Pros: Straightforward Efficient Cons: Limited utility

  32. 2. Approximation via Online Algorithms 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time Example Input: Twitter hashtags Example Output: Approximate top N hashtags per prefix Pros: Efficient Cons: Inexact Complicated Algorithms

  33. 3. Windowing by Stream Time 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time Example Input: Web server request traffic Example Output: Per-minute rate of received requests Pros: Straightforward Results reflect contents of stream Cons: Results don’t reflect events as they happened If approximating event time, usefulness varies

  34. 4. Windowing by Event Time - Fixed Windows 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Event Time Example Input: Twitter hashtags Example Output: Top N hashtags by prefix per hour. Pros: Reflects events as they occurred Cons: More complicated buffering Completeness issues

  35. 4. Windowing by Event Time - Sessions 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Event Time Example Input: User activity stream Example Output: Per-session group of activities Pros: Reflects events as they occurred Cons: More complicated buffering Completeness issues

  36. Dataflow API

  37. What are you computing? Where in event time? When in stream time?

  38. What = Aggregation API Where = Windowing API When = Watermarks + Triggers API

  39. Aggregation API PCollection<KV<String, Double>> sums = Pipeline .begin() .read(“userRequests”) .apply(new Sum());

  40. Aggregation API 2 3 9 3 7 Sum 1 16 4 8 9 6 18 0

  41. Streaming Mode 2 6 7 4 9 3 3 0 2 3 8 6 8 4 3 3 4 1 3 2 1 0 3 0 7 10:06 10:04 10:02 Stream Time 10:00 2 6 3 9 2 3 7 4 3 0 6 8 8 4 3 4 3 7 1 1 3 2 3 0 0 10:06 10:04 10:02 10:00 Event Time

  42. Windowing API PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE))); .apply(new Sum());

  43. Windowing API 3 2 6 7 4 9 3 0 2 3 8 6 8 4 3 3 4 7 1 3 2 1 0 3 0 10:06 10:04 10:02 10:00 Stream Time FixedWindows 2 6 3 9 2 3 7 4 3 0 6 8 8 4 3 4 3 7 1 1 3 2 3 0 0 10:06 10:04 10:02 10:00 Event Time Sum 15 5 4 3 16 18 6 13 12 10:06 10:04 10:02 10:00 Event Time

  44. Watermarks ● f(S) -> E ● S = a point in stream time (i.e. now) ● E = the point in event time up to which input data is complete as of S

  45. Event Time Skew Stream Time Event Time

  46. Watermarks 3 2 6 7 4 9 3 0 2 3 8 6 8 4 3 3 4 7 1 3 2 1 0 3 0 10:06 10:04 10:02 10:00 Stream Time FixedWindows 2 6 3 9 2 3 7 4 3 0 6 8 8 4 3 4 3 7 1 1 3 2 3 0 0 10:06 10:04 10:02 10:00 Event Time Sum 15 5 4 3 16 18 6 13 12 10:03 10:02 10:01 10:00 Event Time

  47. Watermark Caveats Too slow = more latency Too fast = late data

  48. Triggers When in stream time to emit?

  49. Triggers API PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTES)) .trigger(new AtWatermark()); .apply(new Sum());

  50. 13:00 10:06 9 9 3 Late datum 6 5 10:05 13 13 8 10:04 20 20 5 10:03 8 12 5 4 1 10:02 3 7 2 10:01 2 10:00 10:00 10:01 10:02 10:03 10:04 10:05 10:06 Event Time Stream Time

  51. A Better Strategy 1. Once per stream time minute 2. At watermark 3. Once per record for two weeks

  52. 13:00 10:06 25 13 9 25 13 9 9 9 3 Late datum 6 5 13 25 25 13 10:05 20 13 20 13 13 8 10:04 20 5 20 20 5 10:03 20 20 8 5 5 12 5 4 1 10:02 12 12 3 7 10:01 2 2 2 10:00 10:00 10:01 10:02 10:03 10:04 10:05 10:06 Event Time Stream Time

  53. Triggers API PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

  54. Lambda vs Streaming Low-latency, approximate results Complete, correct results as soon as possible Ability to deal with changes upstream

  55. One Last Thing... What if I want sessions?

  56. Triggers API PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new Sessions(1, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

  57. 13:00 10:06 9 6 9 9 3 38 3 Late datum 6 25 5 33 38 38 5 10:05 33 25 33 33 13 8 8 10:04 25 10:03 9 7 25 25 8 20 8 5 9 3 8 1 4 4 1 10:02 12 9 9 3 3 2 7 1 minute 3 7 2 9 7 1 minute 10:01 2 2 2 2 2 1 minute 10:00 10:00 10:01 10:02 10:03 10:04 10:05 10:06 Event Time Stream Time

  58. Summary Lambda is great Streaming by itself is better :-) Strong Consistency = Correctness Streaming = Aggregation + Windowing + Triggers Tools For Reasoning About Time = Power + Flexibility

  59. Thank you! Questions? Questions about this talk: takidau@google.com (Tyler Akidau) Questions about Cloud Dataflow: cloude@google.com (Eric Schmidt)


More recommend