fundamentals of stream processing with apache beam
play

Fundamentals of Stream Processing with Apache Beam (incubating) - PowerPoint PPT Presentation

Google Docs version of slides (including animations): https://goo.gl/yzvLXe Fundamentals of Stream Processing with Apache Beam (incubating) Frances Perry & Tyler Akidau @francesjperry, @takidau Apache Beam Committers & Google Engineers


  1. Google Docs version of slides (including animations): https://goo.gl/yzvLXe Fundamentals of Stream Processing with Apache Beam (incubating) Frances Perry & Tyler Akidau @francesjperry, @takidau Apache Beam Committers & Google Engineers QCon San Francisco -- November 2016

  2. Agenda 1 Infinite, Out-of-Order Data Sets What, Where, When, How 2 Reasons This is Awesome 3 Apache Beam (incubating) 4

  3. Infinite, Out-of-Order Data Sets 1

  4. Data...

  5. ...can be big...

  6. ...really, really big... Thursday Wednesday Tuesday

  7. … maybe infinitely big... 8:00 9:00 10:00 11:00 12:00 13:00 14:00

  8. … with unknown delays. 8:00 8:00 8:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00

  9. Element-wise transformations Processing 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Time

  10. Aggregating via Processing-Time Windows Processing 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Time

  11. Aggregating via Event-Time Windows Input Processing 10:00 11:00 12:00 13:00 14:00 15:00 Time Output 10:00 11:00 12:00 13:00 14:00 15:00 Event Time

  12. Formalizing Event-Time Skew Skew Processing Time Reality Ideal Event Time

  13. Formalizing Event-Time Skew Watermarks describe event time progress. Skew Processing Time "No timestamp earlier than the ~Watermark watermark will be seen" Ideal Often heuristic-based. Too Slow? Results are delayed . Too Fast? Some data is late . Event Time

  14. What, Where, When, How 2

  15. What are you computing? Where in event time? When in processing time? How do refinements relate?

  16. What are you computing? Element-Wise Aggregating Composite What Where When How

  17. What: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey()); What Where When How

  18. What: Computing Integer Sums What Where When How

  19. What: Computing Integer Sums What Where When How

  20. Where in event time? Windowing divides data into event-time-based finite chunks. Fixed Sliding Sessions 1 4 2 1 3 3 1 2 3 4 Key 1 Key 2 Key 3 4 2 5 Time Often required when doing aggregations over unbounded data. What Where When How

  21. Where: Fixed 2-minute Windows PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey()); What Where When How

  22. Where: Fixed 2-minute Windows What Where When How

  23. When in processing time? • Triggers control Skew when results are Processing Time emitted. ~Watermark Ideal • Triggers are often relative to the watermark. Event Time What Where When How

  24. When: Triggering at the Watermark PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); What Where When How

  25. When: Triggering at the Watermark What Where When How

  26. When: Early and Late Firings PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey()); What Where When How

  27. When: Early and Late Firings What Where When How

  28. How do refinements relate? • How should multiple outputs per window accumulate? • Appropriate choice depends on consumer. Firing Elements Discarding Accumulating Acc. & Retracting Speculative [3] 3 3 3 Watermark [5, 1] 6 9 9, -3 Late [2] 2 11 11, -9 Last Observed 2 11 11 Total Observed 11 23 11 (Accumulating & Retracting not yet implemented.) What Where When How

  29. How: Add Newest, Remove Previous PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); What Where When How

  30. How: Add Newest, Remove Previous What Where When How

  31. Reasons This is Awesome 3

  32. What / Where / When / How Correctness Power Composability Flexibility Modularity

  33. What / Where / When / How Correctness Power Composability Flexibility Modularity

  34. Distributed Systems are Distributed

  35. Processing Time Results Differ

  36. Event Time Results are Stable

  37. What / Where / When / How Correctness Power Composability Flexibility Modularity

  38. Identifying Bursts of User Activity

  39. Sessions PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

  40. Identifying Bursts of User Activity

  41. What / Where / When / How Correctness Power Composability Flexibility Modularity

  42. Calculating Session Lengths input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

  43. Calculating the Average Session Length input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength())); .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally());

  44. What / Where / When / How Correctness Power Composability Flexibility Modularity

  45. 1.Classic Batch 2. Batch with Fixed 3. Streaming Windows 4. Streaming with 5. Streaming With 6. Sessions Speculative + Late Data Retractions

  46. What / Where / When / How Correctness Power Composability Flexibility Modularity

  47. PCollection<KV<String, Integer>> scores = input PCollection<KV<String, Integer>> scores = input PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Window.into(FixedWindows.of(Minutes(2)) .apply(Sum.integersPerKey()); .triggering(AtWatermark())) .apply(Sum.integersPerKey()); 1.Classic Batch 2. Batch with Fixed 3. Streaming Windows PCollection<KV<String, Integer>> scores = input PCollection<KV<String, Integer>> scores = input PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .apply(Window.into(FixedWindows.of(Minutes(2)) .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .triggering(AtWatermark() .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withEarlyFirings(AtPeriod(Minutes(1))) .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .withLateFirings(AtCount(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey()); .accumulatingAndRetractingFiredPanes()) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); .apply(Sum.integersPerKey()); 4. Streaming with 5. Streaming With 6. Sessions Speculative + Late Data Retractions

  48. What / Where / When / How Correctness Power Composability Flexibility Modularity

  49. Apache Beam (incubating) 4

  50. The Evolution of Beam Colossus BigTable PubSub Dremel Google Cloud Dataflow Spanner Megastore Millwheel Flume MapReduce Apache Beam

  51. What is Part of Apache Beam? 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- Java and Python 3. Runners for Existing Distributed Processing Backends • Apache Flink • Apache Spark • Google Cloud Dataflow • Direct runner for local development and testing • In development: Apache Gearpump and Apache Apex

  52. Apache Beam Technical Vision End users: who want to write 1. Other Beam pipelines or transform libraries in Beam Java Languages Python a language that’s familiar. SDK writers: who want to make 2. Beam Model: Pipeline Construction Beam concepts available in new languages. Cloud Runner A Runner B Dataflow Runner writers: who have a 3. distributed processing environment and want to support Beam Model: Fn Runners Beam pipelines Execution Execution Execution

  53. Visions are a Journey 2016-02-25 2016-07-28 2016-10-31 1st commit to 0.2.0-incubating 0.3.0-incubating ASF repository release release 2016-02-01 2016-06-08 2016-10-21 Enter Apache 0.1.0-incubating Three new Incubator release committers Early 2016 Late 2016 Mid 2016 Internal API redesign Multiple runners API Stabilization and chaos execute Beam pipelines

  54. Categorizing Runner Capabilities http://beam.incubator.apache.org/ documentation/runners/capability-matrix/

Recommend


More recommend