a whirlwind overview of apache beam
play

A Whirlwind Overview of Apache Beam Eugene Kirpichov - PowerPoint PPT Presentation

A Whirlwind Overview of Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer (2008) FlumeJava High-level API (2016) Apache Beam (2014) Dataflow (2004) MapReduce Unified batch/streaming, Open ecosystem, SELECT


  1. A Whirlwind Overview of Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer

  2. (2008) FlumeJava High-level API (2016) Apache Beam (2014) Dataflow (2004) MapReduce Unified batch/streaming, Open ecosystem, SELECT + GROUPBY Portable Community-driven Vendor-independent (2013) Millwheel Deterministic Streaming Google Cloud Platform 2

  3. Pipeline p = Pipeline.create(options); Read text files PCollection<String> lines = p.apply( TextIO.read().from ( "gs://.../*" )); Split into words PCollection<KV<String, Long>> wordCounts = lines .apply( FlatMapElements.via (word → word.split( "\\W+" ))) .apply( Count.perElement() ); Count wordCounts .apply( MapElements.via ( Format count → count.getKey() + ": " + count.getValue()) .apply( TextIO.write().to ( "gs://.../..." )); Write text files p.run(); Google Cloud Platform 3

  4. Beam PTransforms DoFn ParDo GroupByKey Composite ("map") ("reduce") Google Cloud Platform 4

  5. Pillars of Beam Ecosystem Unified model Portability Google Cloud Platform 5

  6. Unified Model Batch doesn't exist Google Cloud Platform Confidential & Proprietary 6

  7. E T L Grows Evolves Computes updates (Always expect new data) Growing data is temporal ⇒ All data has timestamps ( event-time: t happened ) Google Cloud Platform 7

  8. Dealing with new data ParDo GroupByKey ⇒ Apply to new data ⇒ ? Google Cloud Platform 8

  9. Continuous aggregation Idea: per-key buffering (K, V) (K, V[]) GroupByKey K i , V K i , V[] Group (K, V) (K, V[]) Group Group Google Cloud Platform 9

  10. t in :V t (event time) K i t out :V[] See: Streams and Tables https://www.infoq.com/presentations/beam-model-stream-table=theory Google Cloud Platform 10

  11. Continuous aggregation Idea: temporal windowing 14:03: (k, v) event time K i Element counts toward 1 or more windows T watermark Apply (user-specified) trigger ⇒ closes old windows drop / add to buffer / emit buffer Google Cloud Platform 11

  12. There is no batch / streaming. Only different ways to control aggregation Google Cloud Platform Confidential & Proprietary 12

  13. Portability (vision for 2018) Google Cloud Platform Confidential & Proprietary 13

  14. Code in any . . . supported language (or a mix) Portable pipeline representation . . . Run on any supported runner Google Cloud Platform 14

  15. No vendor lock-in Run any language on any runner No language lock-in Users: Use all transforms from all languages Library authors: Will be usable by all languages Accelerated ecosystem growth New runner / new SDK ⇒ access all Beam libraries Google Cloud Platform 15

  16. Ecosystem Google Cloud Platform Confidential & Proprietary 16

  17. Community . . . User code Powered by Beam Third-party IO SQL Other libs SDKs Language SDKs Portable Unified Model . . . Runners Google Cloud Platform 17

  18. 250 contributors 31 committers ( 11 orgs) ~5000 PRs ~12,500 commits 25+ IO connectors 5 stable releases 9 runners Google Cloud Platform 18

  19. Thank you! Google Cloud Platform Confidential & Proprietary 19

Recommend


More recommend