Dataflow/Apache Beam A Unified Model for Batch and Streaming Data - PowerPoint PPT Presentation

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov, Google STREAM 2016

Agenda Google’s Data Processing Story 1 Philosophy of the Beam programming model 2 Apache Beam project 3

Google’s 1 Data Processing Story

Data Processing @ Google Dataflow MapReduce FlumeJava Dremel Spanner GFS Big Table Pregel Colossus MillWheel 2002 2004 2006 2008 2010 2012 2014 2016

MapReduce: SELECT + GROUP BY (distributed input dataset) Map (SELECT) Shuffle (GROUP BY) Reduce (SELECT) (distributed output dataset)

FlumeJava Pipelines • A Pipeline represents a graph of data processing transformations • PCollections flow through the pipeline • Optimized and executed as a unit for efficiency

Example: Computing mean temperature // Collection of raw events PCollection<SensorEvent> raw = ...; // Element-wise extract location/temperature pairs PCollection<KV<String, Double>> input = raw.apply(ParDo.of(new ParseFn())) // Composite transformation containing an aggregation PCollection<KV<String, Double>> output = input .apply(Mean.<Double>perKey()); // Write output output.apply(BigtableIO.Write.to(...)); What Where When How

So, people used FJ to process data...

...big data...

...really, really big... Thursday Wednesday Tuesday

Batch failure mode #1 Latency

Batch failure mode #2: Sessions Tuesday Wednesday Tuesday Wednesday Jose Lisa Ingo MapReduce Asha Cheryl Ari

Continuous & Unbounded 8:00 1:00 9:00 2:00 10:00 3:00 11:00 4:00 12:00 5:00 13:00 6:00 14:00 7:00

State of the art until recently: Lambda Architecture Exact Historical Periodic batch historical processing events model Stream Approximate Continuous processing real-time updates system model

MillWheel: Deterministic, low-latency streaming ● Framework for building low-latency data-processing applications ● User provides a DAG of computations to be performed ● System manages state and persistent flow of elements

Streaming or Batch? 1 + 1 = 2 Correctness Latency Why not both?

What are you computing? Where in event time? When in processing time? How do refinements relate?

Where in event time? Windowing divides data into event-time-based finite chunks. ● Required when doing aggregations over unbounded data. ● What Where When How

When in Processing Time? • Triggers control when results are Watermark emitted. Processing Time • Triggers are often relative to the watermark. Event Time What Where When How

How do refinements relate? PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting() ) .apply(new Sum()); What Where When How

Google Cloud Dataflow A fully-managed cloud service and Cloud Dataflow programming model for batch and streaming big data processing.

Dataflow SDK ● Portable API to construct and run a pipeline. ● Available in Java and Python (alpha) ● Pipelines can run… ○ On your development machine ○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.

Dataflow ⇒ Apache Beam 3

Pipeline p = Pipeline.create(options); p.apply( TextIO.Read.from ("gs://dataflow-samples/shakespeare/*")) .apply( FlatMapElements.via ( word → Arrays.asList(word.split("[^a-zA-Z']+")))) .apply( Filter.byPredicate (word → !word.isEmpty())) .apply( Count.perElement() ) .apply( MapElements.via ( count → count.getKey() + ": " + count.getValue()) .apply( TextIO.Write.to ("gs://.../...")); p.run();

Apache Beam ecosystem End-user's pipeline Libraries: transforms, sources/sinks etc. Language-specific SDK Java Python ... Beam model (ParDo, GBK, Windowing…) Runner Execution environment

Apache Beam Roadmap 02/25/2016 1st commit to ASF repository Early 2016 Late 2016 Design for use cases, Multiple runners begin refactoring execute Beam pipelines Mid 2016 Slight chaos 02/01/2016 Enter Apache Incubator

Runner capability matrix

Technical Vision: Still more modular Other Beam Beam Java Languages Python Multiple SDKs • with shared pipeline representation Beam Model: Pipeline Construction • Language-agnostic runners implementing the model Cloud • Fn Runners Runner A Runner B Dataflow run language-specific code Beam Model: Fn Runners Execution Execution Execution

Recap: Timeline of ideas 2004 MapReduce (SELECT / GROUP BY) Library > DSL Abstract away fault tolerance & distribution 2010 FlumeJava: High-level API (typed DAG) 2013 MillWheel: Deterministic stream processing 2015 Dataflow: Unified batch/streaming model Windowing, Triggers, Retractions 2016 Beam: Portable programming model Language-agnostic runners

Learn More! Programming model The World Beyond Batch: Streaming 101, Streaming 102 The Dataflow Model paper Cloud Dataflow http://cloud.google.com/dataflow/ Apache Beam https://wiki.apache.org/incubator/BeamProposal http://beam.incubator.apache.org/ Dataflow/Beam vs. Spark

Thank you Google confidential │ Do not distribute

Google confidential │ Do not distribute

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data - PowerPoint PPT Presentation

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov, Google STREAM 2016 Agenda Googles Data Processing Story 1 Philosophy of the Beam programming model 2 Apache Beam project 3 Googles 1

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache NiFi Better Analytics Demand Better Dataflow Presented by: Joe Witt Apache NiFi PPMC

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Simplifying ML Workflows with Apache Beam & TensorFlow Extended Tyler Akidau @takidau

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

DONT OPTIMIZE MY QUERIES, ORGANIZE MY DATA! Julian Hyde (Apache Calcite) TELUQ, Montral,

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019,

Lowering boundaries between data analysis ecosystems Jim Pivarski Princeton University DIANA

APACHE PULSAR - THE NEXT GENERATION MESSAGING AND QUEUING KARTHIK RAMASAMY SENIOR DIRECTOR OF

API Design is Hard By Dave Halter @davidhalter on Github @jedidjah_ch on Twitter Me Creator

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data - PowerPoint PPT Presentation

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov, Google STREAM 2016 Agenda Googles Data Processing Story 1 Philosophy of the Beam programming model 2 Apache Beam project 3 Googles 1

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache NiFi Better Analytics Demand Better Dataflow Presented by: Joe Witt Apache NiFi PPMC

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Simplifying ML Workflows with Apache Beam &amp; TensorFlow Extended Tyler Akidau @takidau

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

DONT OPTIMIZE MY QUERIES, ORGANIZE MY DATA! Julian Hyde (Apache Calcite) TELUQ, Montral,

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019,

Lowering boundaries between data analysis ecosystems Jim Pivarski Princeton University DIANA

APACHE PULSAR - THE NEXT GENERATION MESSAGING AND QUEUING KARTHIK RAMASAMY SENIOR DIRECTOR OF

API Design is Hard By Dave Halter @davidhalter on Github @jedidjah_ch on Twitter Me Creator

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Simplifying ML Workflows with Apache Beam & TensorFlow Extended Tyler Akidau @takidau

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb