google cloud dataflow
play

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer - PowerPoint PPT Presentation

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015 Agenda 1 Dataflow Overview 2 Dataflow SDK Concepts (Programming Model) 3 Cloud Dataflow Service 4 Demo: Counting Words! 5 Questions and


  1. Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

  2. Agenda 1 Dataflow Overview 2 Dataflow SDK Concepts (Programming Model) 3 Cloud Dataflow Service 4 Demo: Counting Words! 5 Questions and Discussion

  3. History of Big Data at Google Cloud Dataflow Dremel Flume MapReduce Spanner GFS Big Table Pregel MillWheel Colossus 2002 2004 2006 2008 2010 2012 2013

  4. Big Data on Google Cloud Platform Store Capture Process Analyze Pub/Sub Cloud BigQuery Cloud SQL Cloud Dataflow Hadoop BigQuery Hadoop Larger Logs Storage Storage (mySQL) Datastore (stream Spark (on Spark (on Hadoop App Engine (objects) BigTable (NoSQL) and batch) GCE) GCE) Ecosystem BigQuery streaming (structured)

  5. What is Cloud Dataflow? Cloud Dataflow is a Cloud Dataflow is managed service a collection of for executing SDK s for building parallelized data parallelized data processing processing pipelines pipelines

  6. Where might you use Cloud Dataflow? ETL Orchestration Analysis

  7. Where might you use Cloud Dataflow? Reduction Movement Composition Batch Filtering External computation orchestration Enrichment Continuous Simulation computation Shaping

  8. Dataflow SDK Concepts (Programming Model)

  9. Dataflow SDK(s) Easily construct parallelized data processing pipelines using an intuitive ● set of programming abstractions Do what the user expects. ○ No knobs whenever possible. ○ Build for extensibility. ○ Unified batch & streaming semantics. ○ Google supported and open sourced ● Java 7 (public) @ github.com/GoogleCloudPlatform/DataflowJavaSDK ○ Python 2 (in progress) ○ Community sourced ● Scala @ github.com/darkjh/scalaflow ○ Scala @ github.com/jhlch/scala-dataflow-dsl ○

  10. Dataflow Java SDK Release Process weekly monthly

  11. Pipeline • A directed graph of data processing transformations • Optimized and executed as a unit • May include multiple inputs and multiple outputs • May encompass many logical MapReduce or Millwheel operations • PCollections conceptually flow through the pipeline

  12. Runners Specify how a pipeline should run ● Direct Runner ● For local, in-memory execution. Great for developing and unit tests ○ Cloud Dataflow Service ● batch mode: GCE instances poll for work items to execute. ○ streaming mode: GCE instances are set up in a semi-permanent topology ○ Community sourced ● Spark from Cloudera @ github.com/cloudera/spark-dataflow ○ Flink from dataArtisans @ github.com/dataArtisans/flink-dataflow ○

  13. Example: #HashTag Autocompletion

  14. Tweets {Go Hawks #Seahawks!, #Seattle works museum pass. Free! Read Go #PatriotsNation! Having fun at #seaside, … } ExtractTags {seahawks, seattle, patriotsnation, lovemypats, ...} Count {seahawks->5M, seattle->2M, patriots->9M, ...} {d->(deflategate, 10M), d->(denver, 2M), …, ExpandPrefixes sea->(seahawks, 5M), sea->(seaside, 2M), ...} {d->[deflategate, desafiodatransa, djokovic], ... Top(3) de->[deflategate, desafiodatransa, dead50],...} Write Predictions

  15. Pipeline p = Pipeline.create(); Tweets p.begin() Read .apply(TextIO. Read .from(“gs://…”)) ExtractTags .apply(ParDo.of(new ExtractTags ())) Count .apply( Count .perElement()) ExpandPrefixes .apply(ParDo.of(new ExpandPrefixes ()) Top(3) .apply( Top .largestPerKey(3)) Write .apply(TextIO. Write. to(“gs://…”)); p.run(); Predictions

  16. Dataflow Basics Pipeline Pipeline p = Pipeline.create(); • Directed graph of steps operating on data p.run();

  17. Dataflow Basics Pipeline Pipeline p = Pipeline.create(); • Directed graph of steps operating on data p.begin() Data .apply(TextIO. Read .from(“gs://…”)) • PCollection • Immutable collection of same-typed elements that can be encoded • PCollectionTuple, PCollectionList .apply(TextIO. Write. to(“gs://…”)); p.run();

  18. Dataflow Basics Pipeline Pipeline p = Pipeline.create(); • Directed graph of steps operating on data p.begin() Data .apply(TextIO. Read .from(“gs://…”)) • PCollection .apply(ParDo.of(new ExtractTags ())) • Immutable collection of same-typed .apply( Count .perElement()) elements that can be encoded .apply(ParDo.of(new ExpandPrefixes ()) • PCollectionTuple, PCollectionList .apply( Top .largestPerKey(3)) .apply(TextIO. Write. to(“gs://…”)); Transformation • Step that operates on data p.run(); • Core transforms • ParDo, GroupByKey, Combine, Flatten • Composite and custom transforms

  19. Dataflow Basics Pipeline p = Pipeline.create(); p.begin() .apply(TextIO. Read .from(“gs://…”)) .apply(ParDo.of(new ExtractTags ())) class ExpandPrefixes … { ... .apply( Count .perElement()) public void processElement(ProcessContext c) { .apply(ParDo.of(new ExpandPrefixes ()) String word = c.element().getKey(); .apply( Top .largestPerKey(3)) for (int i = 1; i <= word.length(); i++) { String prefix = word.substring(0, i); .apply(TextIO. Write. to(“gs://…”)); c.output(KV.of(prefix, c.element())); } p.run(); } }

  20. PCollections • A collection of data of type T in a pipeline {Seahawks, NFC, Champions, Seattle, ...} • Maybe be either bounded or unbounded in size • Created by using a PTransform to: {..., • Build from a java.util.Collection “NFC Champions #GreenBay”, “Green Bay #superbowl!”, • Read from a backing data store ... • Transform an existing PCollection “#GoHawks”, ...} • Often contain key-value pairs using KV<K, V>

  21. Inputs & Outputs • Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore, ... • Write your own custom source by teaching Dataflow how to read it in parallel Your Source/Sink Here • Write to standard Google Cloud Platform data sinks • GCS, BigQuery, Pub/Sub, Datastore, … • Can use a combination of text, JSON, XML, Avro formatted data

  22. Coders • A Coder<T> explains how an element of type T can be written to disk or communicated between machines • Every PCollection<T> needs a valid coder in case the service decides to communicate those values between machines. • Encoded values are used to compare keys -- need to be deterministic. Avro Coder inference can infer a coder for many basic Java objects. •

  23. ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} LowerCase {seahawks, nfc, champions, seattle, ...}

  24. ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} PCollection<String> tweets = …; LowerCase tweets.apply(ParDo.of( new DoFn<String, String>() { @Override public void processElement( {seahawks, nfc, champions, seattle, ...} ProcessContext c) { c.output(c.element().toLowerCase()); }));

  25. ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} FilterOutSWords {NFC, Champions, ...}

  26. ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} ExpandPrefixes {s, se, sea, seah, seaha, seahaw, seahawk, seahawks, n, nf, nfc, c, ch, cha, cham, champ, champi, champio, champion, champions, s, se, sea, seat, seatt, seattl, seattle, ...}

  27. ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} KeyByFirstLetter {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}

  28. ParDo (“Parallel Do”) • Processes each element of a PCollection independently using a user-provided DoFn {Seahawks, NFC, Champions, Seattle, ...} • Elements are processed in arbitrary ‘bundles’ e. g. “shards” KeyByFirstLetter • startBundle(), processElement()*, finishBundle() • supports arbitrary amounts of {KV<S, Seahawks>, KV<C,Champions>, parallelization <KV<S, Seattle>, KV<N, NFC>, ...} • Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo

  29. GroupByKey • Takes a PCollection of key-value pairs and {KV<S, Seahawks>, KV<C,Champions>, gathers up all values with the same key <KV<S, Seattle>, KV<N, NFC>, ...} • Corresponds to the shuffle phase in Hadoop GroupByKey {KV<S, {Seahawks, Seattle, …}, How do you do a GroupByKey on an unbounded KV<N, {NFC, …} PCollection? KV<C, {Champion, …}}

  30. Windows • Logically divide up or groups the elements of a PCollection into finite windows • Fixed Windows: hourly, daily, … • Sliding Windows • Sessions • Required for GroupByKey-based transforms on an unbounded PCollection, but can also be used for bounded PCollections • Window.into() can be called at any point in the Nighttime Mid-Day Nighttime pipeline and will be applied when needed • Can be tied to arrival time or custom event time

  31. Event Time Skew Skew Watermark Processing Time Event Time

Recommend


More recommend