f r o m z e r o t o p o rta b i l i t y
play

F R O M Z E R O T O P O RTA B I L I T Y ? Maximilian Michels - PowerPoint PPT Presentation

F R O M Z E R O T O P O RTA B I L I T Y ? Maximilian Michels mxm@apache.org A PA C H E B E A M S J O U R N E Y T O @stadtlegende C R O S S - L A N G U A G E D ATA P R O C E S S I N G maximilianmichels.com F O S D E M 2 0 1 9


  1. F R O M Z E R O T O P O RTA B I L I T Y ? Maximilian Michels mxm@apache.org A PA C H E B E A M ’ S J O U R N E Y T O @stadtlegende C R O S S - L A N G U A G E D ATA P R O C E S S I N G maximilianmichels.com F O S D E M 2 0 1 9

  2. What is Beam? What does portability mean? How do we achieve portability? Are we there yet? � 2

  3. 3 � W H AT I S B E A M ? • Apache open-source project • Parallel/distributed data processing • Unified programming model for batch/stream processing • Execution engine of your choice ("Uber API") Apache Beam • Programming language of your choice

  4. � 4 B E A M V I S I O N Direct Apache Samza Write Pipeline Translate Apache Flink Google Cloud Dataflow Runners Apache Spark Apache Apex Apache Gearpump Apache Nemo (incubating) SDKs Execution Engines

  5. 5 � T H E B E A M A P I I N P C O L L E C T I O N P C O L L E C T I O N O U T T R A N S F O R M T R A N S F O R M Pipeline 1. Pipeline p = Pipeline.create(options) 2. PCollection pCol1 = p.apply(transform).apply(…).… 3. PCollection pcol2 = pCol1.apply(transform) 4. p.run()

  6. � 6 T R A N S F O R M S P C O L L E C T I O N P C O L L E C T I O N T R A N S F O R M P R I M I T I V E T R A N S F O R M S • Transforms can be primitive or composite ParDo • Composite transforms expand to primitive • Only small set of primitive transforms GroupByKey • Runners can support specialized translation of AssignWindows composite transforms, but don't have to Flatten

  7. � 7 C O R E P R I M I T I V E T R A N S F O R M S P a r D o G ro u p B y K e y input -> output KV<k,v>… -> KV<k, [v…]> “to” -> KV<“to”, 1> KV<“to”, [1,1]> “be” -> KV<“be”, 1> KV<“be”, [1,1]> “or” -> KV<“or”, 1> KV<“or”, [1 ]> “not”-> KV<“not”,1> KV<“not”,[1 ]> “to” -> KV<“to”, 1> “be” -> KV<“be”, 1> "Map/Reduce Phase" "Shuffle Phase"

  8. 8 � W O R D C O U N T — R A W pipeline .apply(Create. of ("hello", "hello", "fosdem")) .apply(ParDo. of ( new DoFn<String, KV<String, Integer>>() { @ProcessElement public void processElement(ProcessContext ctx) { KV<String, Integer> outputElement = KV. of (ctx.element(), 1); ctx.output(outputElement); } })) .apply(GroupByKey. create ()) .apply(ParDo. of ( new DoFn<KV<String, Iterable<Integer>>, KV<String, Long>>() { @ProcessElement public void processElement(ProcessContext ctx) { long count = 0; for (Integer wordCount : ctx.element().getValue()) { count += wordCount; } KV<String, Long> outputElement = KV.of(ctx.element().getKey(), count); ctx.output(outputElement); } }))

  9. E X C U S E M E , T H AT WA S U G LY A S H E L L

  10. 10 � W O R D C O U N T — C O M P O S I T E T R A N S F O R M S pipeline .apply(Create. of ("hello", "fellow", "fellow")) .apply(MapElements. via ( new SimpleFunction<String, KV<String, Integer>>() { @Override public KV<String, Integer> apply(String input) { return KV. of (input, 1); } })) .apply(Sum. integersPerKey ()); Composite Transforms

  11. � 11 W O R D C O U N T — M O R E C O M P O S I T E T R A N S F O R M S pipeline .apply(Create. of ("hello", "fellow", "fellow")) .apply(Count. perElement ()); Composite Transforms

  12. 12 � P Y T H O N T O T H E R E S C U E (p | beam.Create(['hello', 'hello', 'fosdem']) | beam.Map(lambda word: (word, 1)) | beam.GroupByKey() | beam.Map(lambda kv: (kv[0], sum(kv[1]))) )

  13. � 13 P Y T H O N T O T H E R E S C U E (p | beam.Create(['hello', 'hello', 'fosdem']) | beam.Map(lambda word: (word, 1)) | beam.CombinePerKey(sum) )

  14. 14 � T H E R E I S M U C H M O R E T O B E A M • Watermarks • Flatten/Combine/Partition/ CoGroupByKey (Join) • Side Inputs • Define your own transforms! • Multiple Outputs • IOs / Splittable DoFn • State • Windowing • Timers • Event Time / Processing Time • ...

  15. What is Beam? What does portability mean? How do we achieve portability? Are we there yet?

  16. 16 � P O R TA B I L I T Y Engine Portability Language Portability • Beam pipeline can be • Runners can translate a Beam pipeline for any of these generated from any of these execution engines language

  17. � 17 B E A M V I S I O N Write Pipeline Translate Runners SDKs Execution Engines

  18. � 18 C R O S S - E N G I N E P O R TA B I L I T Y 1. Set the Runner • options.setRunner(FlinkRunner.class) • --runner=FlinkRunner 2. Run! • p.run()

  19. 19 � P O R TA B I L I T Y Engine Portability Language Portability • Beam pipeline can be • Runners can translate a Beam pipeline for any of these generated from any of these execution engines language

  20. � 20 W H Y W E WA N T T O U S E O T H E R L A N G U A G E S • Syntax / Expressiveness • Communities (Yes!) • Libraries (!)

  21. � 21 B E A M W I T H O U T L A N G U A G E - P O R TA B I L I T Y Write Pipeline Translate Runners Wait, what?! SDKs Execution Engines

  22. � 22 B E A M W I T H L A N G U A G E - P O R TA B I L I T Y Write Pipeline Translate Runners & language-portability framework SDKs Execution Engines

  23. What is Beam? What does portability mean? How do we achieve portability? Are we there yet?

  24. 24 � L A N G U A G E - P O R TA B I L I T Y Beam Beam Beam Beam Java Python Java Go Apache Cloud Apache Apache Cloud Apache Flink Dataflow Spark Flink Dataflow Spark E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n

  25. 25 � L A N G U A G E - P O R TA B I L I T Y Beam Beam Beam Beam Beam Beam Beam Java Python Python Java Java Go Go Pipeline (Runner API) Apache Apache Cloud Cloud Apache Apache Apache Cloud Apache Flink Flink Dataflow Dataflow Spark Spark Flink Dataflow Spark E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n

  26. 26 � L A N G U A G E - P O R TA B I L I T Y Beam Beam Beam Beam Java Python Java Go Pipeline (Runner API) Apache Cloud Apache Apache Cloud Apache Flink Dataflow Spark Flink Dataflow Spark Execution (Fn API) E x e c u t i o n E x e c u t i o n E x e c u t i o n E x e c u t i o n

  27. � 27 language-specific W I T H O U T P O R TA B I L I T Y S D K R U N N E R Backend (e.g. Flink) TA S K 1 TA S K 2 TA S K 3 TA S K N All components are tight to a single language

  28. 28 � language-specific W I T H P O R TA B I L I T Y language-agnostic S D K J O B S E RV E R R U N N E R Job API Runner API Translate Backend (e.g. Flink) E X E C U TA B L E E X E C U TA B L E … TA S K 2 TA S K N S TA G E S TA G E Fn API Fn API S D K S D K H A R N E S S H A R N E S S

  29. � 29 P I P E L I N E F U S I O N • SDK Harness environment comes at a cost • Serialization step before and after processing with SDK harness • User defined functions should be chained and share the same environment

  30. � 30 F L I N K E X E C U TA B L E S TA G E S D K H A R N E S S J O B B U N D L E FA C T O RY S TA G E B U N D L E FA C T O RY E N V I R O N M E N T FA C T O RY • SDK Harness runs R E M O T E B U N D L E • in a Docker container (repository P r o A v r i t s Progress Report i Input Receivers can be specified) i f o State Request a n c i t n R Logging g e t r i e v a l • in a dedicated process (process- based execution) S D K H A R N E S S • directly in the process (only works if SDK and Runner share the same language)

  31. 31 � Files-Based C R O S S - L A N G U A G E P I P E L I N E S Apache HDFS Amazon S3 Google Cloud Storage local filesystems AvroIO TextIO TFRecordIO XmlIO • Java SDK has rich set of IO connectors, e.g. FileIO, TikaIO ParquetIO KafkaIO, PubSubIO, JDBC, Cassandra, Redis, Messaging Amazon Kinesis ElasticsearchIO, … AMQP Apache Kafka Google Cloud Pub/Sub JMS MQTT • Python SDK has replicated parts of it, i.e. FileIO Databases Apache Cassandra Apache Hadoop InputFormat Apache HBase • Are we going to replicate all the others? Apache Hive (HCatalog) Apache Kudu Apache Solr Elasticsearch (v2.x, v5.x, v6.x) • Solution: Use cross-language pipelines! Google BigQuery Google Cloud Bigtable Google Cloud Datastore Google Cloud Spanner JDBC MongoDB Redis

  32. � 32 C R O S S - L A N G U A G E P I P E L I N E S p = Pipeline() (p | IoExpansion(io='KafkaIO', configuration={ 'topic' : 'fosdem', 'offset' : 'latest' }) | … )

  33. � 33 C R O S S - L A N G U A G E V I A M I X E D E N V I R O N M E N T S d E X PA N S I O N n a p x E S E RV I C E S D K J O B S E RV E R R U N N E R Job API Runner API Translate Execution Engine (e.g. Flink) … S O U R C E M A P G R O U P B Y K E Y C O U N T Fn API Fn API J AVA S D K P Y T H O N S D K H A R N E S S H A R N E S S

  34. What is Beam? What does portability mean? How do we achieve portability? Are we there yet?

  35. � 35 P O R TA B I L I T Y Language Portability Engine Portability pretty darn close

Recommend


More recommend