radically modular data ingestion apis in apache beam
play

Radically modular data ingestion APIs in Apache Beam Eugene - PowerPoint PPT Presentation

Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer Plan 01 Intro to Beam 04 Splittable DoFn Unified, portable data processing Missing piece for composable sources 02


  1. Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer

  2. Plan 01 Intro to Beam 04 Splittable DoFn Unified, portable data processing Missing piece for composable sources 02 IO — APIs for data ingestion 05 Recap What's the big deal If you remember two things 03 Composable IO IO as data processing

  3. 01 Intro to Beam Unified, portable data processing

  4. (2008) FlumeJava High-level API (2016) Apache Beam (2014) Dataflow (2004) MapReduce Batch/streaming agnostic, Open, SELECT + GROUPBY Portable across Community-driven, languages & runners Vendor-independent (2013) Millwheel Deterministic streaming Google Cloud Platform 4

  5. Batch vs. streaming is moot — Beam (Batch is nearly always part of higher-level streaming) Google Cloud Platform 5

  6. Google Cloud Platform 6

  7. Beam PTransforms DoFn ParDo GroupByKey Composite (good old FlatMap) Google Cloud Platform 7

  8. User code Libraries of PTransforms, IO SDK (per language) Runner Google Cloud Platform 8

  9. Pipeline p = Pipeline.create(options); Read text files PCollection<String> lines = p.apply( TextIO.read().from ( "gs://.../*" )); Split into words PCollection<KV<String, Long>> wordCounts = lines .apply( FlatMapElements.via (word → word.split( "\\W+" ))) .apply( Count.perElement() ); Count wordCounts .apply( MapElements.via ( Format count → count.getKey() + ": " + count.getValue()) .apply( TextIO.write().to ( "gs://.../..." )); Write text files p.run(); Google Cloud Platform 9

  10. 02 IO - APIs for data ingestion What's the big deal

  11. Beam IO Files Hadoop Hive Text/Avro/XML/… MQTT Solr HDFS, S3, GCS JDBC Elasticsearch Kafka MongoDb BigQuery Kinesis Redis BigTable AMQP Cassandra Datastore Pubsub HBase Spanner JMS Google Cloud Platform 11

  12. IO is essential Most pipelines move data from X to Y ETL: E xtract, T ransform, L oad Google Cloud Platform 12

  13. IO is messy E T L Cozy, pure programming model Google Cloud Platform 13

  14. IO is messy E T L � � Cozy, pure programming model Google Cloud Platform 14

  15. IO is messy Read via CSV dump Dead-letter failed records Read multiple tables in tx Clean up temp files Read tons of small files Stream new files Preserve filenames Skip headers Quotas & size limits Route to different tables Write A, then write B Rate limiting / throttling Decompress ZIP Write to A, then read B … Google Cloud Platform 15

  16. IO is unified (batch/streaming agnostic) Classic batch Classic streaming Reality Read files Read Kafka Read files + watch new files Write files Stream to Kafka Stream files Read Kafka from start + tail Google Cloud Platform 16

  17. IO is unified (batch/streaming agnostic) changes changes Evolves Evolves Keeps output = f(input) https://www.infoq.com/presentations/beam-model-stream-table=theory Google Cloud Platform 17

  18. IO is unforgiving Correctness Performance Any bug = data corruption Unexpected scale Fault tolerance Throughput, latency, memory, parallelism Exactly-once reads/writes Error handling Google Cloud Platform 18

  19. IO is a chance to do better Nobody writes a paper about their IO API. I made a bigdata (MapReduce paper — 3 paragraphs; Spark, Flink, Beam: 0) programming model Cool, how does data Requirements too diverse get in and out? to support everything out of the box APIs too rigid Brb to let users do it themselves Google Cloud Platform 19

  20. IO is essential, but messy and unforgiving. It begs for good abstractions. Google Cloud Platform Confidential & Proprietary 20

  21. 03 Composable IO IO as data processing

  22. Traditionally: ad-hoc API, at pipeline boundary "Source" "Transform" "Sink" A B InputFormat / Receiver / SourceFunction / ... OutputFormat / Sink / SinkFunction / ... Configuration: Configuration: Filepattern Directory Query string Table name Topic name Topic name … … Google Cloud Platform 22

  23. Traditionally: ad-hoc API, at pipeline boundary "Source" "Transform" "Sink" A B My filenames come on a I want to know which Narrow APIs Kafka topic. records failed to write are not hackable I have a table per client + I want to kick off another table of clients transform after writing Google Cloud Platform 23

  24. IO is just another data processing task Parse Globs Records Parameters Rows Execute files queries Invalid rows Rows Import to Import database statistics Google Cloud Platform 24

  25. IO is just another data processing task Google Cloud Platform Confidential & Proprietary 25

  26. Composability (aka hackability) Unified batch/streaming Transparent fault tolerance The rest of the programming model has Scalability been getting this for free all along. (read 1M files = process 1M elements) Join the party. Monitoring, debugging Orchestration (do X, then read / write, then do Y) Future features Google Cloud Platform 26

  27. IO in Beam: just transforms Google Cloud Platform 27

  28. BigQueryIO.write(): (write to files, call import API) Dynamic routing Cleanup Sharding to fit under API limits … Pretty complex, but arbitrarily powerful Google Cloud Platform 28

  29. Composability ⇒ Modularity What can be composed, can be decomposed. Image credit: Wikimedia Google Cloud Platform 29

  30. Read text file globs Expand globs Read as Glob Filename String text file Watch new ( tail -f ) results Read Kafka topic List partitions Topic Topic, partition Message Read partition Watch new results Google Cloud Platform 30

  31. Read DB via CSV Table Glob Row Read text Invoke dump Parse CSV file globs Write db via CSV Row Filename Done signal Write to Invoke Format CSV text files import Google Cloud Platform 31

  32. Row Done signal Import to DB#1 Done signal Row Row Import to Wait DB#2 Consistent import into 2 databases Google Cloud Platform 32

  33. What can be composed, can be decomposed. Google Cloud Platform Confidential & Proprietary 33

  34. What this means for you Library authors Users Ignore native IO APIs if possible Ignore native IO APIs if possible Unify batch & streaming Assemble what you need from powerful primitives Decompose ruthlessly Google Cloud Platform 34

  35. 04 Splittable DoFn Missing piece for composable sources

  36. Typical IO transforms Read Read Split each ParDo Write { REST call } Google Cloud Platform 36

  37. Read Kafka topic List partitions Topic Topic, partition Message Read partition Watch new results Infinite output per input Read text file globs Expand globs Glob Filename String Read as Watch new text file results No parallelism within file* *No Shard Left Behind: Straggler-free data processing in Cloud Dataflow Google Cloud Platform 37

  38. What ParDo can't do DoFn Per-element work is indivisible black box ⇒ can't be infinite ⇒ can't be parallelized further Google Cloud Platform 38

  39. Splittable DoFn (SDF): Partial work via restrictions Element Element: what work DoFn Restriction: what part of the work Dynamically Splittable (Element , Restriction ) Design: s.apache.org/splittable-do-fn SDF Google Cloud Platform 39

  40. Example restrictions Element Restriction Reading splittable files filename start offset, end offset Reading Bigtable (table, filter, columns) start key, end key Reading Kafka (topic, partition) start offset, end offset Google Cloud Platform 40

  41. ( , ) SDF Splitting restriction ( , ) ( , ) ( , ) SDF SDF SDF Google Cloud Platform 41

  42. Google Cloud Platform 42

  43. Unbounded work per element Finite Google Cloud Platform 43

  44. Anatomy of an SDF How to process 1 element? Read a text file: (String filename) → records How to do it in parts? Reading byte sub-ranges How to describe 1 part? ( restriction ) {long start, long end} How to do this part of this element? f = open(element); f.seek(start); while(f.tell() < end) { yield f.readLine(); } Google Cloud Platform 44

  45. Dynamic splitting of restrictions (basically work stealing) Runner Split! process(e, r) Restriction process(e, r) Primary (keeps running) Residual (can start in parallel) Google Cloud Platform 45

  46. class ReadAvroFn extends DoFn<Filename, AvroRecord> { void processElement(ProcessContext c, OffsetRange range) { try (AvroReader r = Avro.open(c.element())) { for (r.seek(range.start()); r.currentBlockOffset() < range.end(); r.readNextBlock()) { for (AvroRecord record : r.currentBlock()) { c.output(record); } } } } } Google Cloud Platform 46

  47. class ReadAvroFn extends DoFn<Filename, AvroRecord> { void processElement(ProcessContext c, OffsetRange range) { try (AvroReader r = Avro.open(c.element())) { for (r.seek(range.start()); range can change r.currentBlockOffset() < range.end(); concurrently r.readNextBlock()) { for (AvroRecord record : r.currentBlock()) { c.output(record); } } } } } Google Cloud Platform 47

Recommend


More recommend