Apache Beam Modèle de programmation unifié pour Big Data
Who am I? Jean-Baptiste Onofre <jbonofre@apache.org> <jbonofre@talend.com> @jbonofre | http://blog.nanthrax.net Member of the Apache Software Foundation Fellow/Software Architect at Talend PMC on ~20 Apache Projects from system integration & container (Karaf, Camel, ActiveMQ, Archiva, Aries, ServiceMix, …) to big data (Beam, CarbonData, Falcon, Gearpump, Lens, …)
Apache Beam origin Colossus BigTable PubSub Dremel Google Cloud Dataflow Apache Spanner Megastore Millwheel Flume Beam MapReduce
Beam model : asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
Customizing What Where When How 2 1 3 4 Windowed Classic Streaming Streaming Batch + Accumulation Batch
What is Apache Beam? 1. Unified model ( B atch + str EAM ) What / Where / When / How 2. SDKs (Java, Python, ...) & DSLs (Scala, …) 3. Runners for Existing Distributed Processing Backends (Google Dataflow, Spark, Flink, …) 4. IOs : Data store Sources / Sinks
Apache Beam vision 1. End users: who want to write Other Beam Beam Java Languages pipelines in a language that’s familiar. Python 2. SDK/DSL writers: who want to make Beam concepts available in new Beam Model: Pipeline Construction languages. Apache Cloud Apache 3. Runner writers: who have a Flink Dataflow Spark distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Execution Execution Execution
Apache Beam - SDKs & DSLs DSLs SDKs Domain-Specific Languages based on the API based on the Beam Model Beam Model: 1. Current: 1. Current: a. Java Scio (Scala API), • b. Python 2. Future (ideas): 2. Future (possible) SDKs: Streaming SQL (Calcite) • Go, Ruby, etc. Machine Learning • Complex Event Processing •
Apache Beam SDK concepts 1. Pipeline - data processing job as a directed graph of transformations 2. PCollection - the data inside a pipeline 3. PTransform - a transformation step in the pipeline a. IO transforms - Read from a Source or Write to a Sink. b. Core transforms - common transformation provided (ParDo, GroupByKey, …) c. Composite transforms - combine multiple transforms
Apache Beam - Pipeline Data processing pipeline (executed via a Beam runner) Read Write PTransform PTransform PTransform PTransform (source) (sink)
Apache Beam - PCollection 1. PCollection is immutable , does not support random access to element, belongs to a Pipeline 2. Each element in PCollection has a Timestamp (commonly set by IO Source) 3. Coder to support different data serialization 4. Bounded (batch) or Unbounded (streaming) (depending of the IO Source)
Apache Beam - PTransform 1. PTransform are operations that transform data 2. Receive one or multiple PCollections and produce one or multiple PCollections 3. They must be Serializable 4. Should be thread-compatible (If you create your threads you must sync them). 5. Idempotency is not required but recommended .
Apache Beam - IO Transforms 1. IO read/write data as PCollections ( Source/Sink ) 2. Support Bounded and/or Unbounded PCollections 3. Extensible API to create custom sources & sinks 4. Deal with timestamp, watermarks, deduplication , read/write parallelism
Agenda 1. Evolution of the Big Data programming models 2. The Beam approach 3. Apache Beam
Apache Beam - Current IOs Ready WIP MQTT File Hive JDBC Avro Cassandra Mongo / GridFS Google Cloud Storage Reddis JMS BigQuery RabbitMQ Kafka BigTable ... Kinesis DataStore HDFS Elasticsearch HBase
Apache Beam - Pipeline with IO Example public static void main(String[] args) { // Create a pipeline parameterized by command line flags eg. --runner Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg)); p.apply(KafkaIO.read().withBootstrapServers(servers) .withTopics(topics)) // Read input .apply(new YourFancyFn()) // Do some processing .apply(ElasticsearchIO.write().withAddress(esServer) .withIndex(index).withType(type)); // Write output // Run the pipeline. p.run(); }
What are you computing? Element-Wise Aggregating Composite What Where When How
Apache Beam - Programming model in the SDK Element-wise Grouping Windowing/Triggers ParDo GroupByKey FixedWindows GlobalWindows MapElements Combine -> Reduce SlidingWindows Sum Sessions FlatMapElements Count Min AfterWatermark Filter Max AfterProcessingTime Mean AfterPane ... ... WithKeys Keys Values
Apache Beam - Example - GDELT Events by location Pipeline pipeline = Pipeline.create(options); // Read events from a text file and parse them. pipeline .apply( "GDELTFile" , TextIO.Read.from(options.getInput())) // Extract location from the fields .apply( "ExtractLocation" , ParDo.of(...) // Count events per location .apply( "CountPerLocation" , Count.<String>perElement()) // Reformat KV as a String .apply( "StringFormat" , MapElements.via(...)) // write to result files .apply( "Results" ,TextIO.Write.to(options.getOutput())); // Run the batch pipeline. pipeline.run();
Apache Bean - Runners / Execution Engines Runners “ translate ” the code to a target runtime (the runner itself doesn’t provide the runtime) Many runners are tied to other top-level Apache projects, such as Apache Flink and Apache Spark Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability
Runners Apache Flink Apache Beam Google Cloud Apache Spark Dataflow Direct Runner Managed (NoOps) Local WIP Apache MapReduce Apache Karaf Apache Apex Apache Gearpump Same code, different runners & runtimes
Apache Beam - Use cases Apache Beam is a great choice for both batch and stream processing and can handle bounded and unbounded datasets Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on Stream can focus on handling real-time processing on a record-by-record basis Real use cases Data processing, both batch and stream processing Real-time event processing from IoT devices Fraud detection, ...
Why Apache Beam? 1. Portable - You can use the same code with different runners (agnostic) and backends on premise, in the cloud, or locally 2. Unified - Same unified model for batch and stream processing 3. Advanced features - Event windowing, triggering, watermarking, lateness, etc. 4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel
Growing the Beam Community Collaborate - Beam is becoming a community- driven effort with participation from many organizations and contributors Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem
Learn More! Apache Beam http://beam.apache.org Join the Beam mailing lists! user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter
Thank You !
Recommend
More recommend