Stream Processing with Apache Flink QCon London, March 7, 2016 - PowerPoint PPT Presentation

Stream Processing with Apache Flink QCon London, March 7, 2016 Robert Metzger @rmetzger_ rmetzger@apache.org

Talk overview  My take on the stream processing space, and how it changes the way we think about data  Discussion of unique building blocks of Flink  Benchmarking Flink, by extending a benchmark from Yahoo! 2

Apache Flink  Apache Flink is an open source stream processing framework • Low latency • High throughput • Stateful • Distributed  Developed at the Apache Software Foundation, 1.0.0 release available soon, used in production 3

Entering the streaming era 4

Streaming is the biggest change in data infrastructure since Hadoop 5

1. Radically simplified infrastructure 2. Do more with your data, faster 3. Can completely subsume batch 6

Traditional data processing  Log analysis example using a batch processor Periodic (custom) or Periodic log analysis Web server continuous ingestion job (Flume) into HDFS Logs Web server Batch job(s) for Serving HDFS / S3 log analysis layer Logs Web server Logs Job scheduler (Oozie) 7

Traditional data processing  Latency from log event to serving layer usually in the range of hours Periodic (custom) or Periodic log analysis Web server continuous ingestion job (Flume) into HDFS Logs Web server Batch job(s) for Serving HDFS / S3 log analysis layer Logs Web server Job scheduler Logs (Oozie) every 2 hrs 8

Data processing without stream processor  This architecture is a hand-crafted micro- batch model Web server Batch job(s) for Logs HDFS / S3 log analysis Web server Batch interval: ~2 hours Logs Batch processor Manually triggered Approach Stream processor periodic batch job with micro-batches Latency hours minutes seconds milliseconds 9

Downsides of stream processing with a batch engine  Very high latency (hours)  Complex architecture required: • Periodic job scheduler (e.g. Oozie) • Data loading into HDFS (e.g. Flume) • Batch processor • (When using the “lambda architecture”: a stream processor)  All these components need to be implemented and maintained  Backpressure: How does the pipeline handle load spikes? 10

Log event analysis using a stream processor  Stream processors allow to analyze events with sub-second latency . Forward events Process events in real Web server immediately to time & update pub/sub bus serving layer Web server High throughput Serving Stream Processor publish/subscribe layer bus Options : • Web server Options : Apache Flink • • Apache Kafka Apache Beam • • Amazon Kinesis Apache Samza • MapR Streams 11 • Google Cloud Pub/Sub

Real-world data is produced in a continuous fashion. New systems like Flink and Kafka embrace streaming nature of data. Web server Kafka topic Stream processor 12

What do we need for replacing the “batch stack”? Forward events Process events in real Web server immediately to time & update pub/sub bus serving layer Windowing / Out State handling of order events High throughput Web server Serving Stream Processor publish/subscribe layer bus Fault tolerance Low latency Options : and correctness High throughput • Web server Options : Apache Flink • • Apache Kafka Google Cloud • Amazon Kinesis Dataflow • MapR Streams • Google Cloud Pub/Sub 13

Apache Flink stack Apache Beam Apache Beam Hadoop M/R Cascading Storm API Zeppelin SAMOA Table Gelly Table CEP ML DataStream (Java / Scala) DataSet (Java/Scala) Streaming dataflow runtime YARN Cluster Local 15

Needed for the use case Apache Beam Apache Beam Hadoop M/R Cascading Storm API Zeppelin SAMOA Table Gelly Table CEP ML DataStream (Java / Scala) DataSet (Java/Scala) Streaming dataflow runtime YARN Cluster Local 16

Windowing / Out of order events Windowing / Out State handling of order events Fault tolerance Low latency and correctness High throughput 17

Building windows from a stream Kafka topic Web server Stream processor  “ Number of visitors in the last 5 minutes per country ” // create stream from Kafka source DataStream<LogEvent> stream = env.addSource(new KafkaConsumer()); // group by country DataStream<LogEvent> keyedStream = stream.keyBy (“ country “); // window of size 5 minutes keyedStream.timeWindow(Time.minutes(5)) // do operations per window .apply(new CountPerWindowFunction()); 18

Building windows: Execution // window of size 5 minutes keyedStream . timeWindow ( Time . minutes ( 5 )); Job plan Parallel execution on the cluster W S Kafka Source S W group by S W country Time Window Operator 19

Window types in Flink  Tumbling windows  Sliding windows  Custom windows with window assigners, triggers and evictors 20 Further reading : http://flink.apache.org/news/2015/12/04/Introducing-windows.html

Time-based windows { “ accessTime ”: “1457002134”, Stream “ userId ”: “1337”, “ userLocation ”: “UK” Event data } Time of event  Windows are created based on the real world time when the event occurred 21

A look at the reality of time Network delays Stream Processor Kafka Out of sync clocks Window between 33 11 21 15 9 15 0 and 15 Guarantee that no event with time <= 15 will arrive afterwards  Events arrive out of order in the system  Use-case specific low watermarks for time tracking 22

Time characteristics in Apache Flink  Event Time • Users have to specify an event-time extractor + watermark emitter • Results are deterministic, but with latency  Processing Time • System time is used when evaluating windows • low latency  Ingestion Time • Flink assigns current system time at the sources  Pluggable, without window code changes 23

State handling Windowing / Out State handling of order events Fault tolerance Low latency and correctness High throughput 24

State in streaming  Where do we store the elements from our windows? S W S W Elements in windows are state S W Time  In stateless systems, an external state store (e.g. Redis) is needed. 25

Managed state in Flink  Flink automatically backups and restores state  State can be larger than the available memory  State backends: (embedded) RocksDB, Heap memory Web Operator with windows Kafka server (large state) Periodic backup / Distributed State recovery File System (local) backend Stream processor: Flink 26

Managing the state Kafka topic Web server Stream processor  How can we operate such a pipeline 24x7?  Losing state (by stopping the system) would require a replay of past events  We need a way to store the state somewhere! 27

Savepoints: Versioning state  Savepoint: Create an addressable copy of a job’s current state.  Restart a job from any savepoint. > flink run – s hdfs:///flink-savepoints/2 <jar> > flink savepoint <JobID> HDFS HDFS > hdfs:///flink-savepoints/2 28 Further reading : http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/

Fault tolerance and correctness Windowing / Out State handling of order events Fault tolerance Low latency and correctness High throughput 29

Fault tolerance in streaming Kafka topic Web server Stream processor  How do we ensure the results (number of visitors) are always correct?  Failures should not lead to data loss or incorrect results 30

Fault tolerance in streaming  at least once: ensure all operators see all events • Storm: Replay stream in failure case (acking of individual records)  Exactly once: ensure that operators do not perform duplicate updates to their state • Flink: Distributed Snapshots • Spark: Micro-batches on batch runtime 31

Flink’s Distributed Snapshots  Lightweight approach of storing the state of all operators without pausing the execution  Implemented using barriers flowing through the topology barrier Data Stream Before barrier = After barrier = part of the snapshot Not in snapshot Further reading : http://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for- 32 distributed-dataflows/

Wrap-up: Log processing example Kafka topic Web server Stream processor  How to do something with the data? Windowing  How does the system handle large windows? Managed state  How do operate such a system 24x7? Safepoints  How to ensure correct results across failures? Checkpoints, Master HA 33

Performance: Low Latency & High Throughput Windowing / Out State handling of order events Fault tolerance Low latency and correctness High throughput 34

Performance: Introduction  Performance always depends on your own use cases, so test it yourself!  We based our experiments on a recent benchmark published by Yahoo!  They benchmarked Storm, Spark Streaming and Flink with a production use- case (counting ad impressions) Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- 35 computation-engines-at

Stream Processing with Apache Flink QCon London, March 7, 2016 - PowerPoint PPT Presentation

Stream Processing with Apache Flink QCon London, March 7, 2016 Robert Metzger @rmetzger_ rmetzger@apache.org Talk overview My take on the stream processing space, and how it changes the way we think about data Discussion of unique

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,

Text Stream Processing Dunja Mladeni Artificial Intelligence Laboratory Marko Grobelnik Jo

Auto-sizing for Stream Processing Applications at LinkedIn Rayman Preet Singh, Bharath

Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 19/09/2019 The Course

Fresh water stream ecosystem Gr ov p 2 The description of stream lives Quadrat 1: Hong Kong Newt

Phase III Stream Assessment Study: Potential Stream Restoration Projects Strawberry Run and

UPLOAD VIDEOS TO MICROSOFT STREAM VIA ACCESSUH To upload a video on Microsoft Stream, go to

Assessing stream and riparian conditions Stream Habitat Assessment Conducted yearly

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Stream Switching Control draft-gentric-mmusic-stream-switching-00.txt Philippe Gentric

B.e) Stream Ciphers W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 B.125 Stream Ciphers

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Ruby On Rails CSCI 5449 Submitted by: Bhaskar Vaish What is Ruby on Rails ? Ruby on Rails is a

ApplianSys has been working with schools to optimise their Internet connections for more than 15

Running a High Performance LAMP stack on a $20 Virtual Server Friday, July 20, 12 Simplified

INTEL AMT. STEALTH BREAKTHROUGH Dmitriy Evdokimov, CTO Embedi Alexander Ermolov, Security

Resource virtualization and optimization via Grid and Cloud Computing Moon J Kim IBM Senior

Django - ein Python Web-Framework Daniel Klaffenbach 31. Mai 2011 Daniel Klaffenbach Django -

Software Design Principles and Guidelines

Technology Department Bryan Williams - Director June 27, 2017 Department Staff Richard Dech -

Sambuz

Useful Links

Newsletter

Mail Us

Stream Processing with Apache Flink QCon London, March 7, 2016 - PowerPoint PPT Presentation

Stream Processing with Apache Flink QCon London, March 7, 2016 Robert Metzger @rmetzger_ rmetzger@apache.org Talk overview My take on the stream processing space, and how it changes the way we think about data Discussion of unique

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

An Introduction To Data Stream Query Processing Neil Conway &lt;nconway@truviso.com&gt; Truviso,

Text Stream Processing Dunja Mladeni Artificial Intelligence Laboratory Marko Grobelnik Jo

Auto-sizing for Stream Processing Applications at LinkedIn Rayman Preet Singh, Bharath

Introduction to Data Stream Processing Amir H. Payberah payberah@kth.se 19/09/2019 The Course

Fresh water stream ecosystem Gr ov p 2 The description of stream lives Quadrat 1: Hong Kong Newt

Phase III Stream Assessment Study: Potential Stream Restoration Projects Strawberry Run and

UPLOAD VIDEOS TO MICROSOFT STREAM VIA ACCESSUH To upload a video on Microsoft Stream, go to

Assessing stream and riparian conditions Stream Habitat Assessment Conducted yearly

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Stream Switching Control draft-gentric-mmusic-stream-switching-00.txt Philippe Gentric

B.e) Stream Ciphers W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 B.125 Stream Ciphers

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Ruby On Rails CSCI 5449 Submitted by: Bhaskar Vaish What is Ruby on Rails ? Ruby on Rails is a

ApplianSys has been working with schools to optimise their Internet connections for more than 15

Running a High Performance LAMP stack on a $20 Virtual Server Friday, July 20, 12 Simplified

INTEL AMT. STEALTH BREAKTHROUGH Dmitriy Evdokimov, CTO Embedi Alexander Ermolov, Security

Resource virtualization and optimization via Grid and Cloud Computing Moon J Kim IBM Senior

Django - ein Python Web-Framework Daniel Klaffenbach 31. Mai 2011 Daniel Klaffenbach Django -

Software Design Principles and Guidelines

Technology Department Bryan Williams - Director June 27, 2017 Department Staff Richard Dech -

Sambuz

Useful Links

Newsletter

Mail Us

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri