Apache Gearpump next-gen streaming engine Karol Brejna, Intel - PowerPoint PPT Presentation

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng Wang, Intel (huafengw@apache.org) Apache: Big Data Europe 2016 Sevilla, Spain 14 November 2016

Agenda ● What is Gearpump? ● Why Apache Gearpump? ● Apache Gearpump features/internals ● What’s next for Apache Gearpump 2

What is Gearpump ? ● A super simple pump that consists of only two gears but very powerful at streaming water ● An Akka [2] based real-time streaming engine ● An Apache Incubator [1] project since Mar.8th, 2016 3

Why Gearpump ? 4

Stream processing is hard ● Fault tolerance ● Infinite Out-of-order data ● Low latency assurance (e.g real-time recommendation) ● Correctness requirement (e.g. charge advertisers for ads) ● Cheap to update applications (e.g. tune machine learning parameters)

Gearpump makes stream processing easier ● fault tolerant stream processing at latency of milliseconds ● handling out-of-order data event-time based window aggregation ● Akka-stream DSL and Apache Beam API support ● ● runtime DAG modification ● responsive UI with abundant metrics information

Gearpump on TAP ● Gearpump on Trusted Analytics Platform (TAP) ● Stream processing - performance experiments and results 7

Trusted Analytics Platform (TAP) ▪ Open Source project ▪ Collaborative, cloud-ready platform to build applications powered by Big Data Analytics ▪ Includes everything needed by data scientists, application developers and system operators ▪ Optimized for performance and security 9

Analytics Solutions – Big Data Scale Out Applications Silicon and software enhancements to protect Analytics-powered vertical and horizontal Performance and Security Multi-layered, fully-optimized algorithms solutions and accelerate data and analytics Machine Learning Analytics Open source platform for collaborative data science and analytics application development Data Open source, Hadoop-centric platform for distributed and scalable storage and processing Infrastructure Software-defined storage, network and cloud infrastructure optimized for Intel Architecture 10

The Anatomy of Trusted Analytics Platform (TAP) TAP-powered Big Data Analytics Java, Go Applications applications and solutions REST Polyglot services and APIs for Services application developers Data Scientists workbench ATK, Spark*, Impala, H2O, including models, algorithms, Hue,* iPython Marketplace pipelines, engines and frameworks Analytics Management Extensible Marketplace of built-in TAP Core tools, packages and recipes Kafka*, GearPump , RabbitMQ, MQTT, Message brokers and queues for WS, REST batch and stream data ingestion Ingestion User, tenant, security , provisioning and monitoring for system operators Cloudera CDH (Hadoop/HDFS, Hbase)*, PostgreSQL, MySQL, Distributed processing and Redis, MongoDB, InfluxDB, Data Platform scalable data storage Cassandra AWS, Rackspace, OVH, OpenStack, On/Off-prem Public or private clouds Infrastructure * Leverages Cloudera Distribution of Apache Hadoop 11

The problem ● correlate messages using a key in one second sliding window and produce latency stream messages ● consume latency messages and compute average latency per firm in one minute buckets ● send the aggregate message to HBase 15

The expectations ● Handle load of 0.5M msg per second all the time ● Handle load of 7M msg per second for peaks of 1 hour ● Message size 250-500 bytes ● Be able to scale for even more 16

The hardware ● CPU: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz ● Memory: 256 Gbytes DDR4 Storage: 8 SATA SSDs ● 17

The results (1) - let’s start small: ~700k msg/sec Initial attempt 18

The results (2) - 8 executors: ~1.6M msg/sec Initial attempt Findings: ● We need to improve Kafka Source - queue size, fetch frequency ● Improve Kafka partitions design for concurrency ● Network throughput may be a bottleneck (1.6M msg/sec * 0.5 k * 8 bit) - compression 19

The results (3) - 16 executors: ~2.7M msg/sec Initial attempt Findings: ● JVM defaults designed for moderate workloads - we need to pump them up ● Message marshalling starts to play significant role in performance - look for better alternatives 20

The results (4) - 32 executors: ~5M msg/sec Findings: ● Backpressure introduced by JVM ⇔ JVM communication - use task fusing 21

The results (5) - 48 executors: 7.4M msg/sec Mission accomplished!!! 22

The results (6) - 64 executors We can go even further.. 23

The results - summary ● Great performance numbers on decent hardware Predictable scalability ● Executors number Req/sec 8 1.6 M 16 2.7 M 32 5 M 48 7,4 M 64 10 M 24

Gearpump features 25

Gearpump Architecture Actor concurrency ● ● Message passing communication error handling and ● isolation with supervision hierarchy ● Master HA with Akka Cluster 26

Use case - Windowed word count 1. Words 2. Window Counts KafkaSource KafkaSink WindowCounter 27

How Gearpump solves the hard parts ● User interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG 28

User interface - DSL val app = StreamApp("dsl", context) app.source[String](kafkaSource). flatMap(line => line.split("[\\s]+")).map((_, 1)). window(FixedWindow.apply(Duration.ofMillis(5L)) .triggering(EventTimeTrigger)). // (word, count1), (word, count2) => (word, count1 + count2) groupBy(_._1).sum.sink(kafkaSink) Window.groupByKey sink 29

How Gearpump solves the hard parts ● User interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG 30

Without Flow Control - OOM KafkaSource WindowCounter KafkaSink fast fast very slow 31

With Flow Control - Backpressure pull slower KafkaSource WindowCounter KafkaSink Slow down Slow down Very Slow backpressure backpressure 32

How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG 33

Out-of-order data Event time - when data generated ● ● Processing time - when data processed 6 5 Processing time 4 3 2 1 1 2 3 Event time 34

On Watermark [4] watermark 6 5 No timestamp earlier than ● Processing time 4 watermark will be seen 3 2 1 1 2 3 Event time 35

When can window counts be emitted ? WindowCounter In-memory Table window messages (“gearpump”, 1) [0, 2) (“gearpump”, 1) WindowCounter [2, 4) (“gearpump”, 3) (“gearpump”, 2) [4, 6) (“gearpump”, 5) No window can be emitted since ● (“gearpump”, 4) message as early as time 1 has not arrived 36

Out-of-order processing with watermark WindowCounter In-memory Table window messages (“gearpump”, 1) [0, 2) (“gearpump”, 1) WindowCounter [2, 4) (“gearpump”, 3) (“gearpump”, 2) [4, 6) (“gearpump”, 5) watermark = 0 ， (“gearpump”, 4) No window can be emitted 37

Out-of-order processing with watermark WindowCounter In-memory Table window messages (“gearpump”, [0, 2) 1) [2, 4) (“gearpump”, 3) WindowCounter (“gearpump”, 2) [4, 6) (“gearpump”, 5) (“gearpump”, 4) watermark = 2 ， Window [0, 2) can be emitted 38

How to get watermark ? 1. Words 2. Window Counts KafkaSource WindowCounter Sink 39

From upstream Watermark W(50) 50 40 30 KafkaSource WindowCounter Sink Watermark of the operator 40

From upstream Watermark W(50) 60 50 40 KafkaSource WindowCounter Sink Watermark of the operator 41

More on Watermark ● Source watermark defined by user ● Usually heuristic based ● Users decide whether to drop data arriving after watermark 42

How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG 43

Exactly Once with asynchronous checkpointing Watermark = 2 Watermark = 0 Watermark = 0 KafkaSource WindowCounter KafkaSink (2, kafka_offset) (2, kafka_offset) 44

Exactly Once with asynchronous checkpointing Watermark = 2 Watermark = 2 Watermark = 0 KafkaSource WindowCounter KafkaSink (2, window_counts) (2, kafka_offset) (2, window_counts) 45

Exactly Once with asynchronous checkpointing Checkpoint succeed Watermark = 2 Watermark = 2 Watermark = 2 KafkaSource WindowCounter KafkaSink (2, kafka_offset) (2, window_counts) 46

Crash Watermark = 3 Watermark = 2 Watermark = 2 KafkaSource WindowCounter Sink (2, kafka_offset) (2, window_counts) 47

Recover to latest checkpoint at 2 Replay from kafka KafkaSource WindowCounter KafkaSink window_counts kafka_offset Get state at 2 (2, kafka_offset) (2, window_counts) 48

How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG 49

Apache Gearpump next-gen streaming engine Karol Brejna, Intel - PowerPoint PPT Presentation

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng Wang, Intel (huafengw@apache.org) Apache: Big Data Europe 2016 Sevilla, Spain 14 November 2016 Agenda What is Gearpump? Why Apache

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Apache Sling A REST-based Web Application Framework Carsten Ziegeler | cziegeler@apache.org

Testing your REST Server with Apache JMeter By Henry Chan June, 2015 hchan@apache.org Download:

Behind the Scenes of The Apache Software Foundation Lars Eilebrecht <lars @ apache.org>

State Notation Language State Notation Language and the Sequencer and the Sequencer NSLS-II

35T experience with Cryo Measurements and CFD Alan Hahn FNAL 8/15/18 1 35 Ton Prototype

Disclosure Learning Outcomes O Articulate recovery oriented principles of care delivery. The

Learning Objectives Describe applicable results of important histology-specific clinical

Optimization and Machine Learning with Applications Antonio Candelieri 1,2 Department of Computer

Outline Introduction Modeling Specifying properties and Verification An example

Study of paramagnetic properties of Fe 3+ ions in sapphire for the realization of a cryogenic

Capture Zone Analyses For Pump and Treat Systems Internet Seminar Version: July 1, 2008 1 1

Apache Gearpump next-gen streaming engine Karol Brejna, Intel - PowerPoint PPT Presentation

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng Wang, Intel (huafengw@apache.org) Apache: Big Data Europe 2016 Sevilla, Spain 14 November 2016 Agenda What is Gearpump? Why Apache

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Apache Sling A REST-based Web Application Framework Carsten Ziegeler | cziegeler@apache.org

Testing your REST Server with Apache JMeter By Henry Chan June, 2015 hchan@apache.org Download:

Behind the Scenes of The Apache Software Foundation Lars Eilebrecht &lt;lars @ apache.org&gt;

State Notation Language State Notation Language and the Sequencer and the Sequencer NSLS-II

35T experience with Cryo Measurements and CFD Alan Hahn FNAL 8/15/18 1 35 Ton Prototype

Disclosure Learning Outcomes O Articulate recovery oriented principles of care delivery. The

Learning Objectives Describe applicable results of important histology-specific clinical

Optimization and Machine Learning with Applications Antonio Candelieri 1,2 Department of Computer

Outline Introduction Modeling Specifying properties and Verification An example

Study of paramagnetic properties of Fe 3+ ions in sapphire for the realization of a cryogenic

Capture Zone Analyses For Pump and Treat Systems Internet Seminar Version: July 1, 2008 1 1

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Behind the Scenes of The Apache Software Foundation Lars Eilebrecht <lars @ apache.org>