apache gearpump
play

Apache Gearpump next-gen streaming engine Karol Brejna, Intel - PowerPoint PPT Presentation

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng Wang, Intel (huafengw@apache.org) Apache: Big Data Europe 2016 Sevilla, Spain 14 November 2016 Agenda What is Gearpump? Why Apache


  1. Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng Wang, Intel (huafengw@apache.org) Apache: Big Data Europe 2016 Sevilla, Spain 14 November 2016

  2. Agenda ● What is Gearpump? ● Why Apache Gearpump? ● Apache Gearpump features/internals ● What’s next for Apache Gearpump 2

  3. What is Gearpump ? ● A super simple pump that consists of only two gears but very powerful at streaming water ● An Akka [2] based real-time streaming engine ● An Apache Incubator [1] project since Mar.8th, 2016 3

  4. Why Gearpump ? 4

  5. Stream processing is hard ● Fault tolerance ● Infinite Out-of-order data ● Low latency assurance (e.g real-time recommendation) ● Correctness requirement (e.g. charge advertisers for ads) ● Cheap to update applications (e.g. tune machine learning parameters)

  6. Gearpump makes stream processing easier ● fault tolerant stream processing at latency of milliseconds ● handling out-of-order data event-time based window aggregation ● Akka-stream DSL and Apache Beam API support ● ● runtime DAG modification ● responsive UI with abundant metrics information

  7. Gearpump on TAP ● Gearpump on Trusted Analytics Platform (TAP) ● Stream processing - performance experiments and results 7

  8. Gearpump on TAP ● Gearpump on Trusted Analytics Platform (TAP) ● Stream processing - performance experiments and results 8

  9. Trusted Analytics Platform (TAP) ▪ Open Source project ▪ Collaborative, cloud-ready platform to build applications powered by Big Data Analytics ▪ Includes everything needed by data scientists, application developers and system operators ▪ Optimized for performance and security 9

  10. Analytics Solutions – Big Data Scale Out Applications Silicon and software enhancements to protect Analytics-powered vertical and horizontal Performance and Security Multi-layered, fully-optimized algorithms solutions and accelerate data and analytics Machine Learning Analytics Open source platform for collaborative data science and analytics application development Data Open source, Hadoop-centric platform for distributed and scalable storage and processing Infrastructure Software-defined storage, network and cloud infrastructure optimized for Intel Architecture 10

  11. The Anatomy of Trusted Analytics Platform (TAP) TAP-powered Big Data Analytics Java, Go Applications applications and solutions REST Polyglot services and APIs for Services application developers Data Scientists workbench ATK, Spark*, Impala, H2O, including models, algorithms, Hue,* iPython Marketplace pipelines, engines and frameworks Analytics Management Extensible Marketplace of built-in TAP Core tools, packages and recipes Kafka*, GearPump , RabbitMQ, MQTT, Message brokers and queues for WS, REST batch and stream data ingestion Ingestion User, tenant, security , provisioning and monitoring for system operators Cloudera CDH (Hadoop/HDFS, Hbase)*, PostgreSQL, MySQL, Distributed processing and Redis, MongoDB, InfluxDB, Data Platform scalable data storage Cassandra AWS, Rackspace, OVH, OpenStack, On/Off-prem Public or private clouds Infrastructure * Leverages Cloudera Distribution of Apache Hadoop 11

  12. 12

  13. 13

  14. Gearpump on TAP ● Gearpump on Trusted Analytics Platform (TAP) ● Stream processing - performance experiments and results 14

  15. The problem ● correlate messages using a key in one second sliding window and produce latency stream messages ● consume latency messages and compute average latency per firm in one minute buckets ● send the aggregate message to HBase 15

  16. The expectations ● Handle load of 0.5M msg per second all the time ● Handle load of 7M msg per second for peaks of 1 hour ● Message size 250-500 bytes ● Be able to scale for even more 16

  17. The hardware ● CPU: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz ● Memory: 256 Gbytes DDR4 Storage: 8 SATA SSDs ● 17

  18. The results (1) - let’s start small: ~700k msg/sec Initial attempt 18

  19. The results (2) - 8 executors: ~1.6M msg/sec Initial attempt Findings: ● We need to improve Kafka Source - queue size, fetch frequency ● Improve Kafka partitions design for concurrency ● Network throughput may be a bottleneck (1.6M msg/sec * 0.5 k * 8 bit) - compression 19

  20. The results (3) - 16 executors: ~2.7M msg/sec Initial attempt Findings: ● JVM defaults designed for moderate workloads - we need to pump them up ● Message marshalling starts to play significant role in performance - look for better alternatives 20

  21. The results (4) - 32 executors: ~5M msg/sec Findings: ● Backpressure introduced by JVM ⇔ JVM communication - use task fusing 21

  22. The results (5) - 48 executors: 7.4M msg/sec Mission accomplished!!! 22

  23. The results (6) - 64 executors We can go even further.. 23

  24. The results - summary ● Great performance numbers on decent hardware Predictable scalability ● Executors number Req/sec 8 1.6 M 16 2.7 M 32 5 M 48 7,4 M 64 10 M 24

  25. Gearpump features 25

  26. Gearpump Architecture Actor concurrency ● ● Message passing communication error handling and ● isolation with supervision hierarchy ● Master HA with Akka Cluster 26

  27. Use case - Windowed word count 1. Words 2. Window Counts KafkaSource KafkaSink WindowCounter 27

  28. How Gearpump solves the hard parts ● User interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG 28

  29. User interface - DSL val app = StreamApp("dsl", context) app.source[String](kafkaSource). flatMap(line => line.split("[\\s]+")).map((_, 1)). window(FixedWindow.apply(Duration.ofMillis(5L)) .triggering(EventTimeTrigger)). // (word, count1), (word, count2) => (word, count1 + count2) groupBy(_._1).sum.sink(kafkaSink) Window.groupByKey sink 29

  30. How Gearpump solves the hard parts ● User interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG 30

  31. Without Flow Control - OOM KafkaSource WindowCounter KafkaSink fast fast very slow 31

  32. With Flow Control - Backpressure pull slower KafkaSource WindowCounter KafkaSink Slow down Slow down Very Slow backpressure backpressure 32

  33. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG 33

  34. Out-of-order data Event time - when data generated ● ● Processing time - when data processed 6 5 Processing time 4 3 2 1 1 2 3 Event time 34

  35. On Watermark [4] watermark 6 5 No timestamp earlier than ● Processing time 4 watermark will be seen 3 2 1 1 2 3 Event time 35

  36. When can window counts be emitted ? WindowCounter In-memory Table window messages (“gearpump”, 1) [0, 2) (“gearpump”, 1) WindowCounter [2, 4) (“gearpump”, 3) (“gearpump”, 2) [4, 6) (“gearpump”, 5) No window can be emitted since ● (“gearpump”, 4) message as early as time 1 has not arrived 36

  37. Out-of-order processing with watermark WindowCounter In-memory Table window messages (“gearpump”, 1) [0, 2) (“gearpump”, 1) WindowCounter [2, 4) (“gearpump”, 3) (“gearpump”, 2) [4, 6) (“gearpump”, 5) watermark = 0 , (“gearpump”, 4) No window can be emitted 37

  38. Out-of-order processing with watermark WindowCounter In-memory Table window messages (“gearpump”, [0, 2) 1) [2, 4) (“gearpump”, 3) WindowCounter (“gearpump”, 2) [4, 6) (“gearpump”, 5) (“gearpump”, 4) watermark = 2 , Window [0, 2) can be emitted 38

  39. How to get watermark ? 1. Words 2. Window Counts KafkaSource WindowCounter Sink 39

  40. From upstream Watermark W(50) 50 40 30 KafkaSource WindowCounter Sink Watermark of the operator 40

  41. From upstream Watermark W(50) 60 50 40 KafkaSource WindowCounter Sink Watermark of the operator 41

  42. More on Watermark ● Source watermark defined by user ● Usually heuristic based ● Users decide whether to drop data arriving after watermark 42

  43. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly once ● Dynamic DAG 43

  44. Exactly Once with asynchronous checkpointing Watermark = 2 Watermark = 0 Watermark = 0 KafkaSource WindowCounter KafkaSink (2, kafka_offset) (2, kafka_offset) 44

  45. Exactly Once with asynchronous checkpointing Watermark = 2 Watermark = 2 Watermark = 0 KafkaSource WindowCounter KafkaSink (2, window_counts) (2, kafka_offset) (2, window_counts) 45

  46. Exactly Once with asynchronous checkpointing Checkpoint succeed Watermark = 2 Watermark = 2 Watermark = 2 KafkaSource WindowCounter KafkaSink (2, kafka_offset) (2, window_counts) 46

  47. Crash Watermark = 3 Watermark = 2 Watermark = 2 KafkaSource WindowCounter Sink (2, kafka_offset) (2, window_counts) 47

  48. Recover to latest checkpoint at 2 Replay from kafka KafkaSource WindowCounter KafkaSink window_counts kafka_offset Get state at 2 (2, kafka_offset) (2, window_counts) 48

  49. How Gearpump solves the hard parts ● User Interface ● Flow control ● Out-of-order processing ● Exactly Once ● Dynamic DAG 49

Recommend


More recommend