stream processing with apache apex
play

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC - PowerPoint PPT Presentation

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise @atrato_io October 30, 2017, Dagstuhl Seminar Stream Processing with Apache Apex Real-time visualization, Transform / Analytics Data Sources Data


  1. Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise @atrato_io October 30, 2017, Dagstuhl Seminar

  2. Stream Processing with Apache Apex Real-time visualization, Transform / Analytics Data Sources Data Delivery & Storage storage, etc Declarative SQL API SAMOA SAMOA Beam Beam Operator DAG API Library Mobile Devices Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2

  3. Why Apex ● State Management & Fault tolerance ○ End-to-end Exactly-once, Checkpointing and Windowing ○ Fine grained recovery, low-latency SLA support ○ Queryable state ○ Accuracy, Repeatable/Replay ● Scalable, high throughput and low latency ○ Native Streaming, pipelined processing (data in motion) ○ Dynamic scaling and resource allocation, elasticity ● Comprehensive library of connectors and transformations ○ Accelerate development ○ Event time windowing ○ High-level and low level Java API, SQL, Beam Runner ● Used by GE (Predix), Capital One, Royal Bank of Canada, Pubmatic, Silver Spring Networks (more at https://apex.apache.org/powered-by-apex.html) 3

  4. Application Model A Stream is a sequence of data tuples Directed Acyclic Graph (DAG) An Operator takes one or more input streams, performs Filtered Enriched computations & emits one or Stream Stream more output streams • Custom business logic or Output Operator Stream built-in operator from Apex library • Operator has many Tuple instances that run in parallel Operator Operator Operator Operator and each instance is single-threaded Filtered Enriched Stream Stream Directed Acyclic Graph (DAG) Operator of operators and streams 4

  5. DAG Translation 5

  6. Execution Layer ● AM requests worker containers YARN Apex CLI RM from YARN to run physical operators Apex Worker Containers send data ● AM Worker Worker Worker using a pub-sub mechanism 5 1 2 ● Workers heartbeat to master Worker 6 3 Worker 4 3 1 2 5 6 4 Checkpoints DFS (or other distributed storage) 6

  7. Operator API setup teardown (Component) (Component) beforeCheckpoint checkpointed committed activate deactivate (CheckpointListener) (ActivationListener) (ActivationListener) process (InputPort) beginWindow endWindow or (Operator) (Operator) emitTuples (InputOperator) 7

  8. Operator Library Messaging File Systems NoSQL RDBMS Other • JDBC • Kafka • HDFS / Hive • Cassandra, HBase • Elastic Search • MySQL • JMS (ActiveMQ etc.) • Aerospike, Accumulo • Solr • Local File • Kinesis, SQS • Couchbase, CouchDB • Oracle • Twitter • S3 • Flume, NiFi • Redis, MongoDB • WebSocket / HTTP • MemSQL • FTP • MQTT • Geode, Kudu • SMTP Stateless Transformations Stateful Transformations • Parsers: XML, JSON , CSV , Avro • Windowing: sliding, tumbling, session • Accumulations: sum, merge, join, sort, top n, … • Filter • Enrich • Triggering, Watermarks • Configurable POJO schema • Dimensional Aggregations (with state management for historical data + query) • Map, FlatMap (custom Java function) • Deduplication • Script (JavaScript, Jython) 8

  9. Queryable State A set of operators in the library that support real-time queries of operator state. Twitter Feed CountByKey Hashtag Input Window Extractor Operator WebSocket HTTP Pub/Sub Snapshot Broker TopN Server Result Window Query Input ● Example: https://github.com/tweise/apex-samples/tree/master/twitter ● Pub/Sub server: https://github.com/atrato/pubsub-server ● Grafana data source: https://github.com/atrato/apex-grafana-datasource-server 9

  10. Application API DAG API (compositional) Stream API (declarative) 10

  11. Fault Tolerance - Checkpointing ● Stream is divided into fixed time slices Bookkeeping & BeginWindow n+1 Checkpointing done called streaming windows here ● Checkpoint is performed by Worker ... ... Containers at streaming window boundaries EndWindow n+1 EndWindow n BeginWindow n ● Worker Containers send heartbeats to AM Recovery is incremental without resetting ● Time full DAG ● Checkpoints are purged after the corresponding window is committed AM is also checkpointed ● 11

  12. In-Memory PubSub & Recovery • Buffer results until committed Container 2 Container 1 • Backpressure / spillover to disk Buffer Operator Operator • Ordering, idempotency 1 2 Server Node 1 Node 2 Independent pipelines Downstream Operators reset (can be used for speculative execution) 12

  13. Processing Guarantees 13

  14. End-to-End Exactly-Once Exactly-once results = at-least-once + idempotency + operator logic 14

  15. Scaling/Partitioning Partitioning with Unifiers: NxM Partitioning (Shuffle): Logical DAG 0 1 2 3 0 1 2 1 2 a a Physical DAG with operator 1 with 3 partitions 1 0 3 U U b 1 2 1 a b c 1 0 2 U 1 b 2 a U1 a 1 1 0 3 U c b 2 1 U2 b c 15

  16. Scaling/Partitioning (cont’d) Parallel Partition: Cascading Unifiers: 1 2 1 a 1 0 2 3 4 U 1 1 b 2 U 1 1 1 2 3 a a a 1 0 4 U U 1 2 3 1 1 b b b U 2 3 1 U 2 1 16

  17. Dynamic Scaling 1a 1a 0a 1a 0a 1b 0a 1b 2a 2 2 0b 1c 0b 1c 2b 0b 1b 1d 1d Unifiers not shown • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports redistribution of state when number of partitions change ᵒ API for custom scaler or partitioner 17

  18. Compute Locality • By default operators are distributed on different nodes in the cluster • Can be collocated on machine, container or thread basis for efficiency HOST CONTAINER THREAD Default (serialization, loopback) (callstack) (serialization+IPC) (in-process queue) • Host Locality ᵒ Operators can be deployed on specific hosts • (Anti-)Affinity ᵒ Ability to express relative deployment without specifying a host 18

  19. Compute Locality (cont’d) Message size (default locality) CONTAINER_LOCAL THREAD_LOCAL (bytes) (bytes/s) (bytes/s) (bytes/s) 64 59,176,512 204,748,032 2,480,432,448 128 89,803,904 395,023,360 3,662,684,672 256 137,019,648 671,409,664 5,218,227,968 512 156,255,744 1,255,749,632 4,416,738,304 1024 167,139,328 2,022,868,992 3,423,519,744 2048 182,349,824 3,508,013,056 4,050,688,000 4096 255,229,952 3,732,725,760 3,884,101,632 https://www.datatorrent.com/blog/blog-apex-performance-benchmark/ 19

  20. Recent Additions & Roadmap ● Apex runner in Apache Beam ● Iterative processing, Integrated with Apache Samoa, opens up ML ● Integrated with Apache Calcite, enables SQL ● Scalable, incremental state management ● User defined control tuples (watermarks, batch control, …) ● Apache Kudu connectors ● Support for Python ● Support for Docker, Mesos and Kubernetes ● Enhanced support for Batch Processing ● Encrypted Streams 20

  21. Adoption Challenges for Big Stream Processing Functionality Performance Usability Operations Testing 21

  22. Resources • http://apex.apache.org/ • Powered by Apex - http://apex.apache.org/powered-by-apex.html • Learn more - http://apex.apache.org/docs.html • Getting involved - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Follow @ApacheApex - https://twitter.com/apacheapex • Meetups - https://www.meetup.com/topics/apache-apex/ • Examples - https://github.com/apache/apex-malhar/tree/master/examples • Slideshare - http://www.slideshare.net/ApacheApex/presentations 22

Recommend


More recommend