apache apex next gen big data analytics
play

Apache Apex: Next Gen Big Data Analytics Thomas Weise - PowerPoint PPT Presentation

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14 th 2016 Stream Data Processing Real-time Data Delivery Transform /


  1. Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14 th 2016

  2. Stream Data Processing Real-time Data Delivery Transform / Analytics v isualization, … Declarative SQL SAMOA SAMOA API Beam Beam Data Operator DAG API Sources Library Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2

  3. Industries & Use Cases Financial Services Ad-Tech Telecom Manufacturing Energy IoT Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data HORIZONTAL • • Large scale ingest and distribution Enforcing data quality and data governance requirements • • Real-time ELTA (Extract Load Transform Analyze) Real-time data enrichment with reference data • • Dimensional computation & aggregation Real-time machine learning model scoring 3

  4. Apache Apex • In-memory, distributed stream processing • Application logic broken into components (operators) that execute distributed in a cluster • Unobtrusive Java API to express (custom) logic • Maintain state and metrics in member variables • Windowing, event-time processing • Scalable, high throughput, low latency • Operators can be scaled up or down at runtime according to the load and SLA • Dynamic scaling (elasticity), compute locality • Fault tolerance & correctness • Automatically recover from node outages without having to reprocess from beginning • State is preserved, checkpointing, incremental recovery • End-to-end exactly-once • Operability • System and application metrics, record/visualize data • Dynamic changes and resource allocation, elasticity 4

  5. Native Hadoop Integration • YARN is the resource manager • HDFS for storing persistent state 5

  6. Application Development Model Directed Acyclic Graph (DAG) A Stream is a sequence of data tuples A typical Operator takes one or Filtered Enriched more input streams, performs Stream Stream computations & emits one or more output streams Output Operator Stream • Each Operator is YOUR custom business logic in java, or built-in operator from our open source library Tuple Operator Operator Operator Operator • Operator has many instances that run in parallel and each instance is single-threaded Filtered Enriched Stream Stream Directed Acyclic Graph (DAG) is Operator made up of operators and streams 6

  7. Development Process Kafka Word JDBC Parser Filter Input Counter Output Lines Words Filtered Counts Database Kafka Apex Application • Operators from library or develop for custom logic • Connect operators to form application • Configure operator properties • Configure scaling and other platform attributes • Test functionality, performance, iterate 7

  8. Application Specification DAG API (compositional) Java Stream API (declarative) 8

  9. Developing Operators 9

  10. Operator Library Messaging NoSQL RDBMS • Kafka • Cassandra, HBase • JDBC • JMS (ActiveMQ , …) • Aerospike, Accumulo • MySQL • Kinesis, SQS • Couchbase/ CouchDB • Oracle • Flume, NiFi • Redis, MongoDB • MemSQL • Geode File Systems Parsers Transformations • HDFS/ Hive • XML • Filter, Expression, Enrich • NFS • JSON • Windowing, Aggregation • S3 • CSV • Join • Avro • Dedup • Parquet Analytics Protocols Other • Dimensional Aggregations • HTTP • Elastic Search (with state management for • FTP • Script (JavaScript, Python, R) historical data + query) • WebSocket • Solr • MQTT • Twitter • SMTP 10

  11. Stateful Processing with Event Time Event Stream k=A k=B k=B k=A k=A t=4:00 t=5:00 t=5:59 t=4:30 t=5:00 Processing Time +30s +60s +90s (All) : 1 (All) : 4 (All) : 5 t=4:00 : 1 t=4:00 : 2 t=4:00 : 2 t=5:00 : 2 t=5:00 : 3 State k=A, t=4:00 : 1 k=A, t=4:00 : 2 k=A, t=4:00 : 2 k=A, t=5:00 : 1 K=B, t=5:00 : 2 k=B, t=5:00 : 2 11

  12. Windowing - Apache Beam Model Event-time Session windows Watermarks Accumulation Triggers Keyed or Not Keyed Allowed Lateness Accumulation Mode Merging streams ApexStream<String> stream = StreamFactory .fromFolder(localFolder) .flatMap(new Split()) .window(new WindowOption.GlobalWindow(), new TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes()) .countByKey(new ConvertToKeyVal()).print(); 12

  13. Fault Tolerance • Operator state is checkpointed to persistent store ᵒ Automatically performed by engine, no additional coding needed ᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state • Automatic detection and recovery of failed containers ᵒ Heartbeat mechanism ᵒ YARN process status notification • Buffering to enable replay of data from recovered point ᵒ Fast, incremental recovery, spike handling • Application master state checkpointed ᵒ Snapshot of physical (and logical) plan ᵒ Execution layer change log 13

  14. Checkpointing State   Distributed, asynchronous No artificial latency   Periodic callbacks Pluggable storage 14

  15. Buffer Server & Recovery • In-memory PubSub Container 2 Container 1 • Stores results until committed Buffer Operator Operator • Backpressure / spillover to disk 1 2 Server • Ordering, idempotency Node 1 Node 2 Independent pipelines Downstream Operators reset (can be used for speculative execution) 15

  16. Recovery Scenario sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 0 sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 7 sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 10 sum … EW 2 , 1, 3, BW 2 , EW 1 , 4, 2, 1, BW 1 7 16

  17. Processing Guarantees At-least-once • On recovery data will be replayed from a previous checkpoint ᵒ No messages lost ᵒ Default, suitable for most applications • Can be used to ensure data is written once to store ᵒ Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most-once • On recovery the latest data is made available to operator ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly-once ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior 17

  18. End-to-End Exactly Once • Important when writing to external systems • Data should not be duplicated or lost in the external system in case of application failures • Common external systems ᵒ Databases ᵒ Files ᵒ Message queues • Exactly-once = at-least-once + idempotency + consistent state • Data duplication must be avoided when data is replayed from checkpoint ᵒ Operators implement the logic dependent on the external system ᵒ Platform provides checkpointing and repeatable windowing 18

  19. Scalability Unifier NxM Partitions Logical DAG 0 1 2 3 Logical Diagram 0 1 2 Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier 1a 2a Physical Diagram with operator 1 with 3 partitions 0 1b Unifier Unifier 3 1 2b 1c Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck 0 1 Unifier 2 1a Unifier 2a 1 0 1b Unifier 3 Unifier 2b 1c 19

  20. Advanced Partitioning Parallel Partition Cascading Unifiers Logical DAG Logical Plan 0 1 2 3 4 uopr dopr Execution Plan, for N = 4; M = 1 Physical DAG uopr1 1a uopr2 Container NIC 0 Unifier 2 3 4 unifier dopr uopr3 1b uopr4 Physical DAG with Parallel Partition Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers uopr1 Container 1a 2a 3a NIC NIC unifier 0 Unifier 4 Container uopr2 NIC unifier dopr 1b 2b 3b uopr3 Container NIC NIC unifier uopr4 20

  21. Dynamic Partitioning 2a 2a 1a 2a 1a 2b 1a 2b 3a 3 3 1b 2c 1b 2c 3b 1b 2b 2d 2d Unifiers not shown • Partitioning change while application is running ᵒ Change number of partitions at runtime based on stats ᵒ Determine initial number of partitions dynamically • Kafka operators scale according to number of kafka partitions ᵒ Supports re-distribution of state when number of partitions change ᵒ API for custom scaler or partitioner 21

Recommend


More recommend