continuous intelligence
play

Continuous Intelligence Through Computation Sharing With Arcon - PowerPoint PPT Presentation

Continuous Intelligence Through Computation Sharing With Arcon Paris Carbone Senior Researcher @ RISE Committer @ Apache Flink <paris.carbone@ri.se> Castor Software Days E n Data g i n Science e e r i n g Tech Business A


  1. Continuous Intelligence Through Computation Sharing With Arcon Paris Carbone Senior Researcher @ RISE Committer @ Apache Flink <paris.carbone@ri.se> Castor Software Days

  2. E n Data g i n Science e e r i n g Tech Business A Lot is going on in Tech (Deep Learning, Scalable Processing etc.) Little contribution to critical real-time decision making

  3. Continuous Intelligence A design pattern in which real-time analytics are integrated within a business operation, processing current and historical data to prescribe actions in response to events . events actions Business Tech https://www.gartner.com/en/newsroom/press-releases/2019-02-18-gartner-identifies-top-10-data-and-analytics-technolo 3

  4. What we think of data vs actual data…. 4

  5. The Paradigm Shift Some Missed The Stream Analytics Stack retrospective Queries answers Data High Level Stream SQL, CEP… Models paradigm Flink, Beam, Kafka-Streams, Compute shift Apex, Storm Kafka, Pub/Sub, Kinesis, lots of real-time Storage Pravega… Query Data answers • Data Stream Processing as a 24/7 execution paradigm 5

  6. Similar Technologies 01110011100001001000100010010001 net socket logic 000100110010 000100110010 000100110 service 24/7 applications/services have always been event-driven e.g., using actor programming events 6

  7. Actors vs Streams Actor Programming Data Stream Computing Declarative Program logi logic logi state logi logi logic logi logic logic state service vs logic logi logic service service • Low-Level Event-Based Programming • Declarative Programming • Manual/External State • State Managed by the system • Not Robust: Manual Fault Tolerance • Robust: Built-in Fault Tolerance • Not flexible scaling • Scalable Deployments

  8. The Real-Time Analytics Stack High Level Stream SQL, CEP… Models Flink, Beam, Kafka-Streams, Compute Apex, Storm, Spark Streaming… Kafka, Pub/Sub, Kinesis, Storage Pravega… 8

  9. Apache Flink Foundations Data Streams,Fault Tolerance, Window Aggregation • Top-level Apache Project • #1 stream processor (2019) • Production-Proof influenced • > 400 contributors stream-SQL Calcite • 100s of deployments commercial deployments 9

  10. Structure of a 24/7 Stream Application Event Logs Stream Event Logs Processing Files Historic Data Applications/Services State

  11. Programming Abstractions in Flink Automates • Fully Declarative Programming Data SQL, CEP, Tables, ML Domain-Specific APIs • Event Patterns, Relations etc. Scientists • Higher-Order Streaming Functions window,map,filter etc. DataStream API • Event Windowing (sessions, time etc.) Data • Dynamic program state Engineers f(input, state, time) Event Processing API • Operations on out-of-order streams • Fault Tolerance • Scalability Dataflow Engine • Monitoring/IO Management 11

  12. Declarative Streaming Examples SELECT HOUR(r.rideTime) AS hourOfDay, AVG(f.tip) AS avgTip Average Tip per Hour FROM with Stream SQL Rides r, Fares f WHERE r.rideId = f.rideId AND NOT r.isStart AND f.payTime BETWEEN r.rideTime - INTERVAL '5' MINUTE AND r.rideTime GROUP BY HOUR(r.rideTime); val completedRides = Pattern .begin[TaxiRide]("start").where(_.isStart) .next(“end").where(!_.isStart) Completed Taxi Rides within 120min with Complex Event Processing CEP.pattern[TaxiRide](allRides, completedRides.within(Time.minutes(120))) 12

  13. Case Study Car Sharing Source: https://www.flink-forward.org/

  14. AthenaX - An Online Warehousing Platform (2017) real-time estimations • earnings UberEats UberEats A stream SQL query optimiser • user satisfaction Restaurants and executor based on Flink AthenaX UberEats event streams Users estimated delivery? AthenaX was released and open sourced by Uber Technologies. It is capable of scaling across hundreds of machines and processing hundreds of billions of real-time events daily. https://eng.uber.com/athenax/ https://github.com/uber/AthenaX 14

  15. Marketplace - Dynamic Ride Pricing with Apache Flink (2018) Geo-Sensitive Time-based Aggregations Output Decisions Input Streams • supply • Pricing • demand (taxi orders) • Dispatch • Trips • Promotions • Traffic • Driver Positioning Prices million events per sec Compute Location-Sensitive Trends in Rider Demand and Driver Availability https://marketplace.uber.com/ Flink Forward 2018 15

  16. Dynamic Pricing - A Data Stream-Powered Standard •Dynamic Pricing •more profitable •best deals for users • competition had to adapt 16

  17. Dynamic Pricing (2019) too many • PrimeTime Real-Time Service • Price Multiplier per geog. cell • 3M Geohashes/min too few Flink Pipeline 17

  18. The Bigger Picture Data Streams • scalable, fault tolerant analytics • event-based business logic • out-of-order computation ? • dynamic relational tables (SQL) • event pattern-matching (CEP) Data but what about deeper analytics… Processing • tensors ? • graph algorithms ? • deep learning • feature learning ? • reinforcement learning • …. 18

  19. Data Pipelines Today • Many Frameworks/Frontends for different needs • (ML Training & Serving, SQL, Streams, Tensors, Graphs) σ θ π ⋈ σ θ ⋈ σ θ π ⋈ σ θ Streams Feature Learning AI Feature Engineering ML Dynamic Tensor Programming RL Graphs Model Serving Reasoning Simulation tasks 19

  20. Fundamental Problems Framework/Library Silos offline/ batch Fragmented Codebases/Runtimes historic Historic ML Model data Unshared Hardware features, aggregates,ETL Over-materialization of results model serving online/ streaming Ridiculously Unoptimised Programs live Event data Logic No continuous intelligence 20

  21. Next paradigm shift? offline/ batch historic Historic ML Model data ? features, aggregates,ETL critical Live model serving online/ decision Model streaming making live Event data Logic 21

  22. Secret Sauce? “A revolutionary technology that does NOT require you to throw tons of data to your problem to be able to solve it” The Compiler • Instead, compilers can understand instructions… • explained by humans in a high-level declarative language • and then optimise them • and translate to primitive machines to execute them reliably 22

  23. The Arcon Vision Unified Declarative Programming Tensors DataFrames DataStreams Graphs Cross-Compile Arcon Arcon Arcon Optimise and Generate Code 23 Shared Native Execution

  24. The Arcon Architecture Unified Analytics DSL Arc IR (Intermediate Representation) Arcon Runtime 24

  25. Unified Analytics DSL … • Host language-agnostic core Data Linear Relational • Compositional Streams Algebra Algebra • First-class citizen support for: Core σ θ DSL • streams, tensors, relations ⋈ π σ θ Translation Arc IR 25

  26. IR Intuition • No cross-optimisation is possible, e.g. resource sharing • Data movement 
 f 1 f 2 f 3 costs ( ) Performance IR f 1 IR f 2 IR f 3 IR f 1 + f 2 + f 3 #Frameworks 26

  27. Arcon Compiler Pipeline Unified Analytics DSL Arc (High Level IR) Arcon Logical Dataflow IR Physical Dataflow IR Binaries 27

  28. Arc IR • A minimal yet feature-complete set of read/write-only types and expressions Read More [Paper] Arc: An IR for Batch and Stream Programming @ DBPL19 [Code] https://github.com/cda-group/arc 28

  29. Arc Optimisations • Arc supports both compiler and dataflow optimisations • Compiler : Loop unrolling, partial evaluation, • Dataflow : Operator fusion, fission, reordering, predicate pushdown, specialisation, ... 29

  30. Unlocking Speed Arc can boost even existing frameworks 10M elements 50 map operations on Apache Flink Arc (High Level IR) 10 3 ne 1. Unoptimised T a k) 2. Fused In on 3. Inlined In on Execution Time (seconds) 4. Partially Evaluated Logical Dataflow IR 10 2 x 2 orders of magnitude faster Physical Dataflow IR 10 1 Binaries 10 0 30

  31. A Runtime Capable for Unified Analytics Arcon Operational Plane Hadoop Spark Flink Statemaster Appmaster Neptune Storm Ray control data dataflow snapshots deployment Static Dynamic workers Execution Plane … IO - Channels / State IO - Channels / State Dynamic Scheduler Dynamic Scheduler Flexible State Backends Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications SOCC 2019 (external/shared, embedded) Garefalakis, Karanasos, Pietzuch 31

  32. Performance Matters • Arc Optimiser : ~ 10x Speedup • Shared Hardware Acceleration : ~ 10 2 x Speedup • Data Parallel Execution : ~ 10 3 x Speedup 32

  33. Learn More Code: https://github.com/cda-group/arc https://github.com/cda-group/arcon Project: https://cda-group.github.io https://twitter.com/SenorCarbone 33

Recommend


More recommend