alexandros koliousis
play

Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with - PowerPoint PPT Presentation

Large-Scale Data & Systems Group The design of a hybrid stream processing system for heterogeneous servers Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo


  1. Large-Scale Data & Systems Group The design of a hybrid stream processing system for heterogeneous servers Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, Peter Pietzuch and, more recently, George Theodorakis, Panos Garefalakis and Holger Pirk Large-Scale Data & Systems Group Department of Computing, Imperial College London http://lsds.doc.ic.ac.uk LSDS Cambridge, October 31

  2. Streams of Data Everywhere Many new data sources are now available: Linear access patterns make data processing a streaming problem LSDS Cambridge, October 31 2

  3. High-Throughput Low-Latency Analytics NovaSparks Facebook Insights Google Zeitgeist Feedzai 150M 9GB 40K 40K of page metrics/s user queries/s card trans/s stock options/s In less than 1 ms In less than 10 s Within ms In 25 ms window t+ t+1 t LSDS Cambridge, October 31 3

  4. Algorithmic Complexity Increases T1 Pre-process … T1(a, b, c) Share state … highway highway highway highway highway Iterate segment segment segment segment segment T2 Parallelize T2(c, d, e) direction direction direction direction direction speed speed speed speed speed … Aggregate T3 T3(g, i, h) Complex Topic- Content- Online machine Stream pattern based based learning, data queries matching filtering filtering mining Complex Event Stream Publish/Subscribe Processing (CEP) processing LSDS Cambridge, October 31 4

  5. Design Space for Data-Intensive Systems Tension between performance & algorithmic complexity TBs Data amount Hard for all GBs algorithms Hard for machine learning MBs algorithms Easy for most algorithms 10s 1s 100ms 10ms 1ms Result latency LSDS Cambridge, October 31 5

  6. Scale Out in Data Centres LSDS Cambridge, October 31 6

  7. Task vs Data Parallelism ... Input data Results Servers in data centre select distinct W.cid select distinct W.cid select highway, segment, direction, AVG(speed) From Payments [ range 300 seconds] as W, select distinct W.cid from Vehicles[ range 5 seconds slide 1 second] From Payments [ range 300 seconds] as W, Payments [ partition-by 1 row] as L select highway, segment, direction, AVG(speed) From Payments [ range 300 seconds] as W, group by highway, segment, direction Payments [ partition-by 1 row] as L where from Vehicles[ range 5 seconds slide 1 second] W.cid = L.cid and W.region != L.region having avg < 40 Payments [ partition-by 1 row] as L where W.cid = L.cid and W.region != L.region group by highway, segment, direction where W.cid = L.cid and W.region != L.region having avg < 40 Task parallelism: Data parallelism: Multiple data processing jobs Single data processing job LSDS Cambridge, October 31 7

  8. Peter Pietzuch - Imperial College London Distributed Dataflow Systems Idea: Execute data-parallel tasks on cluster nodes Tasks organised as dataflow graph parallelism degree 2 Almost all big data systems do this: Apache Hadoop, Apache Spark, Apache – parallelism Storm, Apache Flink, degree 3 Google TensorFlow, ... LSDS Cambridge, October 31 8

  9. “Nobody Ever Got Fired for Using a Hadoop Cluster” [HotCDP’12] Or Flink or Spark ;) • 2012 study of MapReduce workloads – Microsoft: median job size < 14 GB The size of the workloads has changed, – Yahoo: median job size < 12.5 GB but so has the size/price of memory ! – Facebook: 90% of jobs < 100 GB Many data-intensive jobs easily fit into memory It’s expensive to scale-out in terms of hardware and engineering! ☛ In many cases a single server is cheaper/more efficient than a cluster LSDS Cambridge, October 31 9

  10. Exploit Single-Node Heterogeneous Hardware Servers with CPUs and GPUs now common – 10x higher linear memory access throughput – Limited data transfer throughput PCIe PC Ie B Bus Command Queue Co Processor 1 ... N Socket 1 Socket 2 C 1 C 5 C 1 C 5 10s of C 2 C 6 C 2 C 6 streaming processors C 3 C 7 C 3 C 7 1000s of C 4 C 8 C 4 C 8 cores L3 L3 10s GB of L2 Cache RAM DMA DM DRAM DRAM Use both CPU & GPU resources for stream processing LSDS Cambridge, October 31 10

  11. With Well-Defined High-Level Queries CQL: SQL-based declarative language for continuous queries [Arasu et al. , VLDBJ’06] Credit card fraud detection example: – Find attempts to use same card in different regions within 5-min window CQL offers correct window semantics <\> Self-join se select di distinct W.cid fr from Payments [ ra range 300 seconds] as as W, Payments [ pa partition-by by 1 row] as as L wh where W.cid = L.cid an and W.region != L.region LSDS Cambridge, October 31 11

  12. SABER Window-Based Hybrid Stream Processing Engine for CPUs & GPUs Challenges & Contributions 1. How to parallelise sliding-window queries across CPU and GPU? Decouple query semantics from system parameters 2. When to use CPU or GPU for a CQL operator? Hybrid processing: offload tasks to both CPU and GPU 3. How to reduce GPU data movement costs? Amortise data movement delays with deep pipelining Details omitted LSDS Cambridge, October 31 12

  13. How to Parallelise Window Computation? Problem: Window semantics affect system throughput and latency – Pick task size based on window size? size: 4 sec 6 5 4 3 2 1 slide: 1 sec Task T 1 Output window results in order Task T 2 Window-based parallelism results in redundant computation LSDS Cambridge, October 31 13

  14. How to Parallelise Window Computation? Problem: Window semantics affect system throughput and latency – Pick task size based on window size? On window slide? size: 4 sec 6 5 4 3 2 1 slide: 1 sec T 1 T 2 Compose window results from partial results T 3 T 4 T 5 … Slide-based parallelism limits GPU parallelism LSDS Cambridge, October 31 14

  15. How to Relate Slides to Tasks? Avoid coupling throughput/latency of queries to window definition – e.g. Spark imposes lower bound on window slide: 1s 2s 3s 4s 5s 2 Window slide limited by 1.5 (10 6 tuples/s) Throughput min. latency (~500 ms) 1 Micro - batch size limited 0.5 by window slide (0.1, 0.2) 0 0 1 2 3 4 5 6 7 8 9 Window slide (10 6 tuples/s) LSDS Cambridge, October 31 15

  16. SABER’s Window Processing Model Idea: Decouple task size from window size/slide – Pick based on underlying hardware features • e.g. PCIe throughput T 3 T 2 T 1 5 tuples/task 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 size: 7 rows w 1 slide: 2 rows w 2 w 3 w 4 w 5 – Task contains one or more window fragments • E.g. closing/pending/opening windows in T 2 LSDS Cambridge, October 31 16

  17. Merging Window Fragment Results Idea: Decouple task size from window size/slide – Assemble window fragment results – Output them in correct order Worker A: T 1 w 1 w 2 w 2 result w 3 w 1 result w 1 w 2 w 3 Slot 2 Slot 1 Output result w 4 Result Stage w 5 circular buffer Worker B : T 2 Worker B stores T 2 results and exits (nothing to forward) Worker A stores T 1 results, merges window fragment results and forwards complete windows downstream LSDS Cambridge, October 31 17

  18. Operator Implementations / API T 2 T 1 5 tuples/sec Fragment function, f f 10 9 8 7 6 5 4 3 2 1 size: 7 rows w 1 slide: 2 rows Processes window fragments w 2 Assembly function, f a f f f f f b f f f f Merges partial window results w 2 results w 1 results Batch function, f b f a f a Composes fragment functions within a task output Allows incremental processing LSDS Cambridge, October 31 18

  19. How to Pick the Task Size? 8 CPU GPU Throughput (GB/s) 6 4 2 0 32 64 128 256 512 1024 2048 4096 Task Size (KB) LSDS Cambridge, October 31 19

  20. How Does Window Slide Affect Performance? Performance of window-based queries remains predictable 8 0.2 Throughput (GB/s) 6 0.15 Latency (sec) 4 0.1 SABER SABER Latency 2 0.05 0 0 64 256 1024 4096 16384 Window Slide (Bytes) Aggregation avg [ ro rows 1024, sl slide x ] LSDS Cambridge, October 31 20

  21. SABER Window-Based Hybrid Stream Processing Engine for CPUs & GPUs Challenges & Contributions 1. How to parallelise sliding-window queries across CPU and GPU? Decouple query semantics from system parameters 2. When to use CPU or GPU for a CQL operator? Hybrid processing: offload tasks to both CPU and GPU 3. How to reduce GPU data movement costs? Amortise data movement delays with deep pipelining LSDS Cambridge, October 31 21

  22. SABER’s Hybrid Stream Processing Model Idea: Enable tasks to run on both processors – Scheduler assigns tasks to idle processors Past behavior: Task Queue: CPU CPU GPU comes first T 10 T 9 T 8 T 7 T 6 T 5 T 4 T 3 T 2 T 1 GPU Q A 3 ms 2 ms Q A Q A Q B Q A Q B Q B Q B Q B Q A Q B Q B 3 ms 1 ms 0 3 6 9 12 First-Come First-Served T 1 T 4 T 8 T 10 CPU GPU T 2 T 3 T 5 T 6 T 7 T 9 Idle FCFS ignores effectiveness of processor for given task LSDS Cambridge, October 31 22

  23. Heterogeneous Look-Ahead Scheduler (HLS) Idea: Idle processor skips tasks that could be executed faster by another processor – Decision based on observed query task throughput Past behavior: Task Queue: CPU CPU GPU comes first T 10 T 9 T 8 T 7 T 6 T 5 T 4 T 3 T 2 T 1 GPU Q A 3 ms 2 ms Q A Q A Q B Q A Q B Q B Q B Q B Q A Q B Q B 3 ms 1 ms 0 0 3 3 6 6 9 9 12 12 HLS T 7 T 3 T 10 CPU T 1 T 2 T 4 T 5 T 6 T 8 T 9 GPU HLS fully utilises processors LSDS Cambridge, October 31 23

Recommend


More recommend