alexandros koliousis
play

Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with - PowerPoint PPT Presentation

Window-Based Hybrid Stream Processing for Heterogeneous Architectures github.com/lsds/saber Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa & Peter


  1. Window-Based Hybrid Stream Processing for Heterogeneous Architectures github.com/lsds/saber Alexandros Koliousis a.koliousis@imperial.ac.uk Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa & Peter Pietzuch Large-Scale Distributed Systems Group Department of Computing, Imperial College London http://lsds.doc.ic.ac.uk LSDS Large-Scale Distributed Systems Group

  2. High-Throughput Low-Latency Analytics NovaSparks Facebook Insights Google Zeitgeist Feedzai 9GB 40K 40K 150M of page metrics/s user queries/s card trans/s stock options/s In less than 1 ms In less than 10 s Within ms In 25 ms window t+1 t LSDS Large-Scale Distributed Systems Group 2

  3. Exploit Single-Node Heterogeneous Hardware Servers with CPUs and GPUs now common – 10x higher linear memory access throughput – Limited data transfer throughput PCIe Bus Command Queue Processor 1 ... N Socket 1 Socket 2 C 1 C 5 C 1 C 5 10s of C 2 C 6 C 2 C 6 streaming processors C 3 C 7 C 3 C 7 1000s of C 4 C 8 C 4 C 8 cores L3 L3 10s GB of L2 Cache RAM DMA DRAM DRAM Use both CPU & GPU resources for stream processing LSDS Large-Scale Distributed Systems Group 3

  4. With Well-Defined High-Level Queries CQL: SQL-based declarative language for continuous queries [Arasu et al. , VLDBJ’06] Credit card fraud detection example: – Find attempts to use same card in different regions within 5-min window CQL offers correct window semantics <\> Self-join W.cid se select ct di distinct Payments [ ra range 300 seconds] as as W, from fr Payments [ pa by 1 row] as as L partition-by W.cid = L.cid an and W.region != L.region wh where LSDS Large-Scale Distributed Systems Group 4

  5. SABER Window-Based Hybrid Stream Processing Engine for CPUs & GPUs Challenges & Contributions 1. How to parallelise sliding-window queries across CPU and GPU? Decouple query semantics from system parameters 2. When to use CPU or GPU for a CQL operator? Hybrid processing: offload tasks to both CPU and GPU 3. How to reduce GPU data movement costs? Amortise data movement delays with deep pipelining LSDS Large-Scale Distributed Systems Group 5

  6. How to Parallelise Window Computation? Problem: Window semantics affect system throughput and latency – Pick task size based on window size? size: 4 sec 6 5 4 3 2 1 slide: 1 sec Task T 1 Output window results in order Task T 2 Window-based parallelism results in redundant computation LSDS Large-Scale Distributed Systems Group 6

  7. How to Parallelise Window Computation? Problem: Window semantics affect system throughput and latency – Pick task size based on window size? On window slide? size: 4 sec 6 5 4 3 2 1 slide: 1 sec T 1 T 2 Compose window results from partial results T 3 T 4 T 5 Slide-based parallelism limits GPU parallelism LSDS Large-Scale Distributed Systems Group 7

  8. SABER’s Window Processing Model Idea: Decouple task size from window size/slide – Pick based on underlying hardware features • e.g. PCIe throughput T 3 T 2 T 1 5 tuples/task 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 size: 7 rows w 1 slide: 2 rows w 2 w 3 w 4 w 5 – Task contains one or more window fragments • E.g. closing/pending/opening windows in T 2 LSDS Large-Scale Distributed Systems Group 8

  9. Merging Window Fragment Results Idea: Decouple task size from window size/slide – Assemble window fragment results – Output them in correct order Worker A: T 1 w 1 w 2 w 2 result w 3 w 1 result w 1 w 2 w 3 Slot 2 Slot 1 Output result w 4 Result Stage w 5 circular buffer Worker B : T 2 Worker A stores T 1 results, merges window fragment results and forwards complete windows downstream LSDS Large-Scale Distributed Systems Group 9

  10. SABER Window-Based Hybrid Stream Processing Engine for CPUs & GPUs Challenges & Contributions 1. How to parallelise sliding-window queries across CPU and GPU? Decouple query semantics from system parameters 2. When to use CPU or GPU for a CQL operator? Hybrid processing: offload tasks to both CPU and GPU 3. How to reduce GPU data movement costs? Amortise data movement delays with deep pipelining LSDS Large-Scale Distributed Systems Group 10

  11. SABER’s Hybrid Stream Processing Model Idea: Enable tasks to run on both processors – Scheduler assigns tasks to idle processors Past behavior: Task Queue: CPU CPU GPU comes first T 10 T 9 T 8 T 7 T 6 T 5 T 4 T 3 T 2 T 1 GPU Q A 3 ms 2 ms Q A Q A Q B Q A Q B Q B Q B Q B Q A Q B Q B 3 ms 1 ms 0 3 6 9 12 First-Come First-Served T 1 T 4 T 8 T 10 CPU GPU T 2 T 3 T 5 T 6 T 7 T 9 Idle FCFS ignores effectiveness of processor for given task LSDS Large-Scale Distributed Systems Group 11

  12. Heterogeneous Look-Ahead Scheduler (HLS) Idea: Idle processor skips tasks that could be executed faster by another processor – Decision based on observed query task throughput Past behavior: Task Queue: CPU CPU GPU comes first T 10 T 9 T 8 T 7 T 6 T 5 T 4 T 3 T 2 T 1 GPU Q A 3 ms 2 ms Q A Q A Q B Q A Q B Q B Q B Q B Q A Q B Q B 3 ms 1 ms 0 0 3 3 6 6 9 9 12 12 HLS T 3 T 7 T 10 CPU T 1 T 2 T 4 T 5 T 6 T 8 T 9 GPU HLS fully utilises processors LSDS Large-Scale Distributed Systems Group 12

  13. The SABER Architecture Java C & OpenCL 15K LOC 4K LOC T 1 op T 2 T 1 T 2 CPU T 2 T 1 GPU op α α Dispatching stage Scheduling & execution stage Result stage Dispatch Dequeue tasks Merge & forward partial fixed-size tasks based on HLS window results LSDS Large-Scale Distributed Systems Group 13

  14. Is Hybrid Stream Processing Effective? Different queries result in different CPU:GPU processing split that is hard to predict offline select group-by avg group-by cnt group-by avg aggr avg group-by avg select group-by cnt Throughput (10 6 tuples/s) 50 SABER (CPU contrib.) Intel Xeon 2.6 GHz 40 16 cores 30 SABER (GPU contrib.) 20 NVIDIA Quadro K5200 10 2,304 cores 0 CM2 SG1 SG2 LRB3 LRB4 Cluster Mgmt. Smart Grid LRB LSDS Large-Scale Distributed Systems Group 14

  15. Is Hybrid Stream Processing Effective? Aggregate throughput of CPU and GPU always higher than its counterparts GPU is faster CPU is faster Not additive due to queue contention 6 0.3 SABER (CPU only) Throughput (GB/s) SABER (GPU only) 4 0.2 SABER 2 0.1 0 0 Aggregation Group-by θ -join LSDS Large-Scale Distributed Systems Group 15

  16. Is Heterogeneous Look-Ahead Scheduling Effective? W 1 W 2 5 project project FCFS Throughput (GB/s) 4 Static group-by cnt aggr sum 3 HLS 2 CPU GPU CPU GPU 1 π 5x 1.5x π 0 α 1.5x γ 6x W 1 W 2 W1 W2 W 1 benefits from static scheduling but HLS fully utilises GPU: – GPU also runs ~%1 of of group-by tasks W 2 benefits from FCFS but HLS better utilises GPU: – HLS CPU:GPU split is 1:2.5 for project and 1:0.5 for α ggr LSDS Large-Scale Distributed Systems Group 16

  17. Summary Window processing model Decouples query semantics from system parameters Hybrid stream processing model Can achieve aggregate throughput of heterogeneous processors Hybrid Look-ahead Scheduling (HLS) Allows use of both CPU and GPU opportunistically for arbitrary workloads Thank you! Any Questions? Alexandros Koliousis github.com/lsds/saber LSDS Large-Scale Distributed Systems Group 17

Recommend


More recommend