finestream fine grained window based stream
play

FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU - PowerPoint PPT Presentation

USENIX ATC20 2020 USENIX Annual Technical Conference JULY 15 17, 2020 FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du


  1. USENIX ATC’20 2020 USENIX Annual Technical Conference JULY 15 – 17, 2020 FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du Renmin University of China Technische Universit ả t Berlin National University of Singapore 1

  2. Outline 1. Background 2. Motivation 3. Challenges 4. FineStream 5. Evaluation 6. Conclusion 2

  3. 1. Background • Continuous operator model • Bulk-synchronous parallel model • operator granularity • query granularity operator 1 operator 1 CPU CPU query query operator 2 operator 2 … … GPU GPU operator n operator n [ SIGMOD’16 ] Saber : Window-based hybrid stream This paper processing for heterogeneous architectures CPU and GPU can concurrently execute in both cases — only the granularity is different. 3

  4. 2. Integrated Architectures • 2012, Jan • 2011, Jan • 2014, Apr • Int Intel l Iv Ivy y Br Bridge • AMD APU • Nvi vidia ia Tegra Benefit Be its • No PCI-e transfer overhead • Shared global memory • High energy efficiency 4

  5. 1. Background • Integrated architectures vs. discrete architectures Integrated architectures Discrete architectures Architecture A10-7850K Ryzen5 2400G GTX 1080Ti V100 #cores 512+4 704+4 3584 5120 TFLOPS 0.9 1.7 11.3 14.1 bandwidth (GB/s) 25.6 38.4 484.4 900 price ($) 209 169 1100 8999 TDP (W) 95 65 250 300 5

  6. 3. Stream Processing with SQL window size • Data stream • Window window w1 data stream • Operator … tuple … • Query window w2 * Batch window slide operator 1 operator 2 query … operator n results 6

  7. Outline 1. Background 2. Motivation 3. Challenges 4. FineStream 5. Evaluation 6. Conclusion 7

  8. 2. Motivation • Varying Operator-Device Preference query on CPU query on GPU GPU queue: operator 1 operator 2 5.2 ms 18.2 ms group-by aggregation operator 1 operator 2 6.7 ms 5.8 ms CPU group-by aggregation queue: time 8

  9. 2. Motivation • Performance (tuples/s) of operators on the CPU and the GPU of the integrated architecture. Operator CPU only GPU only Device choice Projection 14.2 14.3 GPU Selection 13.1 14.1 GPU Aggregation 14.7 13.5 CPU Group-by 8.1 12.4 GPU Join 0.7 0.1 CPU 9

  10. 2. Motivation • Fine-Grained Stream Processing • A fine-grained stream processing method that can consider both integrated architecture characteristics and operator features shall have better performance. • memory bandwidth limit • operators - preferred devices • CPU and GPU have good performance • consider the interplay of operator features and architecture difference. 10

  11. Outline 1. Background 2. Motivation 3. Challenges 4. FineStream 5. Evaluation 6. Conclusion 11

  12. 3. Challenges • Challenge 1: Application topology combined with architectural characteristics GPU CPU OP2 OP3 GPU GPU GPU CPU CPU CPU … … core core core core core core OP7 CPU Cache GPU Cache OP5 OP6 OP11 Shared Memory Management Unit OP9 OP10 System DRAM 12

  13. 3. Challenges • Challenge 2: SQL query plan optimization with shared main memory GPU only CPU only CPU-GPU co-run 6.7 ms 18.2 ms 22.4 ms GPU queue: query on GPU query on GPU CPU queue: query on CPU query on CPU time 13

  14. 3. Challenges • Challenge 3: Adjustment for dynamic workload 90% OP2 10% OP2 OP1 OP1 10% 90% OP3 OP3 14

  15. Outline 1. Background 2. Motivation 3. Challenges 4. FineStream 5. Evaluation 6. Conclusion 15

  16. 4. FineStream • Overview stream batch batch … … online dispatcher SQL profiling operators op1 op2 op1 performance results model dev dev dev device mapping FineStream 16

  17. 4. FineStream • Topology branch1 OP1 OP2 OP3 path critical OP7 branch2 OP4 OP5 OP6 OP11 branch3 OP8 OP9 OP10 17

  18. 4. FineStream • Optimization 1: Branch Co-Running t stage t stage3 t stage2 t stage3 t stage2 t stage1 1 branch 1 branch 1 branch 3 branch 2 branch 2 branch 3 branch 3 time time (a) Branch parallelism. (b) Branch scheduling optimization. 18

  19. 4. FineStream • Optimization 2: Batch Pipeline phase 2 phase 1 PH i : phase i B i : batch i OP1 OP2 OP3 PH2 B1 PH2 B2 OP7 OP4 OP5 OP6 OP11 PH1 B1 PH1 B2 … OP8 OP9 OP10 time (a) Phase partitioning. (b) Batch pipeline. 19

  20. 4. FineStream • Optimization 3: Handling Dynamic Workload • Light-Weight Resource Reallocation Integrated Integrated Shared memory Shared memory architectures architectures 90% 10% OP2 GPU CPU OP2 CUs CUs OP1 OP1 CPU GPU 10% 90% OP3 OP3 CUs CUs (a) 90% workload goes to OP2. (b) 90% workload goes to OP3. • Query Plan Adjustment 20

  21. 4. FineStream • Execution flow stream 1 parallelism thread 1 thread 2 operators performance utilization DAG 1 DAG i CPU OP CPU% GPU% … Branch … batch1 CPU OP1 GPU OP1 … … … … Co-Running … … … … … … … Batch OPi bandwidth batch2 GPU OPi … … … … Pipeline utilization default batch3 dynamic- operator yes migration resource query plan still low batch4 dataflow workload detected reallocation adjustment performance? … detection monitoring time 21

  22. Outline 1. Background 2. Motivation 3. Challenges 4. FineStream 5. Evaluation 6. Conclusion 22

  23. 5. Evaluation • Platforms Example - Q1 • AMD A10- 7850K • Ryzen 5 2400G (Google compute cluster monitoring) • Datasets • Google compute cluster monitoring select timestamp, category, sum (cpu) • Anomaly detection in smart grids as totalCPU • Linear road benchmark from TaskEvents [ range 256 slide 1] • Synthetically generated dataset group by category • Benchmarks • Nine queries 23

  24. 5. Evaluation • Throughput: FineStream achieves the best performance in most cases. 25 Single Saber FineStream (1E5 tuples/s) 20 Throughput 15 10 5 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 A10-7850K Ryzen5 2400G 24

  25. 5. Evaluation • Latency: Low latency in most cases. 1.5 Single Saber FineStream Latency (s) 1 0.5 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 A10-7850K Ryzen5 2400G 25

  26. 5. Evaluation • Throughput vs. latency • Queries with high throughput usually have low latency, and vice versa. 25 FineStream(A10-7850K) 20 (1E5 tuples/s) Saber(A10-7850K) Throughput 15 FineStream(Ryzen5) Saber(Ryzen) 10 5 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Latency (s) 26

  27. 5. Evaluation • Utilization • FineStream utilizes the GPU device better on the integrated architecture. 100 Saber FineStream utilization (%) 80 60 40 20 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 CPU GPU 27

  28. 5. Evaluation • Comparison with Discrete Architectures • Throughput: The discrete GPUs exhibit 1.8x to 5.7x higher throughput than the integrated architectures, due to the more computational power of discrete GPUs. • Latency: • Discrete GPUs: Ttotal = TPCIe_transmit + Tcompute • Integrated GPUs: Ttotal = Tcompute 28

  29. 5. Evaluation • Comparison with Discrete Architectures • High Price-Throughput Ratio 14000 1080ti v100 A10-7850K Ryzen Price-performance ratio 12000 (performance/USD) 10000 8000 6000 4000 2000 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 29

  30. 5. Evaluation • Comparison with Discrete Architectures • High Energy Efficiency 35000 1080ti v100 A10-7850K Ryzen (performance/Watt) 30000 Energy efficiency 25000 20000 15000 10000 5000 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 30

  31. Outline 1. Background 2. Motivation 3. Challenges 4. FineStream 5. Evaluation 6. Conclusion 31

  32. USENIX ATC’20 2020 USENIX Annual Technical Conference JULY 15 – 17, 2020 6. Conclusion • The first fine-grained window-based relational stream processing. • Lightweight query plan adaptations handling dynamic workloads. • FineStream evaluation on a set of stream queries. Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du Renmin University of China, Technische Universit ả t Berlin, National University of Singapore fengzhang@ruc.edu.cn, yanglin2330@ruc.edu.cn, shuhao.zhang@tu-berlin.de, hebs@comp.nus.edu.sg, lu-wei@ruc.edu.cn, duyong@ruc.edu.cn 32

Recommend


More recommend