argo architecture aware graph partitioning
play

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros - PowerPoint PPT Presentation

Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1 Big Graphs


  1. Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1

  2. Big Graphs Are Everywhere [SIGMOD’16 Tutorial] 2

  3. A Balanced Partitioning = Even Load Distribution Minimal Edge-Cut = Minimal Data Comm N2 N1 N3 Assumption: Network is the bottleneck. 3

  4. The End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] ✓ Dual-socket Xeon E5v2 server with ○ DDR3-1600 ✓ Infiniband: 1.7GB/s~37.5GB/s ○ 2 FDR 4x NICs per socket ✓ DDR3: 6.25GB/s~16.6GB/s 4

  5. The End of Slow Networks: Does edge-cut still matter? 5

  6. Roadmap  Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions 6

  7. The End of Slow Networks: Does edge-cut still matter? Graph Partitioners METIS and LDG Graph Workloads BFS, SSSP , and PageRank Graph Dataset Orkut (|V|=3M, |E|=234M) Number of Partitions 16 (one partition per core) 7

  8. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) m:s:c METIS LDG m: # of machines used 1:2:8 633 2,632 s: # of sockets used per machine c: # of cores used per socket 2:2:4 654 2,565 9x 4:2:2 521 631 8:2:1 222 280 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 8

  9. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 9

  10. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 10

  11. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. The distribution of edge-cut matters. ○ Network may not always be the bottleneck . ○ 11

  12. The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ METIS had lower execution time and LLC misses than LDG. ○ Edge-cut matters. ○ Higher edge-cut-->higher comm-->higher contention 12

  13. The End of Slow Networks: Does edge-cut still matter? Yes! Both edge-cut and its distribution matter! ✓ Intra-Node and Inter-Node Data Communication ○ Have different performance impact on the memory subsystems of modern multicore machines. 13

  14. Roadmap  Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions 14

  15. Intra-Node Data Comm: Shared Memory Sending Core Receiving Core 4b. Write 2b. Write 3. Load 1. Load 4a. Load 2a. Load Shared Buffer Receive Buffer Send Buffer Extra Memory Copy 15

  16. Intra-Node Data Comm: Shared Memory Cached Send/Shared/Receive Buffer Cache Pollution LLC and Memory Bandwidth Contention 16

  17. Intra-Node Data Comm: Shared Memory Cached Send/Shared Buffer Cached Receive/Shared Buffer Cache Pollution LLC and Memory Bandwidth Contention 17

  18. Excess intra-node data communication may hurt performance. 18

  19. Inter-Node Data Comm: RDMA Read/Write Node#1 Node#2 Sending Core Sending Core Send Receive Buffer Buffer IB IB HCA HCA No Extra Memory Copy and Cache Pollution 19

  20. Offloading excess intra-node data comm across nodes may achieve better performance. 20

  21. Roadmap  Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions 21

  22. Argo: Graph Partitioning Model Vertex Stream ... Partitioner ... Streaming Graph Partitioning Model [I. Stanton, KDD’12] 22

  23. Argo: Architecture-Aware Vertex Placement Place vertex, v , to a partition, Pi , that maximize: Weighted Edge-cut Penalize the placement based on the load of Pi ✓ Weighted by the relative network comm cost , Argo will ○ avoid edge-cut across nodes (inter-node data comm). Great for cases where the network is the bottleneck. 23

  24. Argo: Architecture-Aware Vertex Placement Degree of Contention ( 𝞵 ∈ [0, 1]) Bottleneck Network Memory 𝞵 =0 𝞵 =1 Maximal Inter-Node Refined Intra-Node Original Intra-Node Network Comm Cost Network Comm Cost Network Comm Cost ✓ Weighted by the refined relative network comm cost , Argo will ○ avoid edge-cut across cores of the same node (intra-node data comm). 24

  25. Roadmap  Introduction  Does edge-cut still matter?  Why edge-cut still matters?  Argo  Evaluation  Conclusions 25

  26. Evaluation: Workloads & Datasets  Three Classic Graph Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) o PageRank  Three Real-World Large Graphs Dataset |V| |E| Orkut 3M 234M Friendster 124M 3.6B Twitter 52M 3.9B 26

  27. Evaluation: Platform Cluster Configuration # of Nodes 32 Network Topology FDR Infiniband (Single Switch) Network Bandwidth 56Gbps Compute Node Configuration 2 Intel Haswell # of Sockets (10 cores / socket) L3 Cache 25MB 27

  28. Evaluation: Partitioners  METIS: the most well-known multi-level partitioner.  LDG: the most well-known streaming partitioner.  ARGO-H: network is the bottleneck. weight edge-cut by the original network comm costs. o  ARGO: memory is the bottleneck. weight edge-cut by the refined network comm costs. o 28

  29. Evaluation: SSSP Exec. Time on Orkut dataset ★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines 5x 4x 3x 3x 2x 2x 2x 2x 1.4x 1x 1x 1x Message Grouping Size ✓ ARGO had the lowest SSSP execution time. (Group multiple msgs by a single SSSP process to the same destination into one msg) 29

  30. Evaluation: SSSP LLC Misses on Orkut dataset ★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines 50x 38x 12x 9x 9x 6x 4x 3x 1x 1x 1.2x 1x Message Grouping Size ✓ ARGO had the lowest LLC Misses. 30

  31. Evaluation: SSSP Comm Vol. on Orkut dataset 64 Intra-Socket ★ Orkut: |V| = 3M, |E| = 234M METIS 69% ★ 60 Partitions: three 20-core machines LDG 49% ARGO-H 70% ✓ ARGO had the lowest intra-node communication volume. ✓ Distribution of the edge-cut also matters. 31

  32. Evaluation: SSSP Exec. Time vs Graph Size ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80 Partitions: four 20-core machines ★ Message Grouping Size: 512 ✓ ARGO had the lowest SSSP execution time. ✓ Up to 6x improvement against ARGO-H. ✓ Improvement became larger as the graph size increased. 32

  33. Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512 ✓ ARGO always outperformed LDG and ARGO-H. ✓ Up to 11x improvement against ARGO-H. 33

  34. Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512 * 160 = 13h * 180 = 6h ✓ Hours CPU Time Saving. 34

  35. Evaluation: Partitioning Overhead ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines Partitioning Time as a Percentage of the CPU Time Saved (SSSP Execution) # of Partitions # of Partitions ✓ ARGO is indeed slower than LDG. ✓ The overhead was negligible in comparison to the CPU time saved. ✓ Graph analytics usually have much longer execution time. 35

  36. Conclusions  Findings o Network is not always the bottleneck. Thanks! o Contention on memory subsystems may impact the performance a lot  due to excess intra-node data comm. o Both edge-cut and its distribution matter. Acknowledgments :  Peyman Givi  ARGO  Patrick Pisciuneri o voids contention by offloading excess Funding : intra-node data comm across nodes.  NSF CBET-1609120 o Achieves up to 11x improvement on  NSF CBET-1250171 real-world workloads.  BigData’16 Student o Scales well in terms of both graph size Travel Award and number of partitions. 36

Recommend


More recommend