Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1
Big Graphs Are Everywhere [SIGMOD’16 Tutorial] 2
A Balanced Partitioning = Even Load Distribution Minimal Edge-Cut = Minimal Data Comm N2 N1 N3 Assumption: Network is the bottleneck. 3
The End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] ✓ Dual-socket Xeon E5v2 server with ○ DDR3-1600 ✓ Infiniband: 1.7GB/s~37.5GB/s ○ 2 FDR 4x NICs per socket ✓ DDR3: 6.25GB/s~16.6GB/s 4
The End of Slow Networks: Does edge-cut still matter? 5
Roadmap Introduction Does edge-cut still matter? Why edge-cut still matters? Argo Evaluation Conclusions 6
The End of Slow Networks: Does edge-cut still matter? Graph Partitioners METIS and LDG Graph Workloads BFS, SSSP , and PageRank Graph Dataset Orkut (|V|=3M, |E|=234M) Number of Partitions 16 (one partition per core) 7
The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) m:s:c METIS LDG m: # of machines used 1:2:8 633 2,632 s: # of sockets used per machine c: # of cores used per socket 2:2:4 654 2,565 9x 4:2:2 521 631 8:2:1 222 280 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 8
The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 9
The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck. ○ 10
The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ Denser configurations had longer execution time. Contention on the memory subsystems impacted performance. The distribution of edge-cut matters. ○ Network may not always be the bottleneck . ○ 11
The End of Slow Networks: Does edge-cut still matter? SSSP Execution Time (s) SSSP LLC Misses (in Millions) m:s:c m:s:c METIS LDG METIS LDG 1:2:8 633 2,632 1:2:8 10,292 44,117 2:2:4 654 2,565 2:2:4 10,626 44,689 9x 235x 4:2:2 521 631 4:2:2 2,541 1,061 8:2:1 222 280 8:2:1 96 187 ✓ METIS had lower execution time and LLC misses than LDG. ○ Edge-cut matters. ○ Higher edge-cut-->higher comm-->higher contention 12
The End of Slow Networks: Does edge-cut still matter? Yes! Both edge-cut and its distribution matter! ✓ Intra-Node and Inter-Node Data Communication ○ Have different performance impact on the memory subsystems of modern multicore machines. 13
Roadmap Introduction Does edge-cut still matter? Why edge-cut still matters? Argo Evaluation Conclusions 14
Intra-Node Data Comm: Shared Memory Sending Core Receiving Core 4b. Write 2b. Write 3. Load 1. Load 4a. Load 2a. Load Shared Buffer Receive Buffer Send Buffer Extra Memory Copy 15
Intra-Node Data Comm: Shared Memory Cached Send/Shared/Receive Buffer Cache Pollution LLC and Memory Bandwidth Contention 16
Intra-Node Data Comm: Shared Memory Cached Send/Shared Buffer Cached Receive/Shared Buffer Cache Pollution LLC and Memory Bandwidth Contention 17
Excess intra-node data communication may hurt performance. 18
Inter-Node Data Comm: RDMA Read/Write Node#1 Node#2 Sending Core Sending Core Send Receive Buffer Buffer IB IB HCA HCA No Extra Memory Copy and Cache Pollution 19
Offloading excess intra-node data comm across nodes may achieve better performance. 20
Roadmap Introduction Does edge-cut still matter? Why edge-cut still matters? Argo Evaluation Conclusions 21
Argo: Graph Partitioning Model Vertex Stream ... Partitioner ... Streaming Graph Partitioning Model [I. Stanton, KDD’12] 22
Argo: Architecture-Aware Vertex Placement Place vertex, v , to a partition, Pi , that maximize: Weighted Edge-cut Penalize the placement based on the load of Pi ✓ Weighted by the relative network comm cost , Argo will ○ avoid edge-cut across nodes (inter-node data comm). Great for cases where the network is the bottleneck. 23
Argo: Architecture-Aware Vertex Placement Degree of Contention ( 𝞵 ∈ [0, 1]) Bottleneck Network Memory 𝞵 =0 𝞵 =1 Maximal Inter-Node Refined Intra-Node Original Intra-Node Network Comm Cost Network Comm Cost Network Comm Cost ✓ Weighted by the refined relative network comm cost , Argo will ○ avoid edge-cut across cores of the same node (intra-node data comm). 24
Roadmap Introduction Does edge-cut still matter? Why edge-cut still matters? Argo Evaluation Conclusions 25
Evaluation: Workloads & Datasets Three Classic Graph Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) o PageRank Three Real-World Large Graphs Dataset |V| |E| Orkut 3M 234M Friendster 124M 3.6B Twitter 52M 3.9B 26
Evaluation: Platform Cluster Configuration # of Nodes 32 Network Topology FDR Infiniband (Single Switch) Network Bandwidth 56Gbps Compute Node Configuration 2 Intel Haswell # of Sockets (10 cores / socket) L3 Cache 25MB 27
Evaluation: Partitioners METIS: the most well-known multi-level partitioner. LDG: the most well-known streaming partitioner. ARGO-H: network is the bottleneck. weight edge-cut by the original network comm costs. o ARGO: memory is the bottleneck. weight edge-cut by the refined network comm costs. o 28
Evaluation: SSSP Exec. Time on Orkut dataset ★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines 5x 4x 3x 3x 2x 2x 2x 2x 1.4x 1x 1x 1x Message Grouping Size ✓ ARGO had the lowest SSSP execution time. (Group multiple msgs by a single SSSP process to the same destination into one msg) 29
Evaluation: SSSP LLC Misses on Orkut dataset ★ Orkut: |V| = 3M, |E| = 234M ★ 60 Partitions: three 20-core machines 50x 38x 12x 9x 9x 6x 4x 3x 1x 1x 1.2x 1x Message Grouping Size ✓ ARGO had the lowest LLC Misses. 30
Evaluation: SSSP Comm Vol. on Orkut dataset 64 Intra-Socket ★ Orkut: |V| = 3M, |E| = 234M METIS 69% ★ 60 Partitions: three 20-core machines LDG 49% ARGO-H 70% ✓ ARGO had the lowest intra-node communication volume. ✓ Distribution of the edge-cut also matters. 31
Evaluation: SSSP Exec. Time vs Graph Size ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80 Partitions: four 20-core machines ★ Message Grouping Size: 512 ✓ ARGO had the lowest SSSP execution time. ✓ Up to 6x improvement against ARGO-H. ✓ Improvement became larger as the graph size increased. 32
Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512 ✓ ARGO always outperformed LDG and ARGO-H. ✓ Up to 11x improvement against ARGO-H. 33
Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512 * 160 = 13h * 180 = 6h ✓ Hours CPU Time Saving. 34
Evaluation: Partitioning Overhead ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines Partitioning Time as a Percentage of the CPU Time Saved (SSSP Execution) # of Partitions # of Partitions ✓ ARGO is indeed slower than LDG. ✓ The overhead was negligible in comparison to the CPU time saved. ✓ Graph analytics usually have much longer execution time. 35
Conclusions Findings o Network is not always the bottleneck. Thanks! o Contention on memory subsystems may impact the performance a lot due to excess intra-node data comm. o Both edge-cut and its distribution matter. Acknowledgments : Peyman Givi ARGO Patrick Pisciuneri o voids contention by offloading excess Funding : intra-node data comm across nodes. NSF CBET-1609120 o Achieves up to 11x improvement on NSF CBET-1250171 real-world workloads. BigData’16 Student o Scales well in terms of both graph size Travel Award and number of partitions. 36
Recommend
More recommend