Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm Angen Zheng , Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi University of Pittsburgh 1
Importance of Graph Partitioning Applications of Graph Partitioning o Scientific Simulations o Distributed Graph Computation o Pregel, Hama, Giraph o VLSI Design o Task Scheduling o Linear Programming 2
Target Workloads ★ Vertex ○ a unique identifier ○ a modifiable, user-defined value ★ Edge ○ a modifiable, user-defined value ○ a target vertex identifier ★ Vertex-Centric UDF ○ Change vertex/edge state ○ Send msg to neighbours ○ UDF UDF UDF UDF Receive msg from neighbors ○ Mutate the graph topology ○ Deactivate at end of the superstep ○ Reactivate by external msgs Minimizing the comm cost!!! Balanced load distribution!!! 3
A Balanced Partitioning = Even Load Distribution N2 N1 N3 4
Minimal Edge-Cut = Minimal Data Comm N2 N1 N3 Minimal Data Comm ≠ Minimal Comm Cost 5
Roadmap Experiments % of Contention audience asleep PARAGON State of the Art Heterogeneity Introduction ✔ # of slides 6
Nonuniform Inter-Node Network Comm Cost Comm costs vary a lot as their locations change! 7
Nonuniform Intra-Node Network Comm Cost Cores sharing more cache levels communicate faster! 8
Inter-Node Comm Cost > Intra-Node Comm Cost Node#1 Node#2 Network (Ethernet, IPoIB) 9
Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost N1 N2 N3 N1 1 6 N2 N2 1 1 N3 6 1 N1 N3 • 3 edge-cut • 3 unit data comm • 8 unit comm cost (8 = 1 * 6 + 2 * 1) 10
Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost N1 N2 N3 N1 1 6 N2 N2 1 1 N3 6 1 N3 • 4 edge-cut N1 • 4 unit data comm • 4 unit comm cost (4 = 1 * 1+ 3 * 1) Group neighbouring vertices as close as possible! 11
Roadmap Experiments % of Contention audience asleep PARAGON State of the Art Heterogeneity ✔ Introduction ✔ # of slides 12
Overview of the State-of-the-Art Balanced Graph (Re)Partitioning Partitioners Repartitioners (static graphs) (dynamic graphs) Offline Methods Offline Methods Online Methods Online Methods ( High Quality) (High Quality) (Moderate Quality) (Moderate~High Quality) (Poor Scalability ) (Poor Scalability) (High Scalability) (High Scalability) Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG Parmetis Aragon Paragon Hermes CatchW xdgp Mizan LogGP Fennel Heterogeneity-Aware Heterogeneity-Aware 13
Our Prior Work: Aragon A sequential architecture-aware graph partition refinement algorithm. o Input: • A partitioned graph • The relative network comm cost matrix o Output : • A partitioning with improved mapping of the comm pattern to the underlying hardware topology . [1]. Angen Zheng, Alexandors Labrinidis, and Panos K. Chrysanthis. Archiitecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing. BigGraphs, 2014 14
Our Prior Work: Aragon P1 N 1 P1 N 2 P2 P2 P3 P3 N 3 P4 P4 N 4 G Heterogeneity-Aware Refinement N 5 P5 (More details in the paper) P6 N 6 Aragon P6 N5 can hold entire graph P7 in memory N 7 P7 Prefer to work in offline P8 mode P8 N 8 P9 P9 N 9 15
Roadmap Experiments % of Contention audience asleep PARAGON State of the Art ✔ Heterogeneity ✔ Introduction ✔ # of slides 16
Paragon Overview: o Parallel Architecture-Aware Graph Partition Refinement Algorithm Goal: o Group neighbouring vertices as close as possible Paragon vs Aragon ○ lower overhead ○ scale to much larger graphs 17
Paragon: Partition Grouping N1 P1 N2 P2 P2 P3 P1 N3 P3 N4 P4 P6 P9 P4 N5 P5 N6 P6 P7 P8 P5 N7 P7 P8 N8 P9 N9 18
Paragon: Group Server Selection N1 P1 N2 P2 N2 P2 P3 P1 N3 P3 N4 P4 N9 P6 P9 P4 N5 P5 N6 P6 N8 P7 P8 P5 N7 P7 P8 N8 P9 N9 19
Paragon: Sending “Partition” to Group Servers N1 P1 P1 P2 N2 N2 P2 P3 P1 P3 P3 N3 P4 N4 N9 P6 P9 P4 P5 N5 P6 P5 N6 N8 P7 P4 P8 P5 P7 P7 N7 P6 P8 N8 Only send boundary vertices P9 N9 20
Paragon: Parallel Refinement P1 N1 P1 Aragon P2 N2 N2 P2 P3 P1 P3 P3 N3 P4 N4 N9 P6 P9 P4 P5 N5 P6 N6 P5 N8 P7 P4 P8 P5 P7 P7 N7 P6 # of Groups N8 P8 Aragon ○ Degree of Parallelism Aragon P9 N9 21
Paragon: Parallel Refinement P1 N1 P1 Aragon P2 N2 N2 P2 P3 P1 P3 P3 N3 36 P4 N4 N9 P6 P9 P4 P5 N5 16 P6 9 N6 P5 6 N8 P7 P4 P8 P5 P7 P7 N7 P6 # of Groups N8 P8 Aragon ○ Degree of Parallelism ○ Parallelism vs Quality Aragon P9 N9 22
Paragon: Shuffle Refinement N2: Aragon P2 P4 P1 P2 P3 P1 Swap Parallel Aragon P7 N9: P9 P5 P6 P9 P4 Aragon N8: P6 P8 P3 P7 P8 P5 Repeat k times To increase the # of partition pairs being refined! 23
Roadmap Experiments % of Contention audience asleep PARAGON ✔ State of the Art ✔ Heterogeneity ✔ Introduction ✔ # of slides 24
Inter-Node Comm Cost ? Intra-Node Comm Cost Node#1 Node#2 RDMA-enabled Network 25
Inter-Node Comm Cost ≅ Intra-Node Comm Cost ★ Dual-socket Xeon E5v2 server with ★ Infiniband: 1.7GB/s~37.5GB/s ○ DDR3-1600 ★ DDR3: ○ 2 FDR 4x NICs per socke t 6.25GB/s~16.6GB/s Revisit the Impact of Memory Subsystem Carefully! [2]. C. Binnig, U. Çetintemel, A. Crotty, A. Galakatos, T. Kraska, E. Zamanian, and S. B. Zdonik. The End of Slow Networks: It’sTime for a Redesign. CoRR, 2015 26
Intra-Node Shared Resource Contention Sending Core Receiving Core 4b. Write 2b. Write 3. Load 1. Load 4a. Load 2a. Load Shared Buffer Receive Buffer Send Buffer 27
Intra-Node Shared Resource Contention Cached Send/Shared/Receive Buffer Multiple copies of the same data in LLC, contending for LLC and MC 28
Intra-Node Shared Resource Contention Cached Send/Shared Buffer Cached Receive/Shared Buffer Multiple copies of the same data in LLC, contending for LLC, MC, and QPI. 29
Paragon: Avoiding Contention Degree of Contention Intra-Node Network Maximal Inter-Node Comm Cost Network Comm Cost (Small HPC Clusters) (Cloud/Large Clusters) 30
Paragon: Avoiding Contention Node#1 Node#2 Sending Core Receiving Core Send Receive Buffer Buffer IB IB HCA HCA 31
Roadmap Experiments Contention ✔ % of audience asleep PARAGON ✔ State of the Art ✔ Heterogeneity ✔ Introduction ✔ # of slides 32
Evaluation MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) Billion-Edge Graph Scaling 33
Evaluation MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) Billion-Edge Graph Scaling 34
Degree of Refinement Parallelism: Refinement Time ★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ # of Shuffle Times: 0 Aragon 35
Degree of Refinement Parallelism: Partitioning Quality ★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ # of Shuffle Times: 0 36
Evaluation MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) Billion-Edge Graph Scaling 37
Varying Shuffle Refinement Times ★ com-lj: |V|=4M, |E| = 69M ★ 40 partitions: two 20-core machines ★ Initial Partitioner: DG (deterministic greedy) ★ Deg. of Parallelism: 8 # of shuffle refinement times > 10 ○ Paragon had lower refinement overhead ■ 8~10s vs 33s (Paragon vs Aragon) ○ Paragon produce better decompositions ■ 0~2.6% (Paragon vs Aragon) 38
Evaluation MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) Billion-Edge Graph Scaling 39
Varying Initial Partitioners Dataset 12 datasets from various areas # of Parts 40 (two 20-core machines) Initial Partitioner HP/DG/LDG Deg. of Parallelism 8 # of Refinement Times 8 HP: Hashing Partitioning DG: Deterministic Greedy Partitioning LDG: Linear Deterministic Greedy Partitioning 40
Impact of Varying Initial Partitioners: Partitioning Quality Improv. Max Avg. HP 58% 43% DG 29% 17% LDG 53% 36% 41
Evaluation MicroBenchmarks o Degree of Refinement Parallelism o Varying Shuffle Refinement Times o Varying Initial Partitioners Real-World Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) Billion-Edge Graph Scaling 42
Recommend
More recommend