Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng , Alexandros Labrinidis, and Panos K. Chrysanthis University of Pittsburgh 1
Graph Partitioning Applications of Graph Partitioning Scientific Simulations Distributed Graph Computation o Pregel, Hama, Giraph VLSI Design Task Scheduling Linear Programming 2
A Balanced Partitioning = Even Load Distribution N2 N1 N3 Balanced: 3
Minimal Edge-Cut = Minimal Data Comm N2 N1 N3 Minimizing Edge-Cut: 4
Minimal Edge-Cut = Minimal Data Comm But Minimal Data Comm≠ Minimal Comm Cost STD DEV: STD DEV: STD DEV: 269 . 71Mb/s 416.82Mb/s 358.34Mb/s Figure 1. Pair-Wise Network Bandwidth (J. Xue , BigData’15 ) Group neighboring vertices as close as possible The partitioner has to be Architecture-Aware 5
Overview of the State-of-the-Art Balanced Graph (Re)Partitioning Partitioners Repartitioners (static graphs) (dynamic graphs) Offline Methods Offline Methods Online Methods Online Methods ( High Quality) (High Quality) (Moderate Quality) (Moderate~High Quality) (Poor Scalability ) (Poor Scalability) (High Scalability) (High Scalability) Architecture-Aware Architecture-Aware 6
Roadmap Conclusions Evaluation Planar Introduction 7
Planar: Problem Statement Given G=(V, E) and an initial Partitioning P: Balancing Load: Minimizing Communication: Network Cost Minimizing Migration: 8
Planar: Overview S k S k+1 S k+2 S k+4 S k+5 Planar Planar Planar Planar Planar ★ Migration Planning Phase-1: Logical Vertex Migration ○ What vertices to move? ○ Phase-1a: Minimizing Comm Cost ○ Where to move? ○ Phase-1b: Ensuring Balanced Partitions Phase-2: Physical Vertex Migration ★ Perform the Migration Plan ★ Still beneficial? Phase-3: Convergence Check 9
Phase-1a: Minimizing Comm Cost N3 1 N1 N2 N3 N1 6 1 6 1 N2 6 1 N3 1 1 6 N1 N2 10
Phase-1a: Minimizing Comm Cost N3 ★ Run Planar on each partition in Parallel 1 ○ Each boundary vertex of my partition ■ make a migration decision on my own 6 1 ■ Probabilistic vertex migration 6 N1 N2 11
Phase-1a: Minimizing Comm Cost N3 ★ Run Planar on each partition in Parallel 1 ○ Each boundary vertex of my partition ■ make a migration decision on my own 6 1 ■ Probabilistic vertex migration 6 N1 N2 12
Phase-1a@N1: Use vertex a as an example N3 ★ Run Planar on each partition in Parallel 1 ○ Each boundary vertex of my partition ■ make a migration decision on my own 1 6 ■ Probabilistic vertex migration 6 N2 N1 g (a, N1, N1) = 0 Max Gain: 0 Optimal Dest: N1 13
Phase-1a@N1: Move vertex a to N2? N3 ★ Run Planar on each partition in Parallel 1 ○ Each boundary vertex of my partition ■ make a migration decision on my own 1 6 ■ Probabilistic vertex migration 6 N2 N1 old_comm(a, N1) = 2 * 6 + 1 * 1 = 13 new_comm(a, N2) = 1 * 6 + 1 * 1 = 7 N3 1 mig(a, N1, N2) = 1 * 6 = 6 1 g (a, N1, N2) = 13 - 7 - 6 = 0 6 N2 Max Gain: 0 N1 Optimal Dest: N1 14
Phase-1a@N1: Move vertex a to N3? N3 ★ Run Planar on each partition in Parallel 1 ○ Each boundary vertex of my partition ■ make a migration decision on my own 1 6 ■ Probabilistic vertex migration 6 N2 N1 old_comm(a, N1) = 2 * 6 + 1 * 1 = 13 new_comm(a, N3) = 1 * 1 + 2 * 1 = 3 N3 1 mig(a, N1, N3) = 1 * 1 = 1 1 1 g (a, N1, N3) = 13 - 3 - 1 = 9 1 N2 Max Gain: 9 N1 Optimal Dest: N3 15
Phase-1a: Probabilistic Vertex Migration Migration Planning Partition N1 N2 N3 N3 1 Boundary Vtx a b d e g Migration Dest N3 N3 N3 N3 N3 1 6 Gain 9 2 3 0 0 Max Gain 9 3 0 6 N1 N2 Probability 9/9 2/3 3/3 0 0 Migrate with a probability proportional to the gain 16
Phase-1b: Balancing Partitions Quota-Based Vertex Migration Q1: How much work should each overloaded partition migrate to each underloaded partition? ■ Potential Gain Computation ● Similar to Phase-1a vertex gain computation ■ Iteratively allocate quota starting from the partition pair having the largest gain. Q2: What vertices to migrate? ■ Phase-1a vertex migration, but limited by the quota . 17
Planar: Physical Vertex Migration S k S k+1 S k+2 S k+4 S k+5 Planar Planar Planar Planar Planar ★ Migration Planning Phase-1: Logical Vertex Migration ○ What vertices to move? ○ Phase-1a: Minimizing Comm Cost ○ Where to move? ○ Phase-1b: Ensuring Balanced Partitions Phase-2: Physical Vertex Migration ★ Perform the Migration Plan ★ Still beneficial? Phase-3: Convergence Check 18
Planar: Convergence Check S k S k+1 S k+2 S k+4 S k+5 Planar Planar Planar Planar Planar ★ Migration Planning Phase-1: Logical Vertex Migration ○ What vertices to move? ○ Phase-1a: Minimizing Comm Cost ○ Where to move? ○ Phase-1b: Ensuring Balanced Partitions Phase-2: Physical Vertex Migration ★ Perform the Migration Plan ★ Still beneficial? Phase-3: Convergence Check 19
Phase-3: Convergence Repartitioning Epoch Enough changes Converge (structure/load) S k S k+1 S k+2 S k+4 S k+5 Planar Planar Planar Planar Planar ★ Converge ○ improvement achieved per adaptation superstep < 𝜀 ○ after 𝜐 consecutive adaptation supersteps 𝜀 = 1% and 𝜐 = 10 (via Sensitivity Analysis) 20
Evaluation Microbenchmarks Convergence Study (Param Selection) Partitioning Quality Real-World Workloads Breadth First Search (BFS) Single Source Shortest Path (SSSP) Scalability Test Scalability vs Graph Size Scalability vs # of Partitions Scalability vs Graph Size and # of Partitions 21
Partitioning Quality: Setup Dataset 12 datasets from various areas # of Parts 40 (two 20-core machines) HP : Hashing Partitioning Initial Partitioners DG : Deterministic Greedy LDG : Linear Deterministic Greedy 22
Partitioning Quality: Datasets Dataset |V| |E| Description wave 156.317 2,118,662 auto 448,695 6,629,222 FEM 333SP 3,712,815 22,217,266 CA-CondMat 108,300 373, 756 Collaboration Network DBLP 317,080 1,049,866 Email-Eron 36,692 183,831 as-skitter 1,696,415 22,190,596 Internet Topology Amazon 334,863 925,872 Product Network USA-roadNet 23,947,347 58,333,344 Road Network roadNet-PA 1,090,919 6,167,592 YouTube 3,223,589 24,447,548 Com-LiveJournal 4,036,537 69,362,378 Social Network Friendster 124,836,180 3,612,134,270
Partitioning Quality: Planar achieved up to 68% improvement Improv. Max Avg. HP 68% 53% DG 46% 24% LDG 69% 48% 24
Evaluation Microbenchmarks Convergence Study (Param Selection) Partitioning Quality Real-World Workloads Breadth First Search (BFS) Single Source Shortest Path (SSSP) Scalability Test Scalability vs Graph Size Scalability vs # of Partitions Scalability vs Graph Size and # of Partitions 25
Real-World Workload: Setup PittMPICluster Gordon Cluster Configuration ( FDR Infiniband ) (QDR Infiniband) # of Nodes 32 1024 Single Switch 4*4*4 3D Torus of Switches Network Topology (32 nodes / switch) (16 nodes / switch) Network Bandwidth 56Gbps 8Gbps PittMPICluster Gordon Node Configuration (Intel Haswell) (Intel Sandy Bridge) 2 2 # of Sockets (10 cores / socket) (8 cores / socket) L3 Cache 25MB 20MB Memory Bandwidth 65GB/s 85GB/s 26
Planar: Avoiding Resource Contention on the Memory Subsystems of Multicore Machines System Bottleneck (A. Zheng EDBT’16) PittMPICluster Gordon Memory (λ=1 ) Network (λ=0) Degree of Contention Intra-Node Network Maximal Inter-Node Network Comm Cost Comm Cost 27
Real-World Workload: Baselines Balanced Graph (Re)Partitioning Partitioners Repartitioners (static graphs) (dynamic graphs) Offline Methods Offline Methods Online Methods Online Methods ( High Quality) (High Quality) (Moderate Quality) (Moderate~High Quality) (Poor Scalability ) (Poor Scalability) (High Scalability) (High Scalability) uniPlanar 28 Initial Partitioner: DG
BFS Exec. Time on PittMPICluster ( λ=1 ): Planar achieved up to 9x speedups ★ as-skitter: |V|=1.6M, |E| = 22M ★ 60 Partitions: three 20-core machines 9x 7.5x 5.8x 4.1x 1.48x 1.37x 1x 29
BFS Comm Volume on PittMPICluster (λ=1 ): Planar had the lowest intra-node comm volume ★ as-skitter: |V|=1.6M, |E| = 22M ★ 60 Partitions: three 20-core machines Reduction Intra-Socket Inter-Socket DG 51% 38% METIS 51% 36% PARMETIS 47% 34% uniPLANAR 44% 28% ARAGON 4.3% 0.8% PARAGON 5.2% 2.6% 30
BFS Exec. Time on Gordon ( λ=0 ): Planar achieved up to 3.2x speedups ★ as-skitter: |V|=1.6M, |E| = 22M ★ 48 Partitions: three 16-core machines 3.2x 1.16x 1.21x 1x 1.05x 31
BFS Comm. Volume on Gordon ( λ=0 ): Planar had the lowest inter-node comm volume ★ as-skitter: |V|=1.6M, |E| = 22M ★ 48 Partitions: three 16-core machines 51% 11% 0.1% 25% 32
Conclusions PLANAR Architecture-Aware Adaptive Graph Acknowledgments : Repartitioner Peyman Givi • Communication Heterogeneity Patrick Pisciuneri Mark Silvis • Shared Resource Contention Up to 9x speedups on real-world Funding : NSF OIA-1028162 workloads. NSF CBET-1250171 Scaled up to a graph with 3.6B edges. 33
Recommend
More recommend