Graph Partitioning for Scalable Distributed Graph Computations Aydn - PowerPoint PPT Presentation

Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç Kamesh Madduri ABuluc@lbl.gov madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering February 13-14, 2012 Atlanta, GA

Overview of our study • We assess the impact of graph partitioning for computations on ‘ low diameter ’ graphs • Does minimizing edge cut lead to lower execution time? • We choose parallel Breadth-First Search as a representative distributed graph computation • Performance analysis on DIMACS Challenge instances 2

Key Observations for Parallel BFS • Well-balanced vertex and edge partitions do not guarantee load-balanced execution, particularly for real-world graphs – Range of relative speedups (8.8-50X, 256-way parallel concurrency) for low-diameter DIMACS graph instances. • Graph partitioning methods reduce overall edge cut and communication volume, but lead to increased computational load imbalance • Inter-node communication time is not the dominant cost in our tuned bulk-synchronous parallel BFS implementation 3

Talk Outline • Level-synchronous parallel BFS on distributed- memory systems – Analysis of communication costs • Machine-independent counts for inter-node communication cost • Parallel BFS performance results for several large-scale DIMACS graph instances 4

Parallel BFS strategies 1. Expand current frontier ( level-synchronous approach, suited for low diameter graphs) 5 8 1 • O(D) parallel steps • Adjacencies of all vertices 0 7 3 4 6 9 in current frontier are visited in parallel source vertex 2 2. Stitch multiple concurrent traversals (Ullman-Yannakakis, for high-diameter graphs) 5 8 • path-limited searches from 1 “super vertices” • APSP between “super source vertices” 0 7 3 4 6 9 vertex 5 2

“2D” graph distribution • Consider a logical 2D processor grid (p r * p c = p) and the dense matrix representation of the graph • Assign each processor a sub-matrix (i.e, the edges 9 vertices, 9 processors, 3x3 processor grid within the sub-matrix) x x x 5 8 1 x x x 0 7 3 4 6 x x x x x x Flatten x x Sparse matrices 2 x x x x x x Per-processor local graph x x x x representation

BFS with a 1D-partitioned graph Consider an undirected graph with Each processor ‘owns’ n/p vertices and n vertices and m edges stores their adjacencies (~ 2m/p per processor, assuming balanced partitions). 1 0 [0,1] [0,3] [0,3] [1,0] [1,4] [1,6] [2,3] [2,5] [2,5] [2,6] [3,0] [3,0] [3,2] [3,6] 3 4 6 [4,1] [4,5] [4,6] [5,2] [5,2] [5,4] [6,1] [6,2] [6,3] [6,4] 2 5 Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

BFS with a 1D-partitioned graph 1 0 Current frontier: vertices 1 (partition Blue ) and 6 (partition Green ) 1. Local discovery: 3 4 6 [1,4] [1,6] P0 [1,0] P1 No work 2 5 P2 No work P3 [6,2] [6, 3] [6,4] [6,1] Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

BFS with a 1D-partitioned graph 1 0 Current frontier: vertices 1 (partition Blue ) and 6 (partition Green ) 2. All-to-all exchange: 3 4 6 [1,4] [1,6] P0 [1,0] P1 No work 2 5 P2 No work P3 [6,2] [6, 3] [6,4] [6,1] Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

BFS with a 1D-partitioned graph 1 0 Current frontier: vertices 1 (partition Blue ) and 6 (partition Green ) 2. All-to-all exchange: 3 4 6 P0 [1,0] [6,1] P1 [6,2] [6, 3] 2 5 P2 [6,4] [1,4] P3 [1,6] Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

BFS with a 1D-partitioned graph 1 0 Current frontier: vertices 1 (partition Blue ) and 6 (partition Green ) Frontier for 3. Local update: next iteration 3 4 6 0 P0 [1,0] [6,1] P1 2, 3 [6,2] [6, 3] 2 5 P2 4 [6,4] [1,4] P3 [1,6] Steps: 1. Local discovery: Explore adjacencies of vertices in current frontier. 2. Fold: All-to-all exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.

Modeling parallel execution time • Time dominated by local memory references and inter-node communication • Assuming perfectly balanced computation and communication, we have Local memory references: Inter-node communication:  edgecut m n m       ( p ) p N , a 2 a N L L , n / p p p p Inverse local Local latency on All-to-all remote bandwidth RAM bandwidth working set |n/p| with p participating processors 12

BFS with a 2D-partitioned graph • Avoid expensive p -way All-to-all communication step • Each process collectively ‘owns’ n/p r vertices • Additional ‘ Allgather ’ communication step for processes in a row Local memory references: Inter-node communication: edgecut     m n m ( p ) p      N , a 2 a r N r p L L n p L n p , , p p p c r    1 n      ( p ) 1 p   N , gather c N c 13   p p r c

Temporal effects, communication-minimizing tuning prevent us from obtaining tighter bounds • The volume of communication can be further reduced by maintaining state of non-local visited vertices [0,4] [1,4] 0 1 P0 [0,3] [0,3] [1,3] [0,6] [1,6] [1,6] Local pruning prior to 3 4 6 All-to-all step [0,3] [0,4] [1,6] 5 2 14

Predictable BFS execution time for synthetic small-world graphs • Randomly permuting vertex IDs ensures load balance on R-MAT graphs (used in the Graph 500 benchmark). • Our tuned parallel implementation for the NERSC Hopper system (Cray XE6) is ranked #2 on the current Graph 500 list. Execution time is dominated by work performed in a few parallel phases Buluc & Madduri , Parallel BFS on distributed memory systems, Proc. SC’11, 2011. 15

Modeling BFS execution time for real-world graphs • Can we further reduce communication time utilizing existing partitioning methods? • Does the model predict execution time for arbitrary low-diameter graphs? • We try out various partitioning and graph distribution schemes on the DIMACS Challenge graph instances – Natural ordering, Random, Metis, PaToH 16

Experimental Study • The (weak) upper bound on aggregate data volume communication can be statically computed (based on partitioning of the graph) • We determine runtime estimates of – Total aggregate communication volume – Sum of max. communication volume during each BFS iteration – Intra-node computational work balance – Communication volume reduction with 2D partitioning • We obtain and analyze execution times (at several different parallel concurrencies) on a Cray XE6 system (Hopper, NERSC) 17

Orderings for the CoPapersCiteseer graph Natural Random PaToH checkerboard Metis PaToH 18

BFS All-to-all phase total communication volume normalized to # of edges (m) % compared to m Graph name Natural Random PaToH # of partitions 19

Ratio of max. communication volume across iterations to total communication volume Ratio over total volume Graph name Natural Random PaToH # of partitions 20

Reduction in total All-to-all communication volume with 2D partitioning Ratio compared to 1D Graph name Natural Random PaToH # of partitions 21

Edge count balance with 2D partitioning Max/Avg. ratio Graph name Natural Random PaToH # of partitions

Parallel speedup on Hopper with 16-way partitioning 23

Execution time breakdown kron-simple-logn18 eu-2005 Computation Fold Expand Computation Fold Expand 300 200 250 BFS time (ms) BFS time (ms) 150 200 150 100 100 50 50 0 0 Random-1D Random-2D Metis-1D PaToH-1D Random-1D Random-2D Metis-1D PaToH-1D Partitioning Strategy Partitioning Strategy Fold Expand Fold Expand 3 10 Comm. time (ms) Comm. time (ms) 2.5 8 2 6 1.5 4 1 2 0.5 0 0 Random-1D Random-2D Metis-1D PaToH-1D Random-1D Random-2D Metis-1D PaToH-1D Partitioning Strategy Partitioning Strategy 24

Imbalance in parallel execution eu-2005, 16 processes* PaToH Random * Timeline of 4 processes shown in figures. PaToH-partitioned graph suffers from severe load imbalance in computational phases. 25

Conclusions • Randomly permuting vertex identifiers improves computational and communication load balance, particularly at higher process concurrencies • Partitioning methods reduce overall communication volume, but introduce significant load imbalance • Substantially lower parallel speedup with real-world graphs compared to synthetic graphs (8.8X vs 50X at 256- way parallel concurrency) – Points to the need for dynamic load balancing 26

Thank you! • Questions? • Kamesh Madduri, madduri@cse.psu.edu • Aydın Buluç, ABuluc@lbl.gov • Acknowledgment of support: 27

Graph Partitioning for Scalable Distributed Graph Computations Aydn - PowerPoint PPT Presentation

Graph Partitioning for Scalable Distributed Graph Computations Aydn Bulu Kamesh Madduri ABuluc@lbl.gov madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Prioritized Restreaming Algorithms for Balanced Graph Partitioning Amel Awadelkarim

Lecture 19: Graph Partitioning David Bindel 3 Nov 2011 Logistics Please finish your project

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Co-clustering documents and words using Bipartite Spectral Graph Partitioning Inderjit S. Dhillon

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Background MapReduce Model SCOPE Language and Cosmos system Advanced partitioning

Power grid partitioning Data-Driven Partitioning of Power Networks Via Koopman Mode

Low-frequency oscillations and convective phenomena in a density-inverted vibrofluidised granular

LAVA full SPH system with: - pressure force - viscosity force - gravity force - terrain

Methods for Assessing Fine Particle Number, Concentration and Size Distribution in Water Daniel

Modeling Land Competition Modeling Land Competition Ron Sands Ron Sands Man-Keun Kim Man-Keun

510 5 TH AVENUE LANDMARKS PRESENTATION S E P T E M B E R 2 5 , 2 0 1 8 WWW.SPECTORGROUP.COM

THE AURORA PARTITION-WALL SYSTEM, the 31 SERIES Komandor has introduced a new AURORA system. The

The Power of Two-Choices in Regulating Interval Partitions Ohad N. Feldheim (Stanford) Joint

BUILDING STATISTICS Introduction Owner: CFBC Properties, LLC Occupancy Type: Office

Graph Partitioning for Scalable Distributed Graph Computations Aydn - PowerPoint PPT Presentation

Graph Partitioning for Scalable Distributed Graph Computations Aydn Bulu Kamesh Madduri ABuluc@lbl.gov madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

CS 5220: Graph Partitioning David Bindel 2017-11-07 1 Reminder: Sparsity and partitioning 1 2

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Prioritized Restreaming Algorithms for Balanced Graph Partitioning Amel Awadelkarim

Lecture 19: Graph Partitioning David Bindel 3 Nov 2011 Logistics Please finish your project

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Co-clustering documents and words using Bipartite Spectral Graph Partitioning Inderjit S. Dhillon

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Background MapReduce Model SCOPE Language and Cosmos system Advanced partitioning

Power grid partitioning Data-Driven Partitioning of Power Networks Via Koopman Mode

Low-frequency oscillations and convective phenomena in a density-inverted vibrofluidised granular

LAVA full SPH system with: - pressure force - viscosity force - gravity force - terrain

Methods for Assessing Fine Particle Number, Concentration and Size Distribution in Water Daniel

Modeling Land Competition Modeling Land Competition Ron Sands Ron Sands Man-Keun Kim Man-Keun

510 5 TH AVENUE LANDMARKS PRESENTATION S E P T E M B E R 2 5 , 2 0 1 8 WWW.SPECTORGROUP.COM

THE AURORA PARTITION-WALL SYSTEM, the 31 SERIES Komandor has introduced a new AURORA system. The

The Power of Two-Choices in Regulating Interval Partitions Ohad N. Feldheim (Stanford) Joint

BUILDING STATISTICS Introduction Owner: CFBC Properties, LLC Occupancy Type: Office

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System