 
              An Adaptive Parallel Algorithm for Computing Connectivity Chirag Jain, Patrick Flick, Tony Pan, Oded Green, Srinivas Aluru SIAM Workshop on Combinatorial Scientific Computing (CSC16) October 10, 2016 1
Introduction Methods Experiments Connected Components • Finding connected components is at the heart of many graph applications. • Sequentially, we have linear time O(|E|) solutions. • Union-find G(V,E) • BFS / DFS 2
Introduction Methods Experiments Scaling to Large Graphs Sequencing machines • Sizes of graph datasets continue generate ~ 10 9 DNA to grow in multiple scientific reads in 1 day domains • Bioinformatics : Metagenomics de-Bruijn graphs • Iowa Prairie (3.3B reads) - JGI > 10 9 content uploads in 1 day • Social networks, WWW • We need method that scales to graphs with billions/trillion of edges • irrespective of graph topology 3
Introduction Methods Experiments Background A. Parallel connectivity algorithms source 1. Parallel BFS 2. Shiloach-Vishkin PRAM algorithm (SV) B. Recent prior work Buluç and Madduri “Parallel breadth-first search …” SC 11 Beamer et. al. "Distributed memory breadth-first search revisited …” IPDPSW 13 4
Introduction Methods Experiments Background A. Parallel connectivity algorithms 1. Parallel BFS 2. Shiloach-Vishkin PRAM algorithm (SV) B. Recent prior work Shiloach and Vishkin “An O(log n) parallel connecLvity algorithm” 1982 5
Introduction Methods Experiments Background Label PropagaLon Shiloach-Vishkin A. Parallel connectivity algorithms 1. Parallel BFS Pointer jumping for 2. Shiloach-Vishkin O( |V| ) iterations faster convergence PRAM algorithm (SV) → O( |E| . |V| ) work O( log |V| ) iterations → O( |E| log |V| ) work B. Recent prior work Shiloach and Vishkin “An O(log n) parallel connecLvity algorithm” 1982 6
Introduction Methods Experiments Background A. Parallel connectivity algorithms Part of popular graph analysis frameworks : GraphX, PowerLyra, PowerGraph 1. Parallel BFS 2. Shiloach-Vishkin PRAM Parallel 1 Parallel BFS G(V,E) Label algorithm (SV) iteration Propagation B. Recent prior work Multistep algorithm Slota et. al. “A Case Study of Complex Graph Analysis …” IPDPS 2016 Slota et al. “BFS and coloring-based parallel … IPDPS 2014 7
Introduction Methods Experiments Contributions 1. Novel edge-based adaptation of Shiloach-Vishkin algorithm for distributed memory parallel systems. 2. Fast heuristic to guide algorithm selection at run-time . Parallel BFS 2 G(V,E) Parallel SV 1 Flick et. al. “A parallel connecLvity algorithm …” SC 15 8
Introduction Methods Experiments Parallel SV algorithm u • Initialization v 2 v 1 • We work with an array of tuples (call it A ) to keep partition v 1 v 2 v 1 v 2 u u id of each vertex. • O( |V| ) partitions at Current partition id beginning u v 1 v 2 u v 1 v 2 Vertex ids • Size of A : O(|V| + |E|) v 1 u v 2 u v 1 u v 2 u 9
Introduction Methods Experiments Parallel SV algorithm u • Initialization v 2 v 1 • We work with an array of tuples (call it A ) to keep partition v 1 v 2 v 1 v 2 u u id of each vertex. • O(|V|) partitions at Current partition id beginning u v 1 v 2 u v 1 v 2 Vertex ids • Size of A : O(|V| + |E|) v 1 u v 2 u v 1 u v 2 u , < 10
Introduction Methods Experiments Parallel SV algorithm u u Current partition id u u u Vertex ids • vertex ‘u’ is member of which all partition ids? • Sort A by ‘vertex id’ layer 11
Introduction Methods Experiments Parallel SV algorithm u u v w u v w Current partition id Current partition id u v w u u u Vertex ids • vertex ‘u’ is member of which all • Which all vertices are member partition ids? of partition ? • Sort A by ‘vertex id’ layer • Sort A by ‘partition id’ layer 12
Introduction Methods Experiments Parallel SV algorithm u u v w u v w Current partition id Current partition id u v w u u u Vertex ids • vertex ‘u’ is member of which all • Which all vertices are member partition ids? of partition ? • Sort A by ‘vertex id’ layer • Sort A by ‘partition id’ layer 13
Introduction Methods Experiments Parallel SV algorithm • In our implementation, we use parallel sample sort. • Custom reduction operations to efficiently compute minimums. • Additional details: Check our preprint • pointer jumping • detect convergence of small components early, load balance • Runtime : 14
Introduction Methods Experiments Contributions 1. Novel edge-based adaptation of Shiloach-Vishkin algorithm for distributed memory parallel systems. 2. Fast heuristic to guide algorithm selection at run-time . Parallel BFS 2 G(V,E) Parallel SV 1 Flick et. al. “A parallel connecLvity algorithm …” SC 15 15
Introduction Methods Experiments Dynamic hybrid method • Parallel BFS is close to work efficient for a giant small world graph component. • Efficiency is lost when : • Large number of small components • Large diameter of a graph component • How to decide which algorithm to choose at runtime? 16
Introduction Methods Experiments Dynamic hybrid method Compute degree distribution of input graph Curve fits power- Yes law distribution? 1 BFS iteration No Run Parallel-SV on remaining graph 17
Introduction Methods Experiments Experimental Setup • Software : C++14, MPI, CombBLAS library for parallel BFS • Hardware : Cray XC30 (Edison) at Lawrence Berkeley National Laboratory • 5,576 nodes, each with 2 x 12-core Intel Ivy processors and 64 GB RAM • 1 MPI process per physical core • Timing : • Exclude graph construction and I/O time • Profiling starts after having block-distributed list of edges in memory Buluç and Gilbert “The Combinatorial BLAS: Design …” IJHPCA 2011 18
Introduction Methods Experiments Datasets 19
Introduction Methods Experiments Datasets Small world graphs 20
Introduction Methods Experiments Datasets Small world graphs Large diameter graph 21
Introduction Methods Experiments Datasets Large number of components Small world graphs Large diameter graph 22
Introduction Methods Experiments Dynamic Approach Timings against opposite choice, using 2K cores Method 4.0x 60 Dynamic 0.9x Static (Opp. Choice) Time (sec) 40 Time (sec) 1.2x 4.7x 3.7x 20 3.6x 4.1x 1.2x 0 M1 M2 M3 G1 G2 G3 K1 K2 Datasets Run BFS? Graphs 23
Introduction Methods Experiments Dynamic Approach Proportion of time spent in prediction (using 2K cores) Method 60 Dynamic Static (Opp. Choice) Time (sec) 40 Time (sec) Proportion of time 20 0 M1 M2 M3 G1 G2 G3 K1 K2 Datasets Run BFS? Graphs 24
Introduction Methods Experiments Strong Scalability 300 • Maximum speedup of ~8x 200 Time (sec) using 4096 cores (Ideal :16x) Time (sec) 100 • Sorting benchmark with 2B integers achieves 8.06x ● ● ● ● ● ● ● ● ● ● 0 speedup as well. Dataset G1 ● 7.5 G2 Speedup ● ● G3 Speedup K1 5.0 Timings for the largest graph M4 ● ● M1 M2 ● ● 2.5 ● ● ● ● 256 512 1024 2048 4096 Number of cores (log scale) Number of cores (log scale) 25
Introduction Methods Experiments v/s Multistep method Method 24x Our method 75 Multistep 2.1x 1.1x Time (sec) Time (sec) 50 2.7x 25 1.1x 1.9x 0.9x 1.1x 0 M1 M2 M3 G1 G2 G3 K1 K2 Diameter 4K 4K 2K 16 17 25K 9 9 Datasets Graphs 26
Introduction Methods Experiments v/s Best sequential method • Performance comparison against Rem’s algorithm (based on union-find) • Using small graphs that fit in single node (64 GB RAM) E. W. Dijkstra, A discipline of programming. 1976 27
Conclusions 1. Efficient distributed memory parallel connectivity algorithm based on Shiloach-Vishkin approach. 2. Propose heuristic to guide algorithm selection at runtime. 3. Efficient as well as generic, scales on a variety of large graphs. 4. Significant performance gains against previous state- of-the-art, particularly in case of large diameter graphs. 28
Thank you! arxiv.org/abs/1607.06156 cjain @ gatech.edu github.com/ParBLiSS/ parconnect Reproducibility IniLaLve Award
Recommend
More recommend