in graph computations
play

in Graph Computations Aydn Bulu John R. Gilbert University of - PowerPoint PPT Presentation

Parallel Combinatorial BLAS and Applications in Graph Computations Aydn Bulu John R. Gilbert University of California, Santa Barbara Adapted from talks at SIAM conferences 1 Primitives for Graph Computations By analogy to numerical


  1. Parallel Combinatorial BLAS and Applications in Graph Computations Aydın Buluç John R. Gilbert University of California, Santa Barbara Adapted from talks at SIAM conferences 1

  2. Primitives for Graph Computations • By analogy to numerical Peak linear algebra, BLAS 3 BLAS 2 • What would the BLAS 1 combinatorial BLAS look like? BLAS 3 (n-by-n matrix-matrix multiply) BLAS 2 (n-by-n matrix-vector multiply) BLAS 1 (sum of scaled n-vectors) 2

  3. Real-World Graphs Properties: • Huge (billions of vertices/edges) • Very sparse (typically m = O(n)) • Scale-free [maybe] • Community structure [maybe] Examples: • World-wide web • Science citation graphs • Online social networks 3

  4. What Kinds of Computations? • Some are inherently latency-bound. → S-T connectivity • Many graph mining algorithms are computationally intensive. → Graph clustering → Centrality computations  + Huge Graphs Expensive Kernels Massive Parallelism Sparse Data  Very Sparse Graphs Structures (Matrices) 4

  5. The Case for Sparse Matrices Many irregular applications contain su ffi cient coarse- • grained parallelism that can ONLY be exploited using abstractions at proper level. Traditional graph Graphs in the language of computations linear algebra Data driven. Unpredictable Fixed communication patterns. communication patterns Overlapping opportunities Irregular and unstructured. Poor Operations on matrix blocks. locality of reference Exploits memory hierarchy Fine grained data accesses. Coarse grained parallelism. Dominated by latency Bandwidth limited 5

  6. The Case for Primitives It takes a “certain” level of expertise to get any kind of performance in this jungle of parallel computing • I think you’ll agree with me by the end of the talk :) What’s bandwidth anyway? 480x I can just implement it (w/ enough coffee) The right primitive ! All pairs shortest paths on the GPU 6

  7. Identification of Primitives ‣ Sparse matrix-matrix multiplication (SpGEMM) Most general and challenging parallel primitive. ‣ Sparse matrix-vector multiplication (SpMV) ‣ Sparse matrix-transpose-vector multiplication (SpMVT) Equivalently, multiplication from the left ‣ Addition and other point-wise operations (SpAdd) Included in SpGEMM, “proudly” parallel ‣ Indexing and assignment (SpRef, SpAsgn) A(I,J) where I and J are arrays of indices Reduces to SpGEMM Matrices on semirings, e.g. ( , +), (and, or), (+, min) 7

  8. Why focus on SpGEMM? • Graph clustering (Markov, peer pressure) • Shortest path calculations • Betweenness centrality • Subgraph / submatrix indexing • Graph contraction • Cycle detection • Multigrid interpolation & restriction • Colored intersection searching • Applying constraints in finite element computations • Context-free parsing ... 8

  9. Comparative Speedup of Sparse 1D & 2D In practice, 2D algorithms have the potential to scale, if implemented correctly. Overlapping communication, and maintaining load balance are crucial. 9

  10. 2-D example: Sparse SUMMA B kj j k k * = i C ij A ik  C ij += A ik * B kj  Based on dense SUMMA Generalizes to nonsquare matrices, etc.  10

  11. Sequential Kernel Standard algorithm is O(nnz+ flops+n) flops nnz n X • Strictly O(nnz) data structure • Outer-product formulation • Work-efficient 11

  12. Node Level Considerations Submatrices are hypersparse (i.e. nnz << n) blocks Average of c nonzeros per column Total Storage: blocks • A data structure or algorithm that depends on the matrix dimension n (e.g. CSR or CSC) is asymptotically too wasteful for submatrices 12

  13. Addressing the Load Balance RMat: Model for graphs with high variance on degrees • • Random permutations are Asynchronous algorithms have no notion of stages. useful. But... • • Bulk synchronous algorithms Overall, no significant imbalance. may still suffer: 13

  14. Asynchronous Implementation O(nnz) Sparse2D<I,N> Remote get using MPI-2 DCSC<I,N>  Two-dimensional block layout  (Passive target) remote-memory access  Avoids hot spots  With very high probability, a block is accessed at most by a single remote get operation at any given time 14

  15. Scaling Results for SpGEMM  Asynchronous implementation One-sided MPI-2  Runs on TACC’s Lonestar cluster  Dual-core dual-socket Intel Xeon 2.66 Ghz  RMat X RMat product Average degree (nnz/n) ≈ 8 15

  16. Applications and Algorithms Betweenness Centrality C B (v): Among all the shortest paths, what fraction of them pass through the node of interest? A typical software stack for an application enabled with the Combinatorial BLAS Brandes’ algorithm 16

  17. Betweenness Centrality using Sparse Matrices [ Robinson, Kepner] A T 2 1 4 5 7 6 3 x • Adjacency matrix: sparse array w/ nonzeros for graph edges • Storage-efficient implementation from sparse data structures • Betweenness Centrality Algorithm: 1. Pick a starting vertex, v 2. Compute shortest paths from v to all other nodes 3. Starting with most distant nodes, roll back and tally paths 17

  18. Betweenness Centrality using BFS T x) . *¬x x (A A T 2 1 4 5  7 6 3 x • Every iteration, another level of the BFS is discovered. T • Sparsity is preserved, but sparse matrix times sparse vector has very little potential parallelism (has o(nnz) work) ~ x += x t 1 t 2 t 3 t 4 18

  19. Parallelism: Multiple-source BFS 1 2  4 5 7 6 3 (A T X) . *¬X T A X • Batch processing of multiple source vertices • Sparse matrix-matrix multiplication => work efficient • Potential parallelism is much higher • Same applies to the tallying phase 19

  20. Betweenness Centrality on Combinatorial BLAS 140,000,000 120,000,000 100,000,000 Batch processing greatly helps for large p 80,000,000 16 [batch: 256] 16 [batch: 512] 60,000,000 BC perf. in TEPS (Traversed Edges per Second) 100,000,000 40,000,000 90,000,000 20,000,000 80,000,000 70,000,000 0 25 36 49 64 81 100 121 144 169 196 225 256 60,000,000 RMAT scale N has 2 N 50,000,000 vertices and 8*2 N edges 40,000,000 • Likely to perform better on 30,000,000 20,000,000 large inputs 10,000,000 • Code only a few lines 0 longer than Matlab version 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 16 [batch: 256] 20

  21. Betweenness Centrality on Combinatorial BLAS Fundamental trade-off: Parallelism vs memory usage 140,000,000 120,000,000 100,000,000 80,000,000 16 [batch: 256] 16 [batch: 512] 17 [batch: 256] 60,000,000 17 [batch: 512] 40,000,000 20,000,000 0 64 81 100 121 144 169 196 225 256 21

  22. Thank You ! Questions? 22

Recommend


More recommend