towards a graphblas library in chapel
play

Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu - PowerPoint PPT Presentation

Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu Lawrence Berkeley Na.onal Laboratory (LBNL) CHIUW, IPDPS 2017 Overview q High-level research objec.ve: Enable produc.ve and high-performance graph analy.cs We used


  1. Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Buluç Lawrence Berkeley Na.onal Laboratory (LBNL) CHIUW, IPDPS 2017

  2. Overview q High-level research objec.ve: – Enable produc.ve and high-performance graph analy.cs – We used GraphBLAS and Chapel to achieve this goal GraphBLAS Chapel Building blocks for graph An emerging parallel language algorithms in the language of designed for produc.ve sparse linear algebra parallel compu.ng at scale Both promise: Produc.vity + Performance q Scope of this paper: A GraphBLAS library in Chapel

  3. Outline 1. Overview of GraphBLAS primi.ves 2. Implementa.on of a subset of GraphBLAS primi.ves in Chapel with experimental results Warning: this is just an early evalua.on as Chapel’s sparse matrix support is ac.vely under development. All experiments were conducted on Chapel 1.13.1. The performance numbers are expected to improve significantly in future releases of Chapel.

  4. Part 1. GraphBLAS overview

  5. GraphBLAS analogy A ready-to-assemble furniture shop (Ikea) Building blocks Objects Final product (Algorithms) (Applica.ons) 5

  6. Graph algorithm building blocks q GraphBLAS ( http://graphblas.org ) – Standard building blocks for graph algorithms in the language of sparse linear algebra – Inspired by the Basic Linear Algebra Subprograms (BLAS) – Par.cipants from industry, academia and na.onal labs – C API is available in the website ( Design of the GraphBLAS API for C , A Buluç, T MaYson, S McMillan, J Moreira, C Yang, IPDPS Workshops 2017) 6

  7. GraphBLAS as algorithm building blocks q Employs graph-matrix duality – Graphs => sparse matrix – A subset of vertex/edges => sparse/dense vector q Benefits – Standard set of opera.ons – Learn from the rich history of numerical linear algebra – Offers structured and regular memory accesses and communica.ons (as opposed to irregular memory accesses in tradi.on graph algorithm) – Opportunity for communica.on avoiding algorithms 7

  8. Some GraphBLAS basic primi.ves FuncJon Parameters Returns Matlab notaJon MxM - sparse matrices A and B sparse matrix C = A * B (SpGEMM) - op.onal unary functs MxV - sparse matrix A sparse/dense vector y = A * x (SpM{Sp}V) - sparse/dense vector x EwiseMult, Add, … - sparse matrices or vectors in place or sparse C = A .* B (SpEWiseX) - binary funct, op.onal unarys matrix/vector C = A + B Reduce - sparse matrix A and funct dense vector y = sum(A, op) (Reduce) Extract - sparse matrix A sparse matrix B = A(p, q) (SpRef) - index vectors p and q Assign - sparse matrices A and B none A(p, q) = B (SpAsgn) - index vectors p and q BuildMatrix - list of edges/triples (i, j, v) sparse matrix A = sparse(i, j, v, m, n) (Sparse) ExtractTuples - sparse matrix A edge list [i, j, v] = find(A) (Find) 8

  9. General purpose opera.ons via semirings (overloading addi.on and mul.plica.on opera.ons) Real field: (R, +, x ) Classical numerical linear algebra Boolean algebra: ({0 1}, |, &) Graph traversal Tropical semiring: (R U { ∞ }, min, +) Shortest paths (S , select, select) Select subgraph, or contract nodes to form quo.ent graph (edge/vertex aYributes, vertex data Schema for user-specified aggrega.on, edge data processing) computa.on at ver.ces and edges (R, max, + ) Graph matching &network alignment (R, min, Jmes) Maximal independent set Shortened semiring notaJon: (Set, Add, MulJply) . Both iden..es omiYed. • Add: Traverses edges, MulJply: Combines edges/paths at a vertex • 9

  10. Example: Exploring the next-level ver.ces via SpMSpV Overload (mul.ply,add) with (select2nd, min) 1 a 3 2 3 2 a b c d e f g h Current a a e b x x fronJer b b x x x c c 2 x x x x d d x Next e e x x x h c f fronJer f f 3 x x g g x x h 2 h x x Adjacency matrix d g

  11. Algorithmic coverage Higher-level combinatorial and machine learning algorithms Miscellaneous: Classifica7on Graph clustering Centrality Dimensionality Shortest paths (all-pairs, connec<vity, traversal (support vector (Markov cluster, (PageRank, reduc7on single- source, (BFS), independent sets machines, Logis<c peer pressure, betweenness, (NMF, PCA) (MIS), graph matching regression) spectral, local) closeness) temporal) Sparse Matrix Times Sparse - Sparse Sparse - Dense Sparse Matrix- Sparse Matrix- Matrix Product Matrix Product Sparse Vector Mul<ple Dense Vectors Dense Vector (SpGEMM) (SpMSpV) (SpMM) (SpDM 3 ) (SpMV) GraphBLAS primi<ves in increasing arithme<c intensity • Develop high-performance algorithms for 10-12 primi.ves. • Use them in many algorithms (boost produc.vity). 11

  12. Expecta.on: two-layer produc.vity Graph algorithms user space use GraphBLAS opera.ons library use Chapel’s produc.vity features language

  13. Part 2. ImplemenJng a subset of GraphBLAS operaJons in Chapel

  14. For Chapel: A subset of GraphBLAS opera.ons Parameters Returns Apply x: sparse matrix/vector None x[i] = f(x[i]) f: unary func.on Assign x: sparse matrix/vector None x[i] = y[i] y: sparse matrix/vector eWiseMult x: sparse matrix/vector z: sparse z[i] = x[i] * y[i] y: sparse matrix/vector matrix/vector SpMSpV A: sparse matrix y: sparse y = Ax x: sparse vector vector

  15. Experimental platorm q Chapel details – Chapel 1.13.1 (the latest version before the IPDPS deadline) – Chapel built from source – CHPL_COMM: gasnet/gemini – Job launcher: slurm-srun q Experiment platorm: NERSC/Edison – Intel Ivy Bridge processor – 24 cores on 2 sockets – 64 GB memory per node – 30-MB L3 Cache

  16. Sparse matrices in Chapel q Block distributed sparse matrices. The dense container is block distributed. q We used compressed sparse block (CSR) layout to store local matrices. In this example: #locales = 9 var n = 6 const D = {0..n-1, 0..n-1} dmapped Block(1..3,1..3); var spD: sparse subdomain (D); var A = [spD] real; In our results, we did not include .me to construct arrays

  17. The simplest GraphBLAS opera.on: Apply ( x[i] = f(x[i]) ) Apply1 : high-level (Chapel style) Apply2 manipula.ng internal arrays (MPI style)

  18. Example, simple case : Apply ( x[i] = f(x[i]) ) 256 Apply1 : high-level (Chapel style) Apply1 Apply2 128 Apply2 : manipula.ng internal arrays (C++ style) 64 Time (ms) x: 10M nonzeros 32 Platorm: NERSC/Edison 16 8 256 Apply1 64 Apply2 4 1 2 4 8 16 32 16 Number of Threads (single node) 4 Time (second) 1 Data parallel loops perform 0.25 well in shared memory 0.0625 0.015625 0.00390625 But do not perform 0.000976562 well in distributed memory 0.000244141 1 2 4 8 16 32 64 Number of Nodes (24 threads per node)

  19. Performance on distributed-memory Using chplvis on four locales Red: data in, blue: data out Apply 1 Apply 2 All work at locale 0 This issue with sparse arrays has been addressed about a week ago

  20. Assign x[i] = y[i] Assign1 : high-level (Chapel style) Assign2 : manipula.ng internal arrays (MPI style)

  21. Shared-memory performance: Assign ( x[i] = y[i] ) Assign1 : high-level (Chapel style) 2048 Assign1 Assign2 Assign2 : manipula.ng internal 1024 512 arrays (C++ style) 256 x: 1M nonzeros Time (ms) 128 Platorm: NERSC/Edison 64 32 16 8 1 2 4 8 16 32 Number of Threads (single node) Big performance gap Why? Even in shared memory Indexing a sparse domain uses binary search. For assignment it can be avoided

  22. distributed-memory performance: Assign ( x[i] = y[i] ) 1024 Assign1 : high-level (Chapel style) Assign1 256 Assign2 Assign2 : manipula.ng internal 64 arrays (C++ style) 16 Time (second) 4 x: 1M nonzeros 1 Platorm: NERSC/Edison 0.25 0.0625 0.015625 0.00390625 0.000976562 1 2 4 8 16 32 64 Number of Nodes (24 threads per node) Big performance gap Even in distributed memory

  23. Example, complex case: SpMSpV (y = Ax) Algorithm overview *" = y" x" A" sca-er/" gather" accumulate" SPA$

  24. Sparse matrix-sparse vector mul.ply (SpMSpV) n p P processors are arranged in n p x p Processor grid p × à x A x Algorithm (Chapel Style) Algorithm (MPI Style) 1. Gather ver.ces in processor column Mul.ply (access remote data 2. Local mul.plica.on as needed). No collec.ve 3. ScaYer results in processor row communica.on

  25. Distributed-memory performance of SpMSpV on Edison A: random; 16M nonzeros x: random; 2000 nonzeros Gather Input Local Mul<ply Sca?er output 4 1 0.25 Time (s) 0.0625 We don’t know the reason 0.015625 0.0039063 0.0009766 1 2 4 8 16 32 64 Number of Nodes (24 threads/node) Remote atomics are expensive in Chapel

  26. Requirements for achieving high performance q Exploit available spa.al locality in sparse manipula.ons – Efficient access of nonzeros of sparse matrices/vectors – Chapel is almost there, needs improved parallel iterators q Use bulk-synchronous communica.on whenever possible – Avoid latency-bound communica.on – Team collec.ves are useful

Recommend


More recommend