Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu - PowerPoint PPT Presentation

Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Buluç Lawrence Berkeley Na.onal Laboratory (LBNL) CHIUW, IPDPS 2017

Overview q High-level research objec.ve: – Enable produc.ve and high-performance graph analy.cs – We used GraphBLAS and Chapel to achieve this goal GraphBLAS Chapel Building blocks for graph An emerging parallel language algorithms in the language of designed for produc.ve sparse linear algebra parallel compu.ng at scale Both promise: Produc.vity + Performance q Scope of this paper: A GraphBLAS library in Chapel

Outline 1. Overview of GraphBLAS primi.ves 2. Implementa.on of a subset of GraphBLAS primi.ves in Chapel with experimental results Warning: this is just an early evalua.on as Chapel’s sparse matrix support is ac.vely under development. All experiments were conducted on Chapel 1.13.1. The performance numbers are expected to improve significantly in future releases of Chapel.

Part 1. GraphBLAS overview

GraphBLAS analogy A ready-to-assemble furniture shop (Ikea) Building blocks Objects Final product (Algorithms) (Applica.ons) 5

Graph algorithm building blocks q GraphBLAS ( http://graphblas.org ) – Standard building blocks for graph algorithms in the language of sparse linear algebra – Inspired by the Basic Linear Algebra Subprograms (BLAS) – Par.cipants from industry, academia and na.onal labs – C API is available in the website ( Design of the GraphBLAS API for C , A Buluç, T MaYson, S McMillan, J Moreira, C Yang, IPDPS Workshops 2017) 6

GraphBLAS as algorithm building blocks q Employs graph-matrix duality – Graphs => sparse matrix – A subset of vertex/edges => sparse/dense vector q Benefits – Standard set of opera.ons – Learn from the rich history of numerical linear algebra – Offers structured and regular memory accesses and communica.ons (as opposed to irregular memory accesses in tradi.on graph algorithm) – Opportunity for communica.on avoiding algorithms 7

Some GraphBLAS basic primi.ves FuncJon Parameters Returns Matlab notaJon MxM - sparse matrices A and B sparse matrix C = A * B (SpGEMM) - op.onal unary functs MxV - sparse matrix A sparse/dense vector y = A * x (SpM{Sp}V) - sparse/dense vector x EwiseMult, Add, … - sparse matrices or vectors in place or sparse C = A .* B (SpEWiseX) - binary funct, op.onal unarys matrix/vector C = A + B Reduce - sparse matrix A and funct dense vector y = sum(A, op) (Reduce) Extract - sparse matrix A sparse matrix B = A(p, q) (SpRef) - index vectors p and q Assign - sparse matrices A and B none A(p, q) = B (SpAsgn) - index vectors p and q BuildMatrix - list of edges/triples (i, j, v) sparse matrix A = sparse(i, j, v, m, n) (Sparse) ExtractTuples - sparse matrix A edge list [i, j, v] = find(A) (Find) 8

General purpose opera.ons via semirings (overloading addi.on and mul.plica.on opera.ons) Real field: (R, +, x ) Classical numerical linear algebra Boolean algebra: ({0 1}, |, &) Graph traversal Tropical semiring: (R U { ∞ }, min, +) Shortest paths (S , select, select) Select subgraph, or contract nodes to form quo.ent graph (edge/vertex aYributes, vertex data Schema for user-specified aggrega.on, edge data processing) computa.on at ver.ces and edges (R, max, + ) Graph matching &network alignment (R, min, Jmes) Maximal independent set Shortened semiring notaJon: (Set, Add, MulJply) . Both iden..es omiYed. • Add: Traverses edges, MulJply: Combines edges/paths at a vertex • 9

Example: Exploring the next-level ver.ces via SpMSpV Overload (mul.ply,add) with (select2nd, min) 1 a 3 2 3 2 a b c d e f g h Current a a e b x x fronJer b b x x x c c 2 x x x x d d x Next e e x x x h c f fronJer f f 3 x x g g x x h 2 h x x Adjacency matrix d g

Algorithmic coverage Higher-level combinatorial and machine learning algorithms Miscellaneous: Classifica7on Graph clustering Centrality Dimensionality Shortest paths (all-pairs, connec<vity, traversal (support vector (Markov cluster, (PageRank, reduc7on single- source, (BFS), independent sets machines, Logis<c peer pressure, betweenness, (NMF, PCA) (MIS), graph matching regression) spectral, local) closeness) temporal) Sparse Matrix Times Sparse - Sparse Sparse - Dense Sparse Matrix- Sparse Matrix- Matrix Product Matrix Product Sparse Vector Mul<ple Dense Vectors Dense Vector (SpGEMM) (SpMSpV) (SpMM) (SpDM 3 ) (SpMV) GraphBLAS primi<ves in increasing arithme<c intensity • Develop high-performance algorithms for 10-12 primi.ves. • Use them in many algorithms (boost produc.vity). 11

Expecta.on: two-layer produc.vity Graph algorithms user space use GraphBLAS opera.ons library use Chapel’s produc.vity features language

Part 2. ImplemenJng a subset of GraphBLAS operaJons in Chapel

For Chapel: A subset of GraphBLAS opera.ons Parameters Returns Apply x: sparse matrix/vector None x[i] = f(x[i]) f: unary func.on Assign x: sparse matrix/vector None x[i] = y[i] y: sparse matrix/vector eWiseMult x: sparse matrix/vector z: sparse z[i] = x[i] * y[i] y: sparse matrix/vector matrix/vector SpMSpV A: sparse matrix y: sparse y = Ax x: sparse vector vector

Experimental platorm q Chapel details – Chapel 1.13.1 (the latest version before the IPDPS deadline) – Chapel built from source – CHPL_COMM: gasnet/gemini – Job launcher: slurm-srun q Experiment platorm: NERSC/Edison – Intel Ivy Bridge processor – 24 cores on 2 sockets – 64 GB memory per node – 30-MB L3 Cache

Sparse matrices in Chapel q Block distributed sparse matrices. The dense container is block distributed. q We used compressed sparse block (CSR) layout to store local matrices. In this example: #locales = 9 var n = 6 const D = {0..n-1, 0..n-1} dmapped Block(1..3,1..3); var spD: sparse subdomain (D); var A = [spD] real; In our results, we did not include .me to construct arrays

The simplest GraphBLAS opera.on: Apply ( x[i] = f(x[i]) ) Apply1 : high-level (Chapel style) Apply2 manipula.ng internal arrays (MPI style)

Example, simple case : Apply ( x[i] = f(x[i]) ) 256 Apply1 : high-level (Chapel style) Apply1 Apply2 128 Apply2 : manipula.ng internal arrays (C++ style) 64 Time (ms) x: 10M nonzeros 32 Platorm: NERSC/Edison 16 8 256 Apply1 64 Apply2 4 1 2 4 8 16 32 16 Number of Threads (single node) 4 Time (second) 1 Data parallel loops perform 0.25 well in shared memory 0.0625 0.015625 0.00390625 But do not perform 0.000976562 well in distributed memory 0.000244141 1 2 4 8 16 32 64 Number of Nodes (24 threads per node)

Performance on distributed-memory Using chplvis on four locales Red: data in, blue: data out Apply 1 Apply 2 All work at locale 0 This issue with sparse arrays has been addressed about a week ago

Assign x[i] = y[i] Assign1 : high-level (Chapel style) Assign2 : manipula.ng internal arrays (MPI style)

Shared-memory performance: Assign ( x[i] = y[i] ) Assign1 : high-level (Chapel style) 2048 Assign1 Assign2 Assign2 : manipula.ng internal 1024 512 arrays (C++ style) 256 x: 1M nonzeros Time (ms) 128 Platorm: NERSC/Edison 64 32 16 8 1 2 4 8 16 32 Number of Threads (single node) Big performance gap Why? Even in shared memory Indexing a sparse domain uses binary search. For assignment it can be avoided

distributed-memory performance: Assign ( x[i] = y[i] ) 1024 Assign1 : high-level (Chapel style) Assign1 256 Assign2 Assign2 : manipula.ng internal 64 arrays (C++ style) 16 Time (second) 4 x: 1M nonzeros 1 Platorm: NERSC/Edison 0.25 0.0625 0.015625 0.00390625 0.000976562 1 2 4 8 16 32 64 Number of Nodes (24 threads per node) Big performance gap Even in distributed memory

Example, complex case: SpMSpV (y = Ax) Algorithm overview *" = y" x" A" sca-er/" gather" accumulate" SPA$

Sparse matrix-sparse vector mul.ply (SpMSpV) n p P processors are arranged in n p x p Processor grid p × à x A x Algorithm (Chapel Style) Algorithm (MPI Style) 1. Gather ver.ces in processor column Mul.ply (access remote data 2. Local mul.plica.on as needed). No collec.ve 3. ScaYer results in processor row communica.on

Distributed-memory performance of SpMSpV on Edison A: random; 16M nonzeros x: random; 2000 nonzeros Gather Input Local Mul<ply Sca?er output 4 1 0.25 Time (s) 0.0625 We don’t know the reason 0.015625 0.0039063 0.0009766 1 2 4 8 16 32 64 Number of Nodes (24 threads/node) Remote atomics are expensive in Chapel

Requirements for achieving high performance q Exploit available spa.al locality in sparse manipula.ons – Efficient access of nonzeros of sparse matrices/vectors – Chapel is almost there, needs improved parallel iterators q Use bulk-synchronous communica.on whenever possible – Avoid latency-bound communica.on – Team collec.ves are useful

Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu - PowerPoint PPT Presentation

Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu Lawrence Berkeley Na.onal Laboratory (LBNL) CHIUW, IPDPS 2017 Overview q High-level research objec.ve: Enable produc.ve and high-performance graph analy.cs We used

CHAPEL + LAPACK Ian Bertolacci NEW DOG, MEET OLD DOG. INTRO: WHAT IS CHAPEL Chapel is a

Chapel: Global HPCC Benchmarks and Status Update Brad Chamberlain Chapel Team CUG 2007 May 7,

GraphBLAS: A linear algebraic approach for high-performance graph algorithms Gbor Szrnyas

Multiplex graph analysis with GraphBLAS Gbor Szrnyas With contributions from Petra Vrhegyi

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

William Dalmer 20 Psalm & Hymn Tunes Trim Street Chapel, Bath. Completed 1796. Northgate

Chapel: Status/Community Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010 Outline Chapel

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

The Homeschooling - Library Connection Diane Pamel- Library Director Southworth Library and

Eric Lashley Library Director, Georgetown Public Library (TX) Patrick Lloyd, LMSW Community

King Fahd University of Petroleum & Minerals Deanship of Library Affairs KFUPM Library

University of North Carolina at Chapel Hill Thank You Institute of Museum and Library

Fits! Jason Richards, MD Resident, UNC Dept. of Neurology One year ago... 11/12/2015 2

LAUNCH CHAPEL HILL & 1789 CHAPEL HILLS GROWING ENTREPRENEURIAL ECOSYSTEM KFBS &

FOX CHAPEL AREA SCHOOL DISTRICT CLASS SIZE REPORT FOX CHAPEL AREA HIGH SCHOOL REPORT SCHEDULING

Dimensionality Reduction for Tukey Regression Kenneth L. Clarkson 1 Ruosong Wang 2 David P.

, , & _/ |\__'_|_|_|\__'_| |__/ Sacha Verweij and Jane Herriman _ _ _ _(_)_ (_)

A Numerical Linear Algebra Framework for Solving Problems with Multivariate Polynomials Kim

How to Write Fast Numerical Code Spring 2011 Lecture 12 Instructor: Markus Pschel TA: Georg

Computer Graphics (CS 543) Lecture 4(Part 2): Linear Algebra for Graphics (Points, Scalars,

CS475/CS675 Lecture 1: May 3, 2016 Basic Theory of Linear Algebra Reading: [TB] chapt 1 (p.

Numerical Linear Algebra issues in Singular Spectrum Analysis of time series Dario Fasino

Applied Mathematics 205 Advanced Scientific Computing: Numerical Methods Course logistics Lead

Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu - PowerPoint PPT Presentation

Towards a GraphBLAS Library in Chapel Ariful Azad & Aydin Bulu Lawrence Berkeley Na.onal Laboratory (LBNL) CHIUW, IPDPS 2017 Overview q High-level research objec.ve: Enable produc.ve and high-performance graph analy.cs We used

CHAPEL + LAPACK Ian Bertolacci NEW DOG, MEET OLD DOG. INTRO: WHAT IS CHAPEL Chapel is a

Chapel: Global HPCC Benchmarks and Status Update Brad Chamberlain Chapel Team CUG 2007 May 7,

GraphBLAS: A linear algebraic approach for high-performance graph algorithms Gbor Szrnyas

Multiplex graph analysis with GraphBLAS Gbor Szrnyas With contributions from Petra Vrhegyi

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

William Dalmer 20 Psalm &amp; Hymn Tunes Trim Street Chapel, Bath. Completed 1796. Northgate

Chapel: Status/Community Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010 Outline Chapel

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

The Homeschooling - Library Connection Diane Pamel- Library Director Southworth Library and

Eric Lashley Library Director, Georgetown Public Library (TX) Patrick Lloyd, LMSW Community

King Fahd University of Petroleum &amp; Minerals Deanship of Library Affairs KFUPM Library

University of North Carolina at Chapel Hill Thank You Institute of Museum and Library

Fits! Jason Richards, MD Resident, UNC Dept. of Neurology One year ago... 11/12/2015 2

LAUNCH CHAPEL HILL &amp; 1789 CHAPEL HILLS GROWING ENTREPRENEURIAL ECOSYSTEM KFBS &amp;

FOX CHAPEL AREA SCHOOL DISTRICT CLASS SIZE REPORT FOX CHAPEL AREA HIGH SCHOOL REPORT SCHEDULING

Dimensionality Reduction for Tukey Regression Kenneth L. Clarkson 1 Ruosong Wang 2 David P.

, , &amp; _/ |\__'_|_|_|\__'_| |__/ Sacha Verweij and Jane Herriman _ _ _ _(_)_ (_)

A Numerical Linear Algebra Framework for Solving Problems with Multivariate Polynomials Kim

How to Write Fast Numerical Code Spring 2011 Lecture 12 Instructor: Markus Pschel TA: Georg

Computer Graphics (CS 543) Lecture 4(Part 2): Linear Algebra for Graphics (Points, Scalars,

CS475/CS675 Lecture 1: May 3, 2016 Basic Theory of Linear Algebra Reading: [TB] chapt 1 (p.

Numerical Linear Algebra issues in Singular Spectrum Analysis of time series Dario Fasino

Applied Mathematics 205 Advanced Scientific Computing: Numerical Methods Course logistics Lead

William Dalmer 20 Psalm & Hymn Tunes Trim Street Chapel, Bath. Completed 1796. Northgate

King Fahd University of Petroleum & Minerals Deanship of Library Affairs KFUPM Library

LAUNCH CHAPEL HILL & 1789 CHAPEL HILLS GROWING ENTREPRENEURIAL ECOSYSTEM KFBS &

, , & _/ |\__'_|_|_|\__'_| |__/ Sacha Verweij and Jane Herriman _ _ _ _(_)_ (_)