The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful - PowerPoint PPT Presentation

The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful Azad Lawrence Berkeley Na.onal Laboratory (LBNL) SIAM CSC 2016, Albuquerque

Acknowledgements q Joint work with – Aydın Buluç – Mathias Jacquelin – Esmond Ng q Funding – DOE Office of Science – Time alloca.on at the DOE NERSC Center

Reordering a sparse matrix q In this talk, I consider parallel algorithms for reordering sparse matrices q Goal: Find a permuta.on P so that the bandwidth/ profile of PAP T is small. ASer permuta.on Before permuta.on

Why reordering a matrix q BeTer cache reuse in SpMV [Karantasis et al. SC ‘14] q Faster itera.ve solvers such as precondi.oned conjugate gradients (PCG). Example: PCG implementa.on in PETSc Natural ordering RCM ordering 128 64 32 Solver (me (s) 16 8 4 4x 2 1 Thermal2 (n=1.2M, nnz=4.9M) 1 2 4 8 16 32 64 128 256 Number of cores

The case for the Reverse Cuthill-McKee (RCM) algorithm q Finding a permuta.on to minimize the bandwidth is NP-complete. [Papadimitriou ‘76] q Heuris.cs are used in prac.ce – Examples: the Reverse Cuthill-McKee algorithm, Sloan’s algorithm q We focus on the Reverse Cuthill-McKee (RCM) algorithm – Simple to state – Easy to understand – Rela.vely easy to parallelize

The case for distributed-memory algorithm q Enable solving very large problems q More prac.cal: The matrix is already distributed – gathering the distributed matrix onto a node for serial execu.on is expensive. 12" Gather'(me'(sec)' 10" 8" Time to gather a graph 6" on a node from 45 nodes of 4" NERSC/Edison (Cray XC30) 2" 0" ldoor" dielFilterV3real" Serena" delaunay_n24" nlpkkt240" hugetrace300020" rgg_n_2_24_s0" Distributed algorithms are cheaper and scalable

The RCM algorithm 1 Start vertex Cuthill-McKee (a pseudo-peripheral vertex) order 3 2 Order ver.ces by increasing degree 4 5 6 Order ver.ces by (parents’ order, degree) 7 8 Order ver.ces by parents’ order Reverse the order of ver.ces to obtain the RCM ordering

RCM: Challenges in paralleliza.on (in addi.on to parallelizing BFS) 1 q Given a start vertex, the algorithm a 3 gives a fixed ordering except for .e 2 breaks. Not parallelizaLon friendly . e b q Unlike tradiLonal BFS , the parent of 4 5 6 a vertex is set to a vertex with the minimum label. (i.e., boTom-up BFS h c f is not beneficial) 7 8 q Within a level, ver.ces are labeled by lexicographical order of (parents’ d g order, degree) pairs, needs sor.ng

Our approach to address paralleliza.on challenges q We use specialized level-synchronous BFS q Key differences from tradi.onal BFS (Buluç and Madduri, SC ‘11) 1. A parent with smaller label is preferred over another vertex with larger label 2. The labels of parents are passed to their children 3. Lexicographical sor.ng of ver.ces in BFS levels q The first two of them are addressed by sparse matrix- sparse vector mul.plica.on (SpMSpV) over a semiring q The third challenge is addressed by a lightweight sor.ng func.on

Exploring the next-level ver.ces via SpMSpV Overload (mul.ply,add) with (select2nd, min) 1 a 3 2 3 2 a b c d e f g h Current a a e b x x fronLer b b x x x c c 2 x x x x d d x Next e e x x x h c f fronLer f f 3 x x g g x x h 2 h x Adjacency matrix d g

Ordering ver.ces via par.al sor.ng 1 Sort degrees of the siblings a many instances of small sor.ngs 3 2 (avoids expensive parallel sor.ng) Current e b fronLer 4 5 3 a b c d e f g h Next h c f Parent’s label 2 3 2 fronLer My degree 4 2 1 Rules for ordering verLces d g 1. c and h are ordered before f 2. h is ordered before c

Distributed memory paralleliza.on (SpMSpV) n p P processors are arranged in n p x p Processor grid p × à fron.er A x ALGORITHM: 1. Gather ver.ces in processor column [communicaLon] 2. Local mul.plica.on [computa.on] 3. Find owners of the current fron.er’s adjacency and exchange adjacencies in processor row [communicaLon]

Distributed-memory par.al sor.ng q Bin ver.ces by their parents’ labels – All ver.ces in a bin is assigned to a single node – Needs AllToAll communica.on q Sequen.ally sort the degree of ver.ces in a single node

Computa.on and communica.on complexity Per processor Per processor Per processor OperaLon ComputaLon Comm Comm (lower bound) (latency) (bandwidth) ! $ m β m p + n SpMSpV # & diameter * α p # & p p " % β n n Sor.ng diameter * α p log( n / p ) p p n: number of ver.ces, m: number of edges α : latency (0.25 μs to 3.7 μs MPI latency on Edison) β : inverse bandwidth (~8GB/sec MPI bandwidth on Edison) p : number of processors

Other aspects of the algorithm q Finding a pseudo peripheral vertex. – Repeated applica.on of the usual BFS (no ordering of ver.ces within a level) q Our SpMSpV is hybrid OpenMP-MPI implementa.on – Mul.threaded SpMSpV is also fairly complicated and subject to another work

Results: Scalability on NERSC/Edison (6 threads per MPI process) #ver.ces: 1.1M #edges: 89M Bandwidth before: 1,036,475 aSer: 23,813 dielFilterV3real 10 Peripheral: SpMSpV Peripheral: Other 8 Ordering: SpMSpV Ordering: Sorting Time (sec) Ordering: Other 6 4 Communica.on dominates 2 30x 0 1 6 24 54 216 1,014 4,056 Number of Cores

Scalability on NERSC/Edison (6 threads per MPI process) #ver.ces: 78M #edges: 760M Bandwidth before: 14,169,841 aSer: 361,755 nlpkkt240 12 Peripheral: SpMSpV Peripheral: Other 10 Ordering: SpMSpV Ordering: Sorting 8 Larger graphs Time (sec) Ordering: Other conLnue scaling 6 4 2 0 54 216 1,014 4,056 Number of Cores

Single node performance NERSC/Edison (2x12 cores) q SpMP (Sparse Matrix Pre-processing) package by Park et al. (hTps://github.com/jspark1105/SpMP) q We switch to MPI+OpenMP aSer 12 cores Matrix: ldoor #ver.ces: 1M #edges: 42M SpMP Our algorithm 8 If the matrix is already 4 Distributed in 1K cores Time (s) 2 (~45 nodes) Time to gather: 0.82 s 1 making the distributed 0.5 algorithm more profitable 0.25 1 2 4 8 16 32 Number of cores

Conclusions q For many prac.cal problems, the RCM ordering expedites itera.ve solvers q No scalable distributed memory algorithm for RCM ordering exists – forcing us gathering an already distributed matrix on a node and use serial algorithm (e.g., in PETSc), which is expensive q We developed a distributed-memory RCM algorithm using SpMSpV and par.al sor.ng q The algorithm scales up to 1K cores on modern supercomputers.

Thanks for your a_enLon

The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful - PowerPoint PPT Presentation

The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful Azad Lawrence Berkeley Na.onal Laboratory (LBNL) SIAM CSC 2016, Albuquerque Acknowledgements q Joint work with Aydn Bulu Mathias Jacquelin Esmond Ng q Funding

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Reverse Osmosis Reverse Osmosis Background to Market and to Market and Background Technology

Hitting the Memory Wall: Implications of the Obvious Win. A. Wulf Sally A. McKee Department of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Multiple- -Writer Distributed Memory Writer Distributed Memory Multiple The Sequential

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Online Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn Paging Algorithm Assume a simple memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

Reverse Logistics Woodfield Distribution, LLC v081617 Reverse Logistics About Us Description

June 2020 Employment Report Doug Walls, Labor Market Information Director Types of Employment

Extending the Reach and Scope of Hosted CEs OSG All Hands Meeting March 20, 2018 Suchandra

Uniqueness for a class of linear quadratic mean field games with common noise Foguen Tchuendom

Decision Making under Uncertainty Part 2: Subjective probability and utility Christos

Foundations of Machine Learning Learning with Infinite Hypothesis Sets Motivation With an

Basics and Random Graphs Social and Technological Networks Rik Sarkar University of Edinburgh,

Academic Educa>on of So@ware Engineering Prac>ces Towards

Coordinated Entry System WEDNESDAY, JULY 27, 2016 CES Progress Report CES Module & Working

The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful - PowerPoint PPT Presentation

The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful Azad Lawrence Berkeley Na.onal Laboratory (LBNL) SIAM CSC 2016, Albuquerque Acknowledgements q Joint work with Aydn Bulu Mathias Jacquelin Esmond Ng q Funding

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Reverse Osmosis Reverse Osmosis Background to Market and to Market and Background Technology

Hitting the Memory Wall: Implications of the Obvious Win. A. Wulf Sally A. McKee Department of

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Multiple- -Writer Distributed Memory Writer Distributed Memory Multiple The Sequential

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Distributed Memory and Cache Consistency Distributed Memory and Cache Consistency (some slides

Online Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn Paging Algorithm Assume a simple memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

Reverse Logistics Woodfield Distribution, LLC v081617 Reverse Logistics About Us Description

June 2020 Employment Report Doug Walls, Labor Market Information Director Types of Employment

Extending the Reach and Scope of Hosted CEs OSG All Hands Meeting March 20, 2018 Suchandra

Uniqueness for a class of linear quadratic mean field games with common noise Foguen Tchuendom

Decision Making under Uncertainty Part 2: Subjective probability and utility Christos

Foundations of Machine Learning Learning with Infinite Hypothesis Sets Motivation With an

Basics and Random Graphs Social and Technological Networks Rik Sarkar University of Edinburgh,

Academic Educa&gt;on of So@ware Engineering Prac&gt;ces Towards

Coordinated Entry System WEDNESDAY, JULY 27, 2016 CES Progress Report CES Module &amp; Working

Academic Educa>on of So@ware Engineering Prac>ces Towards

Coordinated Entry System WEDNESDAY, JULY 27, 2016 CES Progress Report CES Module & Working