The Reverse Cuthill-McKee Algorithm in Distributed-Memory Ariful Azad Lawrence Berkeley Na.onal Laboratory (LBNL) SIAM CSC 2016, Albuquerque
Acknowledgements q Joint work with – Aydın Buluç – Mathias Jacquelin – Esmond Ng q Funding – DOE Office of Science – Time alloca.on at the DOE NERSC Center
Reordering a sparse matrix q In this talk, I consider parallel algorithms for reordering sparse matrices q Goal: Find a permuta.on P so that the bandwidth/ profile of PAP T is small. ASer permuta.on Before permuta.on
Why reordering a matrix q BeTer cache reuse in SpMV [Karantasis et al. SC ‘14] q Faster itera.ve solvers such as precondi.oned conjugate gradients (PCG). Example: PCG implementa.on in PETSc Natural ordering RCM ordering 128 64 32 Solver (me (s) 16 8 4 4x 2 1 Thermal2 (n=1.2M, nnz=4.9M) 1 2 4 8 16 32 64 128 256 Number of cores
The case for the Reverse Cuthill-McKee (RCM) algorithm q Finding a permuta.on to minimize the bandwidth is NP-complete. [Papadimitriou ‘76] q Heuris.cs are used in prac.ce – Examples: the Reverse Cuthill-McKee algorithm, Sloan’s algorithm q We focus on the Reverse Cuthill-McKee (RCM) algorithm – Simple to state – Easy to understand – Rela.vely easy to parallelize
The case for distributed-memory algorithm q Enable solving very large problems q More prac.cal: The matrix is already distributed – gathering the distributed matrix onto a node for serial execu.on is expensive. 12" Gather'(me'(sec)' 10" 8" Time to gather a graph 6" on a node from 45 nodes of 4" NERSC/Edison (Cray XC30) 2" 0" ldoor" dielFilterV3real" Serena" delaunay_n24" nlpkkt240" hugetrace300020" rgg_n_2_24_s0" Distributed algorithms are cheaper and scalable
The RCM algorithm 1 Start vertex Cuthill-McKee (a pseudo-peripheral vertex) order 3 2 Order ver.ces by increasing degree 4 5 6 Order ver.ces by (parents’ order, degree) 7 8 Order ver.ces by parents’ order Reverse the order of ver.ces to obtain the RCM ordering
RCM: Challenges in paralleliza.on (in addi.on to parallelizing BFS) 1 q Given a start vertex, the algorithm a 3 gives a fixed ordering except for .e 2 breaks. Not parallelizaLon friendly . e b q Unlike tradiLonal BFS , the parent of 4 5 6 a vertex is set to a vertex with the minimum label. (i.e., boTom-up BFS h c f is not beneficial) 7 8 q Within a level, ver.ces are labeled by lexicographical order of (parents’ d g order, degree) pairs, needs sor.ng
Our approach to address paralleliza.on challenges q We use specialized level-synchronous BFS q Key differences from tradi.onal BFS (Buluç and Madduri, SC ‘11) 1. A parent with smaller label is preferred over another vertex with larger label 2. The labels of parents are passed to their children 3. Lexicographical sor.ng of ver.ces in BFS levels q The first two of them are addressed by sparse matrix- sparse vector mul.plica.on (SpMSpV) over a semiring q The third challenge is addressed by a lightweight sor.ng func.on
Exploring the next-level ver.ces via SpMSpV Overload (mul.ply,add) with (select2nd, min) 1 a 3 2 3 2 a b c d e f g h Current a a e b x x fronLer b b x x x c c 2 x x x x d d x Next e e x x x h c f fronLer f f 3 x x g g x x h 2 h x Adjacency matrix d g
Ordering ver.ces via par.al sor.ng 1 Sort degrees of the siblings a many instances of small sor.ngs 3 2 (avoids expensive parallel sor.ng) Current e b fronLer 4 5 3 a b c d e f g h Next h c f Parent’s label 2 3 2 fronLer My degree 4 2 1 Rules for ordering verLces d g 1. c and h are ordered before f 2. h is ordered before c
Distributed memory paralleliza.on (SpMSpV) n p P processors are arranged in n p x p Processor grid p × à fron.er A x ALGORITHM: 1. Gather ver.ces in processor column [communicaLon] 2. Local mul.plica.on [computa.on] 3. Find owners of the current fron.er’s adjacency and exchange adjacencies in processor row [communicaLon]
Distributed-memory par.al sor.ng q Bin ver.ces by their parents’ labels – All ver.ces in a bin is assigned to a single node – Needs AllToAll communica.on q Sequen.ally sort the degree of ver.ces in a single node
Computa.on and communica.on complexity Per processor Per processor Per processor OperaLon ComputaLon Comm Comm (lower bound) (latency) (bandwidth) ! $ m β m p + n SpMSpV # & diameter * α p # & p p " % β n n Sor.ng diameter * α p log( n / p ) p p n: number of ver.ces, m: number of edges α : latency (0.25 μs to 3.7 μs MPI latency on Edison) β : inverse bandwidth (~8GB/sec MPI bandwidth on Edison) p : number of processors
Other aspects of the algorithm q Finding a pseudo peripheral vertex. – Repeated applica.on of the usual BFS (no ordering of ver.ces within a level) q Our SpMSpV is hybrid OpenMP-MPI implementa.on – Mul.threaded SpMSpV is also fairly complicated and subject to another work
Results: Scalability on NERSC/Edison (6 threads per MPI process) #ver.ces: 1.1M #edges: 89M Bandwidth before: 1,036,475 aSer: 23,813 dielFilterV3real 10 Peripheral: SpMSpV Peripheral: Other 8 Ordering: SpMSpV Ordering: Sorting Time (sec) Ordering: Other 6 4 Communica.on dominates 2 30x 0 1 6 24 54 216 1,014 4,056 Number of Cores
Scalability on NERSC/Edison (6 threads per MPI process) #ver.ces: 78M #edges: 760M Bandwidth before: 14,169,841 aSer: 361,755 nlpkkt240 12 Peripheral: SpMSpV Peripheral: Other 10 Ordering: SpMSpV Ordering: Sorting 8 Larger graphs Time (sec) Ordering: Other conLnue scaling 6 4 2 0 54 216 1,014 4,056 Number of Cores
Single node performance NERSC/Edison (2x12 cores) q SpMP (Sparse Matrix Pre-processing) package by Park et al. (hTps://github.com/jspark1105/SpMP) q We switch to MPI+OpenMP aSer 12 cores Matrix: ldoor #ver.ces: 1M #edges: 42M SpMP Our algorithm 8 If the matrix is already 4 Distributed in 1K cores Time (s) 2 (~45 nodes) Time to gather: 0.82 s 1 making the distributed 0.5 algorithm more profitable 0.25 1 2 4 8 16 32 Number of cores
Conclusions q For many prac.cal problems, the RCM ordering expedites itera.ve solvers q No scalable distributed memory algorithm for RCM ordering exists – forcing us gathering an already distributed matrix on a node and use serial algorithm (e.g., in PETSc), which is expensive q We developed a distributed-memory RCM algorithm using SpMSpV and par.al sor.ng q The algorithm scales up to 1K cores on modern supercomputers.
Thanks for your a_enLon
Recommend
More recommend