introduction to starneig
play

Introduction to StarNEig A Task-based Library for Solving - PowerPoint PPT Presentation

Introduction to StarNEig A Task-based Library for Solving Nonsymmetric Eigenvalue Problems Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen in collaboration with Angelika Schwarz, Lars Karlsson, Bo K agstr om and Mahmoud Eljammaly


  1. Introduction to StarNEig A Task-based Library for Solving Nonsymmetric Eigenvalue Problems Mirko Myllykoski and Carl Christian Kjelgaard Mikkelsen in collaboration with Angelika Schwarz, Lars Karlsson, Bo K˚ agstr¨ om and Mahmoud Eljammaly Department of Computing Science Ume˚ a University PPAM 2019 1 / 24

  2. Eigenvalue problem ◮ StarNEig library aims to implement a complete stack of algorithms for solving dense nonsymmetric eigenvalue problems. ◮ Both standard Ax i = λ i x i and generalized Ax i = λ i Bx i eigenvalue problems are considered. 2 / 24

  3. Eigenvalue problem (algorithm stack) Preprocessing Hessenberg reduction Schur reduction Schur reduction Eigenvalues AED Eigenvalue reordering Hessenberg reduction Invariant subspaces Eigenvectors Eigenvalue reordering Eigenvectors Figure: An illustration of the complete algorithm stack in standard case. 3 / 24

  4. Motivation (eigenvalue reordering) ◮ In some cases, we want to reorder the Schur form S such that a selected cluster of eigenvalues appears in the leading diagonal blocks of the updated Schur form ˜ S . ◮ Gives an orthonormal basis for a desired invariant subspace. Figure: An illustration of the reordering process in standard case. 4 / 24

  5. Motivation (accumulated transformations) ◮ A modern algorithm ◮ groups a set of orthogonal transformations together and ◮ initially applies them only within a small diagonal window . ◮ The transformations are accumulated and later propagated with level 3 BLAS operations. Propagate with BLAS-3 updates Group transformations Apply locally Figure: An illustration of accumulated transformations. 5 / 24

  6. Motivation (concurrent windows) ◮ Multiple diagonal windows can be active concurrently . ◮ The level 3 BLAS updates must be propagated in a sequentially consistent order . ◮ Requires careful coordination! C o n fl i c t Figure: An illustration of two concurrent windows. 6 / 24

  7. Motivation (ScaLAPACK-style approach) ◮ Eigenvalue reordering is implemented in ScaLAPACK 1 . ◮ With p cores , we can have up to √ p concurrent windows . ◮ The transformation are broadcasted and applied in parallel. ◮ Theoretically possible degree of parallelism is p . ◮ Only if we have √ p concurrent windows. Figure: An illustration of a ScaLAPACK-style algorithm. 1Granat, R., K˚ agstr¨ om, B., Kressner, D.: Parallel eigenvalue reordering in real Schur forms. Concurrency and Computation: Practice and Experience 21(9), 1225–1250 (2009). 7 / 24

  8. Motivation (task-based approach and task graphs) ◮ Computational work is cut into self-contained tasks . ◮ A runtime system ◮ derives the task dependences and ◮ schedules the tasks to computational resources. ◮ The task dependencies can be visualized as a task graph . W R R R R R W R R R R L W R R R W L L R R W L L L R W L L L L L L L L L Figure: A simplified task graph arising from eigenvalue reordering. 8 / 24

  9. Motivation (more opportunities for concurrency) ◮ Real live task graphs are much more complex. ◮ But enclose more opportunities for increased concurrency . ◮ The runtime system unrolls the task graph. ◮ No global synchronization. ◮ Computational steps are allowed overlap and merge. Figure: An illustration of a task-based algorithm 2 . 2Myllykoski, M.: A task-based algorithm for reordering the eigenvalues of a matrix in real schur form. In: Parallel Processing and Applied Mathematics, PPAM 2017. LNCS, vol. 10777, pp. 207–216. Springer International Publishing (2018) 9 / 24

  10. Motivation (GPUs, distributed memory, other benefits) ◮ Other benefits of the task-based approach include ◮ better load balancing, ◮ task priorities, ◮ accelerators support (GPUs) with performance models, ◮ automatic data transfers between memory spaces and ◮ implicit MPI communications . Node 0 Node 1 Node 2 Node 3 Receive Send Figure: An illustration of implicit MPI communications. 10 / 24

  11. StarNEig library (overview) ◮ Designed and implemented at Ume˚ a University as a part of NLAFET project. ◮ Runs on top of the StarPU task-based runtime system. ◮ Targets both ◮ shared memory and ◮ distributed memory machines. ◮ Some components of the library support GPUs . ◮ Real arithmetic supported, complex arithmetic planned. ◮ Beta release (v0.1-beta.2) available at https://github.com/NLAFET/StarNEig . 11 / 24

  12. StarNEig library (current status) Standard case Generalized case SM DM GPU SM DM GPU — � � � � � Hessenberg � � � � � � Schur � � � � � � Reordering — — � ✗ � ✗ Eigenvectors � Ready � Experimental � LAPACK or ScaLAPACK wrapper ✗ In progress — Not planned 12 / 24

  13. Distributed memory (data distribution) ◮ StarNEig distributes matrices in rectangular blocks of a uniform size . ◮ User has three options: 1. Use the default data distribution. 2. Use a two-dimensional block cyclic distribution . 3. Define a data distribution function d : Z + × Z + → Z + that maps the block indices to the MPI rank space. (0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) 0 1 0 1 0 1 0 2 0 2 1 3 1 3 (1,0) (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (1,0) (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) 2 3 2 3 2 3 2 3 2 0 2 2 0 0 (2,0) (2,1) (2,2) (2,3) (2,4) (2,5) (2,6) (2,0) (2,1) (2,2) (2,3) (2,4) (2,5) (2,6) 0 1 0 1 0 1 0 1 3 2 1 1 2 0 (3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) 0 1 2 3 2 0 3 2 3 2 3 2 3 2 (4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (a) 2D block cyclic (b) Arbitrary Figure: Examples of various data distributions. 13 / 24

  14. Distributed memory (block size) ◮ StarNEig divides the distributed blocks into square tiles . ◮ Tile size is closely connected to task granularity. ◮ Tiny tile size ⇒ fine-grained task granularity ⇒ large scheduling overhead. ◮ Distributed blocks should be relatively large . ◮ Many ScaLAPACK-style codes are designed for / perform optimally with smaller block sizes. Figure: An illustration of how the block are divided into tiles. 14 / 24

  15. Distributed memory (CPU core mapping) ◮ StarPU manages a set of worker threads. ◮ Usually one thread per CPU core / GPU + MPI thread. ◮ One process per node (1ppn) configuration required. ◮ A node can be, e.g., a full compute node or a NUMA island. ◮ Many ScaLAPACK-style codes are designed for / perform optimally in one process per core (1ppc) configuration. cores 0 1 StarNEig 2 3 threads nodes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ScaLAPACK 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 MPI ranks CPU core mapping Data distribution Figure: Illustrations of CPU core mappings and data distributions. 15 / 24

  16. Distributed memory (ScaLAPACK compatability) ◮ StarNEig is compatible with ScaLAPACK and provides a ScaLAPACK compatibility layer : // create a 2D block cyclic data distribution (pm X pn process mesh) starneig_distr_t distr = starneig_distr_init_mesh (pm , pn , STARNEIG_ORDER_DEFAULT ); // create a n X n distributed matrix (bn X bn blocks ) starneig_distr_matrix_t dA = starneig_distr_matrix_create (n, n, bn , bn , STARNEIG_REAL_DOUBLE , distr); ... // convert the data distribution to a BLACS context starneig_blacs_context_t context = starneig_distr_to_blacs_context (distr); // convert the distributed matrix to a BLACS descriptor and a local buffer starneig_blacs_descr_t descr_a; double *local_a; starneig_distr_matrix_to_blacs_descr (dA , context , &descr_a , (void **)&local_a); // ScaLAPACK subroutine for reducing general distributed matrix to upper // Hessenberg form extern void pdgehrd_(int const *, int const *, int const *, double *, int const *, int const *, starneig_blacs_descr_t const *, double *, double *, int const *, int *); pdgehrd_ (&n, &ilo , &ihi , local_a , &ia , &ja , &descr_a , tau , ...); 16 / 24

  17. Regarding the presented numerical results ◮ Computational experiments were performed on the Kebnekaise system, HPC2N, Ume˚ a University. ◮ Regular compute node: 28 Intel Xeon E5-2690v4 Broadwell cores. 128 GB memory. FDR Infiniband. ◮ V100 GPU node: 28 Intel Xeon Gold 6132 Skylake cores. 192 GB memory. Two NVIDIA Tesla V100 GPUs . ◮ The results are extracted from ◮ Mirko Myllykoski, Carl Christian Kjelgaard Mikkelsen, Angelika Schwarz, Bo K˚ agstr¨ om: D2.7 Eigenvalue solvers for nonsymmetric problems, public NLAFET deliverable, 2019. 17 / 24

  18. Schur reduction (distributed memory performance) 1.0 1.0 1.6 - 2.9 fold speedup 1.4 - 7.9 fold speedup 0.8 0.8 Relative runtime Relative runtime 0.6 0.6 0.4 0.4 0.2 0.2 StarNEig StarNEig PDHSEQR PDHGEQZ 0.0 0.0 20k 40k 60k 80k 100k 120k 20k 40k 60k 80k Matrix dimension Matrix dimension (a) Standard case 3 . (b) Generalized case 4 . Figure: StarNEig versus ScaLAPACK-style approach (relative run-time improvement). 3 https://github.com/NLAFET/SEVP-PDHSEQR-Alg953/ . 4 https://github.com/NLAFET/GEVP-PDHGEQZ . 18 / 24

  19. Schur reduction (distributed memory scalability) 1 nodes 3000 4 nodes 9 nodes 16 nodes 2500 25 nodes 2000 Runtime [s] 1500 1000 500 0 20k 40k 60k 80k 100k 120k 140k 160k Matrix dimension Figure: Standard case, 28 cores / node, max 700 cores. 19 / 24

Recommend


More recommend