Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs Azzam Haidar and Stan Tomov with J. Dongarra, P. Luszczek, I. Yamazaki, M. Gates, H. Anzt, and T. Dong University of Tennessee, Knoxville GPU Technology Conference March 17 – 20, 2015 San Jose, CA
Outline • Motivation • Overview of the MAGMA library – Hybrid GPU+CPU dense linear algebra – Sparse linear algebra – Batched linear algebra – Power considerations and GPU-only algorithms • Batched one-sided factorizations • Future Directions
Motivation • Important scientific applications rely on sparse linear algebra • K40 GPU computing efficiency on Comp mput ute int ntens nsive (dense LU) vs. Memory-bound ound comp mputa utati tion n (SpMV)
Motivation • Many dense and sparse direct solvers need HP, energy-efficient LA on many small independent dense matrices – Tiled linear algebra algorithms – Multifrontal methods – Preconditioners (using DLA) in sparse iterative solvers, many applications, … Batched LA DAG-based factorization Sparse / Dense Matrix System LU, QR, or Cholesky on small diagonal matrices TRSM s, QR s, or LU s TRSM s, TRMM s Updates (Schur complement) GEMM s, SYRK s, TRMM s And many other BLAS/LAPACK, e.g., for application specific solvers, preconditioners, and matrices
MAGMA: LAPACK for GPUs • MAGMA – M atrix A lgebra for G PU and M ulticore A rchitecture – LAPACK/ScaLAPACK on hybrid architectures – http://icl.cs.utk.edu/magma/ • Next Generation of DLA Software: Rely on MAGMA - hybrid scheduler • Developers & Hybrid Algorithms - hybrid kernels (for nested parallelism) (heterogeneity friendly) collaborators: – UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) – Community effort, similarly to LAPACK/ScaLAPACK
Key Features of MAGMA 1.6 HIBRID ALGORITHMS MAGMA uses hybrid algorithms where the computation is split into tasks of varying granularity and their execution scheduled over the hardware components. Scheduling can be static or dynamic. In either case, small non-parallelizable tasks, often on the critical path, are scheduled on the CPU, and larger more parallelizable ones, often Level 3 BLAS, are scheduled on the MICs. PERFORMANCE & ENERGY EFFICIENCY
Methodology overview A methodology to use all available resources: • MAGMA uses hybrid algorithms based on – Representing linear algebra algorithms as collections Hybrid CPU+GPU algorithms (small tasks for multicores and of tasks and data dependencies among them large tasks for GPUs) – Properly scheduling tasks' execution over multicore and GPU hardware components • Successfully applied to fundamental linear algebra algorithms – One- and two-sided factorizations and solvers – Iterative linear and eigensolvers • Productivity – 1) High level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions
A Hybrid Algorithm Example Left-looking hybrid Cholesky to parallel hybrid MAGMA From sequential 1 for ( j=0, j<n; j+=nb) { LAPACK 2 jb = min(nb, n-j); for ( j=0, j<n; j+=nb) { 3 magma_ zherk( MagmaUpper, MagmaConjTrans, jb = min(nb, n-j); jb, j, one, dA(0,j), ldda, one, dA(j,j), ldda, queue ); zherk( MagmaUpper jb, j, one, dA(0 4 magma_zgetmatrix_async ( jb, jb, dA(j,j), ldda, work, jb, queue, & event ); 5 if (j+jb < n) if (j+jb < n) 6 magma_ zgemm( MagmaConjTrans, MagmaNoTrans, jb, n-j-jb, j, one, zgemm( MagmaCo dA(0,j), ldda, dA(0,j+jb), ldda, one, dA(j, j+jb), ldda, que dA(0,j), ldd 7 magma_event_sync ( event ); zpotrf( MagmaUpper 8 zpotrf( MagmaUpperStr, &jb, work, &jb, info); if (info != 0) 9 if (info != 0) *info += j; 10 *info += j; 11 magma_zsetmatrix_async (jb, jb, work, jb, dA(j, j), ldda, queue, & event ); If (j+jb) < n) { 12 If (j+jb) < n) { ztrsm( MagmaLeft, 13 magma_event_sync ( event ); jb, n 14 magma_ ztrsm( MagmaLeft, MagmaUpper, MagmaConjTrans, MagmaNo } jb, n-j-jb, one, dA(j,j), ldda, dA(j,j+jb), ldda, queue); } } } Note: • MAGMA and LAPACK look similar • Difference is lines in red, specifying data transfers and dependencies • Differences are further hidden in a dynamic scheduler making the top level representation of MAGMA algorithms almost identical to LAPACK
A Hybrid Algorithm Example Left-looking hybrid Cholesky MAGMA runtime environment to parallel hybrid • Scheduling can be static MAGMA or dynamic From sequential 1 for ( j=0, j<n; j+=nb) { LAPACK 2 jb = min(nb, n-j); • Dynamic is based on QUARK for ( j=0, j<n; j+=nb) { 3 magma_ zherk( MagmaUpper, MagmaConjTrans, jb = min(nb, n-j); jb, j, one, dA(0,j), ldda, one, dA(j,j), ldda, queue ); zherk( MagmaUpper • Uses CUDA streams to offload jb, j, one, dA(0 4 magma_zgetmatrix_async ( jb, jb, dA(j,j), ldda, work, jb, queue, & event ); 5 if (j+jb < n) computation to the GPU if (j+jb < n) 6 magma_ zgemm( MagmaConjTrans, MagmaNoTrans, jb, n-j-jb, j, one, zgemm( MagmaCo dA(0,j), ldda, dA(0,j+jb), ldda, one, dA(j, j+jb), ldda, que dA(0,j), ldd 7 magma_event_sync ( event ); zpotrf( MagmaUpper 8 zpotrf( MagmaUpperStr, &jb, work, &jb, info); if (info != 0) 9 if (info != 0) *info += j; 10 *info += j; 11 magma_zsetmatrix_async (jb, jb, work, jb, dA(j, j), ldda, queue, & event ); If (j+jb) < n) { 12 If (j+jb) < n) { ztrsm( MagmaLeft, 13 magma_event_sync ( event ); jb, n 14 magma_ ztrsm( MagmaLeft, MagmaUpper, MagmaConjTrans, MagmaNo } jb, n-j-jb, one, dA(j,j), ldda, dA(j,j+jb), ldda, queue); } } } Note: • MAGMA and LAPACK look similar • Difference is lines in red, specifying data transfers and dependencies • Differences are further hidden in a dynamic scheduler making the top level representation of MAGMA algorithms almost identical to LAPACK
A Hybrid Algorithm Example Left-looking hybrid Cholesky MAGMA runtime environment to parallel hybrid GPU CPU MAGMA CUDA Queues From sequential 3 1 for ( j=0, j<n; j+=nb) { 14 13 11 7 6 4 LAPACK Tasks 2 jb = min(nb, n-j); for ( j=0, j<n; j+=nb) { Computed 3 magma_ zherk( MagmaUpper, MagmaConjTrans, 3’ 3’ jb = min(nb, n-j); on the GPU 3 3 jb, j, one, dA(0,j), ldda, one, dA(j,j), ldda, queue ); zherk( MagmaUpper 3’ 3’ 4 jb, j, one, dA(0 4 magma_zgetmatrix_async ( jb, jb, dA(j,j), ldda, work, jb, queue, & event ); Offloaded 6 4 5 if (j+jb < n) to the 6’ 6’ if (j+jb < n) 6 magma_ zgemm( MagmaConjTrans, MagmaNoTrans, jb, n-j-jb, j, one, GPU 6’ zgemm( MagmaCo 6’ 6 4 dA(0,j), ldda, dA(0,j+jb), ldda, one, dA(j, j+jb), ldda, que 7 dA(0,j), ldd 6’ 6’ CPU task #8 and 7 magma_event_sync ( event ); 7 6’ 6’ CPU-GPU zpotrf( MagmaUpper 8 zpotrf( MagmaUpperStr, &jb, work, &jb, info); 7 on the CPU communications if (info != 0) 9 if (info != 0) 6’ 6’ 8 *info += j; are overlapped 10 *info += j; 6’ 6’ with GPU 11 magma_zsetmatrix_async (jb, jb, work, jb, dA(j, j), ldda, queue, & event ); 11 If (j+jb) < n) { 11 computations 6’ 11 6’ 12 If (j+jb) < n) { Offloaded 13 6’ 6’ ztrsm( MagmaLeft, 13 magma_event_sync ( event ); to the 13 13 jb, n 14 magma_ ztrsm( MagmaLeft, MagmaUpper, MagmaConjTrans, MagmaNo 6’ 6’ GPU 14 } 14 jb, n-j-jb, one, dA(j,j), ldda, dA(j,j+jb), ldda, queue); 6’ 6’ } } } 14’ 14’ 14’ 14’ PCI Note: • MAGMA and LAPACK look similar time • Difference is lines in red, specifying data transfers and dependencies • Differences are further hidden in a dynamic scheduler making the top level representation of MAGMA algorithms almost identical to LAPACK
Mixed precision iterative refinement Solving general dense linear systems using mixed precision iterative refinement 700 SP Solve 600 DP Solve (MP Iter.Ref.) DP Solve 500 400 300 200 Keeneland GPU M2090 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X5660@2.80GHz (2 x 6 cores) 100 0 Matrix size
Out of GPU Memory Algorithms Solving large problems that do not fit in the GPU memory 350 DGETRF 300 250 200 Matrices of size that do not fit in a specified GPU memory 150 100 Keeneland 50 GPU M2090 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X5660@2.80GHz (2 x 6 cores) 0 Matrix size
Out of GPU Memory Algorithms Solving large problems that do not fit in the GPU memory 350 DGETRF 300 250 200 Out-of-GPU-memory Algorithms can now solve large problems 150 100 Keeneland 50 GPU M2090 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X5660@2.80GHz (2 x 6 cores) 0 Matrix size
Out of GPU Memory Algorithm • Perform left-looking factorizations on sub-matrices that fit in the GPU memory (using existing algorithms) • The rest of the matrix stays on the CPU • Left-looking versions minimize writing on the CPU 1) Copy A 2 to the GPU Untouched 2) Update A 2 using A 1 (a panel of A 1 at a time) part of the 3) Factor the updated A 2 using existing matrix hybrid code To be Factored 4) Copy factored A 2 to the CPU factored A 2 on GPU . . . sub-matric sub-matrix A 1 on CPU Trivially extended to multiGPUs: A 2 is “larger” with 1 -D block cyclic distribution, again reusing existing algorithms
MultiGPU Support • Data distribution – 1-D block-cyclic distribution nb • Algorithm GPU GPU GPU GPU . . . – GPU holding current panel 0 1 2 0 is sending it to CPU – All updates are done in parallel on the GPUs – Look-ahead is done with GPU holding the next panel
Recommend
More recommend