feeding of the thousands leveraging the gpu s computing
play

Feeding of the Thousands Leveraging the GPU's Computing Power for - PowerPoint PPT Presentation

SPPEXA Annual Meeting 2016, January 25 th , 2016, Garching, Germany Feeding of the Thousands Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt Sparse Linear Algebra on GPUs In Inherently parallel operations


  1. SPPEXA Annual Meeting 2016, January 25 th , 2016, Garching, Germany Feeding of the Thousands – Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt

  2. Sparse Linear Algebra on GPUs • In Inherently parallel operations • axpy, copy, gemv... • usually memory bound • Kernel fusion • Sp Sparse matrix vector product • often computationally most expensive part • variety of storage formats, kernels… • row-thread mapping can result in imbalance • malicious memory access • Bo Bottlenecks • sequential operations, unstructured, random memory access • incomplete factorizations (ILU/IC) • sparse triangular solves 2

  3. Sparse Matrix Vector Product (SpMV) Sliced Ellpack (SELL) format as • trade-off between CSR and Ellpack Sorting can improve load-balancing • CSR format ELL format SELLP format col-index 1 2 3 4 5 6 7 0 5 2 4 2 5 0 0 0 5 2 4 2 5 0 0 0 5 2 4 2 5 0 0 0 5 2 4 0 0 2 0 5 0 3 7 2 0 0 0 0 0 3 7 2 0 0 0 0 0 3 7 2 0 0 0 0 0 3 7 2 0 0 0 0 0 1 row-index 7 5 0 0 0 0 0 0 7 5 0 0 0 0 0 0 7 5 0 0 0 0 0 0 2 0 0 7 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Sparse 0 0 0 0 0 0 0 0 5 8 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 storage formats 8 0 0 0 0 0 0 0 6 3 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 7 0 0 0 0 0 0 0 0 5 2 4 2 5 0 1 2 5 7 values 5 2 4 2 5 3 7 2 7 5 8 3 3 7 2 0 0 0 1 2 X X 5 2 4 2 5 0 1 2 5 7 colind 0 1 2 5 7 0 1 2 2 7 0 6 7 5 0 0 0 2 7 X X X 3 7 2 0 0 0 1 2 X X 0 0 0 0 0 X X X X X 7 5 2 7 0 0 0 0 0 X X X X X 0 0 X X rowptr 0 5 8 10 10 10 11 8 0 0 0 0 0 X X X X 0 X values colind 3 0 0 0 0 6 X X X X 8 0 points to first element in row 3 6 values values colind colind 0 X rowptr 0 10 14 16 18 points to first element in block Kreutzer et al.: A A Un Unified Sparse Matrix Data Format for Ef Efficient Ge General Sparse Matrix-Ve Vector Multiplication on Mo Modern Pr Processors wi with Wide e SIMD Un Units , SISC 36(5), 2014. 3

  4. Sparse Matrix Vector Product (SpMV) Assign multiple threads to each row • 2-dimensional thread blocks • shared memory 2D TBs + + + TB 0 + TB 1 reduction Kreutzer et al.: A A Un Unified Sparse Matrix Data Format for Ef Efficient Ge General Sparse Matrix-Ve Vector Multiplication on Mo Modern Pr Processors wi with Wide e SIMD Un Units , SISC 36(5), 2014. 4

  5. Sparse Matrix Vector Product with multiple Vectors (SpMM) 3-dimensional thread blocks for processing multiple vectors simultaneously • multiple vectors shared memory shared memory 2D TBs + shared memory 2D TBs + + shared memory 2D TBs + + + shared memory 2D TBs TB 0 + + + + shared memory 2D TBs TB 0 + + + + 2D TBs TB 0 + + + + TB 0 + + + TB 0 + + TB 1 TB 0 + = TB 1 reduction TB 1 reduction TB 1 reduction TB 1 reduction TB 1 reduction reduction Anzt et al.: En Energy efficiency and performance frontiers for sparse computations on GPU su supercomputers , PMAM 2015. . 5

  6. Sparse Matrix Vector Product with multiple Vectors (SpMM) 3-dimensional thread blocks for processing multiple vectors simultaneously • Performance on NVIDIA K40, 64 vectors, DP: • 120 100 GFLOP/s 80 60 40 20 audikw_1 bmw3_2 bmwcra_1 bone_010 bone_S10 cant crankseg_2 F1 Fault_639 Hook_1498 inline_1 ldoor pwtk Stoc_1465 stomach xenon_2 cuSPARSE CSRSpMM cuSPARSE CSRSpMM v2 MAGMA SELL-P SpMM Anzt et al.: En Energy efficiency and performance frontiers for sparse computations on GPU su supercomputers, PMAM 2015. . 6

  7. Kernel Fusion in Sparse Iterative Algorithms Memory bandwidth in many cases the performance bottleneck. • Sequence of consecutive vector updates (BLAS 1) • benefits from enhanced data locality. Design of algorithm-specific kernels. • r+w Thread block r+w r+w r+w r+w r+w r+w Thread block r+w r+w r+w r+w r+w r+w Thread block r+w r+w r+w r+w r+w kernel2 kernel1 7

  8. Kernel Fusion in Sparse Iterative Algorithms Memory bandwidth in many cases the performance bottleneck • Sequence of consecutive vector updates (BLAS 1) benefits from enhanced • data locality while ( ( k < maxiter ) && ( res > epsilon ) ){ while ( ( k < maxiter ) && ( res > epsilon ) ){ Scalar_SpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, d, z ); scalar_fusion_1 <<<Gs, Bs, Ms>>> ( n, rowA, colA, valA, tmp = cublasSdot ( n, d, 1, z, 1 ); d, z, beta, rho, gamma, vtmp ); rho = beta / tmp; fusion_2 ( Gs, Bs, Ms, n, beta, rho, vtmp ); gamma = beta; cublasSaxpy (n, rho, d, 1, x, 1 ); fusion_3 <<<Gs, Bs, Ms>>> ( n, rho, d, x, z, r, vtmp ); cublasSaxpy (n, -rho, z, 1, r, 1 ); fusion_4 ( Gs, Bs, Ms, n, vtmp, vtmp2 ); tmp = cublasSdot ( n, r, 1, r, 1 ); beta = tmp; fusion_5 <<<Gs, Bs>>> ( n, beta, gamma, alpha, alpha = beta / gamma; d, r, vtmp ); cublasSscal (n, alpha, d, 1 ); cudaMemcopy ( &res, beta, sizeof(float), cublasSaxpy (n, one, r, 1, d, 1 ); cudaMemcpyDeviceToHost ); res = sqrt( beta ); res = sqrt( beta ); k++; k ++; } // end-while } // end-while Aliaga et al .: .: Re Reformulated Conjugate Gradient for the Energy-Aw Aware S Solution of Linear Systems on G GPUs , Parallel Processing (ICPP), 2013. 8

  9. Kernel Fusion in Sparse Iterative Algorithms Wh Which operations can be merged d into a single kernel? • kernels compatible in terms of component-thread mapping • example classification for Jacobi-CG: • Op Operation on Input vector(s) In Output Ou / / M − 1 x y α y AXPY mapped mapped mapped y = α x + y COPY mapped mapped mapped y = x DOT mapped mapped unmapped α = h x, y i Jacobi-Prec y = M − 1 x mapped mapped mapped SpMV CSR unmapped - mapped y = Ax SpMV ELL unmapped - mapped y = Ax SpMV SELL-P unmapped - unmapped y = Ax Aliaga et al .: .: Systematic Fusion of C CUDA Kernels for It Iterative S Sparse Linear S System So Solvers , Euro-Par 2015, LLNCS 9233, 2015. 9

  10. Kernel Fusion in Sparse Iterative Algorithms Wh Which operations can be merged d into a single kernel? • Wh Which kernels do do we want to merge? • performance vs. flexibility… • Operation Op on In Input vector(s) Ou Output / / M − 1 x y α y AXPY mapped mapped mapped y = α x + y COPY mapped mapped mapped y = x DOT mapped mapped unmapped α = h x, y i Jacobi-Prec y = M − 1 x mapped mapped mapped SpMV CSR unmapped - mapped y = Ax SpMV ELL unmapped - mapped y = Ax SpMV SELL-P unmapped - unmapped y = Ax Code Jacobi-preconditioned J-CG, J-BiCGSTAB, J-IDR, J-GMRES…? How about ILU/IC? 10

  11. Kernel Fusion in Sparse Iterative Algorithms Wh Which operations can be merged d into a single kernel? • Wh Which kernels do do we want to merge? • performance vs. flexibility… • Operation Op on In Input vector(s) Ou Output / / M − 1 x y α y AXPY mapped mapped mapped y = α x + y COPY mapped mapped mapped y = x DOT mapped mapped unmapped α = h x, y i Jacobi-Prec y = M − 1 x mapped mapped mapped SpMV CSR unmapped - mapped y = Ax SpMV ELL unmapped - mapped y = Ax SpMV SELL-P unmapped - unmapped y = Ax One stand-alone code for each SpMV kernel? 11

  12. Kernel Fusion in Sparse Iterative Algorithms Wh Which operations can be merged d into a single kernel? • Wh Which kernels do do we want to merge? • performance vs. flexibility… • 10 basic JCG fusion JCG 9 Jacobi-fusion 8 Matrix Size Nonzeros 7 AUD 943,695 77,651,847 G3 1,585,478 7,660,826 Runtime [s] 6 INL 503,712 36,816,342 5 LDO 952,203 46,522,475 4 3 2 1 0 AUD G3 INL LDO • Benefits from fusion Jacobi-preconditioner for very large and sparse matrices. • Smaller benefits for more complex algorithms (BiCGSTAB, CGS, QMR, IDR…) 12

  13. Kernel Fusion in Sparse Iterative Algorithms How close can kernel fusion bring us to the • th theoreti tical performance bound induced by memory bandwidth th? Cooperation with Moritz Kreutzer, Eduardo Ponce . • NVIDIA K40, theoretical bandwidth: 288 GB/s… • 200 b = 193 GB/s ... WEB 160 Bandwidth b in GB/s 120 TDK SCR 80 Matrix Size SCR 170,998 40 TDK 204,316 WEB 1,000,005 DIE 1,157,456 0 3 4 5 6 7 10 10 10 10 10 Vector length 13

  14. Kernel Fusion in Sparse Iterative Algorithms Ef Efficiency compared to roofline performance model: P = min( P peak ; Ib ) Gflop/s • P theoretical compute peak, I intensity, b bandwidth IDR(s) g ID general Kr Krylov sol solver 100 IDR(1) Moritz Kreutzer, Eduardo Ponce IDR(2) 90 IDR(4) IDR(8) 80 Matrix Size Nonzeros 70 Efficiency [%] SCR 170,998 958,936 60 TDK 204,316 2,846,228 WEB 1,000,005 3,105,536 50 NLP 1,062,400 28,704,672 40 DIE 1,157,456 48,538,952 THM 1,228,045 8,580,313 30 AFS 1,508,065 52,672,325 MLG 1,504,002 110,879,972 20 G3 1,585,478 7,660,826 TRA 1,602,111 23,500,731 10 0 SCR TDK WEB NLP DIE THM AFS MLG G3 TRA 14

  15. Kernel Overlap in Sparse Iterative Algorithms Co Concurrent kernel execution to exploit unused GPU compute resources. • Ra Rare in sparse linear algebra (most algorithms compose of memory-bound • operations). Small benefits if datasets too sm Sm small to saturate memory bandwidth. • 10 IDR(1) IDR(2) 9 IDR(4) IDR(8) 8 Performance improvement [%] Matrix Size 7 SCR 170,998 6 TDK 204,316 5 WEB 1,000,005 4 DIE 1,157,456 3 2 1 0 SCR TDK WEB DIE 15

Recommend


More recommend