ioptimizing krylov subspace solvers on iigraphics
play

iOptimizing Krylov Subspace Solvers on iiGraphics Processing Units - PowerPoint PPT Presentation

iOptimizing Krylov Subspace Solvers on iiGraphics Processing Units Hartwig Anzt, W. Sawyer, S. Tomov, P . Luszczek, I. Yamazaki, J. Dongarra May 19, 2014 Innovative and Computing Laboratory ICL http://icl.cs.utk.edu/ University of


  1. iOptimizing Krylov Subspace Solvers on iiGraphics Processing Units Hartwig Anzt, W. Sawyer, S. Tomov, P . Luszczek, I. Yamazaki, J. Dongarra May 19, 2014 Innovative and Computing Laboratory – ICL http://icl.cs.utk.edu/ University of Tennessee, Knoxville

  2. Krylov Methods Alexei Nikolajewitsch Krylov (1863-1945) iterative solvers for linear systems searching for solution approximation in subspace matrix-vector product generates (Krylov-) subspace target: sparse systems examples: CG, BiCGStab, GMRES. . . 2 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

  3. Porting Krylov Methods to GPUs provides a set of useful routines (cuBLAS/cuSPARSE) compose Krylov method out of these building blocks is this efficient? 3 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

  4. Porting Krylov Methods to GPUs provides a set of useful routines (cuBLAS/cuSPARSE) compose Krylov method out of these building blocks is this efficient? Krylov methods are usually memory-bound every kernel requires accessing all needed data in main memory high number of kernels in iteration loop idea: replace cuBLAS/cuSPARSE by algorithm-specific kernels reduced memory traffic reduced kernel launch overhead 4 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

  5. reducing kernel number merged dot-product performance model BiCGStab 1: while ( k < maxiter ) && ( res > τ ) 2: k := k + 1 r T ρ k := ˆ 3: 0 r k − 1 α k − 1 ρ k 4: β k + 1 := ρ k − 1 ω k − 1 5: p k := r k − 1 + β ( p k − 1 − ω k − 1 v k − 1 ) v k := Ap k 6: ρ k 7: α k := r 0T v ˆ s k := r k − 1 − α v k 8: 9: t k := As k ω k := s T k t k 10: t T k t k 11: x k + 1 := x k + α k p k + ω k s k r k := s k − ω t k 12: res = r T 13: k r k 14: end 5 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

  6. reducing kernel number merged dot-product performance model BiCGStab 1: while ( k < maxiter ) && ( res > τ ) p k := r k − 1 + β ( p k − 1 − ω k − 1 v k − 1 ) 2: k := k + 1 r T cuBLAS ρ k := ˆ 3: 0 r k − 1 ρ k α k − 1 4: β k + 1 := cublasDscal( n, beta, p, 1 ); ρ k − 1 ω k − 1 cublasDaxpy( n, omega * beta, v, 1 , p, 1 ); 5: p k := r k − 1 + β ( p k − 1 − ω k − 1 v k − 1 ) cublasDaxpy( n, 1.0, r, 1, p, 1 ); 6: v k := Ap k ρ k 3 kernels - 5n reads, 3n writes 7: α k := r 0T v ˆ 8: s k := r k − 1 − α v k merge in one kernel 9: t k := As k p_update( int n, double beta, double omega, ω k := s T k t k 10: double *v, double *r, double *p ){ t T k t k 11: x k + 1 := x k + α k p k + ω k s k int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) 12: r k := s k − ω t k res = r T p[i] = r[i] + beta * ( p[i]-omega*v[i] ); 13: k r k } 14: end 1 kernel - 3n reads, 1n writes 6 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

  7. reducing kernel number merged dot-product performance model BiCGStab 1: while ( k < maxiter ) && ( res > τ ) p k := r k − 1 + β ( p k − 1 − ω k − 1 v k − 1 ) 2: k := k + 1 25 MERGE cuBLAS ρ k := ˆ r T 3: 0 r k − 1 CUBLAS α k − 1 4: β k + 1 := ρ k cublasDscal( n, beta, p, 1 ); ρ k − 1 ω k − 1 20 cublasDaxpy( n, omega * beta, v, 1 , p, 1 ); 5: p k := r k − 1 + β ( p k − 1 − ω k − 1 v k − 1 ) cublasDaxpy( n, 1.0, r, 1, p, 1 ); 6: v k := Ap k 7: α k := ρ k 3 kernels - 5n reads, 3n writes r 0T v 15 ˆ GFLOPS 8: s k := r k − 1 − α v k better merge in one kernel 9: t k := As k p_update( int n, double beta, double omega, ω k := s T k t k 10 10: t T double *v, double *r, double *p ){ k t k 11: x k + 1 := x k + α k p k + ω k s k int i = blockIdx.x * blockDim.x + threadIdx.x; 12: r k := s k − ω t k if( i<n ) 5 res = r T p[i] = r[i] + beta * ( p[i]-omega*v[i] ); 13: k r k } 14: end 1 kernel - 3n reads, 1n writes 0 0 100 200 300 400 500 600 700 800 900 1000 vector length n in 10^3 7 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

  8. reducing kernel number merged dot-product performance model merged BiCGStab while( ( k < maxiter ) && ( res > epsilon ) ){ while( ( k < maxiter ) && ( res_host > epsilon ) ){ rho_new = cublas_dot( n, r_hat, 1, r, 1 ); magma_dbicgmerge_p_update<<<Gs, Bs, 0>>> beta = rho_new/rho_old * alpha/omega; ( n, skp, v, r, p ); cublas_dscal( n, beta, p, 1 ); cublas_daxpy( n, omega * beta, v, 1 , p, 1 ); magma_dbicgmerge_spmv1<<<Gs, Bs, Ms1>>> cublas_daxpy( n, 1.0, r, 1, p, 1 ); ( n, valA, rowA, colA, p, r, v, d1 ); dSpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, p, v ); magma_zbicgmerge_reduce1( n, Gs, Bs, d1, d2, skp ); alpha = rho_new / cublas_dot( n, r_hat, 1, v, 1 ); cublas_dcopy( n, r, 1 , s, 1 ); magma_dbicgmerge_s_update<<<Gs, Bs, 0>>> cublas_daxpy( n, -1.0 * alpha, v, 1 , s, 1 ); ( n, skp, r, v, s ); dSpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, s, t ); omega = cublas_dot( n, t, 1, s, 1 ) magma_dbicgmerge_spmv2<<<Gs, Bs, Ms2>>> / cublas_dot( n, t, 1, t, 1 ); ( n, valA, rowA, colA, s, t, d1 ); cublas_daxpy( dofs, alpha, p, 1 , x, 1 ); magma_dbicgmerge_reduce2( n, Gs, Bs, d1, d2, skp); cublas_daxpy( dofs, omega, s, 1 , x, 1 ); cublas_dcopy( n, s, 1 , r, 1 ); magma_dbicgmerge_xr_update<<<Gs, Bs, 0>>> cublas_daxpy( n, -1.0 * omega, t, 1 , r, 1 ); ( n, skp, r_hat, r, p, s, t, x, d1); res = cublas_dnrm2( n, r, 1 ); magma_dbicgmerge_reduce3( n, Gs, Bs, d1, d2, skp); rho_old = rho_new; magma_memcopy( 1, skp+5, res_host ); k++; k++; } } 4nnz+29n reads + 11n writes 4nnz+16n reads + 6n writes small variances possible due to recursive reduction (depending on n) savings about 35 % when neglecting SpMV 8 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

  9. reducing kernel number merged dot-product performance model merged BiCGStab while( ( k < maxiter ) && ( res > epsilon ) ){ while( ( k < maxiter ) && ( res_host > epsilon ) ){ rho_new = cublas_dot( n, r_hat, 1, r, 1 ); magma_dbicgmerge_p_update<<<Gs, Bs, 0>>> beta = rho_new/rho_old * alpha/omega; ( n, skp, v, r, p ); cublas_dscal( n, beta, p, 1 ); cublas_daxpy( n, omega * beta, v, 1 , p, 1 ); magma_dbicgmerge_spmv1<<<Gs, Bs, Ms1>>> cublas_daxpy( n, 1.0, r, 1, p, 1 ); ( n, valA, rowA, colA, p, r, v, d1 ); dSpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, p, v ); magma_zbicgmerge_reduce1( n, Gs, Bs, d1, d2, skp ); alpha = rho_new / cublas_dot( n, r_hat, 1, v, 1 ); cublas_dcopy( n, r, 1 , s, 1 ); magma_dbicgmerge_s_update<<<Gs, Bs, 0>>> cublas_daxpy( n, -1.0 * alpha, v, 1 , s, 1 ); ( n, skp, r, v, s ); dSpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, s, t ); two consecutive omega = cublas_dot( n, t, 1, s, 1 ) magma_dbicgmerge_spmv2<<<Gs, Bs, Ms2>>> / cublas_dot( n, t, 1, t, 1 ); ( n, valA, rowA, colA, s, t, d1 ); reductions cublas_daxpy( dofs, alpha, p, 1 , x, 1 ); magma_dbicgmerge_reduce2( n, Gs, Bs, d1, d2, skp); cublas_daxpy( dofs, omega, s, 1 , x, 1 ); cublas_dcopy( n, s, 1 , r, 1 ); magma_dbicgmerge_xr_update<<<Gs, Bs, 0>>> cublas_daxpy( n, -1.0 * omega, t, 1 , r, 1 ); ( n, skp, r_hat, r, p, s, t, x, d1); res = cublas_dnrm2( n, r, 1 ); magma_dbicgmerge_reduce3( n, Gs, Bs, d1, d2, skp); rho_old = rho_new; magma_memcopy( 1, skp+5, res_host ); k++; k++; } } 4nnz+29n reads + 11n writes 4nnz+16n reads + 6n writes small variances possible due to recursive reduction (depending on n) savings about 35 % when neglecting SpMV 9 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

  10. reducing kernel number merged dot-product performance model merging multiple dot-products (MDOT) consecutive dot products sharing one common vector can be merged reduced memory reads improve performance 30 MDOT(1) CUBLAS(1) MDOT(2) 25 CUBLAS(2) 20 GFLOPS 15 10 5 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 vector length n in 10^3 where is the line between merged dot-product and matrix-vector product? 10 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

  11. reducing kernel number merged dot-product performance model merged dot vs. matrix vector product vector length n = 100,000 vector length n = 1,000,000 25 CUDOT CUDOT MDOT 12 MDOT MDGM MDGM CUGEMV 20 CUGEMV 10 MAGMA_GeMV MAGMA_GeMV 8 15 GFLOPS GFLOPS 6 10 4 5 2 0 0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 number of vectors number of vectors MAGMA_GeMV superior 20 number of vectors 15 10 MDGM superior 5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 vector length n in 10^3 11 May 19, 2014 Hartwig Anzt - Optimizing Krylov Subspace Solvers on GPUs

Recommend


More recommend