symmetric dense matrix tridiagonalization on a gpu cluster
play

Symmetric dense matrix tridiagonalization on a GPU cluster Ichitaro - PowerPoint PPT Presentation

Symmetric dense matrix tridiagonalization on a GPU cluster Ichitaro Yamazaki, Tim Dong, Stan Tomov, Jack Dongarra Inovative Computing Lab. University of Tennessee, Knoxville Accelerators and Hybrid Exascale Systems (AsHES) Workshop Boston,


  1. Symmetric dense matrix tridiagonalization on a GPU cluster Ichitaro Yamazaki, Tim Dong, Stan Tomov, Jack Dongarra Inovative Computing Lab. University of Tennessee, Knoxville Accelerators and Hybrid Exascale Systems (AsHES) Workshop Boston, Massachusetts, 05/20/2013 Symmetric dense tridiag. on a GPU cluster 1/18

  2. Introduction ◮ Objective: reduces a matrix A into a tridiagonal matrix T , where A is dense ( a ij � = 0) and symmetric ( A = A T ) , through Q T AQ = T , where Q is orthogonal. ◮ Motivation: often a bottleneck in solving symmetric dense eigenvalue problem: Av = λ v or Av = λ Bv . ◮ arise in many scientific and engineering simulations: e.g., - electronics calculation, quantum physics, image processing, web analysis, etc. Symmetric dense tridiag. on a GPU cluster 2/18

  3. MAGMA (Matrix Algebra on GPU and Multicore Architecture) - multi-GPU dense symmetric solver - multi-GPU SYTRD integrated in 800 DORMTR > dense and sparse eigensolvers DSTEDC 700 DSYTRD 8GPU results are from R. Solc´ a of ETH. 600 500 > electronic structure calculation Time (s) 400 (e.g., Exciting, Elk, Abinit, QuantumEspress). 300 > optimized GPU kernels 200 with 1DBC on multiple GPUs . 100 0 0 1 2 3 Number of GPUs - multi-GPU dense symmetric generalized solver - - distributed sparse Lanczos solver - 150 120 ZPOTRF Total solution time ZHEGST LAPACK ZHEEVD MAGMA 100 ZTRSMM 100 80 Time (m) Time (s) 60 50 40 20 0 0 1 2 3 4 5 6 7 8 2 4 8 Number of GPUs Number of nodes Symmetric dense tridiag. on a GPU cluster 3/18

  4. Extension to distributed GPUs Motivation: solving larger-scale problems ◮ not easy to develop an out-of-GPU-memory version ◮ whole trailing submatrix accessed to reduce each column ◮ problem size limitted by GPU memory ◮ weak-scaling studies on tens of GPUs or nodes. Our first step: ScaLAPACK with GPUs GPU−0,1 MPI−0 ◮ any number of MPIs/GPUs per node, GPU−0,0 MPI−2 MPI−4 but one MPI dispatch GPU kernels. Node−0 - larger GPU kernel and smaller communication GPU−1,0 MPI−1 ◮ 1DBC and MPI mapped to cores in a GPU−1,1 MPI−3 round-robing among nodes. MPI−5 Node−1 - our GPU-kernels recycled ◮ same optimization techniques (e.g., static schedule, overlapping CPU with GPU and MPI-comm of vectors). MPI id: 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 GPU id: 0,0 1,0 0,1 1,1 0,0 1,0 0,1 1,1 0,0 1,0 0,1 1,1 0,0 1,0 0,1 Symmetric dense tridiag. on a GPU cluster 4/18

  5. “Blocked” tridiagonalization algorithm step 1: step 2: panel factorization trailing submatrix update q 1, q 2, ... T Q q 1, q 2, ... Q Panel Submatrix for each column in the panel - reduce communication by a - generate q j to reduce the column block update - accumurate it into a blocked update Q Symmetric dense tridiag. on a GPU cluster 5/18

  6. 3 n 3 + O ( n 2 ) flops computational kernels in tridiagonalization : total of 4 1. panel factorization: tridiagonal reduction of a panel ◮ SYMV (bandwidth-limited BLAS-2) about 50% of flops multiply with whole trailing submatrix � A to reduce each column n 2 flops n data ≈ 4 flops 2 � w j := � Av j → (per call) n 2+ � data ( � n ) / 2+ � 3 n 3 flops total of 2 → A → exploit greater memory bandwidth of GPUs − 2. trailing submatrix update: blocked orthogonal update of trailing submatrix: ◮ SYR2K (more data-parallel BLAS-3) about 50% of flops symmetric rank- k update with each panel n 2 k flops nk data ≈ 4 k flops A := A − VW T − WV T 2 � → (per call, k = 32 , 64) n 2+ � data ( � n ) / 2+ � 3 n 3 flops total of 2 → − → exploit larger computational throughput of GPUs Symmetric dense tridiag. on a GPU cluster 6/18

  7. breakdown of reduction time using one GPU Keeneland: (two 8-core Intel SandyBridge CPUs + three NVIDIA M2090 GPUs)/node 1.1 1 0.9 0.8 Time / Total 0.7 0.6 0.5 dsytrd 0.4 dsymv+dsyr2k+dlarfg 0.3 dsymv+dsyr2k dsymv 0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Matrix size, n 4 x 10 ◮ reduction time dominated by dsymv and dsyr2k (up to 90%), especially BLAS-2 dsymv (up to 80%). Symmetric dense tridiag. on a GPU cluster 7/18

  8. multi-GPU kernel 1: BLAS-3 SYR2K symmetric rank- k updates a sequence of “small” CUBLAS calls on streams to exploit data parallelism where A is statically distributed among GPUs in 1DBC T W T V V W syr2k syr2k syr2k syr2k gemm gemm gemm gemm gemm gemm gemm gemm ◮ each GPU updates block columns of its local submatrix ◮ SYR2K( V I , W I ) on diagonal block, and GEMM( V T I , W I : N ) and GEMM( W T I , V I : N ) on off-diagonal ◮ multiple GPU streams cyclically on block columns Symmetric dense tridiag. on a GPU cluster 8/18

  9. performance of SYR2K on multiple GPUs (Keeneland: three NVIDIA M2090 GPUs/node) . Double Complex Double Real 1200 1200 3 GPUs 3 GPUs 2 GPUs 2 GPUs 1 GPU 1 GPU 1000 1000 MKL MKL 800 800 Gflop/s Gflop/s 600 600 400 400 200 200 0 0 0 1000 2000 3000 4000 5000 6000 7000 0 2000 4000 6000 8000 10000 Matrix size Matrix size high data-parallelism ◮ peak double precision performance = 665Gflop/s → 40% of the peak ◮ 430 and 380 Gflops by zgemm and dgemm → 75% of gemm ◮ only about 15% of the reduction time is spent here Symmetric dense tridiag. on a GPU cluster 9/18

  10. multi-GPU kernel 2: BLAS-2 SYMV symmetric matrix-vector multiplication a specialized CUDA kernel to minimize data traffic for multiply with local matrix (an extension of our SC’11 paper) . t t t t t t t t h h h h h h h h r r r r r r r r e e e e e e e e a a a a a a a a d d d d d d d d 1 2 3 4 5 6 7 8 thread 1 thread 2 thread 3 thread 4 = thread 5 thread 6 thread 7 thread 8 ◮ each GPU thread processes a block row ◮ as each block A ij is read, we multiply with A ij and A T ij - each thread writes its partial result to its own workspace - another kernels is launched to sum the partial results - A is read only once from the GPU memory Symmetric dense tridiag. on a GPU cluster 10/18

  11. performance of SYMV on multiple GPUs (Keeneland: three NVIDIA M2090 GPUs/node) . Double Complex Double Real 200 200 180 180 160 160 140 140 120 120 Gflop/s Gflop/s 100 100 80 80 60 60 3 GPUs 3 GPUs 40 40 2 GPUs 2 GPUs 1 GPU 1 GPU 20 20 CUBLAS CUBLAS 0 0 0 5000 10000 15000 0 0.5 1 1.5 2 2.5 Matrix size, n Matrix size, n 4 x 10 bandwidth-limited operation: ◮ zhemv performs twice more operations → twice the Gflop/s ◮ 120GB/sec with ECC on → peaks are 120 and 60 Gflop/s for zhemv and dsymv → 65% − 75% of this peak → 15% − 20% of gemm ◮ up to 80% of the reduction time is spent here Symmetric dense tridiag. on a GPU cluster 11/18

  12. putting them together: multi-GPU tridiagonalization 1 statically distribute A in 1DBC 2 for each panel 2.1 panel factorization for each column in the panel 2.1.1 update a j with previous v and w on CPUs 2.1.2 compute Householder reflector on CPUs 2.1.3 broadcast reflector to all GPUs 2.1.4 SYMV of local submatrices on GPUs in parallel 2.1.5 copy vector v j back to CPU 2.1.6 compute w j on CPUs 2.2 trailing submatrices update 2.2.1 broadcast V and W to GPUs 2.2.2 SYR2K of local submatrix on GPUs in parallel for each GPU call, communicate: n 2+ n ◮ local matrix ( 2 × gpus data) from GPU memory ◮ vector(s) ( n or nk data) between CPU and GPU (non-blocking MPI + GPU streams) Symmetric dense tridiag. on a GPU cluster 12/18

  13. hybrid CPU-GPU computing with asynch. MPI and GPU communication ◮ SYR2K: hide communication of V or W behind panel factorization using asynch MPI send and dedicated GPU stream ◮ SYMV: overlap CPU and GPU computation (and hide CPU-GPU communication) for multiplying with updated � A (left-look within panel, right-look between panels) , W T − � V T ) v j , ( � A − � V � W � where � n , while � V and � A is � n × � W are � n × ( j − 1). W T + � V T ) v j (and updating next column) on CPUs. - � Av j on GPUs, while ( � V � W � - a piece of a trace on 3 GPUs - Symmetric dense tridiag. on a GPU cluster 13/18

  14. Performance on Keeneland: ScaLAPACK ◮ Keeneland: two 8-core 2.6GHz Intel Xeon processors (SandyBridge) and three NVIDIA Tesla M2090 GPUs. 1 ◮ weak-scaling studies: matrix size = 11 , 520 × (number of nodes) 2 . 900 Others PDSYMV comm 800 PDSYMV PDSYR2K 700 600 Time (s) 500 400 300 200 100 0 1 4 16 Number of nodes ◮ hybrid-programming performed well - 1MPI/socket obtained best computation and communication performance Symmetric dense tridiag. on a GPU cluster 14/18

  15. Performance on Keeneland: ScaLAPACK+1GPU/node 900 900 Others Others PDSYMV comm PDSYMV comm 800 800 PDSYMV PDSYMV PDSYR2K PDSYR2K 700 700 600 600 Time (s) Time (s) 500 500 400 400 0.8 2.2 1.8 300 300 200 200 2.3 1.5 3.5 100 100 1.7 1.7 4.8 0 0 1 4 16 1 4 16 Number of nodes Number of nodes ◮ GPU-extension got reasonable speedups - significant speedups over 1MPI/core or 1MPI/node. - similar performance as 1MPI/socket. Symmetric dense tridiag. on a GPU cluster 15/18

Recommend


More recommend