Hierarchical-matrix Linear Solver on GPU clusters with MAGMA - PowerPoint PPT Presentation

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA variable-size batched kernel Ichitaro Yamazaki ∗ , Ahmad Abdelfattah ∗ , Akihiro Ida † , Satoshi Ohshima ‡ , Stanimire Tomov ∗ , Rio Yokota ♯ , Jack Dongarra ∗ ∗ The University of Tennessee, Knoxville, USA † The University of Tokyo, Japan ‡ Kyushu University, Japan ♯ Tokyo Institute of Tehcnology, Japan GPU Technology Conference San Jose, CA, 03/26/2018 1/20 H -matrix BiCGStab with GPUs

Boundary Equation Method : from integral equation to linear equations ◮ many scientific and engineering applications (e.g., acoustics, electro magnetics, and fracture and fluid mechanics) ◮ numerical solution of integral equation � K ( x , y ) u ( y )d y = f Ω → solution of dense linear system Aφ = b ◮ sizes of problems limited by cost of solving the linear system 2/20 H -matrix BiCGStab with GPUs

HACApK : dense linear solver ◮ solves dense linear system of equations e.g., for BEM (ppohBEM). ◮ reduces computational and storage costs by compressing the matrix into H -matrix ⊲ reordered/partitioned using geometry of problem ◮ uses Krylov solver like BiCGStab for computing the solution ◮ is available at http://ppopenhpc.cc.u-tokyo.ac.jp → this talk focuses on utilizing GPUs 3/20 H -matrix BiCGStab with GPUs

BiCGStab with H -matrix on distributed-memory computer 1: t := A x 2: r := b − t ; r 0 := r , γ := � r 0 � 2 3: for iter = 1 , 2 , . . . , maxiters do 4: p := r + β · ( p − ζ · v ) 5: v := A p , followed by Allgatherv 6: α = ( r 0 , r )/( r 0 , v ) 7: v := r − α · v 8: t := A v , followed by Allgatherv 9: ζ = ( t , v )/( t , t ) 10: x := x + α p + ζ v 11: r := v − ζ t 12: β = α/ζ · ( r 0 , r )/ γ 13: γ = � r � 14: end for ◮ HiMV is dominant in iteration time, and parallelized ⊲ 1D block row, but with H -blocks and non-disjoint rows ◮ vector operations are insignificant, and redanduntly computed ⊲ avoid all-reduces for five dot-products per iter ◮ MPI Allgatherv after each HiMV ◮ OpenMP threads may be used to parallelize local matrix/vector operations 4/20 H -matrix BiCGStab with GPUs

GPU testbeds ◮ Reedbush-H : two 18-core Intel Xeon CPUs and two NVIDIA P100 GPUs per node, connected with 2 × 56Gb/s InifiniBand ◮ Tsubame-3 : two 14-core Intel Xeon CPUs and four NVIDIA P100 GPUs per node, connected with 4 × 100 Gb/s Omni-Path 5/20 H -matrix BiCGStab with GPUs

BiCGStab with H -matrix on GPU cluster 1: t := A x 2: r := b − t ; r 0 := r , γ := � r 0 � 2 3: for iter = 1 , 2 , . . . , maxiters do 4: p := r + β · ( p − ζ · v ) 5: v := A p , followed by Allgatherv 6: α = ( r 0 , r )/( r 0 , v ) 7: v := r − α · v 8: t := A v , followed by Allgatherv 9: ζ = ( t , v )/( t , t ) 10: x := x + α p + ζ v 11: r := v − ζ t 12: β = α/ζ · ( r 0 , r )/ γ 13: γ = � r � 14: end for ◮ all the operations are on GPUs (CPUs schedules tasks) ⊲ CPU-GPU data copy before/after MPI call ⊲ vector-operations using CUBLAS ⊲ HiMV using batched kernel !! fine-grained irregular computation + global communication 6/20 H -matrix BiCGStab with GPUs

batched GPU kernels from MAGMA ◮ many small same operations in parallel ◮ hardware parallelism through data parallilization ◮ motizated by application needs (e.g., deep learning, structural mechanics, high-order FEM, astrophysics, sparse/dense solvers) ◮ MAGMA : http://www.icl.utk.edu/magma LU, QR, Cholesky (fixed) , all BLAS-3 (fixed or variable) , and SYMV and GEMV (fixed or variable) http://www.icl.utk.edu/files/print/2017/magma-sc17.pdf (SC’17 handout) 7/20 H -matrix BiCGStab with GPUs

interface to variable-size batched DGEMV kernel magmablas_dgemv_vbatched ( magma trans t trans , magma int t * m, magma int t * n, double alpha , magmaDouble ptr dA array[], magma int t* ldda, magmaDouble ptr dx array[], magma int t* incx, magmaDouble ptr dy array[], magma int t* incy, magma int t batchCount , magma queue t queue) ◮ matrices/vectors as arrays of size batchCount on GPU (i.e., dA , dx , dy ) ⊲ maximum batchCount is 65,536 ◮ variable matrix sizes as arrays on GPU (e.g, m , n , lda ) ◮ same operations (i.e., trans and alpha ) ◮ layered interface (e.g., magmablas dgemv vbatched nocheck ) 8/20 H -matrix BiCGStab with GPUs

integration of variable-size batched kernel into HiMV for k = 1 , 2 , . . . , n ℓ do if dense block then // multiply with dense B ( k ) y ( k ) := B ( k ) x ( k ) else // multiply with compressed U ( k ) V ( k ) t ( k ) := V ( k ) x ( k ) y ( k ) := U ( k ) t ( k ) end if end for ◮ variable-size batched kernel to perform a batch of dgemv in parallel ⊲ group dgemv s into multiple batches (e.g., of fixed batch count) ◮ HiMV with many small dgemv with dense or compressed blocks ⊲ flat for -loop without hierarchical recursion → effective integration of batched kernel 9/20 H -matrix BiCGStab with GPUs

integration of variable-size batched kernel into HiMV for k = 1 , 2 , . . . , n ℓ do if dense block then // multiply with dense B ( k ) y ( k ) := B ( k ) x ( k ) else // multiply with compressed U ( k ) V ( k ) t ( k ) := V ( k ) x ( k ) y ( k ) := U ( k ) t ( k ) end if end for ◮ two data conflicts: ⊲ output y ( k ) may overlap → NVIDIA’s atomic-add on y ⊲ multiply with U ( k ) depends on t ( k ) from that with V ( k ) → 1) batches of B ( k ) and V ( k ) , and then 2) batches of U ( k ) on same stream, or on multiple streams with events 10/20 H -matrix BiCGStab with GPUs

performance of batched kernel for HiMV ◮ a wide range of block sizes ⊲ diagonal blocks : dense & square ⊲ off-diagonal blocks: dense/compress & tall-skinny/short-wide ◮ overhead with variable sizes, e.g., to accomodate largest block, smaller blocks have thread blocks with no work ◮ lower variable-size performance (Gflop/s) 11/20 H -matrix BiCGStab with GPUs

performance of batched kernel for HiMV ◮ sort blocks to reduce overhead associated with variable-size blocks ⊲ sort by numbers of rows in blocks ⊲ group by number of rows, sort by numbers of columns within group 12/20 H -matrix BiCGStab with GPUs

performance of batched kernel for HiMV ◮ appropriate sorting scheme improves performance ⊲ up to 2 . 5 × speedups 13/20 H -matrix BiCGStab with GPUs

performance (Gflop/s) of different HiMV implementations 160 OpenMP+MKL 140 CUBLAS+5 streams fixed batch(5K) with pad 120 variable batch(5K) variable batch(20K) 100 variable batch(variable) + 3 streams Gflop/s 80 60 40 20 0 100ts 338ts human2 human6 ◮ obtained higher performance using variable-size GPU kernel compared to fixed-size (wasted ops with zeros or limited batch count) ◮ last three rows: variable batch counts to reduce overhead ⊲ specific range of block sizes in each batch ⊲ GPU streams to execute small batches in parallel 14/20 H -matrix BiCGStab with GPUs

BiCGStab performance with GPUs (strong scaling) Tsubame-3 Reedbush-H 1.2 1.2 other 1.1 other BiCG Solution Time (s) BiCG Solution Time (s) HiMV(MPI) HiMV(MPI) 1 1 HiMV(copy) HiMV(copy) 0.9 HiMV(comp) HiMV(comp) 0.8 0.8 CPU only CPU only 0.7 0.6 0.6 0.5 0.4 0.4 4.2x 0.3 6.0x 4.4x 0.2 0.2 3.6x 8.5x 4.7x 4.6x 6.5x 2.1x 5.1x 0.1 0 0 1 GPU 1 2 4 8 1 GPU 1 2 4 8 Number of nodes Number of nodes ◮ CPU runs (one process/socket with threads ) vs. GPU runs (one process/GPU) ◮ 2 . 1 × speedup on 8 nodes of Tsubame-3 ◮ 4 . 6 × speedup on 8 nodes of Reedbush-H 15/20 H -matrix BiCGStab with GPUs

BiCGStab performance with GPUs on 8 nodes Tsubame-3 Reedbush-H 12 14 other other BiCG Solution Time (s) 13 BiCG Solution Time (s) HiMV(MPI) HiMV(MPI) 12 10 11 HiMV(copy) HiMV(copy) 10 HiMV(comp) HiMV(comp) 8 9 CPU only CPU only 8 6 7 6 5 4 3.8x 4 4.2x 3 2 3.1x 2 1.8x 4.6x 3.3x 3.0x 2.1x 3.0x 3.3x 3.3x 1 3.2x 0 0 100ts 288ts 388ts 1ms hum4 hum6 100ts 288ts 338ts 1ms hum1 hum4 ◮ CPU runs (one process/socket with threads ) vs. GPU runs (one process/GPU) ◮ upto 4 . 2 × speedup on 8 nodes of Tsubame-3 ⊲ 6 . 0 × on one node ◮ upto 4 . 6 × speedup on 8 nodes of Reedbush-H ⊲ 4 . 2 × on one node ◮ communication starts to become significant ⊲ 46% or 43% on Tsubame-3 or Reedbush-H 16/20 H -matrix BiCGStab with GPUs

BiCGStab with multiple GPUs per process Tsubame-3 13 12 per node per socket 11 per gpu 10 9 GB/s 8 7 6 5 4 3 2 1 2 4 8 16 Node count ◮ each process with multiple GPUs to lower inter-node communication by reducing number of processes ◮ use NVLink for data transfer among local GPUs 17/20 H -matrix BiCGStab with GPUs

BiCGStab performance with multiple GPUs per process matrix 100ts human6 175 175 No GPU No GPU per-GPU per-GPU 150 150 per-Socket per-Socket per-Node per-Node 125 125 time/iter (ms) time/iter (s) 100 100 75 75 50 50 25 25 0 0 4 8 16 32 4 8 16 32 Number of nodes Number of nodes ◮ on large number of nodes, inter-GPU comm may be reduced by multi-GPU implementation with careful communication scheme 18/20 H -matrix BiCGStab with GPUs

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA - PowerPoint PPT Presentation

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA variable-size batched kernel Ichitaro Yamazaki , Ahmad Abdelfattah , Akihiro Ida , Satoshi Ohshima , Stanimire Tomov , Rio Yokota , Jack Dongarra The

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

meshflow interactive visualization of mesh construction sequences [ ]

CELLENGER CELLENGER Automated High Automated High Content Content Analysis of Analysis of

Semantic Web & BI Triples (Quads) Sources of contextualized triple graphs Analysis. &

Energy Imbalance Market 3 rd Revised Straw Proposal and Governance Stakeholder Meeting August 20,

Visual Mapping of Program Components to Resources Representation: a 3D Analysis of Grid Parallel

Service-Oriented Programming in MPI Sarwar Alam, Humaira Kamal and Alan Wagner University of

Background Background Background Background Design Task Museum! Museum! Design step1

Task 879.1: Intelligent Demand Aggregation and Forecasting Task Leader: Argon Chen

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA - PowerPoint PPT Presentation

Hierarchical-matrix Linear Solver on GPU clusters with MAGMA variable-size batched kernel Ichitaro Yamazaki , Ahmad Abdelfattah , Akihiro Ida , Satoshi Ohshima , Stanimire Tomov , Rio Yokota , Jack Dongarra The

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

meshflow interactive visualization of mesh construction sequences [ ]

CELLENGER CELLENGER Automated High Automated High Content Content Analysis of Analysis of

Semantic Web &amp; BI Triples (Quads) Sources of contextualized triple graphs Analysis. &amp;

Energy Imbalance Market 3 rd Revised Straw Proposal and Governance Stakeholder Meeting August 20,

Visual Mapping of Program Components to Resources Representation: a 3D Analysis of Grid Parallel

Service-Oriented Programming in MPI Sarwar Alam, Humaira Kamal and Alan Wagner University of

Background Background Background Background Design Task Museum! Museum! Design step1

Task 879.1: Intelligent Demand Aggregation and Forecasting Task Leader: Argon Chen

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Semantic Web & BI Triples (Quads) Sources of contextualized triple graphs Analysis. &