ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE - PowerPoint PPT Presentation

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY

SPARSE MATRIX FACTORIZATION ON GPU S Objective: Find methods for GPU acceleration of Sparse Cholesky Factorization Experiment using SuiteSparse 4.4.3 / CHOLMOD Outline Sparse Cholesky Factorization Previous work / Issues ‘Branches’ approach

DIRECT SPARSE FACTORIZATION Dense block Cholesky Supernodes A t L 11 0 A 11 I 0 L t 11 L t 21 21 = A 21 L 21 I A 22 0 A * 0 I 22 L 11 L t 11 = A 11 POTRF dense Cholesky triangular solve L 11 L t 21 = A t 21 TRSM compressed column matrix multiplication A * 22 = A 22 – L 21 L t GEMM 21 Schur complement

DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree 1 7 2 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 (wide range of descendant sizes)

DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 2 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 (wide range of descendant sizes)

DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM 2 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 (wide range of descendant sizes)

DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 POTRF 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 POTRF 3 TRSM 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 POTRF 3 TRSM 6 3 GEMM 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 POTRF 3 TRSM 6 3 GEMM 4 POTRF 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

DIRECT SPARSE FACTORIZATION Lots of ‘small’ math Irregular access patterns Larger matrices -> more dense math Greater connectivity -> more dense math Factors can be large ( > 128 GB )

PREVIOUS WORK Just send large BLAS-3 to GPU WORKS! For large, dense matrices Not so good for: small matrices large matrices with low connectivity (shells / beams in FEA) Find methods for further GPU acceleration of Sparse Factorization

PREVIOUS WORK SuiteSparse (CHOLMOD) 4.4.3 Send appropriately-sized BLAS calls to GPU CPU ¡ CPU ¡+ ¡GPU ¡ why not higher? ‘hide’ PCIe communication 800 ¡ Assemble supernodes on GPU 700 ¡ 1.5x 600 ¡ Hybrid computing GFlops/s ¡ 500 ¡ 400 ¡ row/column threshold ndrow >= 256 300 ¡ ndcol >= 32 why so low? 200 ¡ supernode score 100 ¡ 0 ¡ GPU CPU Florida ¡Sparse ¡Matrix ¡Collec4on ¡ decreasing cost to assemble supernodes 2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ hEp://faculty.cse.tamu.edu/davis/suitesparse.html ¡ ¡

ISSUES % on CPU PCIe communication Limits which BLAS operations audikw_1.mtx can be accelerated on GPU Small BLAS Low occupancy Launch overhead Most BLAS calls don’t get sent to the GPU Seek methods which better accelerate factorization of small / minimally-connected matrices

PROPOSED SOLUTION Factor branches on GPU Use previous methods for root No use of CPU Eliminates PCIe communication Requires POTRF , TRSM & GEMM on GPU level 2 Batch and stream BLAS operations Within levels level 1 Amortizes launch overhead Streamed to improve occupancy level 0 No size restriction branch 2 branch 3 branch 4 branch 1 Maps well to muti-GPU / hybrid computing

BATCHED / STREAMED BLAS Host <-> Device DGEMM example, m,n,k=16 Batch all BLAS calls to Kernel amortize kernel launch latency data on 100 Mflops host Stream multiple batches to increase occupancy data : 500 Mflops on stream device Simply wrap cuBLAS subroutine with batch loop batched: 1.2 Gflops DGEMM w/ m,n,k=16 -> 40 GF streamed: 4.8 Gflops time

BATCHED / STREAMED DGEMM GPU streamed GPU Square DGEMM batched/streamed GPU streamed CPU 1400 64 streams/threads 1200 Batched / streamed 1000 cuBLAS performance Gflop/s 800 matches MKL for small size 600 Created by wrapping 400 existing, non-batched 200 routines 0 passing lists 0 100 200 300 400 500 DGEMM m,n,k 2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡

PLENTY OF PARALLELISM audikw_1.mtx Lower levels Many supernodes GEMM Few descendants Upper levels Few supernodes Many descendants supernodes # of supernodes or GEMM + SYRK ops

BRANCHES Matrix ¡ # ¡branches ¡ # ¡levels ¡ # ¡supernodes ¡ # ¡root ¡levels ¡ # ¡root ¡supernodes ¡ Fault_639 ¡ 2 ¡ 18-‑19 ¡ 14931 ¡-‑ ¡15794 ¡ 1 ¡ 1 ¡ nd24k ¡ 2 ¡ 11 ¡ 302 ¡-‑ ¡325 ¡ 1 ¡ 1 ¡ inline_1 ¡ 4 ¡ 16-‑17 ¡ 3909 ¡-‑ ¡10633 ¡ 1 ¡ 1 ¡ Emilia_923 ¡ 4 ¡ 17-‑18 ¡ 10314 ¡-‑ ¡11570 ¡ 3 ¡ 4 ¡ boneS10 ¡ 4 ¡ 18-‑23 ¡ 7045 ¡-‑ ¡26182 ¡ 1 ¡ 1 ¡ ldoor ¡ 3 ¡ 19-‑20 ¡ 17413 ¡-‑ ¡35704 ¡ 1 ¡ 1 ¡ bone010 ¡ 6 ¡ 16-‑20 ¡ 1957 ¡-‑ ¡23610 ¡ 1 ¡ 1 ¡ Hook_1498 ¡ 9 ¡ 1-‑18 ¡ 1 ¡-‑ ¡33608 ¡ 3 ¡ 5 ¡ Geo_1438 ¡ 8 ¡ 17-‑18 ¡ 8102 ¡-‑ ¡9335 ¡ 5 ¡ 9 ¡ Serena ¡ 60 ¡ 10-‑17 ¡ 189 ¡-‑ ¡4910 ¡ 10 ¡ 60 ¡ branch branch branch branch audikw_1 ¡ 4 ¡ 17-‑19 ¡ 5631 ¡-‑ ¡22300 ¡ 1 ¡ 1 ¡ 2 3 4 1 Flan_1564 ¡ 8 ¡ 15-‑17 ¡ 3937 ¡-‑ ¡16309 ¡ 2 ¡ 2 ¡

CHOLMOD RESULTS CHOLMOD 4.43 CPU CPU + GPU GPU Branches 900 1.38x average speedup 800 vs. previous CPU+GPU 700 600 2x average speedup vs. GFlop/s 500 CPU 400 300 Poorly performing 200 matrices see the 100 greatest speedup 0 Florida Sparse Matrix Collection 2 ¡x ¡Xeon ¡E5-‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ hEp://faculty.cse.tamu.edu/davis/suitesparse.html ¡ ¡

PCI E DEPENDENCE GPU Branches 4.4.3 CPU+GPU PCIe gen1 PCIe gen3 PCIe gen1 PCIe gen3 900 PCIe gen3 -> gen1 800 12 GB/s -> 3 GB/s 700 600 75% loss Gflop/s 500 CPU+GPU 400 300 23% loss 200 100 Branches 0 17% loss Florida Sparse Matrix Collection 1 ¡x ¡ ¡i7 ¡3930K ¡ ¡+ ¡K40 ¡(max ¡boost, ¡ECC=on) ¡

SHELL MODEL PERFORMANCE • 506,082 supernodes • 640 branches 450 Numerical Factorization rate GF/s • 114–1,730 supernodes 2 socket x 16 core HSW • 8-20 levels 400 • 49 levels in root branch E5-2698 v3 @ 2.3 Ghz. • 637 supernodes w/ 256 GB + 2xK40 350 ( ECC=ON, full boost ) 300 250 4.4.3 CPU 200 4.4.3 CPU+GPU 150 Branches 1xK40 100 50 0 0 2 4 6 8 10 12 million degrees of freedom PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

SHELL MODEL PERFORMANCE host <-> device ‘Branches’ algorithm well- suited for Multi-GPU 1 x K40 compute kernels 4 x K40 Overall 1.5x speedup Branches 3.1x speedup • 506,082 supernodes • 640 branches 4 x K40 We’ve ported the previous • 114–1,730 supernodes algorithm to multi-GPU • 8-20 levels • 49 levels in root branch • 637 supernodes time

SHELL MODEL PERFORMANCE 1400 Numerical Factorization rate GF/s assuming 87.5% 2 socket x 16 core HSW parallel efficiency 1200 E5-2698 v3 @ 2.3 Ghz. 4x K40 w/ 256 GB + 2xK40 ( ECC=ON, full boost ) 1000 4.4.3 CPU 4.4.3 CPU+GPU 800 2x K40 Branches 1xK40 600 Branches 2xK40 1x Branches 2xK40-Proj. 400 K40 Branches 4xK40 200 Branches 4xK40-Proj. 0 0 2 4 6 8 10 12 million degrees of freedom PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

CONCLUSIONS Factoring ‘branches’ on GPU avoids PCIe bottleneck Batching and streaming permits higher performance on small matrices Universally beneficial Aspects apply to other factorization methods Future Improved performance of batched routines Support hybrid computing Complete multi-GPU support

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE - PowerPoint PPT Presentation

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY SPARSE MATRIX

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

The Scalable Petascale Data-Driven Approach for the Cholesky Factorization with multiple GPUs

Modeling Power and Energy of the Task-Parallel Cholesky Factorization on Multicore Processors

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools

CS475/CS675 Lecture 2: May 3, 2016 Cholesky factorization, tridiagonal, band matrices Reading:

Cholesky Decomposition Techniques in Quantum Chemical Implementations Outline What is

Asymptotics of Cholesky GARCH Models and Time-Varying Conditional Betas Serge Darolles, Christian

Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS Jorge

On Cholesky structures on real symmetric matrices and their applications Hideyuki ISHI (Osaka

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole

A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization Shaden Smith George

Sparse Tensor Factorization: Algorithms, Data Structures, and Challenges Shaden Smith &

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Financing Investment: The Role of Heterogeneity ebnem Kalemli- S Ozcan University of

Stefan Kempf, founder and CEO why we do what we do Mission 4 Mio. Freelancer und

Interim Report Q4 2019 Qred presentation - Q4 2019 Qred 2019 highlights +66% 17% 25 10 jobs

FACTORING WITH JUDGMENTS w w w . g u t i e r r e z g r o u p . c o m . c o What is it about?

Factoring Chasing the Debt Fiona Greer, Associate Recovery Can you pursue the debt? -

from the previous year. 3) Consolidated profit attributable to owners of the parent company

2019 4Q AND FULL YEAR EARNINGS NYSE: DOOR Safe Harbor / Non-GAAP Financial Measures SAFE HARBOR

Business Results Business Results First Quarter of Fiscal Year First Quarter of Fiscal Year

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE - PowerPoint PPT Presentation

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY SPARSE MATRIX

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

The Scalable Petascale Data-Driven Approach for the Cholesky Factorization with multiple GPUs

Modeling Power and Energy of the Task-Parallel Cholesky Factorization on Multicore Processors

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools

CS475/CS675 Lecture 2: May 3, 2016 Cholesky factorization, tridiagonal, band matrices Reading:

Cholesky Decomposition Techniques in Quantum Chemical Implementations Outline What is

Asymptotics of Cholesky GARCH Models and Time-Varying Conditional Betas Serge Darolles, Christian

Design and Performance Issues of Cholesky and LU Solvers using UPCBLAS Jorge

On Cholesky structures on real symmetric matrices and their applications Hideyuki ISHI (Osaka

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole

A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization Shaden Smith George

Sparse Tensor Factorization: Algorithms, Data Structures, and Challenges Shaden Smith &amp;

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity 2020/11

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Financing Investment: The Role of Heterogeneity ebnem Kalemli- S Ozcan University of

Stefan Kempf, founder and CEO why we do what we do Mission 4 Mio. Freelancer und

Interim Report Q4 2019 Qred presentation - Q4 2019 Qred 2019 highlights +66% 17% 25 10 jobs

FACTORING WITH JUDGMENTS w w w . g u t i e r r e z g r o u p . c o m . c o What is it about?

Factoring Chasing the Debt Fiona Greer, Associate Recovery Can you pursue the debt? -

from the previous year. 3) Consolidated profit attributable to owners of the parent company

2019 4Q AND FULL YEAR EARNINGS NYSE: DOOR Safe Harbor / Non-GAAP Financial Measures SAFE HARBOR

Business Results Business Results First Quarter of Fiscal Year First Quarter of Fiscal Year

Sparse Tensor Factorization: Algorithms, Data Structures, and Challenges Shaden Smith &