accelerating sparse cholesky factorization
play

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE - PowerPoint PPT Presentation

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY SPARSE MATRIX


  1. ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY

  2. SPARSE MATRIX FACTORIZATION ON GPU S Objective: Find methods for GPU acceleration of Sparse Cholesky Factorization Experiment using SuiteSparse 4.4.3 / CHOLMOD Outline Sparse Cholesky Factorization Previous work / Issues ‘Branches’ approach

  3. DIRECT SPARSE FACTORIZATION Dense block Cholesky Supernodes A t L 11 0 A 11 I 0 L t 11 L t 21 21 = A 21 L 21 I A 22 0 A * 0 I 22 L 11 L t 11 = A 11 POTRF dense Cholesky triangular solve L 11 L t 21 = A t 21 TRSM compressed column matrix multiplication A * 22 = A 22 – L 21 L t GEMM 21 Schur complement

  4. DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree 1 7 2 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 (wide range of descendant sizes)

  5. DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 2 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 (wide range of descendant sizes)

  6. DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM 2 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 (wide range of descendant sizes)

  7. DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

  8. DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 POTRF 3 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

  9. DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 POTRF 3 TRSM 6 3 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

  10. DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 POTRF 3 TRSM 6 3 GEMM 4 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

  11. DIRECT SPARSE FACTORIZATION Apply block Cholesky to supernodes ‘Left-looking supernodal’ Elimination tree POTRF 1 7 TRSM GEMM 2 POTRF 3 TRSM 6 3 GEMM 4 POTRF 5 1 2 4 5 6 Bulk of work is in assembling supernodes 7 fill fill (wide range of descendant sizes)

  12. DIRECT SPARSE FACTORIZATION Lots of ‘small’ math Irregular access patterns Larger matrices -> more dense math Greater connectivity -> more dense math Factors can be large ( > 128 GB )

  13. PREVIOUS WORK Just send large BLAS-3 to GPU WORKS! For large, dense matrices Not so good for: small matrices large matrices with low connectivity (shells / beams in FEA) Find methods for further GPU acceleration of Sparse Factorization

  14. PREVIOUS WORK SuiteSparse (CHOLMOD) 4.4.3 Send appropriately-sized BLAS calls to GPU CPU ¡ CPU ¡+ ¡GPU ¡ why not higher? ‘hide’ PCIe communication 800 ¡ Assemble supernodes on GPU 700 ¡ 1.5x 600 ¡ Hybrid computing GFlops/s ¡ 500 ¡ 400 ¡ row/column threshold ndrow >= 256 300 ¡ ndcol >= 32 why so low? 200 ¡ supernode score 100 ¡ 0 ¡ GPU CPU Florida ¡Sparse ¡Matrix ¡Collec4on ¡ decreasing cost to assemble supernodes 2 ¡x ¡Xeon ¡E5-­‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ hEp://faculty.cse.tamu.edu/davis/suitesparse.html ¡ ¡

  15. ISSUES % on CPU PCIe communication Limits which BLAS operations audikw_1.mtx can be accelerated on GPU Small BLAS Low occupancy Launch overhead Most BLAS calls don’t get sent to the GPU Seek methods which better accelerate factorization of small / minimally-connected matrices

  16. PROPOSED SOLUTION Factor branches on GPU Use previous methods for root No use of CPU Eliminates PCIe communication Requires POTRF , TRSM & GEMM on GPU level 2 Batch and stream BLAS operations Within levels level 1 Amortizes launch overhead Streamed to improve occupancy level 0 No size restriction branch 2 branch 3 branch 4 branch 1 Maps well to muti-GPU / hybrid computing

  17. BATCHED / STREAMED BLAS Host <-> Device DGEMM example, m,n,k=16 Batch all BLAS calls to Kernel amortize kernel launch latency data on 100 Mflops host Stream multiple batches to increase occupancy data : 500 Mflops on stream device Simply wrap cuBLAS subroutine with batch loop batched: 1.2 Gflops DGEMM w/ m,n,k=16 -> 40 GF streamed: 4.8 Gflops time

  18. BATCHED / STREAMED DGEMM GPU streamed GPU Square DGEMM batched/streamed GPU streamed CPU 1400 64 streams/threads 1200 Batched / streamed 1000 cuBLAS performance Gflop/s 800 matches MKL for small size 600 Created by wrapping 400 existing, non-batched 200 routines 0 passing lists 0 100 200 300 400 500 DGEMM m,n,k 2 ¡x ¡Xeon ¡E5-­‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡

  19. PLENTY OF PARALLELISM audikw_1.mtx Lower levels Many supernodes GEMM Few descendants Upper levels Few supernodes Many descendants supernodes # of supernodes or GEMM + SYRK ops

  20. BRANCHES Matrix ¡ # ¡branches ¡ # ¡levels ¡ # ¡supernodes ¡ # ¡root ¡levels ¡ # ¡root ¡supernodes ¡ Fault_639 ¡ 2 ¡ 18-­‑19 ¡ 14931 ¡-­‑ ¡15794 ¡ 1 ¡ 1 ¡ nd24k ¡ 2 ¡ 11 ¡ 302 ¡-­‑ ¡325 ¡ 1 ¡ 1 ¡ inline_1 ¡ 4 ¡ 16-­‑17 ¡ 3909 ¡-­‑ ¡10633 ¡ 1 ¡ 1 ¡ Emilia_923 ¡ 4 ¡ 17-­‑18 ¡ 10314 ¡-­‑ ¡11570 ¡ 3 ¡ 4 ¡ boneS10 ¡ 4 ¡ 18-­‑23 ¡ 7045 ¡-­‑ ¡26182 ¡ 1 ¡ 1 ¡ ldoor ¡ 3 ¡ 19-­‑20 ¡ 17413 ¡-­‑ ¡35704 ¡ 1 ¡ 1 ¡ bone010 ¡ 6 ¡ 16-­‑20 ¡ 1957 ¡-­‑ ¡23610 ¡ 1 ¡ 1 ¡ Hook_1498 ¡ 9 ¡ 1-­‑18 ¡ 1 ¡-­‑ ¡33608 ¡ 3 ¡ 5 ¡ Geo_1438 ¡ 8 ¡ 17-­‑18 ¡ 8102 ¡-­‑ ¡9335 ¡ 5 ¡ 9 ¡ Serena ¡ 60 ¡ 10-­‑17 ¡ 189 ¡-­‑ ¡4910 ¡ 10 ¡ 60 ¡ branch branch branch branch audikw_1 ¡ 4 ¡ 17-­‑19 ¡ 5631 ¡-­‑ ¡22300 ¡ 1 ¡ 1 ¡ 2 3 4 1 Flan_1564 ¡ 8 ¡ 15-­‑17 ¡ 3937 ¡-­‑ ¡16309 ¡ 2 ¡ 2 ¡

  21. CHOLMOD RESULTS CHOLMOD 4.43 CPU CPU + GPU GPU Branches 900 1.38x average speedup 800 vs. previous CPU+GPU 700 600 2x average speedup vs. GFlop/s 500 CPU 400 300 Poorly performing 200 matrices see the 100 greatest speedup 0 Florida Sparse Matrix Collection 2 ¡x ¡Xeon ¡E5-­‑2698 ¡v3 ¡+ ¡K40 ¡(max ¡boost, ¡ECC=off) ¡ hEp://faculty.cse.tamu.edu/davis/suitesparse.html ¡ ¡

  22. PCI E DEPENDENCE GPU Branches 4.4.3 CPU+GPU PCIe gen1 PCIe gen3 PCIe gen1 PCIe gen3 900 PCIe gen3 -> gen1 800 12 GB/s -> 3 GB/s 700 600 75% loss Gflop/s 500 CPU+GPU 400 300 23% loss 200 100 Branches 0 17% loss Florida Sparse Matrix Collection 1 ¡x ¡ ¡i7 ¡3930K ¡ ¡+ ¡K40 ¡(max ¡boost, ¡ECC=on) ¡

  23. SHELL MODEL PERFORMANCE • 506,082 supernodes • 640 branches 450 Numerical Factorization rate GF/s • 114–1,730 supernodes 2 socket x 16 core HSW • 8-20 levels 400 • 49 levels in root branch E5-2698 v3 @ 2.3 Ghz. • 637 supernodes w/ 256 GB + 2xK40 350 ( ECC=ON, full boost ) 300 250 4.4.3 CPU 200 4.4.3 CPU+GPU 150 Branches 1xK40 100 50 0 0 2 4 6 8 10 12 million degrees of freedom PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

  24. SHELL MODEL PERFORMANCE host <-> device ‘Branches’ algorithm well- suited for Multi-GPU 1 x K40 compute kernels 4 x K40 Overall 1.5x speedup Branches 3.1x speedup • 506,082 supernodes • 640 branches 4 x K40 We’ve ported the previous • 114–1,730 supernodes algorithm to multi-GPU • 8-20 levels • 49 levels in root branch • 637 supernodes time

  25. SHELL MODEL PERFORMANCE 1400 Numerical Factorization rate GF/s assuming 87.5% 2 socket x 16 core HSW parallel efficiency 1200 E5-2698 v3 @ 2.3 Ghz. 4x K40 w/ 256 GB + 2xK40 ( ECC=ON, full boost ) 1000 4.4.3 CPU 4.4.3 CPU+GPU 800 2x K40 Branches 1xK40 600 Branches 2xK40 1x Branches 2xK40-Proj. 400 K40 Branches 4xK40 200 Branches 4xK40-Proj. 0 0 2 4 6 8 10 12 million degrees of freedom PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

  26. CONCLUSIONS Factoring ‘branches’ on GPU avoids PCIe bottleneck Batching and streaming permits higher performance on small matrices Universally beneficial Aspects apply to other factorization methods Future Improved performance of batched routines Support hybrid computing Complete multi-GPU support

Recommend


More recommend