block solvers for multiple right hand sides on nvidia gpus
play

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias - PowerPoint PPT Presentation

BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias Wagner, Lattice 2016 QUDA update AGENDA Dslash for multiple rhs Block Krylov solvers 2 UPDATE ON QUDA QUDA QCD on CUDA open source, BSD license Effort started at


  1. BLOCK SOLVERS FOR MULTIPLE RIGHT HAND SIDES ON NVIDIA GPUS Mathias Wagner, Lattice 2016

  2. QUDA update AGENDA Dslash for multiple rhs Block Krylov solvers 2

  3. UPDATE ON QUDA

  4. QUDA “QCD on CUDA” – open source, BSD license Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD, Chroma, CPS, MILC, TIFR, etc. Latest release 0.8.0 (8th February 2016) Provides: Various solvers for all major fermionic discretizations, with multi-GPU support Additional performance-critical routines needed for gauge-field generation Maximize performance Exploit physical symmetries to minimize memory traffic Mixed-precision methods Autotuning for high performance on all CUDA-capable architectures Domain-decomposed (Schwarz) preconditioners for strong scaling Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) Multigrid solvers for optimal convergence A research tool for how to reach the exascale 4

  5. QUDA - LATTICE QCD ON GPUS http://lattice.github.com/quda lattice / quda Watch 36 Unstar 45 Fork 29 Code Issues 107 Pull requests 2 Wiki Pulse Graphs Settings QUDA is a library for performing calculations in lattice QCD on GPUs. http://lattice.github.com/quda — Edit 4,621 commits 49 branches 19 releases 16 contributors Clone or download Clone or download Branch: develop New pull request Create new file Upload files Find file Latest commit f3e2aa7 a day ago … mathiaswagner committed on GitHub Merge pull request #487 from lattice/hotfix/checkerboard-reference include In ColorSpinorParam, if staggered fermions then set field dimension t… 11 days ago lib Correctly set volumeCB for parity subset references - need to check p… a day ago tests Requesting --test 1 with staggered_dslash_test now tests MdagM operator 11 days ago .gitignore Updates to .gitignore and renamed multigrid_benchmark to multigrid_be… 3 months ago 5 CMakeLists.txt added some comments to CMakeLists.txt 3 months ago

  6. QUDA AUTHORS Ron Babich (NVIDIA) Steve Gottlieb (Indiana University) Michael Baldhauf (Regensburg) Dean Howarth (Rensselaer Polytechnic Institute) Kip Barros (LANL) Bálint Joó (Jlab) Rich Brower (Boston University) Hyung-Jin Kim (BNL -> Samsung) Nuno Cardoso (Lisbon) Claudio Rebbi (Boston University) Kate Clark (NVIDIA) Guochun Shi (NCSA -> Google) Michael Cheng (Boston University) Mario Schröck (INFN) Carleton DeTar (Utah University) Alexei Strelchenko (FNAL) Justin Foley (Utah -> NIH) Alejandro Vaquero (INFN) Joel Giedt (Rensselaer Polytechnic Institute) Mathias Wagner (NVIDIA) Arjun Gambhir (William and Mary) Frank Winter (Jlab)

  7. SOLVERS FOR MULTIPLE RIGHT HAND SIDES

  8. CONJUGATE GRADIENT just as a reminder procedure CG r (0) = b � Ax (0) p (0) = r (0) for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 8

  9. QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 9

  10. QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats 9

  11. QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound WILSON CLOVER DSLASH Dslash arithmetic intensity for HISQ ~ 0.7 Volume = 32 4 GFLOPS Reconstruct 8 Reconstruct 12 Reconstruct 18 2500 exploit SU(3) symmetry: 2000 reconstruct gauge field from 8/12 floats 1500 1000 500 0 K80 P100 K80 P100 K80 P100 Half Half Single Single Double Double 9

  12. QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Smearing kills symmetry: stuck with 18 floats 9

  13. QCD PERFORMANCE LIMITERS QCD is memory bandwidth bound Dslash arithmetic intensity for HISQ ~ 0.7 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats Smearing kills symmetry: stuck with 18 floats Reuse gauge field for multiple rhs 9

  14. QCD PERFORMANCE LIMITERS reconstruct QCD is memory bandwidth bound 18 13 9 2.8 Dslash arithmetic intensity for HISQ ~ 0.7 arithmetic intensity 2.1 exploit SU(3) symmetry: reconstruct gauge field from 8/12 floats 1.4 Smearing kills symmetry: stuck with 18 floats 0.7 Reuse gauge field for multiple rhs 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 9 # rhs

  15. TESLA PASCAL P100 5.3 TF DP · 10.6 TF SP · 21 TF HP Tesla P100 720 GB/sec Memory Bandwidth for NVLink-enabled Servers 16 GB HBM2 4.7 TF DP · 9.3 TF SP · 18.7 TF HP Tesla P100 Config 1: 16 GB (HBM2), 720 GB/sec for PCIe-Based Servers Config 2: 12 GB (HBM2), 540 GB/sec 10

  16. TITAN X NVIDIA TITAN X 11 TF SP 480 GB/sec Memory Bandwidth 12 GB GDDR5X Pascal architecture 11.0 TF SP 480 GB/sec Memory Bandwidth 12 GB GDDR5X 11

  17. MULTI-SRC DSLASH ON PASCAL Volume 24 4 , HISQ, tuned gauge reconstruct P100 double P100 single Titan X single 2000 1500 1000 500 GFlOP/s 0 1 2 3 4 5 6 8 10 12 14 16 # rhs 12

  18. MULTI-SRC DSLASH ON PASCAL Volume 24 4 , HISQ, tuned gauge reconstruct P100 double P100 single Titan X single 2000 1500 1000 500 GFlOP/s 0 1 2 3 4 5 6 8 10 12 14 16 # rhs 12

  19. MULTI-SRC DSLASH ON PASCAL Volume 24 4 , HISQ, tuned gauge reconstruct P100 double P100 single Titan X single 2000 1500 1000 500 GFlOP/s 0 1 2 3 4 5 6 8 10 12 14 16 # rhs 12

  20. MULTI-SRC DSLASH ON PASCAL Volume 24 4 , HISQ, tuned gauge reconstruct P100 double P100 single Titan X single 2000 1500 1000 500 GFlOP/s 0 1 2 3 4 5 6 8 10 12 14 16 # rhs 12

  21. CONJUGATE GRADIENT using multi-src Dslash procedure CG WITH MULTI - SRC D SLASH r i = b i � Ax (0) i p (0) = r (0) i i for k = 1 , 2 , . . . until converged do exploit multi-src Dslash performance n o z ( k − 1) p ( k − 1) � = A i α ( k − 1) = | r ( k − 1) | 2 / h ( p ( k − 1) ) , z ( k − 1) i i i i i do all the linear algebra for each rhs x ( k ) = x ( k − 1) + α ( k − 1) p ( k − 1) i i i i r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) i i i i β ( k − 1) = | r ( k − 1) | 2 / | r ( k ) | 2 i i i p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) same iteration count as CG i i i i end for end procedure 13

  22. MULTI-SRC CG ON PASCAL Volume 24 4 , HISQ P100 single 1500 1000 500 GFlOP/s 0 1 2 3 4 5 6 8 10 12 14 16 # rhs 14

  23. BLOCK KRYLOV SPACE SOLVERS

  24. BLOCK KRYLOV SOLVERS Share the Krylov space BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations 16

  25. BLOCK KRYLOV SOLVERS Share the Krylov space BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations Nakamura et al., CPC 183 (2012) 34–37 Application in QCD: Nakamura et. (modified block BiCGStab) Birk and Frommer (block methods, including block methods for multi shift) 16

  26. BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) P (0) = R (0) for k = 1 , 2 , . . . until converged do Z ( k − 1) = AP ( k − 1) ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 17

  27. BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 17

  28. BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) = ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 17

  29. BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) = ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) = + ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure = + 17

Recommend


More recommend