SPEEDING UP CONJUGATE GRADIENT SOLVERS BY 10X Mathias Wagner, Developer Technology Engineer GTC 2017 - S7387
HPC BEYOND MOORE’S LAW going wide 4000 CPUs and GPUs becoming wider 3000 increase in flops is driven by more cores CUDA cores 2000 also applies for CPUs (Server to mobile) 1000 need sufficient amount of parallelism to fill architectures 0 Tesla Fermi Kepler Maxwell Pascal 2
HPC BEYOND MOORE’S LAW staying local 10 Tesla GPU Intel CPU Intel MIC CPUs and GPUs becoming wider 8 increase in flops is driven by more 6 cores 4 need sufficient amount of parallelism to fill architectures FLOPS Byte 2 0 3
Lattice Quantum Chromodynamics Classic Conjugate Gradient Solver AGENDA Block Krylov solvers Efficient implementation Time to solution improvements in the wild 4
LATTICE QCD: SOME BASICS Z QCD partition function DAD ¯ Ψ D Ψ e − S E ( T,µ ) Z QCD ( T, µ ) = 4 dimensional grid (=Lattice) includes integral over space and time quarks live on lattice sites gluons live on the links typical sizes: 24 3 x 6 to 256 4 parallelization over lattice sites (10 5 to 10 9 ) 6
QUDA - LATTICE QCD ON GPUS http://lattice.github.com/quda lattice / quda Watch 36 Unstar 45 Fork 29 Code Issues 107 Pull requests 2 Wiki Pulse Graphs Settings QUDA is a library for performing calculations in lattice QCD on GPUs. http://lattice.github.com/quda — Edit 4,621 commits 49 branches 19 releases 16 contributors Clone or download Clone or download Branch: develop New pull request Create new file Upload files Find file Latest commit f3e2aa7 a day ago … mathiaswagner committed on GitHub Merge pull request #487 from lattice/hotfix/checkerboard-reference include In ColorSpinorParam, if staggered fermions then set field dimension t… 11 days ago lib Correctly set volumeCB for parity subset references - need to check p… a day ago tests Requesting --test 1 with staggered_dslash_test now tests MdagM operator 11 days ago .gitignore Updates to .gitignore and renamed multigrid_benchmark to multigrid_be… 3 months ago 7 CMakeLists.txt added some comments to CMakeLists.txt 3 months ago
QUDA AUTHORS Green: Contributors to this talk Ron Babich (NVIDIA) Dean Howarth (Rensselaer Polytechnic Institute) Michael Baldhauf (Regensburg) Bálint Joó (Jlab) Kip Barros (LANL) Hyung-Jin Kim (BNL -> Samsung) Rich Brower (Boston University) Claudio Rebbi (Boston University) Nuno Cardoso (Lisbon) Guochun Shi (NCSA -> Google) Kate Clark (NVIDIA) Mario Schröck (INFN) Michael Cheng (Boston University) Alexei Strelchenko (FNAL) Carleton DeTar (Utah University) Alejandro Vaquero (Utah University) Justin Foley (NIH) Mathias Wagner (NVIDIA) Joel Giedt (Rensselaer Polytechnic Institute) Evan Weinberg (Boston University) Arjun Gambhir (William and Mary) Frank Winter (Jlab) Steve Gottlieb (Indiana University)
THE OLD WORK HORSE
CONJUGATE GRADIENT procedure CG r (0) = b � Ax (0) p (0) = r (0) for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 10
CONJUGATE GRADIENT procedure CG r (0) = b � Ax (0) p (0) = r (0) matrix-vector operation dominates runtime for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 10
CONJUGATE GRADIENT procedure CG r (0) = b � Ax (0) p (0) = r (0) matrix-vector operation dominates runtime for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = caxpy BLAS operations caxpy BLAS operations h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 10
CONJUGATE GRADIENT procedure CG r (0) = b � Ax (0) p (0) = r (0) matrix-vector operation dominates runtime for k = 1 , 2 , . . . until converged do z ( k − 1) = Ap ( k − 1) 2 | r ( k − 1) | α ( k − 1) = caxpy BLAS operations caxpy BLAS operations h ( p ( k − 1) ) ,z ( k − 1) i x ( k ) = x ( k � 1) + α ( k − 1) p ( k − 1) r ( k ) = r ( k − 1) � α ( k − 1) z ( k − 1) 2 β ( k − 1) = | r ( k − 1) | Reductions Reductions 2 | r ( k ) | p ( k ) = r ( k ) + β ( k − 1) p ( k − 1) end for end procedure 10
STAGGERED FERMION MATRIX (DSLASH) Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) 11
STAGGERED FERMION MATRIX (DSLASH) Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil 3 hn o n oi X µ − U † µ − N † w x = D x,x 0 v x 0 = U x,µ v x +ˆ µ,µ v x − ˆ N x,µ v x +3ˆ µ,µ v x − 3ˆ + x − ˆ µ x − 3ˆ µ µ =0 complex 3-dim vector complex 3x3 matrix complex 3x3 matrix complex 3-dim vector 24 byte for fp32 72 byte for fp32 72 byte for fp32 24 byte for fp32 11
STAGGERED FERMION MATRIX (DSLASH) Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil 3 hn o n oi X µ − U † µ − N † w x = D x,x 0 v x 0 = U x,µ v x +ˆ µ,µ v x − ˆ N x,µ v x +3ˆ µ,µ v x − 3ˆ + x − ˆ µ x − 3ˆ µ µ =0 complex 3-dim vector complex 3x3 matrix complex 3x3 matrix complex 3-dim vector 24 byte for fp32 72 byte for fp32 72 byte for fp32 24 byte for fp32 each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site performs 1146 flop: arithmetic intensity: 0.8 flop/byte 11
STAGGERED FERMION MATRIX (DSLASH) Krylov space solve of fermion matrix dominates runtime within inversion application of sparse Matrix (Dslash) dominates (>80%) Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil 3 hn o n oi X µ − U † µ − N † w x = D x,x 0 v x 0 = U x,µ v x +ˆ µ,µ v x − ˆ N x,µ v x +3ˆ µ,µ v x − 3ˆ + x − ˆ µ x − 3ˆ µ µ =0 complex 3-dim vector complex 3x3 matrix complex 3x3 matrix complex 3-dim vector 24 byte for fp32 72 byte for fp32 72 byte for fp32 24 byte for fp32 each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site performs 1146 flop: arithmetic intensity: 0.8 flop/byte sensitive to memory bandwidth 11
THE NEW TRACTOR
BLOCK KRYLOV SPACE SOLVERS Share the Krylov space BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations 13
BLOCK KRYLOV SPACE SOLVERS Share the Krylov space BlockCG solver suggested by O’Leary in early 80’s retooled BlockCG by Dubrulle 2001 In exact precision converges in N / rhs iterations Nakamura et al., CPC 183 (2012) 34–37 Application in QCD: Nakamura et. (modified block BiCGStab) Birk and Frommer (block methods, including block methods for multi shift) 13
BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) P (0) = R (0) for k = 1 , 2 , . . . until converged do Z ( k − 1) = AP ( k − 1) ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 14
BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 14
BLOCK CG share Krylov space between multiple rhs procedure B LOCK CG R (0) = B − AX (0) n n P (0) = R (0) for k = 1 , 2 , . . . until converged do 1 m Z ( k − 1) = AP ( k − 1) = ( P ( k − 1) ) H Z ( k − 1) ⇤ − 1 ( R ( k − 1) ) H R ( k − 1) α ( k − 1) = ⇥ X ( k ) = X ( k − 1) + P ( k − 1) α ( k − 1) R ( k ) = R ( k − 1) − Z ( k − 1) α ( k − 1) ( R ( k − 1) ) H R ( k − 1) ⇤ − 1 ( R ( k ) ) H R ( k ) β ( k − 1) = ⇥ P ( k ) = R ( k ) − P ( k − 1) β ( k − 1) end for end procedure 14
Recommend
More recommend