April 4-7, 2016 | Silicon Valley REVOLUTIONIZING LATTICE QCD PHYSICS WITH HETEROGENEOUS MULTIGRID Kate Clark, April 6th 2016
CONTENTS Introduction to LQCD QUDA Library Adaptive Multigrid QUDA Multigrid Results Conclusion
QUANTUM CHROMODYNAMICS The strong force is one of the basic forces of nature (along with gravity, em and weak) It’s what binds together the quarks and gluons in the proton and the neutron (as well as hundreds of other particles seen in accelerator experiments) QCD is the theory of the strong force It’s a beautiful theory… h Ω i = 1 Z d 4 xL ( U ) Ω ( U ) R [ dU ] e − Z …but i01K($C&-("#&?$7))0?01&-"1$G&'"1&-"12 3
LATTICE QUANTUM CHROMODYNAMICS Theory is highly non-linear ⇒ cannot solve directly Must resort to numerical methods to make predictions Lattice QCD Discretize spacetime ⇒ 4-d dimensional lattice of size L x x L y x L z x L t Finitize spacetime ⇒ periodic boundary conditions PDEs ⇒ finite difference equations High-precision tool that allows physicists to explore the contents of nucleus from the comfort of their workstation (supercomputer) Consumer of 10-20% of public supercomputer cycles 4
STEPS IN AN LQCD CALCULATION 1. Generate an ensemble of gluon field (“gauge”) configurations Produced in sequence, with hundreds needed per ensemble Strong scaling required with O(100 TFLOPS) sustained for several months 50-90% of the runtime is in the linear solver D αβ ij ( x, y ; U ) ψ β j ( y ) = η α i ( x ) or Ax = b 2. “Analyze” the configurations Can be farmed out, assuming O(1 TFLOPS) per job. 80-99% of the runtime is in the linear solver Task parallelism means that clusters reign supreme here U ( x ) 5
Large Hadron Collider Large Hadron Collider 6 Brookhaven National Laboratory Davies et al
QUDA • “QCD on CUDA” – http://lattice.github.com/quda (open source) • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD, Chroma, CPS, MILC, TIFR, etc. – Latest release 0.8.0 (8th February 2016) • Provides: — Various solvers for all major fermionic discretizations, with multi-GPU support — Additional performance-critical routines needed for gauge-field generation • Maximize performance – Exploit physical symmetries to minimize memory traffic – Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multigrid solvers for optimal convergence • A research tool for how to reach the exascale 7
QUDA COLLABORATORS (multigrid collaborators in green) Steve Gottlieb (Indiana University) § Ron Babich (NVIDIA) § Dean Howarth (Rensselaer Polytechnic Institute) § Michael Baldhauf (Regensburg) § Bálint Joó (Jlab) § Kip Barros (LANL) § Hyung-Jin Kim (BNL -> Samsung) § Rich Brower (Boston University) § Claudio Rebbi (Boston University) § Nuno Cardoso (NCSA) § Guochun Shi (NCSA -> Google) § Kate Clark (NVIDIA) § Mario Schröck (INFN) § Michael Cheng (Boston University) § Alexei Strelchenko (FNAL) § Carleton DeTar (Utah University) § Alejandro Vaquero (INFN) § Justin Foley (Utah -> NIH) § Mathias Wagner (NVIDIA) § Joel Giedt (Rensselaer Polytechnic Institute) § Frank Winter (Jlab) § Arjun Gambhir (William and Mary) § 9
THE DIRAC OPERATOR Quark interactions are described by the Dirac operator First-order PDE acting with a background field Large sparse matrix SU(3) QCD gauge field Dirac spin projector A is the clover matrix (link matrices) matrices (3x3 color space) (12x12 spin ⊗ color space) (4x4 spin space) 4 M x,x 0 = − 1 P − µ ⊗ U µ µ,x 0 + P + µ ⊗ U µ † X � µ,x 0 � + (4 + m + A x ) δ x,x 0 x δ x +ˆ µ δ x − ˆ x − ˆ 2 µ =1 m quark mass parameter ≡ − 1 2 D x,x 0 + (4 + m + A x ) δ x,x 0 4-d nearest neighbor stencil operator acting on a vector field Eigen spectrum is complex (typically real positive) 10
MAPPING THE DIRAC OPERATOR TO CUDA • Finite difference operator in LQCD is known as Dslash x • Assign a single space-time point to each thread V = XYZT threads, e.g., V = 24 4 => 3.3x10 6 threads U x D x,x 0 = • Looping over direction each thread must x x x − – Load the neighboring spinor (24 numbers x8) μ U x ν – Load the color matrix connecting the sites (18 numbers x8) – Do the computation μ x − – Save the result (24 numbers) • Each thread has (Wilson Dslash) 0.92 naive arithmetic intensity • QUDA reduces memory traffic Exact SU(3) matrix compression (18 => 12 or 8 real numbers) Use 16-bit fixed-point representation with mixed-precision solver 11
WILSON-DSLASH PERFORMANCE K20X, ECC on, V = 24 3 xT 800 Half 8 GF Half 8 Half 12 Single 8 GF 700 Single 8 Single 12 600 GFLOPS 500 400 300 200 8 32 128 16 64 Temporal Extent 12
LINEAR SOLVERS while (| r k |> ε ) { • β k = ( r k , r k )/( r k-1 , r k-1 ) • p k+1 = r k - β k p k q k+1 = A p k+1 QUDA supports a wide range of linear solvers • α = ( r k , r k )/( p k+1, q k+1 ) • r k+1 = r k - α q k+1 CG, BiCGstab, GCR, Multi-shift solvers, etc. • x k+1 = x k + α p k+1 • k = k+1 } Condition number inversely proportional to mass conjugate Light (realistic) masses are highly singular gradient Naive Krylov solvers suffer from critical slowing down at decreasing mass Entire solver algorithm must run on GPUs Time-critical kernel is the stencil application Also require BLAS level-1 type operations 13
MULTI-GPU DECOMPOSITION face exchange wrap around face exchange wrap around 14
STRONG SCALING Chroma running on Titan with QUDA Clover Propagator Benchmark on Titan: Strong Scaling, QUDA+Chroma+QDP-JIT(PTX) 450 400 350 3 x256 BiCGStab: 72 3 x256 DD+GCR: 72 300 3 x256 BiCGStab: 96 3 x256 DD+GCR: 96 TFLOPS 250 200 150 100 50 B. Joo, F. Winter (JLab), M. Clark (NVIDIA) 0 15 0 512 1024 1536 2048 2560 3072 3584 4096 4608 Titan Nodes (GPUs)
ADAPTIVE MULTIGRID
WHY MULTIGRID? Wallclock'9me'for'Light'Quark'solves'in'Single' 1e+05 3 96 CG 32 Precision'' 3 64 CG 24 3 64 CG 16 70" 3 64 Eig-CG Average'Run'Time'for'1'source'' 24 3 64 Eig-CG 16 60" Dirac operator applications 3 96 MG-GCR 32 10000 3 64 MG-GCR 24 50" 3 64 MG-GCR 16 (seconds)' 40" 30" 1000 20" 10" 0" 100 QUDA"(32"XK"nodes)" Mul:Grid"(16"XE"nodes)"" -0.43 -0.42 -0.41 -0.4 mass Chroma propagator benchmark Figure by Balint Joo Babich et al 2010 MG Chroma integration by Saul Cohen MG Algorithm by James Osborn 17
INTRODUCTION TO MULTIGRID Stationary iterative solvers effective on high frequency errors Minimal effect on low frequency error Example Free Laplace operator in 2d Ax = 0, x 0 = random Gauss Seidel relaxation Plot error e i = -x i 18
INTRODUCTION TO MULTIGRID Low frequency error modes are smooth Can accurately represent on coarse grid Low frequency on fine Falgout => high frequency on coarse Relaxation effective agin on coarse grid Interpolate back to fine grid 19
MULTIGRID V-CYCLE V-CYCLE Solve 1. Smooth SMOOTHER SMOOTHER & RESIDUAL 2. Compute residual INTERPOLATION 3. Restrict residual RESTRICTION SMOOTHER SMOOTHER 4. Recurse on coarse problem & RESIDUAL 5. Prolongate correction 6. Smooth 7. If not converged, goto 1 DIRECT SOLVE Multigrid has optimal scaling O(N) Linear scaling with problem size Convergence rate independent of condition number For LQCD, we do not know the null space components that need to be preserved on the coarse grid 20
ADAPTIVE GEOMETRIC MULTIGRID Adaptively find candidate null-space vectors Dynamically learn the null space and use this to define the prolongator Algorithm is self learning Setup 1. Set solver to be simple smoother 2. Apply current solver to random vector v i = P(D) η i 3. If convergence good enough, solver setup complete 4. Construct prolongator using fixed coarsening (1 - P R) v k = 0 Falgout 4 geometric blocks ➡ Typically use 4 † γ 5 = P † ➡ Preserve chirality when coarsening R = γ 5 P 5. Construct coarse operator (D c = R D P) 6. Recurse on coarse problem 7. Set solver to be augmented V-cycle, goto 2 21 Babich et al 2010
ADAPTIVE GEOMETRIC MULTIGRID 4-d Laplace operator 1e+05 N v =0 1 N v =1 1e-05 2 ||r|| N v =2 1e-10 Typically 20-30 vectors 1e-15 N v =3 needed to capture N v =4 Dirac null space 1e-20 0 20 40 60 80 100 N iter 22
MULTIGRID ON GPUS
THE CHALLENGE OF MULTIGRID ON GPU GPU requirements very different from CPU Each thread is slow, but O(10,000) threads per GPU Fine grids run very efficiently High parallel throughput problem Coarse grids are worst possible scenario More cores than degrees of freedom Increasingly serial and latency bound Little’s law (bytes = bandwidth * latency) Amdahl’s law limiter Multigrid exposes many of the problems expected at the Exascale 24
THE CHALLENGE OF MULTIGRID ON GPU GPU PCIe CPU 25
Recommend
More recommend