Accelerating Quantum Chromodynamics calculations on GPU based systems with an Adaptive Multi-Grid Algorithm K. Clark (NVIDIA), R. Brower (BU) M. Cheng (BU), A. Gambhir (W&M), B. Joó (Jefferson Lab), A. Strelchenko (FNAL), E. Weinberg (BU) NVIDIA CUDA Theatre, SC’16 Thomas Jefferson National Accelerator Facility
Introduction • It is believed that the fundamental building blocks of matter are quarks bound together by gluons , via the strong nuclear force. • Quantum Chromodynamics (QCD) is the theory which describes the strong interactions • Understanding how QCD makes up matter and how quarks and gluons behave is a subject of intense experimental scrutiny - only ~5% of the mass of a proton comes from mass of the quarks, rest comes from binding - gluon self-coupling and gluon excitations can create exotic forms of matter Jefferson Lab Brookhaven National Lab glueball: 0 meson: 2 quarks baryon: 3 quarks quarks GlueX in the new Hall-D only gluons of Jefferson Lab@12 GeV. Hunting for exotics! Thomas Jefferson National Accelerator Facility
LQCD Calculation Workflow Gauge Configurations Propagators, Correlation Functions Physics Gauge Generation Analysis Phase 1 Analysis Phase 2 Result • Gauge Generation: Capability Computing on Leadership Facilities - configurations generated in sequence using Markov Chain Monte Carlo technique - focus the power of leadership computing onto single task exploiting data parallelism • Analysis: Capacity computing, cost effective on Clusters - task parallelize over gauge configurations in addition to data parallelism - can use clusters, but also LCFs in throughput (ensemble) mode. Thomas Jefferson National Accelerator Facility
The Wilson-Clover Fermion Matrix ! δ x,y − 1 N d + M − ic SW X X [ γ µ , γ ν ] F µ ν ( x ) µ + (1 + γ µ ) U † M x,y = (1 − γ µ ) U µ ( x ) δ x,x +ˆ µ ( x − ˆ µ ) δ x,x − ˆ µ 8 2 µ< ν µ “Dslash” Term Clover Term (+ Mass Term) • “Dslash” Term is a nearest neighbor stencil like term - very sparse γ 5 = γ † γ 5 M = M † γ 5 • M is J-Hermitian with J= γ 5 : γ 2 5 = 1 γ 5 = γ 1 γ 2 γ 3 γ 4 5 • γ 5 is maximally indefinite (ev-s are 1,-1) Thomas Jefferson National Accelerator Facility
Gauge Configuration Generation • Gauge Generation proceeds via Hybrid Molecular Hypersurface of Constant H Dynamics Monte Carlo (e.g. HMC) • Momentum Update Step needs ‘Force’ term: ( U � , p � ) MD π µ ( x ) ← π µ ( x ) + F µ ( x ) δτ ( U, p ) • Computing F needs to solve linear system: M † M x = b Momentum refreshment • For Wilson-Clover we can use two step solve: ( U, p old ) M † y = b Mx = y Thomas Jefferson National Accelerator Facility
Analysis • Quark Line Diagrams describe Physical Processes • Each line is a Quark Propagator, solution of: q π π Mq = s q q • Many solves needed for each field configuration - e.g. 256 values of t x 386 sources x 4 values of spin q q - x 2 (light and strange quarks) = 790,528 isolves π π q - Typically 200-500 configurations are used t 0 t • Single precision is good enough • Same Matrix, Multiple Right Hand sides Thomas Jefferson National Accelerator Facility
Chroma Software Stack • Layered Software Chroma - Algorithms in Chroma QDP++ , QDP-JIT/LLVM , QDP-JIT/PTX - Chroma coded in terms of QDP++ QPhiX QUDA - Fast Solvers Come from Libraries QMP • QUDA on NVIDIA GPUs The Chroma Software Stack follows the USQCD SciDAC Software Layers • Different QDP++ Implementations provide LatticeFermion psi,chi; ‘performance portability’ for Chroma gaussian(psi); // gaussian RNG fill // shift sites from forward 0 dir. - Chroma is 99% coded in terms of QDP++ constructs // nearest neighbor commmunication chi = shift(psi, FORWARD, 0); - QDP-JIT/PTX and QDP-JIT/LLVM using NVVM for GPUs // Arithmetic expressions on lattice // subsets • Chroma wraps performance optimized libraries chi[ rb[0] ] += psi; // Global reduction - can give e.g. QUDA solvers a ‘Chroma look & feel’ Double n2 = norm2(chi); Example QDP++ Code Thomas Jefferson National Accelerator Facility
Adaptive Multigrid in LQCD • Critical Slowing down is caused by ‘near zero’ modes of M • Multi-Grid (MG) method - separate (project) low lying and high lying modes - reduce error from high lying modes with “smoother” - reduce error from low modes on coarse grid - Gauge field is ‘stochastic’, so no geometric smoothess on low modes => algebraic multigrid - Setting up restriction/prolongation operators has a cost - Easily amortized in Analysis with O(100,000) solves Image Credit: Joanna Griffin, Jefferson Lab Public Affairs Thomas Jefferson National Accelerator Facility
QUDA Implementation Fine grid: parallelism over sites • Outer Flexible Krylov Method: GCR pre smooth post smooth • MG V-cycle used as a Preconditioner. S S - Null space: R P • Solve M x = 0 for N vec random x with BiCGStab • Construct R, P, M c coarse solve - Smoother: fixed number of iterations with MR S S - ‘Bottom Solver’: GCR coarse 1 R 2 P 2 • May be deflated (e.g. FGMRES-DR) later coarse solve • Is recursively preconditioned by next MG level • Coarsest levels may have very few sites coarse 2 - Turn to other ‘fine grained’ sources of parallelism Coarse grid: parallelism over rows,directions Thomas Jefferson National Accelerator Facility
Benefits of Multigrid: Speed 3 x128 sites, m π ~ 200 MeV V=64 • Algorithmic Speed Improvements 50 - 5x-10x compared to BiCGStab QUDA BiCGStab QUDA Adaptive MG • BiCGStab running in optimal 40 configuration: Wallclock Time (sec) - Mostly low precision with ‘Reliable 30 Update’ flying restarts - Mixed Precision (16-bit/64 bit) 20 - ‘Gauge Field’ Compression 10x reduction • MG is a preconditioner in wall-clock time 10 - Can run in reduced precision with flexible outer GCR solver. 0 0 128 192 320 384 448 64 256 512 Cray XK (Titan) Nodes from K. Clark et. al. SC’16 - sneak preview Thomas Jefferson National Accelerator Facility
Benfits of Multigrid: Optimality 3 x128 sites, Null={24,24}, 64 nodes of Titan, m π ~192 MeV • MG minimizes error, rather than V=64 400 residuum Multigrid 350 • Solver is better behaved than BiCGStab BiCGStab 300 || error || / || residuum || • number of iterations is stable 250 • || error ||/|| residuum || is more stable 200 • Important for t-to-same-t 150 propagators 100 - single precision is good enough BUT: 50 - want precision guarantee from 0 7 0 1 2 3 4 5 6 8 9 10 11 solve to solve Spin/Color component from Clark et. al. SC’16 - sneak preview Thomas Jefferson National Accelerator Facility
Benefits of Multigrid: Power Efficiency 120 • Power Draw of a GPU node 12 BiCGStab Solves Power Consumption (W) 100 during BiCGStab and 80 Multgrid running. 60 - GPU Power only (nvidia-smi) 40 BiCGStab 20 • Once setup is complete, 0 120 integrated power for 12 level 1 null vectors level 2 null vectors 12 MG Solves Power Consumption (W) 100 solves is much less than for 80 coarse op BiGCStab 60 construction • Ongoing optimizations (CPU) 40 Multigrid 20 - smarter setup 0 0 100 200 300 400 500 600 - move more work to GPU Wallclock Time (sec) from Clark et. al. SC’16 - sneak preview Thomas Jefferson National Accelerator Facility
In Praise of Hackathons • Hackathons bring together members of a distributed group for a burst of concentrated activity, to accomplish a concrete development goal. • Hackathons - clear the calendar - are focused — no distractions - Interfaces developers from different sides of interfaces - teach new things (me@OLCF Hack: NVProf, NVIDIA Visual Profiler, Allinea MAP) • Hackathons for QUDA - JLAB: Multi-Node QUDA (way back in 2011?), QUDA Multigrid & Chroma (Jan’16) - Fermilab: QUDA Deflation algorithms - OLCFHack’16: Multigrid in Chroma Gauge Generation, Team Hybrid Titans at OLCFHack. Photograph Courtesy of Sherry Ray, OLCF BiCGStab-L, Staggered Multigrid (Oct’16) Thomas Jefferson National Accelerator Facility
Summary • Taking advantage of modern architectures needs development both in the algorithmic space and in the ‘software’ space - algorithmic optimality, performance optimization, integration with existing codes • Recent QUDA improvements provide Chroma code (and other users) with improved capability. - Multi-Grid solver now in production for propagator calculations on Titan and GPU clusters - Multi-Grid solver has been integrated into Chroma for Gauge Generation projects • Hackathons (a.k.a Code-Fests) are a great way to make rapid advances - We love Hackathons! • Please go and see Kate Clark’s Technical Paper Presentation: 2:30pm, Rm 355-E Thomas Jefferson National Accelerator Facility
Thanks and Acknowledgements • Thanks for Organizing Hackathons: - Chip Watson (Jefferson Lab) for January Multi-Grid Mini-Hackathon at Jefferson Lab - OLCF for the October OLCFHack GPU Hackathon in Knoxville • Results in this talk were generated on the OLCF Titan system (Cray XK7) utilizing USQCD INCITE (LGT003) allocations. • This work is supported by the U.S. Department of Energy, Office of Science, Offices of Nuclear Physics, High Energy Physics, and Advanced Scientific Computing Research. • This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE- AC05-06OR23177. Thomas Jefferson National Accelerator Facility
Recommend
More recommend