Many-GPU calculations in Lattice QuantumChromoDynamics Justin Foley, University of Utah SuperComputing 2012 November 13, 2012 Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
QCD Quantum ChromoDynamics is the theory of the strong nuclear force. One of the fundamental interactions in nature, along with electromagnetism and the weak nuclear force, and gravity. Describes elementary particles called quarks and gluons. Analogy with electromagnetism: quarks ∼ electrons (fundamental matter particles). gluons ∼ photons (force carriers). Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Quarks carry a ‘colour’ charge Colour confinement: Quarks and gluons bind together to form composite, colourless particles called hadrons. E.g., protons and neutrons in atomic nuclei. Quarks and gluons can only break free at extreme temperatures ( > 10 12 K) and densities - deconfinement Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Applications Nuclear physics: Can we understand the forces inside atomic nuclei from first principles? Astrophysics: How did the early universe evolve? Quark stars, quark matter within neutron stars. High-energy physics: Search for new physics that cannot be described by the current Standard Model. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Lattice QCD No analytic solutions for QCD in the low-energy (hadronic) regime. Solve QCD on a computer - K.G. Wilson (1975). Approximate space and time by a 4D-grid. Quarks Gluons Quarks live on the lattice sites, and gluons reside on the links between sites Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Anatomy of a lattice calculation Physical quantities are given in terms of integrals over the gluon fields: DG O [ G ] e − S [ G ] � �O� = � DG e − S [ G ] Large lattices needed to control discretization and finite-volume effects ⇒ 10 9 -D integrals Markov-Chain Monte Carlo Hybrid Monte Carlo (HMC) algorithm is the method of choice [Duane, Kennedy, Pendleton, and Roweth (1987)]. Combines molecular dynamics with a Metropolis accept-reject step. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
The use of large space-time lattices and the need to control statistical uncertainties in Monte Carlo integrals make Lattice QCD a major HPC application. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
MILC Collaboration uses a Lattice QCD package written in C+MPI. Support for the HISQ (Highly-Improved Staggered Quark) lattice formulation. Four routines take up 99% of HMC time: Solving the linear system A φ = η . 1 Fermion force: M.D. force due to quarks. 2 Link fattening: to suppress discretisation errors. 3 Gauge force. 4 Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Linear solves Typically, the most expensive part of HMC A = Q † Q , A φ = η , where Q is the HISQ quark matrix, with stencil Solve using iterative Krylov-subspace methods. Use Conjugate Gradient since A is Hermitian, positive-definite. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Lattice QCD on GPUs QUDA: An opensource library for QCD on GPUs lattice.github.com/quda Written in C++ and CUDA. Linear-solver support for multiple lattice formulations. QUDA-0.5.0 - HMC support for the HISQ formulation. Interfaces for common CPU packages: BQCD, Chroma, CPS, QDP, MILC. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
QUDA performance on a 36 4 lattice on a single K20X (Gflops). Routine Single Double Mixed Linear solve 156.5 77.1 157.4 Fermion force 191.2 97.2 Fattening 170.7 82.0 Gauge force 194.8 98.3 Does not include data transfer between QUDA and MILC. Mixed double-/single-precision solver uses reliable updating [Sleijpen and van der Vorst]. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
2+1-flavor RHMC on 2x(K20X + Sandybridge) 4500 MILC 7.86 4000 MILC+QUDA 3500 3000 2500 Time(s) 7.70 2000 16.36 1500 1000 500 1.39 0.15 0 Linear solves Fermion force* Fattening* Gauge force Other Single-precision Rational HMC on a 24 3 x 64 lattice. 5.7x net gain in performance > 7.7x gain by porting remaining CPU routines. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
(Mixed-) Double-precision Rational HMC on a 96 3 x 192 lattice on Titan Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Preliminary look at strong scaling on Titan Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Domain-decomposition methods Each solver iteration involves the transfer of data between GPUs. In large-scale simulations, linear solver is communication bound. Domain decomposition: Solve the preconditioned linear system MA φ = M η , where M ≈ A − 1 , but involves less or no inter-processor communication [Additive Schwarz method, Schwarz alternating procedure]. Reduce the number of applications of A and hence inter-GPU communication. Successfully employed in Lattice QCD and in QUDA [L¨ uscher (Lattice QCD and the Schwarz alternating procedure); Babich, Clark et al. (Scaling Lattice QCD beyond 100 GPUs)]. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Demonstration of principle on a 32 3 x 64 lattice on 4 C2070s Preconditioning results in 2.4x reduction in communication. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Summary and Outlook Reported on progress in porting Lattice QCD Monte Carlo to GPUs. 99% of MILC HMC ported to QUDA, giving an impressive improvement in performance. Amdahl’s law - need to port remaining CPU routines to QUDA. Persistent data-types to reduce host-device data transfer. Linear solvers dominate state-of-the-art calculations. New algorithms are being implemented which will reduce inter-GPU communication and extend strong scaling on hundreds of GPUs. Justin Foley, University of Utah Many-GPU calculations in Lattice QCD
Recommend
More recommend