S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing Facility (OLCF) hynninena@ornl.gov ORNL is managed by UT-Battelle for the US Department of Energy
Motivation • Make NAMD run fast on upcoming Summit supercomputer • Center for Accelerated Application Readiness (CAAR) project – “ Molecular Machinery of the Brain” – PI Prof. Klaus Schulten, University of Illinois at Urbana- Champaign – Co-PI James C. Phillips 2 S6623: Advances in NAMD GPU Performance
Introduction to NAMD • Popular classical molecular dynamics software – Free to download – Available on most supercomputer centers • Runs across many hardware platforms – CPU, Intel Phi, Nvidia GPU 64M atoms 30K atoms 10M atoms James C. Phillips S6361 - Attacking HIV with Petascale Molecular Dynamics Simulations on Titan and Blue Waters Thu 10am, Room 211A 3 S6623: Advances in NAMD GPU Performance
Introduction to Summit • ~3,400 nodes each with – Multiple IBM Power 9 CPUs – Multiple NVIDIA Volta GPUs – NVLINK (5x-12x faster than PCIe3) • 5x-10x faster than Titan (135-270 peta flops) • Arrives 2018 • About 90% of FLOPs in GPUs 4 S6623: Advances in NAMD GPU Performance
GPU accelerated MD in NAMD • GPU used only for non-bonded and PME reciprocal forces • Bonded forces and time-step integration performed on CPU Non-bonded forces PME forces GPU CPU Bonded forces Time-step integration • No need to re-write the entire MD engine • Enables us to use the same communication, thermostat, and sampling methods on all hardware platforms 5 S6623: Advances in NAMD GPU Performance
Non-bonded force computation • Computes forces between all pairs of atoms that are within a radius R • Takes up approximately 80%-90% of simulation time R • Trivial algorithm scales as O( N 2 ) • Use of neighbor lists brings the algorithm down to O(N) – Spatial sorting 6 S6623: Advances in NAMD GPU Performance
Non-bonded force computation in NAMD • Two levels of spatial 4 sorting 4 – Simulation box is divided into patches – Within the patch, atoms are sorted spatially into groups of 32 using 4 orthogonal recursive bisection method 7 S6623: Advances in NAMD GPU Performance
Non-bonded force compute Patch 1 Compute 1 Compute 1 Patch 2 32 32 • For GPU, Patch 2 compute is split Compute 2 into tiles of size 32x32 atoms Patch 3 • Compute = all pairwise interactions between two patches 8 S6623: Advances in NAMD GPU Performance
Non-bonded force computation F i 32 31 30 Atoms in patch j Warp 1 Warp 2 Warp 3 Warp 4 F j 32 3 2 32 1 Atoms in patch i • One warp per tiles • Loop through 32x32 tile diagonally – Avoids race condition when storing forces F i and F j • Bitmask used for exclusion lookup 9 S6623: Advances in NAMD GPU Performance
Non-bonded force computation kernel • Warp-level programming – __shfl() commands – No shared memory – No __syncthreads() • Requires SM 3.0 (Kepler) or newer • Builds atom-based neighbor lists and exclusion bitmasks on the fly 10 S6623: Advances in NAMD GPU Performance
Neighbor list construction on GPU Bounding box neighbor list R Sort neighbor list Compute forces – atom-based neighbor list R Sort neighbor list 11 S6623: Advances in NAMD GPU Performance
Neighbor list sorting Load imbalance! Warp 1 Warp 3 Warp 2 Thread block sort No load imbalance Warp 1 Warp 2 Warp 3 Warp 4 • Tile lists executed on the same thread block should have approximately the same work load • Simple solution is to sort according to tile list length • Also minimizes tail effects at the end of kernel execution 12 S6623: Advances in NAMD GPU Performance
Non-bonded kernel performance* Speedup vs. NAMD 2.11 • Explicit solvent 2.8 2.6 Non-bonded – 30% faster 2.4 Non-bonded neighbor list 2.2 – Neighbor list build 2 1.8 up to 2.7x faster 1.6 1.4 1.2 1 DHFR (24K atoms) ApoA1 (92K atoms) STMV (1.06M atoms) Speedup vs. NAMD 2.11 4 3.5 • GB implicit solvent GBIS Non-bonded neighbor list 3 – 13K: 3.5x faster 2.5 – 5.7M: 38% faster 2 1.5 1 * Titan supercomputer, K20 GPU 13K atoms 5.7M atoms 13 S6623: Advances in NAMD GPU Performance
Simulation performance • Simulation performance is influenced by communication and time-step integration GPU to CPU memcopy GPU: Non-bonded force kernel CPU Patch 1 Patch 2 Patch 3 Communication & time-step integration 14 S6623: Advances in NAMD GPU Performance
Streaming force computation GPU: Non-bonded force kernel CPU Patch 1 Patch 2 Patch 3 Communication & • Streaming: communication & time-step integration integration done during kernel execution GPU: Non-bonded force kernel savings Patch 1 CPU polling for results Patch 2 Patch 3 15 S6623: Advances in NAMD GPU Performance
Streaming force computation GPU: Non-bonded force kernel Patch 1 CPU polling for results Patch 2 Patch 3 GPU: Non-bonded force kernel CPU polling for results Patch 1 Patch 2 Patch 3 • Streaming: sort computes “globally” and preserve patch order • Kernel performance: sort neighbor list “locally” 16 S6623: Advances in NAMD GPU Performance
Neighbor list sorting - global Input: 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 8 6 4 3 2 1 10 Tile list Sort computes to reverse order 9 7 5 Compute Output: 8, 10, 7, 6, 9, 4, 5, 3, 2, 1 Compute forces – record output order Reverse: 1, 2, 3, 5, 4, 9, 6, 7, 10, 8 (8) 7(8) 9(6) (6) 4 Compute Reverse output + build sort key (10) 7(10) (10) (10) 3 6(7) (5) (5) 5(4) 2 1 5 6 7 2(2) 4(5) 3(3) 1(1) 1 2 3 4 9 10 8 1 2 3 4 Tile list length 17 S6623: Advances in NAMD GPU Performance
Neighbor list sorting - local Sort window 18 S6623: Advances in NAMD GPU Performance
Non-bonded kernel performance 1.2 900 800 1.15 Normalized kernel runtime 700 1.1 600 1.05 Compute 500 1 400 0.95 300 0.9 streaming 200 0.85 100 streaming with local sort (32) 0.8 0 no streaming, incl. streaming global sort streaming global & 0 2000 4000 mem copy local sort (32) Neighbor list index 12 900 10 Tile list length 12 800 8 700 10 List length 600 6 Compute 8 500 4 6 400 2 300 4 200 0 2 0 50 100 150 100 0 0 Neighbor list index 0 2000 4000 0 2000 4000 Neighbor list index Neighbor list index 19 S6623: Advances in NAMD GPU Performance
Streaming time-step profile Non-bonded kernel Time step integration 20 S6623: Advances in NAMD GPU Performance
Streaming simulation performance Speedup with streaming 1.35 DHFR (24K atoms) 1.3 ApoA1 (92K atoms) 1.25 1.2 1.15 1.1 1.05 1 1 2 4 8 Number of Titan nodes • 10% - 30% faster simulations using streaming 21 S6623: Advances in NAMD GPU Performance
Particle Mesh Ewald (PME) – NAMD 2.11 Charge 3D FFT spreading Real to Complex Poisson solver • Charge spreading and force gathering on GPU • 3D FFT and Poisson solver on CPU Force 3D FFT gather Complex to Real 22 S6623: Advances in NAMD GPU Performance
Particle Mesh Ewald (PME) – New Charge 3D FFT spreading Real to Complex Poisson solver • Everything on GPU • Uses cuFFT Force 3D FFT gather Complex to Real 23 S6623: Advances in NAMD GPU Performance
PME Performance on single GPU* Speedup vs. NAMD 2.11 4 3.5 • DHFR 24K atoms 3 • 64x64x64 grid 2.5 2 1.5 1 DHFR, order=4 DHFR, order=6 DHFR, order=8 * Titan supercomputer: K20 GPU + AMD Opteron CPU 24 S6623: Advances in NAMD GPU Performance
Finally, simulation performance Speedup vs. NAMD 2.11 DHFR (24K atoms) • Explicit solvent 1.6 ApoA1 (92K atoms) 1.5 – 30% - 57% faster 1.4 simulations 1.3 1.2 13K atoms 4 1.1 3.5 3 1 1 2 4 8 2.5 Number of Titan nodes 2 1.5 1 1 2 4 • GB implicit solvent Number of Titan nodes 5.7M atoms 1.35 1.3 – Up to 3.5x faster 1.25 1.2 simulations 1.15 1.1 1.05 1 1 2 4 Number of Titan nodes 25 S6623: Advances in NAMD GPU Performance
Challenges: Simulation performance • Single-GPU performance for DHFR of 47 ns/day on K20, is still only about half of the performance of GPU-only codes such as Amber* (95 ns/day) • Major part of the runtime is now taken by memory copying, setup time, time-step integration – This is where the effort has to go now *http://ambermd.org/gpus/benchmarks.htm 26 S6623: Advances in NAMD GPU Performance
Challenges: Simulation performance Bonded forces Non-bonded kernel CPU-GPU memory copy & force clear Time step integration CPU-CPU memory copy 27 S6623: Advances in NAMD GPU Performance
Conclusions • Explicit solvent non-bonded force kernels – 30% speedup – 2x speedup in neighbor list builder • Implicit solvent non-bonded force kernels – 38% speedup (but up to 3.5x speedup) – 3.5x speedup in neighbor list builder • Improved simulation performance – 30% – 57% faster simulations than NAMD 2.11 – Up to 3.5x faster for GB implicit solvent • Challenge – Time-step integrator and rest of the critical path code needs work • Planned for release in NAMD 2.12 28 S6623: Advances in NAMD GPU Performance
Recommend
More recommend