Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of Illinois at Urbana-Champaign Ke Li - HPC Developer Technology Engineer, NVIDIA John Stone - Senior Research Programmer, University of Illinois at Urbana-Champaign
Journey of a Legacy HPC Application • NAMD has been developed for more than 20 years • Parallel scaling of large systems on supercomputers - Blue Waters (NCSA), Titan (ORNL), Stampede (TACC), Summit (ORNL) • First full-featured molecular dynamics code to adopt CUDA - Stone, et al. J Comput Chem , 28:2618-2640, 2007 • Why were certain design choices made? • What lessons have we learned? • Where does development need to go in the Age of Volta?
NAMD & VMD: Computational Microscope Enable researchers to investigate systems described at the atomic scale NAMD - molecular dynamics simulation VMD - visualization, system preparation and analysis Neuron Ribosome Virus Capsid
Simulated System Sizes Have Increased Exponentially 10 8 HIV capsid 10 7 Number of atoms Ribosome 10 6 STMV ATP Synthase 10 5 ApoA1 Lysozyme 10 4 1986 1990 1994 1998 2002 2006 2010 2014
Parallelism in Molecular Dynamics Limited to Each Timestep Update about 1% of Computational workflow of MD: computational work coordinates Initialize coordinates forces, coordinates Force about 99% of computational work calculation Occasional output of reduced quantities (energy, temperature, pressure) Occasional output of coordinates (trajectory snapshot)
Work Dominated by Nonbonded Forces 90% — Non-bonded forces, short-range cutoff 5% — Long-range electrostatics, gridded (e.g. PME) force calculation 2% — Bonded forces (bonds, angles, etc.) 2% — Correction for excluded interactions update 1% — Integration, constraints, thermostat, barostat coordinates Apply GPU acceleration first to the most expensive part
Parallelize by Spatial Decomposition of Atoms Data parallelism is common to many codes
NAMD Also Decomposes Compute Objects Kale et al., J. Comp. Phys. 151:283-312, 1999 • Spatially decompose data and communication • Separate but related work decomposition • “Compute objects” create much greater amount of parallelization , facilitating iterative, measurement-based load balancing system, all from use of Charm++
Overlap Calculations, Offload Nonbonded Forces Phillips et al., SC2002 Offload to GPU Objects are assigned to processors and queued as data arrives
Early Nonbonded Forces Kernel Used All Memory Systems Start with most expensive calculation: direct nonbonded interactions. • Decompose work into pairs of patches, identical to NAMD structure. • GPU hardware assigns patch-pairs to multiprocessors dynamically. • Force computation on single multiprocessor (GeForce 8800 GTX has 16) 16kB Shared Memory Patch A Coordinates & Parameters Texture Unit 32-way SIMD Multiprocessor Constants Force Table 32-256 multiplexed threads Exclusions Interpolation 32kB Registers 8kB cache 8kB cache Patch B Coords, Params, & Forces 768 MB Main Memory, no cache, 300+ cycle latency
NAMD Performance Improved Using Early GPUs ApoA1 Performance 3 • Full NAMD, not test harness Nonbond PME • Useful performance boost Other 2.25 – 8x speedup for nonbonded seconds per step – 5x speedup overall w/o PME – 3.5x speedup overall w/ PME faster 1.5 – GPU = quad-core CPU • Plans for better performance 0.75 – Overlap GPU and CPU work. – Tune or port remaining work. • PME, bonded, integration, etc. 0 CPU GPU 2.67 GHz Core 2 Quad Extreme + GeForce 8800 GTX
Reduce Communication Latency by Separating Work Units Phillips et al., SC2008 Remote Force Local Force GPU f f x x Remote Local Local Update CPU f x f x Other Nodes/Processes One Timestep
Early GPU Fits Into Parallel NAMD as Coprocessor • Offload most expensive calculation: non-bonded forces • Fits into existing parallelization • Extends existing code without modifying core data structures • Requires work aggregation and kernel scheduling considerations to optimize remote communication • GPU is treated as a coprocessor
NAMD Scales Well on Kepler Based Computers 64 21M atoms 32 16 Performance (ns per day) 8 224M atoms 4 2 1 Blue Waters XK7 (GTC16) Kepler based Titan XK7 (GTC16) 0.5 Edison XC30 (SC14) Blue Waters XE6 (SC14) 0.25 256 512 1024 2048 4096 8192 16384 GTC 2017 Biomedical Technology Research Center for Macromolecular Modeling and Bioinformatics 14 (2fs timestep) Number of Nodes Beckman Institute, University of Illinois at Urbana-Champaign - www.ks.uiuc.edu
Large Rate Difference Between Pascal and CPU 20x FLOP rate difference between GPU and CPU Requires full use of CPU cores and vectorization! • Balance between GPU and CPU capability keeps shifting towards GPU • NVIDIA plots show only through Pascal — Volta widens the performance gap! • Difference made worse by multiple GPUs per CPU (e.g. DGX, Summit) • Past efforts to balance work between GPU and CPU are now CPU bound
Balancing Work Between GPU and CPU Case Study: Multilevel Summation CUDA Kernels in VMD Hardy, et al. Journal of Parallel Computing, 35:164-177, 2009 • Effort to balance computational work between GPUs and CPU • GPU gets only straightforward data parallel algorithms, regularized work units • CPU gets algorithms less well suited to GPU, plus “overflow” work Le, et al. PLoS Comput Bio, 6:e1000939, 2010 Partial model: ~10M atoms Electrostatic field of chromatophore model from multilevel summation method: computed with 3 GPUs (G80) in ~90 seconds, 46x faster than single CPU socket
Time-Averaged Electrostatics Analysis on NCSA Blue Waters NCSA Blue Waters Node Type Seconds per trajectory frame for one compute node Cray XE6 Compute Node: 9.33 32 CPU cores (2xAMD 6200 CPUs) Cray XK6 GPU-accelerated Compute Node: 2.25 16 CPU cores + NVIDIA X2090 (Fermi) GPU speedup ok Speedup for GPU XK6 nodes vs. CPU XE6 nodes XK6 nodes are 4.15x faster overall Tests on XK7 nodes indicate MSM is CPU-bound In progress…. with the Kepler K20X GPU. XK7 nodes 4.3x CPU-bound faster overall Performance not much faster (yet) than Fermi X2090 Need to move spatial hashing, prolongation, interpolation onto the GPU… Preliminary performance for VMD time-averaged electrostatics w/ Multilevel Summation Method on the NCSA Blue Waters Early Science System
Multilevel Summation on the GPU Accelerate short-range cutoff and lattice cutoff parts Performance profile for 0.5 Å map of potential for 1.5 M atoms. Hardware platform is Intel QX6700 CPU and NVID IA GTX 280. Computational steps CPU (s) w/ GPU (s) Speedup Short-range cutoff 480.07 14.87 32.3 Long-range anterpolation 0.18 restriction 0.16 lattice cutoff 49.47 1.36 36.4 prolongation 0.17 interpolation 3.47 Total 533.52 20.21 26.4 Multilevel summation of electrostatic potentials using graphics processing units . D. Hardy, J. Stone, K. Schulten. J. Parallel Computing , 35:164-177, 2009.
Balancing Work Between GPU and CPU Case Study: Multilevel Summation CUDA Kernels in VMD Conclusions: • Successful (at the time) of exploiting task and data level parallelism • Work was reasonably well balanced between CPUs and GPUs for Tesla and Fermi • For Kepler and later, approach was CPU bound
Reduce Latency, Offload All Force Computation • Overlapped GPU communication and computation (2012) • Offload atom-based work for PME (2013) Emphasis on improving communication latency - Use higher order interpolation with coarser grid - Reduce parallel FFT communication • Faster nonbonded force kernels (2016) • Offload entire PME using cuFFT (for single node use) (2016) Emphasis on using GPUs • Offload remaining force terms (2017) more effectively - Includes: bonds, angles, dihedrals, impropers, crossterms, exclusions
Overlapped GPU Communication and Computation • Allows incremental results from a single grid to be processed on CPU before grid finishes on GPU • Allows merging and prioritizing of remote and local work • GPU side: Write results to host-mapped memory (also without streaming) - __threadfence_system() and __syncthreads() - Atomic increment for next output queue location - Write result index to output queue - • CPU side: Poll end of output queue (int array) in host memory -
Non-overlapped Kernel Communication Integration unable to start until GPU kernel finishes
Overlapped Kernel Communication GPU kernel communicates results while running; patches begin integration as soon as data arrives
Non-bonded force computation in NAMD Faster • Two levels of spatial 4 sorting 4 – Simulation box is divided into patches – Within the patch, atoms are sorted spatially into groups of 32 using 4 orthogonal recursive bisection method 7 S6623: Advances in NAMD GPU Performance
Non-bonded force compute Faster Patch 1 Compute 1 Compute 1 Patch 2 32 32 • For GPU, Patch 2 compute is split Compute 2 into tiles of size 32x32 atoms Patch 3 • Compute = all pairwise interactions between two patches 8 S6623: Advances in NAMD GPU Performance
Non-bonded force computation Faster F i 32 31 30 Atoms in patch j Warp 1 Warp 2 Warp 3 Warp 4 F j 32 3 2 32 1 Atoms in patch i • One warp per tiles • Loop through 32x32 tile diagonally – Avoids race condition when storing forces F i and F j • Bitmask used for exclusion lookup 9 S6623: Advances in NAMD GPU Performance
Recommend
More recommend