improving namd performance and scaling on heterogeneous
play

Improving NAMD Performance and Scaling on Heterogeneous - PowerPoint PPT Presentation

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D. C. Maia NIH Center for Macromolecular Modeling and Bioinformatics Theoretical and Computational Biophysics Group Beckman Institute for Advanced


  1. Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D. C. Maia NIH Center for Macromolecular Modeling and Bioinformatics Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign

  2. NAMD Scalable Molecular Dynamics J. Phillips, D. Hardy, J. Maia, et al. J. Chem. Phys. 153, 044130 (2020) https://doi.org/10.1063/5.0014475 • Code written in C++/Charm++/CUDA • Performance scales to hundreds of thousands of CPU cores and tens of thousands of GPUs - Large systems (single copy scaling) - Enhanced sampling (multiple copy scaling) • Runs on laptops up to supercomputers • Runs on AWS cloud, MS Azure • TCL/Python script as input file E. coli chemosensory array Zika Virus - Workflow control - Method development at higher level • Structure preparation and analysis with VMD NAMD: http://www.ks.uiuc.edu/Research/namd/ - QwikMD 2

  3. NAMD Highlights • User defined forces - Grid forces - Interactive molecular dynamics - Steered molecular dynamics • Accelerated sampling methods - Replica exchange Membrane vesicle fusion and formation Proteosome (MDFF+IMD) (grid forces) • Collective variables (Colvars) - Biased simulation - Enhanced sampling • Alchemical transformations - Free energy perturbation (FEP) - Thermodynamic integration (TI) - Constant-pH molecular dynamics • Hybrid QM/MM simulation ABC transporter mechanism DNA QM/MM simulation (Colvars) - Multiple QM regions Complete List of NAMD Features: https://www.ks.uiuc.edu/Research/namd/2.14/ug/ 3

  4. Molecular Dynamics Simulation Most fundamentally, integrate Newton’s equations of motion: • integrate for up to billions of time steps most of the computational work (Lennard-Jones) (electrostatics) 4

  5. Parallelism for MD Simulation Limited to Each Timestep update about 1% of Computational workflow of MD: positions computational work Loop millions a of timesteps particle particle initialize positions forces force occasional about 99% of calculation output computational work reduced quantities (energy, temperature, pressure) position coordinates (trajectory snapshot) 5

  6. NAMD 2.14 Decomposes Force Terms into Fine-Grained Objects for Scalability Offload forces to GPU 6

  7. NAMD 2.14 Excels at Scalable Parallelism on CPUs and GPUs Replications of the 128.0 Satellite Tobacco Mosaic Virus (STMV) Summit 64.0 performance (ns/day) 32.0 16.0 8.0 4.0 2.0 = 27.0 5x2x2 grid = 21M Atoms Frontera 9.0 3.0 = 1.0 7x6x5 grid = 224M Atoms 0.3 0.1 10 100 1000 number of nodes 7

  8. NAMD 2.14 Simulating SARS-CoV-2 on Summit Collaboration with Amaro Lab at UCSD, images rendered by VMD (A) Virion, (B) Spike, (C) Glycan shield conformations Scaling performance: • ~305M atom virion A C • ~8.5M atom spike 128 Spike-ACE2 64 Performance (ns per day) Virion B 32 16 10 strong scaling 51% efficiency 5 8 Summit CPU+GPU 0 Summit CPU-only nm Frontera CPU-only 4 64 128 256 512 1024 2048 4096 Number of Nodes D 0 10 20 nm 8

  9. Benchmarks on Single Nodes and Newer GPUs Reveal Problems Peak Performance in TFLOPS 16 15.7 9.3 0 Pascal (P100) Volta (V100) NAMD 2.13 (2018) has ~20% perf improvement from P100 to V100 Hardware has ~70% perf improvement! 9

  10. Profiling on Modern GPUs Profiling ApoA1, 92k atoms NAMD 2.13, 16 cores and 1 GPU Volta Gaps in the blue strip = GPU is idle! 10

  11. NAMD 2.13 and 2.14 Have Limited GPU Performance • O ffl oading force calculation is not enough! • Overall utilization of modern GPUs is limited • We want better single GPU performance - Majority of MD users run system sizes < 1M atoms on a single GPU The DGX-2 has 16 V100 GPUs but only 48 CPU cores: We need to do more GPU work with less • Must transition from GPU-o ffl oad approach to CPU Power GPU-resident ! 11

  12. NAMD 3.0: GPU-Resident NAMD https://www.ks.uiuc.edu/Research/namd/3.0alpha/ • Fetches GPU force bu ff ers directly from the force module • Bypass any CPU-GPU memory transfers - only call GPU kernels! • Convert forces in a structure-of-arrays (SOA) data structure using the GPU • Invoke GPU Integration Tasks Once Fetch GPU Force Integrate all the Convert buffers to Calculate Forces Buffers atoms SOA 12

  13. NAMD 3.0 Has Better GPU Utilization NAMD 2.14 Gaps between GPU tasks Integration Integration Forces Forces NAMD 3.0 No CPU bottlenecks Forces Forces Integration Integration 13

  14. NAMD 3.0 : Performance on Different Systems ns/day NVE 254.4 12A Cuto ff 260 2fs timestep 195 124.2 130 102.5 Nvidia Titan V 65 41.4 25.7 Intel Xeon E5-2650 12.2 7.8 3.8 V2 w/ 16 physical cores 0 JAC (23k) ApoA1 (92k) F1-ATPase(327k) STMV (1.06M) NAMD 2.14 NAMD 3.0 14

  15. NAMD 3.0: Multi-Copy Performance - Aggregate Throughput With DGX-2 ApoA1 92k atoms ns/day 4000 3,005.65 3000 1,924.36 2000 1000 283.7 283.84 16 Replicas 0 1 for each NVIDIA V100 12A Cuto ff 8A Cuto ff NAMD 2.14 NAMD 3.0 15

  16. NAMD 3.0: Single trajectory - Multiple GPU Performance ns/day STMV 50.8 1.06M atoms 47.5 44.9 2fs timestep 39.9 No PME yet 34.8 28.3 20.5 11.5 13.0 11.5 10.9 10.8 10.4 9.6 9.2 8.3 1 2 3 4 5 6 7 8 # GPUs NAMD 2.14 NAMD 3.0 16

  17. PME Impedes Scalability • For multi-node scaling, 3D FFT communication cost grows faster than computation cost • For single-node multi-GPU scaling: - 3D FFTs are too small to parallelize effectively with cuFFT - Too much latency introduced with pencil decomposition and cuFFT 1D FFTs - Is task-based parallelism best, delegating one GPU for 3D FFTs and reciprocal space calculation? - Requires gathering all grid data to that one GPU and being careful to not overload it with other work • Why not use a better scaling algorithm, such as MSM? 17

  18. Multilevel Summation Method (MSM) D. Hardy, et al. J. Chem. Theory Comput. 11(2), 766-779 (2015) https://doi.org/10.1021/ct5009075 D. Hardy, et al. J. Chem. Phys. 144, 114112 (2016) https://doi.org/10.1063/1.4943868 Split the 1/ r potential into a short-range cutoff part plus smoothed parts that are • successively more slowly varying. All but the top level potential are cut off. Smoothed potentials are interpolated from successively coarser grids. • Finest grid spacing h and smallest cutoff distance a are doubled at each successive level. • Split the 1/ r potential Interpolate the smoothed potentials . . . . . . 2 h -grid + 1/ r = h -grid + atoms r 0 a 2 a 18

  19. MSM Calculation is O ( N ) exact interpolated = + force short-range long-range part part Computational Steps 4 h -grid long-range prolongation restriction 2 h -grid cutoff parts grid cutoff ⟺ 3D convolution prolongation anterpolation ⟺ PME charge spreading restriction h -grid cutoff interpolation ⟺ PME force interpolation anterpolation interpolation short-range cutoff positions, potentials, charges forces 19

  20. Periodic MSM: Replaces PME • Previous implementation was fine for non-periodic boundaries but insufficient for periodic boundary conditions - Lower accuracy than PME, requires system to be neutrally charged • New development for MSM: - Interpolation with periodic B-spline basis functions gives same PME accuracy - Handle infinite 1/r tail as reciprocal space calculation of top level grid - Number of grid levels can be terminated long before reaching a single point; use it to bound size of FFT - Communication is nearest neighbor up the tree to the top grid level 20

  21. Extending NAMD 3.0 to multiple nodes • Reintroducing Charm++ communication - Fast GPU integration calls the force kernels directly - Unused Sequencer user-level threads are put to sleep - Awaken threads for atom migration between patches and coordinate output • Will GPU direct messaging be the best alternative? - Charm++ support is being developed 21

  22. Additional Challenges for NAMD • Feature-complete GPU-resident version - NAMD 3.0 for now supports just a subset of features • Incorporating Colvars (collective variables) force biasing - Poses a significant performance penalty without reimplementing parts of Colvars on GPU • Introducing support for other GPU vendors - AMD HIP port of NAMD 2.14, still working on 3.0 - Intel DPC++ port of non-bonded CUDA kernels 22 AMD GPUs Intel GPUs

  23. Acknowledgments • NAMD development is funded by NIH P41-GM104601 • NAMD team: David Hardy Julio Maia Jim Phillips John Stone Mohammad Soroush Mariano Spivak Wei Jiang Rafael Bernardi Ronak Buch Jaemin Choi Barhaghi 23

Recommend


More recommend