Improving NAMD Performance and Scaling on Heterogeneous - PowerPoint PPT Presentation

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D. C. Maia NIH Center for Macromolecular Modeling and Bioinformatics Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign

NAMD Scalable Molecular Dynamics J. Phillips, D. Hardy, J. Maia, et al. J. Chem. Phys. 153, 044130 (2020) https://doi.org/10.1063/5.0014475 • Code written in C++/Charm++/CUDA • Performance scales to hundreds of thousands of CPU cores and tens of thousands of GPUs - Large systems (single copy scaling) - Enhanced sampling (multiple copy scaling) • Runs on laptops up to supercomputers • Runs on AWS cloud, MS Azure • TCL/Python script as input file E. coli chemosensory array Zika Virus - Workflow control - Method development at higher level • Structure preparation and analysis with VMD NAMD: http://www.ks.uiuc.edu/Research/namd/ - QwikMD 2

NAMD Highlights • User defined forces - Grid forces - Interactive molecular dynamics - Steered molecular dynamics • Accelerated sampling methods - Replica exchange Membrane vesicle fusion and formation Proteosome (MDFF+IMD) (grid forces) • Collective variables (Colvars) - Biased simulation - Enhanced sampling • Alchemical transformations - Free energy perturbation (FEP) - Thermodynamic integration (TI) - Constant-pH molecular dynamics • Hybrid QM/MM simulation ABC transporter mechanism DNA QM/MM simulation (Colvars) - Multiple QM regions Complete List of NAMD Features: https://www.ks.uiuc.edu/Research/namd/2.14/ug/ 3

Molecular Dynamics Simulation Most fundamentally, integrate Newton’s equations of motion: • integrate for up to billions of time steps most of the computational work (Lennard-Jones) (electrostatics) 4

Parallelism for MD Simulation Limited to Each Timestep update about 1% of Computational workflow of MD: positions computational work Loop millions a of timesteps particle particle initialize positions forces force occasional about 99% of calculation output computational work reduced quantities (energy, temperature, pressure) position coordinates (trajectory snapshot) 5

NAMD 2.14 Decomposes Force Terms into Fine-Grained Objects for Scalability Offload forces to GPU 6

NAMD 2.14 Excels at Scalable Parallelism on CPUs and GPUs Replications of the 128.0 Satellite Tobacco Mosaic Virus (STMV) Summit 64.0 performance (ns/day) 32.0 16.0 8.0 4.0 2.0 = 27.0 5x2x2 grid = 21M Atoms Frontera 9.0 3.0 = 1.0 7x6x5 grid = 224M Atoms 0.3 0.1 10 100 1000 number of nodes 7

NAMD 2.14 Simulating SARS-CoV-2 on Summit Collaboration with Amaro Lab at UCSD, images rendered by VMD (A) Virion, (B) Spike, (C) Glycan shield conformations Scaling performance: • ~305M atom virion A C • ~8.5M atom spike 128 Spike-ACE2 64 Performance (ns per day) Virion B 32 16 10 strong scaling 51% efficiency 5 8 Summit CPU+GPU 0 Summit CPU-only nm Frontera CPU-only 4 64 128 256 512 1024 2048 4096 Number of Nodes D 0 10 20 nm 8

Benchmarks on Single Nodes and Newer GPUs Reveal Problems Peak Performance in TFLOPS 16 15.7 9.3 0 Pascal (P100) Volta (V100) NAMD 2.13 (2018) has ~20% perf improvement from P100 to V100 Hardware has ~70% perf improvement! 9

Profiling on Modern GPUs Profiling ApoA1, 92k atoms NAMD 2.13, 16 cores and 1 GPU Volta Gaps in the blue strip = GPU is idle! 10

NAMD 2.13 and 2.14 Have Limited GPU Performance • O ffl oading force calculation is not enough! • Overall utilization of modern GPUs is limited • We want better single GPU performance - Majority of MD users run system sizes < 1M atoms on a single GPU The DGX-2 has 16 V100 GPUs but only 48 CPU cores: We need to do more GPU work with less • Must transition from GPU-o ffl oad approach to CPU Power GPU-resident ! 11

NAMD 3.0: GPU-Resident NAMD https://www.ks.uiuc.edu/Research/namd/3.0alpha/ • Fetches GPU force bu ff ers directly from the force module • Bypass any CPU-GPU memory transfers - only call GPU kernels! • Convert forces in a structure-of-arrays (SOA) data structure using the GPU • Invoke GPU Integration Tasks Once Fetch GPU Force Integrate all the Convert buffers to Calculate Forces Buffers atoms SOA 12

NAMD 3.0 Has Better GPU Utilization NAMD 2.14 Gaps between GPU tasks Integration Integration Forces Forces NAMD 3.0 No CPU bottlenecks Forces Forces Integration Integration 13

NAMD 3.0 : Performance on Different Systems ns/day NVE 254.4 12A Cuto ff 260 2fs timestep 195 124.2 130 102.5 Nvidia Titan V 65 41.4 25.7 Intel Xeon E5-2650 12.2 7.8 3.8 V2 w/ 16 physical cores 0 JAC (23k) ApoA1 (92k) F1-ATPase(327k) STMV (1.06M) NAMD 2.14 NAMD 3.0 14

NAMD 3.0: Multi-Copy Performance - Aggregate Throughput With DGX-2 ApoA1 92k atoms ns/day 4000 3,005.65 3000 1,924.36 2000 1000 283.7 283.84 16 Replicas 0 1 for each NVIDIA V100 12A Cuto ff 8A Cuto ff NAMD 2.14 NAMD 3.0 15

NAMD 3.0: Single trajectory - Multiple GPU Performance ns/day STMV 50.8 1.06M atoms 47.5 44.9 2fs timestep 39.9 No PME yet 34.8 28.3 20.5 11.5 13.0 11.5 10.9 10.8 10.4 9.6 9.2 8.3 1 2 3 4 5 6 7 8 # GPUs NAMD 2.14 NAMD 3.0 16

PME Impedes Scalability • For multi-node scaling, 3D FFT communication cost grows faster than computation cost • For single-node multi-GPU scaling: - 3D FFTs are too small to parallelize effectively with cuFFT - Too much latency introduced with pencil decomposition and cuFFT 1D FFTs - Is task-based parallelism best, delegating one GPU for 3D FFTs and reciprocal space calculation? - Requires gathering all grid data to that one GPU and being careful to not overload it with other work • Why not use a better scaling algorithm, such as MSM? 17

Multilevel Summation Method (MSM) D. Hardy, et al. J. Chem. Theory Comput. 11(2), 766-779 (2015) https://doi.org/10.1021/ct5009075 D. Hardy, et al. J. Chem. Phys. 144, 114112 (2016) https://doi.org/10.1063/1.4943868 Split the 1/ r potential into a short-range cutoff part plus smoothed parts that are • successively more slowly varying. All but the top level potential are cut off. Smoothed potentials are interpolated from successively coarser grids. • Finest grid spacing h and smallest cutoff distance a are doubled at each successive level. • Split the 1/ r potential Interpolate the smoothed potentials . . . . . . 2 h -grid + 1/ r = h -grid + atoms r 0 a 2 a 18

MSM Calculation is O ( N ) exact interpolated = + force short-range long-range part part Computational Steps 4 h -grid long-range prolongation restriction 2 h -grid cutoff parts grid cutoff ⟺ 3D convolution prolongation anterpolation ⟺ PME charge spreading restriction h -grid cutoff interpolation ⟺ PME force interpolation anterpolation interpolation short-range cutoff positions, potentials, charges forces 19

Periodic MSM: Replaces PME • Previous implementation was fine for non-periodic boundaries but insufficient for periodic boundary conditions - Lower accuracy than PME, requires system to be neutrally charged • New development for MSM: - Interpolation with periodic B-spline basis functions gives same PME accuracy - Handle infinite 1/r tail as reciprocal space calculation of top level grid - Number of grid levels can be terminated long before reaching a single point; use it to bound size of FFT - Communication is nearest neighbor up the tree to the top grid level 20

Extending NAMD 3.0 to multiple nodes • Reintroducing Charm++ communication - Fast GPU integration calls the force kernels directly - Unused Sequencer user-level threads are put to sleep - Awaken threads for atom migration between patches and coordinate output • Will GPU direct messaging be the best alternative? - Charm++ support is being developed 21

Additional Challenges for NAMD • Feature-complete GPU-resident version - NAMD 3.0 for now supports just a subset of features • Incorporating Colvars (collective variables) force biasing - Poses a significant performance penalty without reimplementing parts of Colvars on GPU • Introducing support for other GPU vendors - AMD HIP port of NAMD 2.14, still working on 3.0 - Intel DPC++ port of non-bonded CUDA kernels 22 AMD GPUs Intel GPUs

Acknowledgments • NAMD development is funded by NIH P41-GM104601 • NAMD team: David Hardy Julio Maia Jim Phillips John Stone Mohammad Soroush Mariano Spivak Wei Jiang Rafael Bernardi Ronak Buch Jaemin Choi Barhaghi 23

Improving NAMD Performance and Scaling on Heterogeneous - PowerPoint PPT Presentation

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D. C. Maia NIH Center for Macromolecular Modeling and Bioinformatics Theoretical and Computational Biophysics Group Beckman Institute for Advanced

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Improving NAMD Performance on Multi-GPU Platforms David J. Hardy Theoretical and Computational

Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Integrating Maude into Hets Mihai Codescu, 1 Till Mossakowski, 1 Adri an Riesco 2 and Christian

Parallel Programming and Heterogeneous Computing A4 Workloads & Fosters Methodology

Upwind Summation By Parts Methods for Large Scale Elastic Wave Equation ICERM, Brown University

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE

Similarity Measures There are an enormous number of ways in which we can measure similarity

OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

HPC for Computational Astrophysics: Looking Forward Ann Almgren Center for Computational

Improving NAMD Performance and Scaling on Heterogeneous - PowerPoint PPT Presentation

Improving NAMD Performance and Scaling on Heterogeneous Architectures David J. Hardy and Julio D. C. Maia NIH Center for Macromolecular Modeling and Bioinformatics Theoretical and Computational Biophysics Group Beckman Institute for Advanced

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

VMD &amp; NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Improving NAMD Performance on Multi-GPU Platforms David J. Hardy Theoretical and Computational

Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Integrating Maude into Hets Mihai Codescu, 1 Till Mossakowski, 1 Adri an Riesco 2 and Christian

Parallel Programming and Heterogeneous Computing A4 Workloads &amp; Fosters Methodology

Upwind Summation By Parts Methods for Large Scale Elastic Wave Equation ICERM, Brown University

A DOMA MAIN S SPECIF ECIFIC A IC APPR PROACH CH TO H HETER EROGENEO ENEOUS US PARALLE

Similarity Measures There are an enormous number of ways in which we can measure similarity

OpenMP and GPU Programming GPU Intro Emanuele Ruffaldi

NEAR DATA PROCESSING Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

HPC for Computational Astrophysics: Looking Forward Ann Almgren Center for Computational

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Parallel Programming and Heterogeneous Computing A4 Workloads & Fosters Methodology