Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers Jaewoon Jung (RIKEN, RIKEN AICS) Yuji Sugita (RIKEN, RIKEN AICS, RIKEN QBiC, RIKEN iTHES)
Molecular Dynamics (MD) 1. Energy/forces are described by classical molecular mechanics force field. 2. Update state according to equations of motion. d r p i i p i ( ) ( ) r r t t t t dt m i i m d p i ( ) ( ) p t t p t F t F i i i i dt Equation of motion Integration Long time MD trajectory => Ensemble generation Long time MD trajectories are important to obtain thermodynamic quantities of target systems.
Potential energy in MD using PME 2 ( ) E k b b O( N ) total 0 b bonds Total number of particles 2 ( ) O( N ) k 0 a angles [1 cos( )] O( N ) V n n dihedrals 12 6 1 N N r r q q 0 0 ij ij i j 2 O( N 2 ) ij r r r 1 1 ij ij ij j i j Main bottleneck in MD 12 6 2 2 erfc( ) r r q q r exp( / 4 ) 0 0 k ij ij i j ij 2 FFT( ( )) k Q ij 2 r r r k ij ij ij 0 i j R k Real space, O(C N ) Reciprocal space, O( N log N ) 3
GENESIS MD software ( G eneralized En semble Si mulation S ystems) 1. Aims at developing efficient and accurate methodologies for free energy calculations in biological systems. 2. Efficient Parallelization - Suitable for massively parallel super- computers, in particular, K. 3. Applicability for large scale simulation. 4. Algorithms coupled with different molecular models such as coar- segrained, all-atom, and hybrid QM/MM. 5. Generalized ensemble with Replica Exchange Molecular Dynamics. Ref : Jaewoon Jung et al. WIREs CMS, 5, 310-323 (2015) Website : http://www.riken.jp/TMS2012/cbp/en/research/software/genesis/index.html
Parallelization of the real space interaction: Midpoint cell method (1) Midpoint cell method : interaction Midpoint method : interaction between between two particles are decided from two particles are decided from the the midpoint cells where each particle midpoint position of them. resides. Small communication, efficient energy/force evaluations Ref : J. Jung, T. Mori and Y. Sugita, JCC 35, 1064 (2014)
Basic domain decomposition using the midpoint cell method (2) 1. Partitioning space into fixed size boxes, with dimension larger than the cutoff distance. 2. We need only information of neighbor space(domain) for computation of energies. 3. Communication is reduced by increasing process number . 4. Efficient for good parallelization and suitable for large system with massiv- ely parallel supercomputers.
Parallelization of FFT in GENESIS : Volumetric decomposition scheme in 3D FFT 1. More communications than existing FFT 2. MPI Alltoall communications only in one dimensional space (existing : communications in two/three dimensional space) 3. Reduce communicational cost for large number of processors
FFT in GENESIS (2 dimensional view) GENESIS (Identical domain decomposition between two space) NAMD, Gromacs (Different domain decomposition)
GENESIS performance on K 7.00 64 GENESIS r1.1 6.00 32 GENESIS r1.0 NAMD 2.9 5.00 16 4.00 8 ns/day 4 3.00 2 2.00 1 1.00 0.5 0.00 128 256 512 1024 2048 4096 8192 16384 32768 65536 GENESIS GENESIS GENESIS NAMD 2.9 Gromacs Gromacs r1.0 r1.1 r1.2 5.0.7 5.1.2 Number of Cores ApoA1 (92,224 atoms) on 128 cores STMV (1,066,628 atoms)
Why midpoint cell method for GPU+CPU cluster? 1. The main bottleneck of MD is the real space non-bonded interactions for small number of processors. 2. The main bottleneck of MD moves to the reciprocal space non-bonded interactions as we increase the number of processors. 3. When we assign GPU for the real space non-bonded interactions, the reciprocal space interaction will be more crucial. 4. The midpoint cell method with volumetric decomposition FFT could be one of good solutions to optimize the reciprocal space interactions due to avoiding communications before/after FFT. 5. In particular, the midpoint cell method with volumetric decomposition FFT will be very useful for massively parallel supercomputers with GPUs.
Overview of CPU+GPU calculations 1. Computation intensive work : GPU • Pairlist • Real space non-bonded interaction 2. Communication intensive work or no computation intensive work : CPU • Reciprocal space non-bonded interaction with FFT • Bonded interaction • Exclusion list 3. Integration is performed on CPU due to file I/O.
Real space non-bonded interaction on GPU (1) - non-excluded particle list scheme Non-excluded particle list scheme is suitable for GPU due to small amount of memory for pairlist.
Real space non-bonded interaction on GPU (2) - How to make block/threads in each cell pair We make 32 thread blocks for efficient calculation on GPU by making blocks according to 8 atoms in cell I and 4 atoms in cell j.
Overview of GPU+CPU calculations with multiple time step integrator 1. In the multiple time step integrator, we do not perform reciprocal space interaction every step. 2. If reciprocal space interaction is not necessary, we assign subset of real space interaction on CPU to maximize the performance. 3. Integration is performed on CPU only.
Validation Tests (Energy drift) Machine Precision Integrator Energy drift 3.37 × 10 -6 CPU Double Velocity Verlet 1.03 × 10 -5 CPU Single Velocity Verlet 1.01 × 10 -6 CPU Double RESPA (4fs) 8.92 × 10 -5 CPU Single RESPA (4fs) 7.03 × 10 -6 CPU+GPU Double Velocity Verlet -4.56 × 10 -5 CPU+GPU Single Velocity Verlet -3.21 × 10 -6 CPU+GPU Double RESPA (4fs) -3.68 × 10 -5 CPU+GPU Single RESPA (4fs) 5.48 × 10 -5 CPU+GPU Single Langevin RESPA (8fs) 1.63 × 10 -6 CPU+GPU Single Langevin RESPA (10fs) • Unit : kT/ns/degree of freedom • 2fs time step with SHAKE/RATTLE/SETTLE constraints • In the case of RESPA, the slow force time step is written in parentheses Our energy drift is similar to AMBER double and hybrid precision calculation. •
Benchmark condition 1. MD program : GENESIS 2. System : TIP3P water (22,000), STMV (1 million), Crowding system1 (11.7 million), and Crowding system 2 (100 million) 3. Cutoff : 12.0 Å 4. PME grid sizes : 192 3 (STMV), 384 3 (Crowding system1), and 768 3 (Crowding system 2) 5. Integrator : Velocity Verlet (VVER), Respa (PME reciprocal every second step), and Langevin Respa (PME reciprocal every fourth step)
Acceleration of real space interactions (1) • System : 9,250 TIP water molecules • Cutoff distance : 12 Å Box size : 64 Å × 64 Å × 64 Å • 1. 1 GPU increase the speed 3 times and 2 GPUs 6 times. 2. By assigning CPU as well as GPU when FFT on CPU is skipped, we can increase the speed up to 7.7 times.
Acceleration of real space interactions (2) Benchmark system (11.7 million atoms, Cutoff = 12.0 Å)
Comparison between real space and reciprocal space interactions STMV (1 million atoms) 11.7 million atoms 1. In both systems, the main bottleneck is the reciprocal space interactions irrespective of the number of processors. 2. Therefore, it is important to optimize the reciprocal space interaction when CPU+GPU clusters are used (=> Midpoint cell method could be best choice)
Comparison between TSUBAME and K STMV (1 million atoms) 11.7 million atoms 1. K has better parallel efficiency of reciprocal space interaction than TSUBAME. 2. Irrespective of the performance of reciprocal space interaction, TSUBAME shows better performance than K due to efficient evaluation of real space interaction on GPU.
Benchmark on TSUBAME VVER (1 million atoms) VVER (11.7 million atoms) RESPA (1 million atoms) RESPA (11.7 million atoms)
Performance on TSUBAME for 100 million atom systems Simulation time Integrator Number of Nodes Time per step (ms) (ns/day) 512 126.09 1.37 VVER 1024 97.87 1.77 512 109.80 1.57 RESPA 1024 70.77 2.44 512 78.92 2.19 Langevin RESPA 1024 44.13 3.92
Summary 1. We implemented MD for GPU+CPU clusters. 2. We assign GPUs for real space non-bonded interactions and CPUs for reciprocal space interactions, bonded interactions, and integrators. 3. We introduce a non-excluded particle list scheme for efficient usage of memory on GPU. 4. We also optimized the usage of GPUs and CPUs for multiple time step integrators. 5. Benchmark result on TSUBAME shows very good strong/weak scalability for 1 million, 11.7 million, and 100 million atoms systems.
Recommend
More recommend