GROMACS simulatjon optjmisatjon Olivier Fisetue olivier.fjsetue@usask.ca Advanced Research Computjng, ICT University of Saskatchewan htups:/ /wiki.usask.ca/display/ARC/ WestGrid 2020 Summer School htups:/ /wgschool.netlify.app/ 2020-06-15 CC BY 4.0 1
Presentatjon What is this session about? ● – Maximising the performance and throughput of MD simulatjons performed with GROMACS – Understanding how GROMACS accelerates and parallelises simulatjons Intended audience ● – You have already performed MD simulatjons with GROMACS. – You do not have a deep knowledge of GROMACS’ architecture. The topics will be mostly technical rather than scientjfjc, but the two ● cannot be separated entjrely. The slides and a pre-recorded presentatjon are available online. ● An interactjve Zoom session will be held at 11:00-13:00 PDT to allow ● atuendees to ask their questjons. 2
Contents Motjvatjon GROMACS and GPUs ● ● Basics of parallel performance Tuning non-bonded interactjons ● ● The limitatjons of non-bonded Integrator tricks ● ● interactjons Concluding remarks ● GROMACS parallelism ● References ● – Domain decompositjon Annex: example MDP fjle for ● – Shared memory parallelism recent GROMACS – Hardware acceleratjon (CPU) Optjmising a simulatjon in ● practjce 3
Motjvatjon Why do we care about the performance of our MD simulatjons? ● – More simulatjon tjme means betuer sampling of biological events. 4
Motjvatjon Why do we care about the performance of our MD simulatjons? ● – More simulatjon tjme means betuer sampling of biological events. H transfer / H bonding Ligand binding Libratjon Catalysis Folding/Unfolding Vibratjon Rotatjonal difgusion Side-chain rotatjon Allosteric regulatjon 10 -15 10 -12 10 -9 10 -6 10 -3 10 0 10 3 s First MD (year) 1977 1995 2008 2010 Biological event tjmescales 5 Fisetue et al. 2012 J. Biomed. Biotechnol.
Motjvatjon Why do we care about the performance of our MD simulatjons? ● – More simulatjon tjme means betuer sampling of biological events. How do make GROMACS faster? ● – We use several CPUs in parallel. – We use GPUs. When using CPUs in parallel, there is a loss of effjciency (e.g. doubling ● the number of CPUs does not always double the performance). 1. How do we measure effjciency? 2. Why does effjciency decrease? 3. How do we avoid or limit loss of effjciency? 4. How can we best confjgure our simulatjons to use multjple CPUs? 6
Speedup and effjciency Speedup ( S ) is the ratjo of serial over parallel executjon tjme ( t ) ● S = t serial t parallel – Example: running a program on a single CPU core takes 10 minutes to complete, but only 6 minutes when run on 2 cores; the speedup is 1.67. Effjciency ( η ) is the ratjo of speedup over number of parallel tasks ( s ) ● s = t serial η = S t parallel s – Example: A 1.67 speedup on 2 cores yields an effjciency of 0.835, or 83.5 %. – When the speedup is equal to the number of parallel tasks ( S = s), the effjciency is said to be linear (η = 1.0). 7
How well does GROMACS scale? Rule of the thumb: the scaling limit is ● Gromacs scaling on SuperMUC ~100 atoms / CPU core. ~ 150 000-atom simulatjon 50 – At that point, adding more CPUs Linear scaling will not make your simulatjon go 40 any faster. ~ 300 Performance [ns/day] atoms/core – Effjciency decreases long before 30 that! 20 Effjciency depends on system size, ● compositjon, and simulatjon parameters. 10 To avoid wastjng resources, you should ● 0 measure scaling for each new 0 100 200 300 400 500 600 molecular system and parameter set. Number of cores 8
Why are MD simulatjons so computatjonally expensive? Most tjme in MD simulatjons is ● spent computjng interatomic 2 V = ∑ k b ( b − b 0 ) potentjals from the force fjeld. bonds + ∑ 2 Non-bonded interactjons are angles k θ ( θ − θ 0 ) ● the bulk of the work. + ∑ dihedrals k ϕ [ 1 + cos ( nϕ − δ )] – Adding one atom to a 1000- atom system adds 0 to 3 2 + ∑ k ω ( ω − ω 0 ) new bonds. impropers – Adding one atom to a 1000- 12 6 atom system adds 1000 new ε [( r min −( r min + ∑ non-bonded pairs! r ) r ) ] VdW – Complexity grows q i q j + ∑ quadratjcally with the k e r Coulomb number of atoms: O ( n 2 ) – Clearly, this is not sustainable! 9
Neighbour lists make large simulatjons possible Only non-bonded interactjons ● between atoms that are close are considered. – Potentjals between atoms farther apart than a cut-ofg (e.g. 10 Å) are not computed. Long-range electrostatjcs are ● computed with Partjcle Mesh Ewald (PME). Neighbour lists are used to keep ● track of atoms in proximity. – These lists are updated as the simulatjon progresses. – GROMACS uses Verlet lists. Complexity becomes O ( n log( n )) ● 10
Overview of GROMACS parallelism GROMACS uses a three-level ● hybrid parallel approach. Spatial DD Level 1 – All levels are independent. PP rank PME rank – All levels can be used Shared memory together. Level 2 This allows GROMACS to take CPU thread CPU thread ● full advantage of modern supercomputers and be very fmexible at the same tjme. It requires the user to ● Hardware SIMD op. SIMD op. understand how the program Level 3 works and to pay atuentjon. GPU core GPU core 11
Spatjal domain decompositjon Let us consider a water box as ● our MD system. When performing an MD ● simulatjon on single CPU core, that core is responsible for all non-bonded potentjals – Short-range interactjons (using cut-ofgs and neighbour lists) – Long-range interactjons (using PME) 12
Spatjal domain decompositjon One strategy to use several CPU ● PP rank PP rank PP rank cores is to break up the system into smaller cells. GROMACS performs this ● domain decompositjon (DD) PP rank PP rank PP rank using MPI. Some MPI ranks compute short- ● range partjcle-partjcle potentjals PP rank PP rank PP rank (PP ranks). Other MPI ranks compute long- ● range electrostatjcs using PME PP rank PP rank PP rank (PME ranks). Domain decompositjon can be ● performed in all three dimensions (2D case shown). PME rank PME rank 13
Spatjal domain decompositjon Each PP rank is responsible for a ● subset of atoms. Adjacent PP ranks need to ● exchange informatjon – Potentjal between nearby atoms PP rank – Atoms that move from one cell to another Non-adjacent PP ranks do not ● exchange informatjon – Communicatjon is minimised GROMACS optjmises the way ● cells are organised and the ranks between PP and PME automatjcally. PME rank PME rank 14
Spatjal domain decompositjon Advantages Disadvantages Can distribute a simulatjon on Adds a signifjcant overhead ● ● many compute nodes – Sometjmes not worth it for – It is the only way to run a single-node simulatjons GROMACS simulatjon on several nodes. Performs poorly for small ● systems Performs very well for large ● – There is a limit to how small systems (~1000 atoms per domain or more) DD cells can be… Requires a fast network Minimises the necessary ● ● interconnect memory per CPU – Betuer use of CPU cache. – InfjniBand and OmniPath are appropriate. – Ethernet is too slow. 15
Spatjal domain decompositjon #!/usr/bin/env bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=32 # Using one full 32-core node module load gcc/7.3.0 module load openmpi/3.1.2 module load gromacs/2020.2 srun gmx_mpi mdrun 16
Spatjal domain decompositjon #!/usr/bin/env bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=32 # Using two full 32-core nodes module load gcc/7.3.0 module load openmpi/3.1.2 module load gromacs/2020.2 srun gmx_mpi mdrun 17
Spatjal domain decompositjon #!/usr/bin/env bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 # Using only 8 cores on a single node (very small # systems may not scale well to a full node) module load gcc/7.3.0 module load openmpi/3.1.2 module load gromacs/2020.2 srun gmx_mpi mdrun 18
Spatjal domain decompositjon #!/usr/bin/env bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=8 # BAD: Using 2 nodes and 16 cores, 8 cores on each # node. This will be slower than 16 cores on a # single node. Always use full nodes in multi-node # jobs. module load gcc/7.3.0 module load openmpi/3.1.2 module load gromacs/2020.2 srun gmx_mpi mdrun 19
Spatjal domain decompositjon #!/usr/bin/env bash #SBATCH --ntasks=32 # BAD: Using 32 CPU cores that could be spread on # many nodes. Always specify the number of nodes # explicitly. module load gcc/7.3.0 module load openmpi/3.1.2 module load gromacs/2020.2 srun gmx_mpi mdrun 20
Spatjal domain decompositjon $ cat md.log ... MPI library: MPI ... Running on 2 nodes with total 80 cores, 80 logical cores Cores per node: 40 ... Initializing Domain Decomposition on 80 ranks Will use 64 particle-particle and 16 PME only ranks Using 16 separate PME ranks, as guessed by mdrun ... Using 80 MPI processes ... NOTE: 11.1 % of the available CPU time was lost due to load imbalance ... NOTE: 16.0 % performance was lost because the PME ranks had more work to do than the PP ranks. ... 21
Recommend
More recommend