Tackling Performance Bottlenecks in the Diversifying CUDA HPC Ecosystem: a Molecular Dynamics Perspective Szilárd Páll KTH Royal Institute of Technology GTC 2015
Diversifying hardware & complex parallelism Compute Cloud ● Increasingly: – parallel on multiple levels – heterogenous – power constrained x10 2-5 + x2-8 Widening SIMD/SIMT units Increasing CPU/GPU “Skinny” workstations Mini-clusters, core count to fat compute nodes petaflop machines, & NUMA on die NUMA, NUAA, ... On-demand computing
Diversifying hardware & complex parallelism Compute Cloud ● Need to address each level ● Choice of parallelization important ● How much of the burden is placed on the user? x10 2-5 + x2-8 Widening SIMD/SIMT units Increasing CPU/GPU “Skinny” workstations Mini-clusters, core count to fat compute nodes petaflop machines, & NUMA on die NUMA, NUAA, ... On-demand computing
Molecular dynamics: modelling physics Reproduced under CC BY-SA from: http://commons.wikimedia.org/wiki/File:MM_PEF.png
Molecular dynamics: basics ● Given: Newton's equations – N particles – masses – potential V ● Integrate (leap frog) – acceleration → velocities – velocities → coordinates
Molecular dynamics: interactions
Molecular dynamics: forces Bonded Non-bonded Over all atom-pairs! ● Bonded forces: loop over all interactions – few but localized → imbalance challenge (threading & domain decomposition) ● Non-bonded: a double loop? → too expensive: limit the interaction range (cut-off)
Pair interactions: cut-off ● LJ decays fast: 1/r^6 – can use a cut-off ● Coulomb decays slowly: 1/r – cut-off is not good enough → treat long-range part separately: Particle Mesh Ewald
GROMACS : fast, flexible, free ● Developers : Stockholm & Uppsala, parallel constraints SE and many more worldwide arbitrary ● Open source : LGPLv2 units cells ● Open development: https://gerrit.gromacs.org virtual interaction sites ● Large user base : – 10k's academic & industry – 100k's through F@H ● Supports all major force-fields eighth shell domain decomposition Triclinic unit cell with load balancing and staggered cell boundaries
GROMACS : fast, flexible, free ● Code: portability of great parallel constraints importance arbitrary – C++98 (subset) units cells – CMake ● Pretty large: virtual interaction sites – LOC: ~2 mil. ½ of which is SIMD! ● Bottom-up performance tuning eighth shell → absolute performance is domain decomposition what matters (to users) Triclinic unit cell with load balancing and staggered cell boundaries
Costs in MD ● Every step: 10 6 -10 8 Flops ● Every simulation: 10 6 -10 8 steps What are flops Pair Search distance check spent on? LJ + Coulomb tabulated (F) Non-bonded LJ + Coulomb tabulated (F+E) Coulomb tabulated (F) Coulomb tabulated (F+E) 1,4 nonbonded interactions Calc Weights PME Spread Q Bspline Gather F Bspline 3D-FFT Angles Bonded Propers Settle
Molecular dynamics step Pair-search step every 10-50 iterations MD iteration = step Pair Bonded F Non-bonded F PME Integration Constraints search ~ milliseconds or less Goal: do it as fast as possible!
Herogenous accelerated GROMACS ● 2 nd gen GPU acceleration: Pair search/domain-decomposition: every 10-50 iterations “regular” MD step since GROMACS v4.6 Pair search, Integration Bonded F Non-bonded F PME DD Constraints ● Advantages: offload – 2-4x speedup – offload → multi-GPU “for free” – wide feature support 100s of microseconds at peak! ● Challenges: – added latencies, overheads – load balancing
Herogenous accelerated MD Pair search: every 10-50 iterations ● 2 nd gen GPU acceleration: “regular” MD step since GROMACS v4.6 CPU Launch Pair Wait Integration, Bonded F PME OpenMP search GPU GPU Constraints threads ● Advatages: H2D H2D x,q D2H F pair-list – 3-4x speedup – offload → multi-GPU “for free” Non-bonded F GPU & Clear F Idle Idle Idle Idle CUDA – wide feature support Pair-list pruning ● Challenges: Average CPU-GPU overlap: 60-80% per step – added latency, re-casting work for GPUs – load balancing: intra-GPU, intra-node,...
Parallel heterogeneous MD Pair-search/domain-decompostion step every 10-50 iterations “regular” MD step MPI send non-local F MPI receive non-local x CPU Local Non-Local Wait for Wait Integration Bonded F PME OpenMP pair search pair search nl F local F Constraints threads non-local pair-list non- local x,q H2D local x,q local pair-list D2H local F non- local F \ H2D H2D H2D D2H ... Local Local Local non-bonded F Clear F NB F stream GPU list pruning pist p. Idle Idle Idle CUDA Non-local Non-local non-bonded F stream list pruning
Intra-node: The accelerator
SIMD/SIMT targeted algorithms ● Cluster pair interaction algorithm – lends itself well to efficient fine-grained parallelization – adaptable to the characteristics of the architecture
Particle cluster algorithm: SIMD implementation 4x4 setup on SIMT - Classical 1x1 neighborlist on 4-way SIMD Cluster size and 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 grouping are the 3 5 6 8 9 10 11 15 “knobs” to adjust for 8 9 10 11 12 13 14 15 0 0 0 0 a specific arch: 0 Traditional algorithm: cache pressure, 1 – data reuse ill data reuse, in-register shuffle bound 2 3 4x4 setup on 4-way SIMD – arithmetic 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 4 intensity 5 6 4 5 6 7 12 13 14 15 – cache-efficiency 7 0 0 0 0 8 1 1 1 1 9 2 2 2 2 10 11 3 3 3 3 Cluster algorithm for fine-grained Cluster algorithm for SSE4, VMX, 128-bit AVX: hardware threading: SIMT, etc. cache friendly, 4-way j-reuse
Tuning kernels: feeding the GPU 800 800 ● Avoid load imbalance: raw pair list reshaped list 600 600 – create enough idependent work units = blocks per list size list size Raw lists: Regularized lists: SM[X|M] 400 400 too few blocks balanced SMX – sort them imbalanced execution 200 200 0 ● Workload regularization improves by up to: 0 100 200 300 400 0 100 200 300 400 300 300 – 2-3x on “narrower” 250 250 Regularized: – 3-5x on “wider” GPUs 4x faster 200 200 execution KCycles 150 150 ● (Re-)tuning is needed for new architectures 100 100 ● Tradeoffs: 50 50 – lowers j-particle data reuse 0 0 SMX0 SMX5 SMX6 SMX10 SMX11 SMX12 SMX0 SMX1 SMX2 SMX3 SMX4 SMX5 SMX6 SMX7 SMX8 SMX9 SMX10 SMX11 SMX12 SMX1 SMX2 SMX3 SMX4 SMX7 SMX8 SMX9 – atomic clashes Workload per SMX: Tesla K20c, 1500 atoms
Tuning kernels: ready for automation TESLA K20c Force kernel tuning 0.5 960 atoms 1.5k atoms 3k atoms 6k atoms 0.45 12k atoms 0.4 0.35 kernel time (ms) 0.3 0.25 0.2 0.15 0.1 0.05 0 Scanning list splits from 0 → 1000
Tuning kernels: still getting faster ● Up to 2x faster wrt the first version Tesla C2070 CUDA non-bonded force kernel PME, rc=1.0 nm, nstlist=20 Initial target 0.3 First alpha ● We still keep finding ways to improve Second alpha Pair list tweaks performance; 0.25 step time per 1000 atoms (ms) Kepler back-port Improved sorting most recently: CU55 + buffer 0.2 improvement – better inter-SM load balancing GMX 5.0 Reduction tweak – more consistent list lengths 0.15 – concurrent atomic operations 0.1 0.96 1.5 3 6 12 24 48 96 192 384 768 system size (x1000 atoms)
Tuning kernels: still getting faster ● Up to 2x faster wrt the first version Tesla C2070 CUDA non-bonded force kernel PME, rc=1.0 nm, nstlist=20 0.31 Initial target First alpha ● We still keep finding ways to improve 0.26 Second alpha Pair list tweaks performance step time per 1000 atoms (ms) Kepler back-port 0.21 Improved sorting most recently: CU55 + buffer 0.16 improvement – better inter-SM load balancing GMX 5.0 Reduction tweak 0.11 – more consistent list lengths GTX TITAN X 0.06 – concurrent atomic operations 0.01 0.96 1.5 3 6 12 24 48 96 192 384 768 ● But NVIDIA does too! system size (x1000 atoms)
Adapting the cluster algorithm to the GK210 j-loop ● Doubled register size 4x4 setup on SIMT -8 4x4 setup on SIMT - 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 => can fit 2x threads/block 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 0 1 i-cluster vs 2 j-clusters 0 1 1 2 2 3 3 i-loop 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 Particle cluster Particle cluster processing on processing on GK210 GF, GK1xx, GM
K80 kernel performance ● Pre-GK210: 64 threads/block, 16 blocks per SM occupancy: max 50%, achieved ~0.495% ● Extra registers allows 128 threads/block, still , 16 blocks per SM occupancy: max 100%, achieved ~0.92% K80 64x16 3.95 GPU, #threads x #blocks K80 128x16 2.90 K80 128x15 2.91 K80 128x14 3.08 K80 256x8 3.06 K40 64x16 3.41 K40 128x8 3.57 K40 256x4 5.50 kernel time (ms)
Recommend
More recommend