speed without compromise precision and methodology

Speed Without Compromise: Precision and Methodology Innovation in - PowerPoint PPT Presentation

Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software Ross Walker, Associate Professor and NVIDIA CUDA Fellow San Diego Supercomputer Center UC San Diego Department of Chemistry & Biochemistry

  1. Speed Without Compromise: Precision and Methodology Innovation in the AMBER GPU MD Software Ross Walker, Associate Professor and NVIDIA CUDA Fellow � San Diego Supercomputer Center � UC San Diego Department of Chemistry & Biochemistry 1 SAN DIEGO SUPERCOMPUTER CENTER

  2. Molecular Dynamics for the 99% • Develop a GPU accelerated 
 version of AMBER’s PMEMD. 
 San Diego 
 Supercomputer Center Ross C. Walker NVIDIA Scott Le Grand Taking MD to 11 Partly funded under NSF SI2 - SSE Program 2 SAN DIEGO SUPERCOMPUTER CENTER

  3. Project Info • AMBER Website: http://ambermd.org/gpus/ Publications 1. Salomon-Ferrer, R.; Goetz, A.W.; Poole, D.; Le Grand, S.; Walker, R.C.* " Routine microsecond molecular dynamics simulations with AMBER - Part II: Particle Mesh Ewald " , J. Chem. Theory Comput. 2013, 9 (9), pp 3878-3888. DOI: 10.1021/ct400314y 2. Goetz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C. "Routine microsecond molecular dynamics simulations with amber - part i: Generalized born", Journal of Chemical Theory and Computation, 2012, 8 (5), pp 1542-1555, DOI:10.1021/ct200909j 3. Pierce, L.C.T., Salomon-Ferrer, R. de Oliveira, C.A.F. McCammon, J.A. Walker, R.C., "Routine access to millisecond timescale events with accelerated molecular dynamics.", Journal of Chemical Theory and Computation, 2012, 8 (9), pp 2997-3002, DOI: 10.1021/ct300284c 4. Salomon-Ferrer, R.; Case, D.A.; Walker, R.C.; "An overview of the Amber biomolecular simulation package", WIREs Comput. Mol. Sci., 2012, in press , DOI: 10.1002/wcms.1121 5. Le Grand, S.; Goetz, A.W.; Walker, R.C.; "SPFP: Speed without compromise - a mixed precision model for GPU accelerated molecular dynamics simulations", Chem. Phys. Comm., 2013, 184, pp374-380 , DOI: 10.1016/j.cpc.2012.09.022 3 SAN DIEGO SUPERCOMPUTER CENTER

  4. Design Goals Overriding Design Goal: Sampling for the 99% • Focus on ~< 4 million atoms. • Maximize single workstation performance. • Focus on minimizing costs. • Be able to use very cheap nodes. • Both gaming and tesla cards. • Ease of use (same input, same output) The <0.0001% The 1.0% The 99.0% 4 SAN DIEGO SUPERCOMPUTER CENTER

  5. Simplicity - Appliances for the 99% 5 SAN DIEGO SUPERCOMPUTER CENTER

  6. AMBER Server (ca. 2013) $8999 6 SAN DIEGO SUPERCOMPUTER CENTER

  7. Digits Dev Box (ca. 2015) $15,000 7 SAN DIEGO SUPERCOMPUTER CENTER

  8. http://exxactcorp.com/index.php/solution/solu_detail/225 8 SAN DIEGO SUPERCOMPUTER CENTER

  9. DGX-99 (Deep Learning for the 99%) http://exxactcorp.com/ index.php/solution/ solu_detail/252 20 x Titan-X = 133 TFLoPs FP32 in 1 node. DGX-1 = 85 TFLoPs FP32 in 1 node. 9 SAN DIEGO SUPERCOMPUTER CENTER


  11. Map problem onto GPU hardware Example: Nonbonded forces atom j • Subdivide force matrix into 3 classes of independent tiles atom i Off-diagonal On-diagonal Redundant • Map non-redundant tiles to warps • SMs consume tiles SM 0 SM 1 SM 2 SM m Shared Memory . . . War War War War p 0 p 0 p 0 p 0 War War War War p 1 p 1 p 1 p 1 War War War War p 2 p 2 p 2 p 2 Registers War War War War p n p n p n p n • Avoid race conditions by dividing the calculation in both space (tiles) and time (warps). Patent: US 8473948 B1 SAN DIEGO SUPERCOMPUTER CENTER

  12. Version History • AMBER 10 – Released Apr 2008 • Implicit Solvent GB GPU support released as patch Sept 2009. • AMBER 11 – Released Apr 2010 • Implicit and Explicit solvent supported internally on single GPU. • Oct 2010 – Bugfix.9 doubled performance on single GPU, added multi-GPU support. • AMBER 12 – Released Apr 2012 • Added Umbrella Sampling Support, REMD, Simulated Annealing, aMD, IPS and Extra Points. • Aug 2012 – Bugfix.9 new SPFP precision model, support for Kepler I, GPU accelerate NMR restraints, improved performance. • Jan 2013 – Bugfix.14 support CUDA 5.0, Jarzynski on GPU, GBSA. Kepler II support. 12 SAN DIEGO SUPERCOMPUTER CENTER

  13. Version History • AMBER 14 – Released Apr 2014 • ~20-30% performance improvement for single GPU runs. • Peer to peer support for multi-GPU runs providing enhanced multi-GPU scaling. • Hybrid bitwise reproducible fixed point precision model as standard (SPFP) • Support for Extra Points in Multi-GPU runs. • Jarzynski Sampling • GBSA support • Support for off-diagonal modifications to VDW parameters. • Multi-dimensional Replica Exchange (Temperature and Hamiltonian) • Support for CUDA 5.0, 5.5 and 6.0 • Support for latest generation GPUs. • Monte Carlo barostat support providing NPT performance equivalent to NVT. • ScaledMD support. • Improved accelerated (aMD) MD support. • Explicit solvent constant pH support. • NMR restraint support on multiple GPUs. • Improved error messages and checking. • Hydrogen mass repartitioning support (4fs time steps). 13 SAN DIEGO SUPERCOMPUTER CENTER

  14. AMBER 16 (GPU) 
 Coming Apr 2016 Amber 2016 • Enhanced NMR Restraints. Reference Manual • R^6 restraint averaging. • Gaussian Accelerated Molecular 
 (Covers Amber16 and AmberTools16) Dynamics. • Optimized binary IO support 
 (mdcrd and restrt). • External electric field support. • Expanded Umbrella Sampling. • Maxwell specific optimizations. • Another 20 to 30% performance 
 improvement! • New SPXP precision model for 
 Maxwell and future hardware. 14 SAN DIEGO SUPERCOMPUTER CENTER

  15. A Question of Dynamic Range 32-bit floating point has approximately 7 significant figures 1.456702 1456702.0000000 +0.3046714 + 0.3046714 ----------- ----------------- 1.761373 1456702.0000000 -1.456702 -1456702.0000000 ----------- ----------------- 0.3046710 0.0000000 Lost a sig fig Lost everything. When it happens: PBC, SHAKE, and Force Accumulation. SAN DIEGO SUPERCOMPUTER CENTER

  16. Precision Models SPSP - Use single precision for the entire calculation with the exception of SHAKE which is always done in double precision. SPDP - Use a combination of single precision for calculation and double precision for accumulation (default < AMBER 12.9) DPDP – Use double precision for the entire calculation. 16 SAN DIEGO SUPERCOMPUTER CENTER

  17. Validation and Precision Testing • Measure a combination of elements that depend on both static energies / forces and ensemble averages. • Energy conservation. • Optimized structures. • Free energy surfaces. • Order parameters. • RMSF. • Radial distribution functions. etc… • 2 aims • Is our implementation valid/correct? • What level of approximation with precision is acceptable? SAN DIEGO SUPERCOMPUTER CENTER




  21. Explicit Solvent Performance 

  22. But then… GTX680 and K10 Ruined the Party. DP performance REALLY sucked. 4 month delay in usefulness while we Developed and tested a new precision model. 22 SAN DIEGO SUPERCOMPUTER CENTER

  23. SPFP • Single / Double / Fixed precision hybrid. Designed for optimum performance on Kepler I. Uses fire and forget atomic ops. Fully deterministic, faster and more precise than SPDP, minimal memory overhead. (default >= AMBER 12.9) Q24.40 for Forces, Q34.30 for Energies / Virials 23 SAN DIEGO SUPERCOMPUTER CENTER

  24. Reproducibility 
 Critical for Debugging Software and Hardware • SPFP precision model is bitwise reproducible. • Same simulation from same random seed = same result. • Is used to validate hardware (misbehaving GPUs) (Exxact AMBER Certified Machines) • Successfully identified 3 GPU models with underlying hardware issues based on this that needed post release fixes. ( GTX-Titan, GTX-780TI, GTX-Titan-Black ) 24 SAN DIEGO SUPERCOMPUTER CENTER

  25. Reproducibility Final Energy after 10^6 MD steps (~45 mins per run) Good GPU Bad GPU 0.0: Etot = -58229.3134 0.0: Etot = -58229.3134 0.1: Etot = -58229.3134 0.1: Etot = -58227.1072 0.2: Etot = -58229.3134 0.2: Etot = -58229.3134 0.3: Etot = -58229.3134 0.3: Etot = -58218.9033 0.4: Etot = -58229.3134 0.4: Etot = -58217.2088 0.5: Etot = -58229.3134 0.5: Etot = -58229.3134 0.6: Etot = -58229.3134 0.6: Etot = -58228.3001 0.7: Etot = -58229.3134 0.7: Etot = -58229.3134 0.8: Etot = -58229.3134 0.8: Etot = -58229.3134 0.9: Etot = -58229.3134 0.9: Etot = -58231.6743 0.10: Etot = -58229.3134 0.10: Etot = -58229.3134 25 SAN DIEGO SUPERCOMPUTER CENTER

  26. Worked Great 
 Until Maxwell DHFR (NVE) HMR 4fs 23,558 Atoms 2x K80 boards (4 GPUs) 423.69 1x K80 board (2 GPUs) 334.05 1/2x K80 board (1 GPU) 229.29 4X K40 489.68 2X K40 364.67 1X K40 266.07 2X K20 263.85 1X K20 196.99 1X K8 116.09 GTX-Titan-Z (2 GPU, full board) 356.48 GTX-Titan-Z (1 GPU, 1/2 board) 261.82 2X GTX Titan Black 383.32 1X GTX Titan Black 280.54 1X GTX 980 262.39 1X GTX 780 251.43 2X C2075 129.79 1X C2075 81.26 2xE5-2660v2 CPU (16 Cores) 30.21 0.00 100.00 200.00 300.00 400.00 500.00 600.00 Performance (ns/day) 26 SAN DIEGO SUPERCOMPUTER CENTER

  27. Titan-X Helps 
 (But only through brute force) 27 SAN DIEGO SUPERCOMPUTER CENTER


More recommend