molecular dynamics md on gpus
play

Molecular Dynamics (MD) on GPUs May 5, 2016 Accelerating - PowerPoint PPT Presentation

Molecular Dynamics (MD) on GPUs May 5, 2016 Accelerating Discoveries Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV


  1. FactorIX on M40s 80 PME - FactorIX_NPT 72.96 Running AMBER version 14 70 67.37 13.6X The blue node contain Single Intel Xeon 60 E5-2698 v3@2.30GHz (Haswell) CPUs 12.5X Simulated Time (ns/Day) 50 The green nodes contain Single Intel 46.90 Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs 40 8.7X 30 20 10 5.38 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node 26

  2. FactorIX on M40s 90 PME - FactorIX_NVE 80.04 80 Running AMBER version 14 14.6X 73.00 70 The blue node contain Single Intel Xeon 13.3X E5-2698 v3@2.30GHz (Haswell) CPUs Simulated Time (ns/Day) 60 The green nodes contain Single Intel 49.33 Xeon E5-2697 v2@2.70GHz (IvyBridge) 50 CPUs + Tesla M40 (autoboost) GPUs 9.0X 40 30 20 10 5.47 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node 27

  3. JAC on M40s 250 PME - JAC_NPT 226.63 Running AMBER version 14 211.97 200 10.9X The blue node contain Single Intel Xeon 10.2X E5-2698 v3@2.30GHz (Haswell) CPUs Simulated Time (ns/Day) 149.40 The green nodes contain Single Intel 150 Xeon E5-2697 v2@2.70GHz (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs 7.2X 100 50 20.88 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node 28

  4. JAC on M40s 300 PME - JAC_NVE Running AMBER version 14 246.15 250 230.18 The blue node contain Single Intel Xeon 11.7X E5-2698 v3@2.30GHz (Haswell) CPUs Simulated Time (ns/Day) 200 The green nodes contain Single Intel 10.9X Xeon E5-2697 v2@2.70GHz (IvyBridge) 157.68 CPUs + Tesla M40 (autoboost) GPUs 150 7.5X 100 50 21.11 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node 29

  5. Myoglobin on M40s 350 GB - Myoglobin 322.09 300.86 Running AMBER version 14 300 32.8X 30.6X The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs 250 232.20 Simulated Time (ns/Day) The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) 200 23.6X CPUs + Tesla M40 (autoboost) GPUs 150 100 50 9.83 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node 30

  6. Nucleosome on M40s 18 GB - Nucleosome 16.11 16 Running AMBER version 14 14 The blue node contain Single Intel Xeon 123.9X E5-2698 v3@2.30GHz (Haswell) CPUs Simulated Time (ns/Day) 12 The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) 10 9.05 CPUs + Tesla M40 (autoboost) GPUs 8 69.6X 6 4.67 4 35.9X 2 0.13 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node 31

  7. TrpCage on M40s 900 GB - TrpCage 831.91 800 Running AMBER version 14 2.03X 700 The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs Simulated Time (ns/Day) 600 551.36 The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge) 500 464.63 1.3X CPUs + Tesla M40 (autoboost) GPUs 408.88 400 1.1X 300 200 100 0 1 Node 1 Node + 1 Node + 1 Node + 1x M40 per node 2x M40 per node 4x M40 per node 32

  8. JAC on K40s and K80s AMBER 14; PME-JAC_NVE on Intel Phi, Tesla K40s and K80s & IVB CPUs (1 Node: Simulation Time in ns/Day) 250 218.62 205.30 195.39 200 Simulated Time (ns/Day) 167.81 150 131.03 116.67 100 50 24.94 0 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 v2@2.70GHz + 1 x v2@2.70GHz + 0.5 x v2@2.70GHz + 1 x v2@2.70GHz + 2 x v2@2.70GHz + 2 x v2@2.70GHz + 4 x v2@2.70GHz (1 Tesla K40@875Mhz (1 Tesla K80 (autoboost) Tesla K80 (autoboost) Tesla K40@875Mhz (1 Tesla K80 (autoboost) Tesla K40@875Mhz (1 Ivybridge node) node) (1 node) (1 node) node) (1 node) node)

  9. FactorIX on K40s and K80s AMBER 14; PME-FactorIX_NVE on Intel Phi, Tesla K40s and K80s & IVB CPUs (1 Node: Simulation Time in ns/Day) 70 59.45 60 53.90 53.28 Simulated Time (ns/Day) 50 46.65 40 36.31 31.52 30 20 10 6.61 0 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 v2@2.70GHz + 1 x v2@2.70GHz + 0.5 x v2@2.70GHz + 1 x v2@2.70GHz + 2 x v2@2.70GHz + 2 x v2@2.70GHz + 4 x v2@2.70GHz (1 Tesla K40@875Mhz (1 Tesla K80@562Mhz (1 Tesla K80@562Mhz (1 Tesla K40@875Mhz (1 Tesla K80@562Mhz (1 Tesla K40@875Mhz (1 Ivybridge node) node) node) node) node) node) node)

  10. Cellulose on K40s and K80s AMBER 14; PME-Cellulose_NVE on Intel Phi, Tesla K40s and K80s & IVB CPUs (1 Node: Simulation Time in ns/Day) 18 14.68 15 13.48 Simulated Time (ns/Day) 12.49 12 10.96 9 8.45 7.36 6 3 1.35 0 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 v2@2.70GHz + 1 x v2@2.70GHz + 0.5 x v2@2.70GHz + 1 x v2@2.70GHz + 2 x v2@2.70GHz + 2 x v2@2.70GHz + 4 x v2@2.70GHz (1 Tesla K40@875Mhz (1 Tesla K80@562Mhz (1 Tesla K80@562Mhz (1 Tesla K40@875Mhz (1 Tesla K80@562Mhz (1 Tesla K40@875Mhz (1 Ivybridge node) node) node) node) node) node) node)

  11. Kepler - Our Fastest Family of GPUs Yet AMBER 14, SPFP-DHFR_production_NVE 250.00 Running AMBER 14 196.69 196.86 The blue node contains Dual 200.00 E5-2697 CPUs (12 Cores per 175.43 CPU). 159.25 159.06 The green nodes contain Dual 150.00 134.08 E5-2697 CPUs (12 Cores per 132.68 ns/Day CPU) and either 1x or 2x 111.32 110.87 NVIDIA K20X, K40 or K80 for the GPU 100.00 50.00 25.80 14.54 4.08 3.82 0.00 1 x Xeon 1 x Xeon 1 x Xeon 1 x Xeon 1 x Xeon 2 x Xeon 2 x Xeon DHFR (JAC) 1 x Xeon 1 x Xeon 2 x Xeon 2 x Xeon E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 1 x Xeon E5-2697 E5-2697 2 x Xeon E5-2697 E5-2697 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 GHz + 1 GHz + 1 GHz + 1 GHz + 1 GHz + 2 GHz + 1 GHz + 2 v2@2.70 GHz + 1 GHz + 2 v2@2.70 GHz + 1 GHz + 2 x Phi x Phi x Tesla x Tesla x Tesla x Tesla x Tesla GHz x Tesla x Tesla GHz x Tesla x Tesla 5110p 7120p K40@87 K80 K40@87 K40@87 K40@87 36 K20X K20X K20X K20X (Offload) (Offload) 5Mhz Board 5Mhz 5Mhz 5Mhz Series1 14.54 4.08 3.82 111.32 134.08 159.25 175.43 196.69 25.80 110.87 132.68 159.06 196.86

  12. Kepler - Our Fastest Family of GPUs Yet AMBER 14, SPFP-Factor_IX_Production_NVE 70.00 Running AMBER 14 57.89 57.83 60.00 51.12 The blue node contains Dual 50.00 E5-2697 CPUs (12 Cores per 46.58 46.50 CPU). 38.65 38.60 ns/Day 40.00 The green nodes contain Dual 32.45 32.30 E5-2697 CPUs (12 Cores per 30.00 CPU) and either 1x or 2x NVIDIA K20X, K40 or K80 for the GPU 20.00 10.00 6.87 3.70 3.29 3.35 0.00 1 x Xeon 1 x Xeon 1 x Xeon 1 x Xeon 1 x Xeon 2 x Xeon 2 x Xeon 1 x Xeon 1 x Xeon 2 x Xeon 2 x Xeon E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 1 x Xeon E5-2697 E5-2697 2 x Xeon E5-2697 E5-2697 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 GHz + 1 GHz + 1 GHz + 1 GHz + 1 GHz + 2 GHz + 1 GHz + 2 v2@2.70 GHz + 1 GHz + 2 v2@2.70 GHz + 1 GHz + 2 x Phi x Phi x Tesla x Tesla x Tesla x Tesla x Tesla GHz x Tesla x Tesla GHz x Tesla x Tesla Factor IX 5110p 7120p K40@87 K80 K40@87 K40@87 K40@87 K20X K20X K20X K20X (Offload) (Offload) 5Mhz Board 5Mhz 5Mhz 5Mhz Series1 3.70 3.29 3.35 32.45 38.65 46.58 51.12 57.89 6.87 32.30 38.60 46.50 57.83 37

  13. Kepler - Our Fastest Family of GPUs Yet AMBER 14, SPFP-Cellulose_Production_NVE 14.00 13.29 13.29 Running AMBER 14 11.86 12.00 10.82 10.83 The blue node contains Dual E5-2697 CPUs (12 Cores per 10.00 CPU). 8.95 8.95 The green nodes contain Dual 7.60 7.60 8.00 ns/Day E5-2697 CPUs (12 Cores per CPU) and either 1x or 2x NVIDIA K20X, K40 or K80 for 6.00 the GPU 4.00 1.56 2.00 1.50 1.38 0.74 0.00 1 x Xeon 1 x Xeon 1 x Xeon 1 x Xeon 1 x Xeon 2 x Xeon 2 x Xeon 1 x Xeon 1 x Xeon 2 x Xeon 2 x Xeon E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 1 x Xeon E5-2697 E5-2697 2 x Xeon E5-2697 E5-2697 Cellulose v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 GHz + 1 GHz + 1 GHz + 1 GHz + 1 GHz + 2 GHz + 1 GHz + 2 v2@2.70 GHz + 1 GHz + 2 v2@2.70 GHz + 1 GHz + 2 x Phi x Phi x Tesla x Tesla x Tesla x Tesla x Tesla GHz x Tesla x Tesla GHz x Tesla x Tesla 5110p 7120p K40@87 K80 K40@87 K40@87 K40@87 K20X K20X K20X K20X (Offload) (Offload) 5Mhz Board 5Mhz 5Mhz 5Mhz 38 Series1 0.74 1.50 1.56 7.60 8.95 10.82 11.86 13.29 1.38 7.60 8.95 10.83 13.29

  14. Cost Comparison 4 simultaneous simulations, 23,000 atoms, 250ns each, 5 days maximum time to solution. Traditional Cluster GPU Workstation Nodes Required 12 1 (4 GPUs) Interconnect QDR IB None Time to complete simulations 4.98 days 2.25 days Power Consumption 5.7 kW (681.3 kWh) 1.0 kW (54.0 kWh) System Cost (per day) $96,800 ($88.40) $5200 ($4.75) Simulation Cost (681.3 * 0.18) + (88.40 * 4.98) (54.0 * 0.18) + (4.75 * 2.25) $562.87 $20.41 >25x cheaper AND solution obtained in less than half the time 39 SAN DIEGO SUPERCOMPUTER CENTER

  15. Replace 8 Nodes with 1 K20 GPU 90.00 35000 Running AMBER 12 GPU Support Revision $32,000.00 81.09 12.1 SPFP with CUDA 4.2.9 ECC Off 80.00 30000 The eight (8) blue nodes each contain 2x 70.00 Intel E5-2687W CPUs (8 Cores per CPU) 65.00 25000 Each green node contains 2x Intel E5- 60.00 2687W CPUs (8 Cores per CPU) plus 1x 20000 NVIDIA K20 GPU 50.00 Note: Typical CPU and GPU node pricing 40.00 15000 used. Pricing may vary depending on node configuration. Contact your preferred HW 30.00 vendor for actual pricing. 10000 20.00 $6,500.00 5000 10.00 0.00 0 Nanoseconds/Day Cost Cut down simulation costs to ¼ and gain higher performance DHFR 40

  16. Replace 7 Nodes with 1 K10 GPU Cost Performance on JAC NVE Running AMBER 12 GPU Support Revision 12.1 SPFP with CUDA 4.2.9 ECC Off $35,000.00 80 $32,000 The eight (8) blue nodes each contain 2x 70 Intel E5-2687W CPUs (8 Cores per CPU) $30,000.00 The green node contains 2x Intel E5-2687W 60 $25,000.00 CPUs (8 Cores per CPU) plus 1x NVIDIA Nanoseconds / Day K10 GPU 50 $20,000.00 Note: Typical CPU and GPU node pricing 40 used. Pricing may vary depending on node $15,000.00 configuration. Contact your preferred HW 30 vendor for actual pricing. $10,000.00 20 $7,000 $5,000.00 10 0 $0.00 CPU Only GPU Enabled CPU Only GPU Enabled Cut down simulation costs to ¼ and increase performance by 70% DHFR 41

  17. Extra CPUs decrease Performance Cellulose NVE Running AMBER 12 GPU Support Revision 12.1 8 The orange bars contains one E5-2687W CPUs (8 Cores per CPU). 7 The blue bars contain Dual E5-2687W CPUs (8 6 Cores per CPU) 1 CPU 2 GPUs 2 CPUs 2 GPUs Nanoseconds / Day 5 1 E5-2687W 4 2 E5-2687W 3 2 1 Cellulose 0 CPU Only CPU with dual K20s When used with GPUs, dual CPU sockets perform worse than single CPU sockets.

  18. Kepler - Greener Science Running AMBER 12 GPU Support Revision 12.1 Energy used in simulating 1 ns of DHFR JAC The blue node contains Dual E5-2687W CPUs 2500 (150W each, 8 Cores per CPU). The green nodes contain Dual E5-2687W CPUs 2000 (8 Cores per CPU) and 1x NVIDIA K10, K20, or Lower is better K20X GPUs (235W each). Energy Expended (kJ) 1500 Energy Expended = Power x Time 1000 500 0 CPU Only CPU + K10 CPU + K20 CPU + K20X The GPU Accelerated systems use 65-75% less energy

  19. Recommended GPU Node Configuration for AMBER Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ (1 CPU core drives 1 GPU) CPU speed (Ghz) 2.66+ System memory per node (GB) 16 GPUs Kepler K20, K40, K80 1-4 # of GPUs per CPU socket GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 16x or higher Server storage 2 TB Network configuration Infiniband QDR or better Scale to multiple nodes with same single node configuration 44 44

  20. CHARMM

  21. Courtesy of Antti-Pekka Hynninen @ NREL

  22. Courtesy of Antti-Pekka Hynninen @ NREL

  23. Courtesy of Antti-Pekka Hynninen @ NREL

  24. Courtesy of Antti-Pekka Hynninen @ NREL

  25. Greener Science with NVIDIA Energy Used in Simulating 1 ns Daresbury G1nBP 61.2k Atoms 18000 16000 Running CHARMM release C37b1 14000 The blue nodes contains 64 X5667 CPUs (95W, 4 Cores per CPU). Energy Expended (kJ) 12000 The green nodes contain 2 X5667 CPUs and 1 or 2 NVIDIA C2070 GPUs (238W each). 10000 Lower is better Note: Typical CPU and GPU node pricing 8000 used. Pricing may vary depending on node configuration. Contact your preferred HW 6000 vendor for actual pricing. 4000 Energy Expended 2000 = Power x Time 0 64x X5667 2x X5667 + 1x C2070 2x X5667 + 2x C2070 Using GPUs will decrease energy use by 75%

  26. CHARMM c40a2 May 2016

  27. 465K System on K80s 2.5 465K System 2.15 Running CHARMM version c40a2 2.0 1.80 6.0X 1.70 The blue node contains Dual Intel Xeon 1.62 Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs 1.5 5.0X ns/day 4.7X The green nodes contain Dual Intel 4.5X Xeon Intel Xeon (R) ES-2698@2.30 MHZ 1.0 (Haswell) CPUs + Tesla K80 (autoboost) GPUs “ Gpuonly ” means all the forces are 0.5 calculated in GPU 0.36 “ Gpuon ” means only non-bonded forces are calculated in GPU 0.0 1 Haswell node 1 Node + 1 Node + 1 Node + 1 Node + 1x K80 per node 1x K80 per node 2x K80 per node 4x K80 per node (gpuonly) (gpuon) (gpuon) (gpuon) 52

  28. 534K System on K80s 2.0 1.86 534K System Running CHARMM version c40a2 1.6 10.3X 1.44 1.44 1.43 The blue node contains Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ (Haswell) CPUs 1.2 ns/day 8.0X 8.0X 8.0X The green nodes contain Dual Intel Xeon Intel Xeon (R) ES-2698@2.30 MHZ 0.8 (Haswell) CPUs + Tesla K80 (autoboost) GPUs “ Gpuonly ” means all the forces are 0.4 calculated in GPU “ Gpuon ” means only non-bonded 0.18 forces are calculated in GPU 0.0 1 Haswell node 1 Node + 1 Node + 1 Node + 1 Node + 1x K80 per node 1x K80 per node 2x K80 per node 4x K80 per node (gpuonly) (gpuon) (gpuon) (gpuon) 53

  29. GROMACS 5.1 October 2015

  30. Erik Lindahl (GROMACS developer) video 55

  31. 384K Waters on K40s and K80s 30 Water [PME] 384k 24.72 25 22.95 22.36 3.5X Running GROMACS version 5.1 20 3.2X Simulated Time (ns/day) 17.07 16.99 The blue node contains Dual Intel E5- 3.1X 2698 v3@2.3GHz CPUs 15 2.4X 2.4X 10.45 The green nodes contain Dual Intel E5- 10 2698 v3@2.3GHz CPUs + either NVIDIA 7.16 1.5X Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs 5 0 1 Haswell 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + Node 1x K40 1x K80 2x K40 2x K80 4x K40 4x K80 56

  32. 384K Waters on Titan X 25 Water [PME] 384k 21.74 20 3.0X 18.13 Running GROMACS version 5.1 16.08 Simulated Time (ns/day) 15 2.5X The blue node contains Dual Intel E5- 2698 v3@2.3GHz CPUs 2.2X 10 The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX 7.16 TitanX@1000Mhz GPUs 5 0 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX 57

  33. 768K Waters on K40s and K80s 15 Water [PME] 768k 12.78 11.36 11.31 3.6X Running GROMACS version 5.1 10 Simulated Time (ns/day) 3.2X 8.60 8.50 The blue node contains Dual Intel E5- 3.2X 2698 v3@2.3GHz CPUs 2.4X 2.4X 5.37 The green nodes contain Dual Intel E5- 5 2698 v3@2.3GHz CPUs + either NVIDIA 1.5X 3.58 Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs 0 1 Haswell 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + Node 1x K40 1x K80 2x K40 2x K80 4x K40 4x K80 58

  34. 768K Waters on Titan X 16 Water [PME] 768k 11.51 12 Running GROMACS version 5.1 3.2X Simulated Time (ns/day) 9.12 The blue node contains Dual Intel E5- 8.19 2698 v3@2.3GHz CPUs 8 2.5X The green nodes contain Dual Intel E5- 2.3X 2698 v3@2.3GHz CPUs + GeForce GTX 3.58 TitanX@1000Mhz GPUs 4 0 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX 59

  35. 1.5M Waters on K40s and K80s 7 Water [PME] 1.5M 6.07 6 5.67 5.61 3.5X Running GROMACS version 5.1 5 3.3X 3.3X 4.13 4.16 Simulated Time (ns/day) The blue node contains Dual Intel E5- 4 2.4X 2698 v3@2.3GHz CPUs 2.4X 3 2.69 The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + either NVIDIA 1.72 2 1.6X Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs 1 0 1 Haswell 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + Node 1x K40 1x K80 2x K40 2x K80 4x K40 4x K80 60

  36. 1.5M Waters on Titan X 8 Water [PME] 1.5M 5.87 6 Running GROMACS version 5.1 3.4X Simulated Time (ns/day) 4.64 The blue node contains Dual Intel E5- 3.75 4 2698 v3@2.3GHz CPUs 2.7X The green nodes contain Dual Intel E5- 2.2X 2698 v3@2.3GHz CPUs + GeForce GTX 1.72 2 TitanX@1000Mhz GPUs 0 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX 61

  37. 3M Waters on K40s and K80s 4 Water [PME] 3M 3.23 3 2.76 2.72 4.0X Running GROMACS version 5.1 3.4X Simulated Time (ns/day) The blue node contains Dual Intel E5- 3.4X 2698 v3@2.3GHz CPUs 1.88 1.85 2 2.3X 2.3X The green nodes contain Dual Intel E5- 1.32 2698 v3@2.3GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla 1 0.81 1.6X K80@562Mhz (autoboost) GPUs 0 1 Haswell 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + Node 1x K40 1x K80 2x K40 2x K80 4x K40 4x K80 62

  38. 3M Waters on Titan X 4 Water [PME] 3M 2.99 3 Running GROMACS version 5.1 3.7X 2.36 Simulated Time (ns/day) The blue node contains Dual Intel E5- 2 2698 v3@2.3GHz CPUs 2.9X 1.53 The green nodes contain Dual Intel E5- 2698 v3@2.3GHz CPUs + GeForce GTX 1 TitanX@1000Mhz GPUs 0.81 1.9X 0 1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX 63

  39. GROMACS 5.0: Phi vs. Kepler K40 fastest GPU! GROMACS 5.0 RC1 (ns/day) on K40 with Boost Clocks and Intel Phi 192K Waters Benchmark (CUDA 6.0) 30 25.84 25 19.29 20 18.55 18.19 ns/day 15 10 7.9 6.02 5.9 4.96 5 0 1 x Xeon E5-2697 1 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 1 x Xeon E5-2697 1 x Intel Phi 3120p 1 x Intel Phi 5110p 2 x Xeon E5-2697 v2@2.70GHz + 1 x v2@2.70GHz + 2 x v2@2.70GHz + 1 x v2@2.70GHz + 2 x v2@2.70GHz (Native Mode) (Native Mode) v2@2.70GHz Tesla K40@875Mhz Tesla K40@875Mhz Tesla K40@875Mhz Tesla K40@875Mhz Series1 4.96 6.02 5.9 18.19 18.55 7.9 19.29 25.84

  40. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_ion_channel Single Node with & without Kepler GPUs 30.00 26.00 25.49 25.00 21.79 20.01 20.00 18.63 ns/Day 15.00 11.60 10.00 7.92 5.00 0.00 1 x Xeon E5-2697 2 x Xeon E5-2697 1 x Xeon E5-2697 1 x Xeon E5-2697 2 x Xeon E5-2697 2 x Xeon E5-2697 1 x Xeon E5-2697 v2@2.70GHz + 1 x v2@2.70GHz + 1 x v2@2.70GHz + 1 x v2@2.70GHz + 1 x v2@2.70GHz (1 v2@2.70GHz + 1 x v2@2.70GHz Tesla K80 Tesla K40@875Mhz Tesla K20X Tesla K40@875Mhz node) Tesla K20X (1 node) (autoboost) (1 node)

  41. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_ion_channel_vsites Single node with & without Kepler GPUs 50.00 45.37 45.29 45.00 42.57 41.94 40.00 37.00 35.27 35.00 31.86 30.00 ns/Day 25.00 20.00 17.98 13.66 15.00 10.00 5.00 0.00 2 x Xeon E5- 2 x Xeon E5- 1 x Xeon E5- 1 x Xeon E5- 2 x Xeon E5- 2 x Xeon E5- 1 x Xeon E5- 2 x Xeon E5- 2697 2697 1 x Xeon E5- 2697 2697 2697 2697 2697 2697 v2@2.70GHz + v2@2.70GHz + 2697 v2@2.70GHz + v2@2.70GHz + v2@2.70GHz + v2@2.70GHz + v2@2.70GHz + v2@2.70GHz (1 1 x Tesla 2 x Tesla v2@2.70GHz 1 x Tesla 1 x Tesla K80 1 x Tesla K20X 2 x Tesla K20X 1 x Tesla K20X node) K40@875Mhz (1 K40@875Mhz (1 K40@875Mhz Board (1 node) (1 node) node) node)

  42. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_methanol Single node with & without Kepler GPUs 0.45 0.40 0.35 0.30 ns/Day 0.25 0.20 0.15 0.10 0.05 0.00 1 x Xeon E5- 2 x Xeon E5- 2 x Xeon E5- 1 x Xeon E5- 2 x Xeon E5- 2 x Xeon E5- 1 x Xeon E5- 2697 2 x Xeon E5- 2697 2697 1 x Xeon E5- 2697 2697 2697 2697 v2@2.70GHz + 2697 v2@2.70GHz + v2@2.70GHz + 2697 v2@2.70GHz + v2@2.70GHz + v2@2.70GHz + v2@2.70GHz + 1 x Tesla K80 v2@2.70GHz (1 1 x Tesla 2 x Tesla v2@2.70GHz 1 x Tesla 1 x Tesla K20X 2 x Tesla K20X 1 x Tesla K20X Board node) K40@875Mhz (1 K40@875Mhz (1 K40@875Mhz (1 node) (1 node) (autoboost) node) node)

  43. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_methanol_rf Single Node with & without Kepler GPUs 0.60 0.52 0.50 0.46 0.40 0.36 0.34 ns/Day 0.31 0.30 0.30 0.27 0.19 0.20 0.12 0.10 0.00 2 x Xeon E5- 2 x Xeon E5- 1 x Xeon E5- 1 x Xeon E5- 2 x Xeon E5- 2 x Xeon E5- 1 x Xeon E5- 2 x Xeon E5- 2697 2697 1 x Xeon E5- 2697 2697 2697 2697 2697 2697 v2@2.70GHz + 1 v2@2.70GHz + 2 2697 v2@2.70GHz + 1 v2@2.70GHz + 1 v2@2.70GHz + 1 v2@2.70GHz + 2 v2@2.70GHz + 1 v2@2.70GHz (1 x Tesla x Tesla v2@2.70GHz x Tesla x Tesla K80 x Tesla K20X (1 x Tesla K20X (1 x Tesla K20X node) K40@875Mhz (1 K40@875Mhz (1 K40@875Mhz Board node) node) node) node) Series1 0.12 0.27 0.31 0.34 0.19 0.30 0.36 0.46 0.52

  44. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_virus_capsid Single Node with & without Kepler GPUs 6.00 5.18 5.00 4.58 3.83 4.00 3.30 3.24 ns/Day 2.99 3.00 2.79 2.00 1.54 0.92 1.00 0.00 2 x Xeon E5- 2 x Xeon E5- 1 x Xeon E5- 1 x Xeon E5- 2 x Xeon E5- 2 x Xeon E5- 1 x Xeon E5- 2 x Xeon E5- 2697 2697 1 x Xeon E5- 2697 2697 2697 2697 2697 2697 v2@2.70GHz + v2@2.70GHz + 2697 v2@2.70GHz + v2@2.70GHz + v2@2.70GHz + v2@2.70GHz + v2@2.70GHz + v2@2.70GHz (1 1 x Tesla 2 x Tesla v2@2.70GHz 1 x Tesla 1 x Tesla K80 1 x Tesla K20X 2 x Tesla K20X 1 x Tesla K20X node) K40@875Mhz (1 K40@875Mhz (1 K40@875Mhz Board (1 node) (1 node) node) node) Series1

  45. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_ion_channel 2 to 8 Nodes, with & without Kepler GPUs 90.00 78.48 80.00 72.18 68.11 70.00 62.95 61.16 59.28 60.00 54.72 52.25 ns/Day 48.85 50.00 45.92 44.49 40.00 35.99 33.76 31.80 30.00 21.32 20.00 10.00 0.00 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 4 x Xeon E5-2697 E5-2697 8 x Xeon E5-2697 E5-2697 16 x Xeon E5-2697 E5-2697 v2@2.70G v2@2.70G v2@2.70G v2@2.70G v2@2.70G v2@2.70G E5-2697 v2@2.70G v2@2.70G E5-2697 v2@2.70G v2@2.70G E5-2697 v2@2.70G v2@2.70G Hz + 2 x Hz + 4 x Hz + 4 x Hz + 8 x Hz + 8 x Hz + 16 x v2@2.70G Hz + 2 x Hz + 4 x v2@2.70G Hz + 4 x Hz + 8 x v2@2.70G Hz + 8 x Hz + 16 x Tesla Tesla Tesla Tesla Tesla Tesla Hz (2 Tesla Tesla Hz (4 Tesla Tesla Hz (8 Tesla Tesla K40@875 K40@875 K40@875 K40@875 K40@875 K40@875 nodes) K20X (2 K20X (2 node) K20X (4 K20X (4 node) K20X (8 K20X (8 Mhz (2 Mhz (2 Mhz (4 Mhz (4 Mhz (8 Mhz (8 nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) Series1 21.32 31.80 33.76 44.49 45.92 35.99 48.85 52.25 59.28 61.16 54.72 62.95 68.11 72.18 78.48

  46. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_ion_channel_vsites 2 to 8 Nodes, with & without Kepler GPUs 160.00 140.66 140.00 131.88 120.00 105.78 102.47 99.26 98.37 100.00 ns/Day 82.31 81.26 76.48 75.50 80.00 70.02 55.66 60.00 53.98 47.92 40.00 32.81 20.00 0.00 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 4 x Xeon E5-2697 E5-2697 8 x Xeon E5-2697 E5-2697 16 x Xeon E5-2697 E5-2697 v2@2.70G v2@2.70G v2@2.70G v2@2.70G v2@2.70G v2@2.70G E5-2697 v2@2.70G v2@2.70G E5-2697 v2@2.70G v2@2.70G E5-2697 v2@2.70G v2@2.70G Hz + 2 x Hz + 4 x Hz + 4 x Hz + 8 x Hz + 8 x Hz + 16 x v2@2.70G Hz + 2 x Hz + 4 x v2@2.70G Hz + 4 x Hz + 8 x v2@2.70G Hz + 8 x Hz + 16 x Tesla Tesla Tesla Tesla Tesla Tesla Hz (2 Tesla Tesla Hz (4 Tesla Tesla Hz (8 Tesla Tesla K40@875 K40@875 K40@875 K40@875 K40@875 K40@875 nodes) K20X (2 K20X (2 node) K20X (4 K20X (4 node) K20X (8 K20X (8 Mhz (2 Mhz (2 Mhz (4 Mhz (4 Mhz (8 Mhz (8 nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes)

  47. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_methanol 2 to 8 Nodes, with & without Kepler GPUs 3.00 2.85 2.73 2.50 2.00 1.83 1.73 ns/Day 1.53 1.50 1.38 1.25 0.97 1.00 0.84 0.80 0.63 0.60 0.47 0.44 0.50 0.33 0.00 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 4 x Xeon E5-2697 E5-2697 8 x Xeon E5-2697 E5-2697 16 x Xeon E5-2697 E5-2697 v2@2.70G v2@2.70G v2@2.70G v2@2.70G v2@2.70G v2@2.70G E5-2697 v2@2.70G v2@2.70G E5-2697 v2@2.70G v2@2.70G E5-2697 v2@2.70G v2@2.70G Hz + 2 x Hz + 4 x Hz + 4 x Hz + 8 x Hz + 8 x Hz + 16 x v2@2.70G Hz + 2 x Hz + 4 x v2@2.70G Hz + 4 x Hz + 8 x v2@2.70G Hz + 8 x Hz + 16 x Tesla Tesla Tesla Tesla Tesla Tesla Hz (2 Tesla Tesla Hz (4 Tesla Tesla Hz (8 Tesla Tesla K40@875 K40@875 K40@875 K40@875 K40@875 K40@875 nodes) K20X (2 K20X (2 node) K20X (4 K20X (4 node) K20X (8 K20X (8 Mhz (2 Mhz (2 Mhz (4 Mhz (4 Mhz (8 Mhz (8 nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) Series1 0.33 0.44 0.47 0.63 0.80 0.60 0.84 0.97 1.38 1.53 1.25 1.73 1.83 2.73 2.85

  48. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_methanol_rf 2 to 8 Nodes, with & without Kepler GPUs 4.50 4.16 4.00 3.65 3.50 3.00 ns/Day 2.50 2.23 2.12 1.86 2.00 1.73 1.48 1.50 1.17 1.05 0.91 0.89 1.00 0.75 0.57 0.49 0.38 0.50 0.00 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 4 x Xeon E5-2697 E5-2697 8 x Xeon E5-2697 E5-2697 16 x Xeon E5-2697 E5-2697 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 GHz + 2 x GHz + 4 x GHz + 4 x GHz + 8 x GHz + 8 x GHz + 16 v2@2.70 GHz + 2 x GHz + 4 x v2@2.70 GHz + 4 x GHz + 8 x v2@2.70 GHz + 8 x GHz + 16 Tesla Tesla Tesla Tesla Tesla x Tesla GHz (2 Tesla Tesla GHz (4 Tesla Tesla GHz (8 Tesla x Tesla K40@875 K40@875 K40@875 K40@875 K40@875 K40@875 nodes) K20X (2 K20X (2 node) K20X (4 K20X (4 node) K20X (8 K20X (8 Mhz (2 Mhz (2 Mhz (4 Mhz (4 Mhz (8 Mhz (8 nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) Series1 0.38 0.49 0.57 0.89 1.05 0.75 0.91 1.17 1.73 2.12 1.48 1.86 2.23 3.65 4.16

  49. GROMACS 5.0 & Fastest Kepler GPUs yet! GROMACS 5.0, cresta_virus_capsid 2 to 8 Nodes, with & without Kepler GPUs 25.00 22.01 20.30 20.00 15.57 15.24 14.20 15.00 ns/Day 12.93 9.81 9.18 10.00 8.99 8.63 8.36 5.71 5.53 5.44 5.00 2.93 0.00 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon 4 x Xeon 4 x Xeon 8 x Xeon 8 x Xeon 16 x Xeon 16 x Xeon E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 E5-2697 4 x Xeon E5-2697 E5-2697 8 x Xeon E5-2697 E5-2697 16 x Xeon E5-2697 E5-2697 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 E5-2697 v2@2.70 v2@2.70 GHz + 2 x GHz + 4 x GHz + 4 x GHz + 8 x GHz + 8 x GHz + 16 v2@2.70 GHz + 2 x GHz + 4 x v2@2.70 GHz + 4 x GHz + 8 x v2@2.70 GHz + 8 x GHz + 16 Tesla Tesla Tesla Tesla Tesla x Tesla GHz (2 Tesla Tesla GHz (4 Tesla Tesla GHz (8 Tesla x Tesla K40@875 K40@875 K40@875 K40@875 K40@875 K40@875 nodes) K20X (2 K20X (2 node) K20X (4 K20X (4 node) K20X (8 K20X (8 Mhz (2 Mhz (2 Mhz (4 Mhz (4 Mhz (8 Mhz (8 nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) nodes) Series1 2.93 5.44 5.71 8.36 8.63 5.53 8.99 9.81 14.20 12.93 9.18 15.24 15.57 20.30 22.01

  50. Slides – courtesy of GROMACS Dev Team

  51. Slides – courtesy of GROMACS Dev Team

  52. Slides – courtesy of GROMACS Dev Team

  53. Slides – courtesy of GROMACS Dev Team

  54. Greener Science ADH in Water (134K Atoms) Running GROMACS 4.6 with CUDA 4.1 12000 The blue nodes contain 2x Intel X5550 CPUs Energy Expended (KiloJoules Consumed) (95W TDP, 4 Cores per CPU) 10000 Lower is better The green node contains 2x Intel X5550 CPUs, 4 Cores per CPU) and 2x NVIDIA M2090s GPUs 8000 (225W TDP per GPU) 6000 Energy Expended 4000 = Power x Time 2000 0 4 Nodes 1 Node + 2x M2090 (760 Watts) (640 Watts) In simulating each nanosecond, the GPU-accelerated system uses 33% less energy

  55. Recommended GPU Node Configuration for GROMACS Computational Chemistry Workstation or Single Node Configuration # of CPU sockets 2 Cores per CPU socket 6+ CPU speed (Ghz) 2.66+ System memory per socket (GB) 32 GPUs Kepler K20, K40, K80 1x # of GPUs per CPU socket Kepler GPUs: need fast Sandy Bridge or Ivy Bridge, or high-end AMD Opterons GPU memory preference (GB) 6 GPU to CPU connection PCIe 3.0 or higher Server storage 500 GB or higher Network configuration Gemini, InfiniBand 80 80

  56. HOOMD-Blue March 2016

  57. 3000 2^23 dodecahedral in HPMC, running on Comet 24 CPU cores 2500 2000 Hours to complete 10e6 sweeps 48 CPU 1500 cores 1000 96 CPU cores 4 K80 192 CPU 500 8 K80 GPUs cores 384 CPU GPUs cores 16 K80 768 CPU 1536 CPU GPUs cores cores 0 0 1 2 4 8 16 32 64 Nodes Blue nodes contain Dual Intel Xeon E5-2680 v3@2.50 GHz (Haswell) CPUs Green nodes contain Tesla K80 (autoboost) GPUs 82

  58. HOOMD-BLUE 1.0 October 2015

  59. HOOMD-Blue 1.0, K40 & K80, Boost impact! 2500 Liquid 2068.27 2000 1516.91 Average Timesteps (s) 1496.42 Running HOOMD-Blue version 1.0 1500 1184.44 The green nodes contain Dual Intel E5- 2697 v2@2.70GHz CPUs + either NVIDIA 1000 Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs 500 0 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 84

  60. HOOMD-Blue 1.0, K40 & K80, Boost impact! 2000 Polymer 1580.45 1500 1203.83 Running HOOMD-Blue version 1.0 Average Timesteps (s) 1173.01 1031.79 The green nodes contain Dual Intel E5- 1000 2697 v2@2.70GHz CPUs + either NVIDIA Tesla K40@875Mhz or Tesla K80 (autoboost) GPUs 500 0 1 CPU Node + 1x K40 1 CPU Node + 1x K80 1 CPU Node + 2x K40 1 CPU Node + 2x K80 85

  61. HOOMD-Blue 1.0, K40 & K80, Boost impact! HOOMD-Blue 1.0, Liquid Single Node with 1 or 2 Kepler GPUs 2500.00 2068.27 Average Timesteps per second 2000.00 1516.91 1496.42 1500.00 1184.44 1000.00 500.00 0.00 2 x Xeon E5-2697 v2@2.70GHz + 1 x 2 x Xeon E5-2697 v2@2.70GHz + 1 x 2 x Xeon E5-2697 v2@2.70GHz + 2 x 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) Tesla K80 (autoboost) Tesla K40@875Mhz (1 node) Tesla K80 (autoboost) Series1 1184.44 1496.42 1516.91 2068.27

  62. HOOMD-Blue 1.0, K40 & K80, Boost impact! HOOMD-Blue, Polymer Single Node with 1 or 2 Kepler GPUs 1800.00 1580.45 1600.00 Average Timesteps per Second 1400.00 1203.83 1173.01 1200.00 1031.79 1000.00 800.00 600.00 400.00 200.00 0.00 2 x Xeon E5-2697 v2@2.70GHz + 1 x 2 x Xeon E5-2697 v2@2.70GHz + 1 x 2 x Xeon E5-2697 v2@2.70GHz + 2 x 2 x Xeon E5-2697 v2@2.70GHz + 2 x Tesla K40@875Mhz (1 node) Tesla K80 (autoboost) Tesla K40@875Mhz (1 node) Tesla K80 (autoboost) Series1 1031.79 1173.01 1203.83 1580.45

  63. HOOMD-Blue 1.0.0 and K40, Boost impact! HOOMD-Blue (Timesteps/Sec) on K40 with and without Boost Clocks lj_liquid (64K particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3) 2500 2232.1 1989.7 2000 Timesteps per Second 1599.0 1500 1412.9 1180.6 1017.4 1000 500 183.6 0 2 x Xeon E5-2690 2 x Xeon E5-2690 2 x Xeon E5-2690 2 x Xeon E5-2690 2 x Xeon E5-2690 2 x Xeon E5-2690 2 x Xeon E5-2690 v2@3.00GHz + 1 x v2@3.00GHz + 1 x v2@3.00GHz + 2 x v2@3.00GHz + 2 x v2@3.00GHz + 4 x v2@3.00GHz + 4 x v2@3.00GHz Tesla K40@745Mhz Tesla K40@875Mhz Tesla K40@745Mhz Tesla K40@875Mhz Tesla K40@745Mhz Tesla K40@875Mhz Series1 183.6 1017.4 1180.6 1412.9 1599.0 1989.7 2232.1

  64. HOOMD-Blue 1.0.0 and K40, fastest GPU yet! HOOMD-Blue (Timesteps/Sec) on K40 with Boost Clocks lj_liquid (64K particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3) 3500 3235.4 3000 2721.6 2684.5 2500 Timesteps per Second 2257.0 2232.1 2166.2 2000 1621.9 1599.0 1500 1180.6 1000 582.5 500 343.4 183.6 0 2 x Xeon E5- 2 x Xeon E5- 2 x Xeon E5- 4 x Xeon E5- 4 x Xeon E5- 4 x Xeon E5- 8 x Xeon E5- 8 x Xeon E5- 8 x Xeon E5- 2690 2690 2690 2690 2690 2690 2690 2690 2690 2 x Xeon E5- 4 x Xeon E5- 8 x Xeon E5- v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz 2690 2690 2690 + 1 x Tesla + 2 x Tesla + 4 x Tesla + 2 x Tesla + 4 x Tesla + 8 x Tesla + 4 x Tesla + 8 x Tesla + 16 x Tesla v2@3.00GHz v2@3.00GHz v2@3.00GHz K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh z z z z z z z z z Series1 183.6 1180.6 1599.0 2232.1 343.4 1621.9 2166.2 2721.6 582.5 2257.0 2684.5 3235.4

  65. HOOMD-Blue 1.0.0 and K40, fastest GPU yet! HOOMD-Blue (Timesteps/Sec) on K40 with Boost Clocks polymer(64,017 particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3) 3000.0 2434.8 2500.0 2082.1 2038.4 Timesteps per Second 2000.0 1773.6 1759.0 1696.5 1500.0 1249.5 1214.2 1015.5 1000.0 576.2 500.0 338.5 179.4 0.0 8 x Xeon E5- 2 x Xeon E5- 2 x Xeon E5- 2 x Xeon E5- 4 x Xeon E5- 4 x Xeon E5- 4 x Xeon E5- 8 x Xeon E5- 8 x Xeon E5- 2690 2 x Xeon E5- 2690 2690 2690 4 x Xeon E5- 2690 2690 2690 8 x Xeon E5- 2690 2690 v2@3.00GH 2690 v2@3.00GH v2@3.00GH v2@3.00GH 2690 v2@3.00GH v2@3.00GH v2@3.00GH 2690 v2@3.00GH v2@3.00GH z + 16 x v2@3.00GH z + 1 x Tesla z + 2 x Tesla z + 4 x Tesla v2@3.00GH z + 2 x Tesla z + 4 x Tesla z + 8 x Tesla v2@3.00GH z + 4 x Tesla z + 8 x Tesla Tesla z K40@875Mh K40@875Mh K40@875Mh z K40@875Mh K40@875Mh K40@875Mh z K40@875Mh K40@875Mh K40@875Mh z z z z z z z z z Series1 179.4 1015.5 1249.5 1759.0 338.5 1214.2 1696.5 2082.1 576.2 1773.6 2038.4 2434.8

  66. HOOMD-Blue 1.0.0 and K40, fastest GPU yet! HOOMD-Blue (Timesteps/Sec) on K40 with Boost Clocks lj_liquid (512K particles) Benchmark (CUDA 5.5, ECC on, gcc 4.7.3) 1400.0 1150.2 1200.0 1000.0 Timesteps per Second 778.9 757.5 800.0 600.0 474.0 463.5 458.0 400.0 273.9 268.3 161.6 200.0 77.5 40.2 20.6 0.0 2 x Xeon E5- 2 x Xeon E5- 2 x Xeon E5- 4 x Xeon E5- 4 x Xeon E5- 4 x Xeon E5- 8 x Xeon E5- 8 x Xeon E5- 8 x Xeon E5- 2690 2690 2690 2690 2690 2690 2690 2690 2690 2 x Xeon E5- 4 x Xeon E5- 8 x Xeon E5- v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz v2@3.00GHz 2690 2690 2690 + 1 x Tesla + 2 x Tesla + 4 x Tesla + 2 x Tesla + 4 x Tesla + 8 x Tesla + 4 x Tesla + 8 x Tesla + 16 x Tesla v2@3.00GHz v2@3.00GHz v2@3.00GHz K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh K40@875Mh z z z z z z z z z Series1 20.6 161.6 268.3 458.0 40.2 273.9 463.5 778.9 77.5 474.0 757.5 1150.2

  67. HOOMD-Blue on ARM vs. Ivy Bridge w/ & w/o K20 Equivalent Performance on ARM + K20 HOOMD-Blue 1.0.0 (Timesteps/Sec) on ARM & Ivy Bridge with/without K20 lj_liquid (64K particles) Benchmark (OpenMPI Ver 1.8.1) 1000 896.2 896.2 900 800 700 Timesteps per Second 600 500 400 300 181.8 200 85.4 100 31.04057 0 ARMv8 64-bit (2.4 GHz) 8 ARMv8 64-bit (2.4 GHz) 8 cores Ivy Bridge (E5-2690 v2 @ Ivy Bridge (E5-2690 v2 @ Ivy Bridge (E5-2690 v2 @ cores, no GPU w/K20 3.00GHz) 8 cores 3.00GHz) 20 cores 3.00GHz) 20 cores w/K20 Series1 31.04057 896.2 85.4 181.8 896.2

  68. Application-Level Evaluation (HOOMD-blue) HOOMD-blue Strong Scaling HOOMD-blue Weak Scaling MV2-2.0b-GDR MV2-NewGDR-Loopback MV2-NewGDR-Fastcopy MV2-2.0b-GDR MV2-NewGDR-Loopback MV2-NewGDR-Fastcopy 3000 4500 47% 48% 4000 2500 53% 56% 3500 Average TPS Average TPS 2000 3000 2500 1500 2000 1000 1500 1000 500 500 0 0 4 8 16 32 64 4 8 16 32 64 Number of GPU Nodes Number of GPU Nodes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • Strong Scaling: fixed 64K particles • Loopback and Fastcopy get up to 45% and 48% improvement for 32 GPUs • Weak Sailing: fixed 2K particles / GPU • Loopback and Fastcopy get up to 54% and 56% improvement for 16 GPUs Webinar - June ‘14 93

  69. LAMMPS October 2015

  70. Lennard-Jones on K20X, K40s & K80s Running LAMMPS 8 Atomic Fluid - Lennard Jones (2.5 Cutoff) The blue node contains Dual Intel Xeon Single Precision (2,048,000 atoms) E5-2697 v2@2.7GHz CPUs 6.19 1 Ivybridge Node 6 The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either Average Loop Time (s) NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) 4 GPUs 2.86 2.51 2.32 2.31 2.21 2.2X 2.14 2 2.5X 2.7X 2.7X 2.8X 2.9X 0 1 Ivybridge 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + Node 1x K20X 1x K40 1x K80 2x K20X 2x K40 2x K80 95

  71. Lennard-Jones on K20X, K40s & K80s Running LAMMPS 12 Atomic Fluid - Lennard Jones (2.5 Cutoff) The blue node contains Dual Intel Xeon E5-2697 v2@2.7GHz CPUs Double Precision (2,048,000 atoms) 1 Ivybridge Node The green nodes contain Dual Intel 7.98 8 Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla 6.14 Average Loop Time (s) K40@875Mhz or Tesla K80 (autoboost) GPUs 1.3X 3.85 3.60 4 2.1X 2.62 2.56 2.47 2.2X 3.0X 3.1X 3.2X 0 1 Ivybridge 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + 1 CPU Node + Node 1x K20X 1x K40 1x K80 2x K20X 2x K40 2x K80 96

  72. Lennard-Jones on K20X, K40s & K80s Running LAMMPS 4 Atomic Fluid - Lennard Jones (2.5 Cutoff) The blue node contains Dual Intel Xeon Single Precision (2,048,000 atoms) E5-2697 v2@2.7GHz CPUs 3.15 2 Ivybridge Node 3 The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either Average Loop Time (s) NVIDIA Tesla K20X@732Mhz, Tesla K40@875Mhz or Tesla K80 (autoboost) 2 GPUs 1.60 1.34 2.0X 1.11 1.08 1.04 0.99 1 2.4X 2.8X 3.0X 3.0X 3.2X 0 2 Ivybridge 2 CPU Node + 2 CPU Node + 2 CPU Node + 2 CPU Node + 2 CPU Node + 2 CPU Node + Node 1x K20X 1x K40 1x K80 2x K20X 2x K40 2x K80 97

  73. Lennard-Jones on K20X, K40s & K80s Running LAMMPS 6 Atomic Fluid - Lennard Jones (2.5 Cutoff) The blue node contains Dual Intel Xeon Double Precision (2,048,000 atoms) E5-2697 v2@2.7GHz CPUs 2 Ivybridge Node 4.08 The green nodes contain Dual Intel 4 Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla Average Loop Time (s) K40@875Mhz or Tesla K80 (autoboost) 2.56 GPUs 2.03 2 1.53 1.6X 1.30 1.29 1.17 2.0X 2.7X 3.1X 3.2X 3.5X 0 2 Ivybridge 2 CPU Node + 2 CPU Node + 2 CPU Node + 2 CPU Node + 2 CPU Node + 2 CPU Node + Node 1x K20X 1x K40 1x K80 2x K20X 2x K40 2x K80 98

  74. Lennard-Jones on K20X, K40s & K80s 2.0 Running LAMMPS Atomic Fluid - Lennard Jones (2.5 Cutoff) Single Precision (2,048,000 atoms) The blue node contains Dual Intel Xeon 1.64 E5-2697 v2@2.7GHz CPUs 4 Ivybridge Node 1.5 The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either NVIDIA Tesla K20X@732Mhz, Tesla Average Loop Time (s) 1.00 K40@875Mhz or Tesla K80 (autoboost) 1.0 GPUs 0.80 1.6X 0.65 0.61 2.1X 0.53 0.53 0.5 2.5X 2.7X 3.1X 3.1X 0.0 4 Ivybridge 4 CPU Node + 4 CPU Node + 4 CPU Node + 4 CPU Node + 4 CPU Node + 4 CPU Node + Node 1x K20X 1x K40 1x K80 2x K20X 2x K40 2x K80 99

  75. Lennard-Jones on K20X, K40s & K80s 2.8 Running LAMMPS Atomic Fluid - Lennard Jones (2.5 Cutoff) Double Precision (2,048,000 atoms) The blue node contains Dual Intel Xeon 4 Ivybridge Node E5-2697 v2@2.7GHz CPUs 2.09 2.1 The green nodes contain Dual Intel Xeon E5-2697 v2@2.7GHz CPUs + either Average Loop Time (s) NVIDIA Tesla K20X@732Mhz, Tesla 1.46 K40@875Mhz or Tesla K80 (autoboost) 1.4 GPUs 1.17 1.4X 0.86 0.77 1.8X 0.71 0.61 0.7 2.4X 2.7X 2.9X 3.4X 0.0 4 Ivybridge 4 CPU Node + 4 CPU Node + 4 CPU Node + 4 CPU Node + 4 CPU Node + 4 CPU Node + Node 1x K20X 1x K40 1x K80 2x K20X 2x K40 2x K80 100

Recommend


More recommend