MPAS on GPUs Using OpenACC: Portability, Scalability & Performance Dr. Raghu Raj Kumar Project Scientist I & Group Head Special Technical Projects (STP) Group National Center for Atmospheric Research March 2018
Outline • Motivation & Goals • System & Software Specs • Approach & Challenges • Results • Future Plans • Questions 2
Motivation & Goals • Motivation o Shallow water equation solver & Discrete Galerkin Kernel • Showed promising results on GPUs • Unified code was possible- at least for small scale applications • Goals o Port MPAS on to GPUs o Optimize performance on GPUs • No compromise on portability: < 10% of original performance o Scale MPAS on GPUs 3
System Specs • NVIDIA’s Internal Clusters o PSG: Dual socket Haswell (32 cores/node) with 4 P100/node, 12 nodes, intra-node PCIe connected, inter-node FDR o PSG: Dual socket Haswell (32 cores/ node) with 4 V100/node, 2 nodes, intra-node PCIe connected, inter-node FDR • NCAR’s Cheyenne supercomputer o Dual socket Broadwell (36 cores/node), 4,032 nodes • IBM’s R92 Cluster (Internal) o Minsky- Dual socket Power8 (20 cores/node) with 4 P100/node, 90+ nodes, intra-node NVLink, inter-node IB 4
Software Spec: MPAS Dry Dynamical Core • Software o MPAS Release (or MPAS 5.2 ) o Intel Compiler 17.0, PGI Compiler 17.10 • Dry Baroclinic Instability Test- No physics, no scalar transport o Dry dynamics test-case produces baroclinic storms from analytic initial conditions o Split Dynamics: 2 sub-steps, 3 split steps o 120 km (40k grid points, dt=720s) , 60 km resolution (163k grid points, dt=360s), 30 km resolution (655k grid points, dt=180s), 15 km resolution (2.6m grid points, dt=60s) o Number of levels = 56 o Double precision (DP) and Single precision (SP) o Simulation executed for 16 days, performance shown for 1 timestep 5
Why did we choose Dycore? MicroPhysics MPAS Physics scheme WSM6(9.62%) Boundary Layer Dynamic Core YSU(1.55%) Execution time - Physics: 45-50% Gravity Wave Drag DyCore: 50-55% GWDO(0.71%) Lines of Code - Physics: 110,000 Radiation Short Wave DyCore: 10,000 RRTMG_SW(18.83%) Radiation Long Wave RRTMG_LW(16.43%) Convection New Tiedtke(4.19%) Flow Diagram by KISTI 6
Dycore: Zoom in MPAS-5 120 km case with Intel Compiler running 36 MPI ranks on Intel E5V2697v4 “Broadwell” node Integration Setup Moist coefficients 1% imp_coef 0% 1% MPI 8% substep 5% dyn_tend 32% diagnostics 20% large_step small_step 16% 4% acoustic_step 13% 7
Approach OpenACC KGen Directives Software & Architecture KGen Porting Baseline Optimize Configuration & Accuracy Profile & Analyze Benchmark Testing Integrate Portability Check Code Verification Refactoring
Challenges Faced: Using the Right Directives Lower time for porting, Reasonable performance Much higher time for porting, Improved performance depending on the loop count- up to 50% 9
Challenges Faced: Using the Right Approach for Data subroutine dyn_tend Lower time for porting, !$acc enter data Repeated unnecessary data transfer- poor !$acc data copy( … ) performance … <code> … !$acc exit data end subroutine dyn_tend subroutine dyn_tend !$acc data present( … ) Create a copy on Host and Device … simultaneously. <code> Harder to design, No unnecessary copies of data … between host and device end subroutine dyn_tend 10
Challenges Faced: Using the Right Techniques • Understand how to optimize o Learn the basics of optimization • Understand how GPUs work o Poor performance snippets o Know when to use global, register and shared memory o Nvprof is your best friend! • Understand how GPUs and CPUs work o Experiment with SIMD friendly loops- code layout o Experiment with GPU’s SIMT code - data layout o Learn how to combine the two! 11
Single Node Performance: MPAS Dry Dycore • Timers o MPAS GPTL timers reported in log files • GPU Timing : Has no updates from device to host o Host updates maybe needed for printing values on screen o Host updates maybe needed for netcdf file output Broadwell (Fully Subscribed, P100 with Power8(1 GPU, Speedup Speedup P100 with Haswell(1 GPU, PGI V100 with Haswell (1 GPU, PGI Dataset OpenMP Enabled, Intel compiled, PGI compiled, OpenACC Broadwell vs Broadwell vs compiled, OpenACC code) compiled, OpenACC code) Base code) code) P100 V100 0.40 0.28 0.19 0.26 1.54 2.16 SP 120 Km (40K) 0.88 0.40 0.29 0.35 2.51 2.99 DP 1.90 1.02 0.69 1.01 1.88 2.74 SP 60 Km (163K) 3.80 1.54 1.12 1.41 2.70 3.40 DP Taking 40k data points per node for SP, 32.8M grid points (15 Km & 3 Km Locally 12 refined grid) need ~800 Volta GPUs
Weak Scaling for MPAS Dry Dycore (SP & DP) on P100 GPU 2.2 2.0 1.8 1.6 1.4 Time (secs) 1.2 1.0 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of GPUs 40k per node, SP 40k per node, DP 163k per node, SP 163k per node, DP Time per timestep, 4 GPUs per node, 1 MPI rank per GPU, Max of 4 MPI ranks per node, Intranode Affinity for MPI 13 ranks, Uses OpenMPI, PCIe no NVLink, PGI 17.10
Strong Scaling for MPAS Dry Dycore (SP & DP) for 15 Km (2.6M) on P100 GPU 1.2 1.0 0.8 Inverse Time 0.6 0.4 0.2 0.0 8 16 20 24 28 32 Number of GPUs SP DP Time per timestep, 4 GPUs per node, 1 MPI rank per GPU, Max of 4 MPI ranks per node, Intranode Affinity for MPI ranks, Uses OpenMPI, PCIe no NVLink, PGI 17.10. 14
Portability: Performance Comparison of Base code with OpenACC code on Fully Subscribed Broadwell single node 4.0 3.5 3.0 2.5 Time (secs) 2.0 1.5 1.0 0.5 0.0 40k SP 40k DP 163k SP 163k DP Dataset Base Code OpenACC Code For datasets up to 40k per node, the execution time is identical. For 40k & 163k, the variation is <1% & <4% respectively. 15
Future Work • Improving MPAS Scalability MVAPICH instead of OpenMPI o NVLink systems o Moving halo exchange book-keeping on GPUs o MPS: Preliminary results showed no improvement o • Scalar Transport Currently being integrated o • Physics Port: 35% remaining o Optimize: 65% remaining o Radiation, Land surface on CPU o • Development Lagged radiation o Adopting Sion library for faster IO o 16
Thank you! Questions? 17
Recommend
More recommend