MPAS on GPUs Using OpenACC Supreeth Suresh Software Engineer II - PowerPoint PPT Presentation

MPAS on GPUs Using OpenACC Supreeth Suresh Software Engineer II Special Technical Projects (STP) Group National Center for Atmospheric Research 26 th September, 2019

Outline • Team • Introduction • System and Software Specs • Approach, Challenges & Performance o Dynamical core • Optimizations • Scalability o Physics • Questions 2

Our Team of Developers • NCAR Supreeth Suresh, Software Engineer, STP o Cena Miller, Software Engineer, STP o Dr. Michael Duda, Software Engineer, MMM o • NVIDIA/PGI Dr. Raghu Raj Kumar, DevTech, NVIDIA o Dr. Carl Ponder, Senior Applications Engineer o Dr. Craig Tierney, Solutions Architect o Brent Leback, PGI Compiler Engineering Manager o • University of Wyoming: GRAs: Pranay Kommera, Sumathi Lakshmiranganatha , Henry O’Meara, George Dylan o Undergrads: Brett Gilman, Briley James, Suzanne Piver o • IBM/TWC • Korean Institute of Science and Technology Information Jae Youp Kim, GRA o 3

MPAS Grids… Horizontal Vertical 4

MPAS Time-Integration Design There are 100s of halo exchanges /timestep! 5

Where to begin? MicroPhysics MPAS Physics scheme WSM6(9.62%) Boundary Layer Dynamic Core YSU(1.55%) Execution time - Physics: 45-50% Gravity Wave Drag DyCore: 50-55% GWDO(0.71%) Lines of Code - Physics: 110,000 Radiation Short Wave DyCore: 10,000 RRTMG_SW(18.83%) Radiation Long Wave RRTMG_LW(16.43%) Convection New Tiedtke(4.19%) Flow Diagram by KISTI 6

System Specs • NCAR Cheyenne supercomputer o 2x 18-core Intel Xeon v4 (BWL) o Intel compiler 19 o 1x EDR IB interconnect; HPE MPT MPI • Summit and IBM “ WSC” supercomputer o AC922 with IB interconnect o 6 GPUs per node; 2x 22-core IBM Power-9 o 2x EDR IB interconnect; IBM Spectrum MPI 7

Software Spec: MPAS Dynamical Core • Software o MPAS 6.x o PGI Compiler 19.4, Intel Compiler 19 • Moist Baroclinic Instability Test- No physics o Moist dynamics test-case produces baroclinic storms from analytic initial conditions o Split Dynamics: 2 sub-steps, 3 split steps o 120 km (40k grid points, dt=720s) , 60 km resolution (163k grid points, dt=300s), 30 km resolution (655k grid points, dt=150s) , 15 km resolution (2.6M grid points, dt=90s), 10 km resolution (5.8M grid points, dt=60s) , 5 km resolution (23M grid points, dt=30s) o Number of levels = 56, Single precision (SP) o Simulation executed for 16 days, performance shown for 1 timestep 8

Software Spec: MPAS • Software o MPAS 6.x o PGI Compiler 19.4, Intel Compiler 19 • Full physics suite o Scale-aware Ntiedtke Convection, WSM 6 Microphysics, Noah Land surface, YSU Boundary Layer, Monin-Obhukov Surface layer, RRTMG radiation, Xu Randall Cloud Fraction o Radiation interval: 30 minutes o Single precision (SP) o Optimization and Integration in progress, performance shown for 1 timestep 9

MPAS-GPU Process Layout on IBM node Proc 0 MPI & NOAH control path CPU – SW/LW Rad & NOAH Proc 1 GPU – everything else Asynch I/O process Idle processor Node 10

MPAS dycore halo exchange • Approach o Original halo exchange written with linked lists • OpenACC loved it! o MMM rewrote halo exchange with arrays • Worked with OpenACC, but huge overhead due to book keeping on CPU • Moved MPI book keeping on GPUs – Bottleneck was send/recv buffer allocations on CPU o MMM rewrote halo exchange with once per execution buffer allocation • No more CPU overheads o STP and NVIDIA rewrote the halo exchange to minimize the data transfers of the buffer 11

Improving MPAS-A halo exchange performance: coalescing kernels Coalescing these 9 kernels dropped MPI overhead by 50% 12

Optimizing MPAS-A dynamical core: Lessons Learned • Module level allocatable variables (20 in number) were unnecessarily being copied by compiler from host to device to initialize them with zeroes. Moved the initialization to GPUs. • dyn_tend: eliminated dynamic allocation and deallocation of variables that introduced H<- >D data copies. It’s now statically created. • MPAS_reconstruct: originally kept on CPU was ported to GPUs. • MPAS_reconstruct: mixed F77 and F90 array syntax caused compiler to serialize the execution on GPUs. Rewrote with F90 constructs. • Printing out summary info (by default) for every timestep consumed time. Turned into debug option. 13

Scalable MPAS Initialization on Summit: CDF5 performance MPAS Initialization Scaling on Summit for 15 & 10 km 1000 MPAS 15 km MPAS 10 km 100 Init time (sec) 10 1 1 10 100 1000 AC922 nodes 14

Strong scaling benchmark test setup • MPAS-A Version 6.x • Test case: Moist dynamics • Compiler: GPU - PGI 19.4, CPU - Intel 19 • MPI: GPU - IBM spectrum, CPU - Intel MPI • CPU: 2 socket Broadwell node with 36 cores • GPU: NVIDIA Volta V100 • 10, 5 km problem o Timestep: 60, 30 sec o Horizontal points/rank: 5,898,242 points, 23,592,962 points(uniform grid) o Vertical: 56 levels 15

Strong scaling Moist Dynamics Strong Scaling on Summit and Cheyenne at 10 km 1.2 TIME PER TIMESTEP IN SECS 1 0.8 0.6 0.4 0.2 0 50 100 150 200 250 300 350 400 NUMBER OF GPUS OR DUAL SOCKET CPU NODES Strong scaling with 5.8M points on GPU Strong scaling with 5.8M points on CPU 16

Moist dynamics strong scaling at 5km Strong scaling with 23M points on GPU 0.4 0.35 TIME PER TIMESTEP IN SECS 0.3 0.25 0.2 0.15 0.1 0.05 0 0 200 400 600 800 1000 1200 1400 1600 1800 NUMBER OF GPUS 17

Weak scaling benchmark test setup • MPAS-A Version 6.x • Test case: Moist dynamics • Compiler: GPU - PGI 19.4, CPU - Intel 19 • MPI: GPU - IBM spectrum, CPU - Intel MPI • CPU: 2 socket Broadwell node with 36 cores • GPU: NVIDIA Volta V100 • 120-60-30-15-10-5 km problem o Timestep: 720, 300, 180, 90, 60, 30 sec o Horizontal points/rank: 40,962 points, 81,921 points (uniform grid) o Vertical: 56 levels 18

Weak scaling Weak Scaling, Moist Dynamics with 6 tracers, Summit, 120Km-5Km, 6 GPUs (6 MPI ranks) per node 0.6 Time per timestep in seconds 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 600 Number of GPUs/MPI ranks 40k Points per GPU 80k Points per GPU 19

MPAS Physics- Order of tasks • Build a methodology that supports re-integration for all physics modules (50%) o Must be flexible to validate or integrate o Must be able to run individual portions on CPU/GPU as required • Upgrade, Integrate, Validate & Optimize WSM6(20%) • Benchmark Dycore-scalar-WSM6 • Upgrade, Integrate & Validate YSU and Gravity Wave Drag(15%) • Benchmark Dycore-scalar-WSM6-YSU-GWDO • Upgrade, Integrate & Validate Monin Obhukov (5%) • Benchmark Dycore-scalar-WSM6-YSU- Monin Obhukov • Upgrade, Integrate & Validate Ntiedtke (10%) • Benchmark Full MPAS 20 20

What does a methodology look like? Grep search help string Preprocessor Directive to offload routine on CPU Flip GPU/CPU based on requirement 21

Methodology description • Repeat layout for all physics modules- Completes the framework • The preprocessor directives will be removed after validation • Methodology includes the required data directives o Noah & Radiation included 22 22

Projected Full MPAS Performance MPAS-A estimated timestep budget for 40k pts per GPU Dynamics dry+moist+halo dynamics (dry) • 0.18s instead of expected 0.018 sec dynamics (moist) 0.22s physics 0.06 sec radiation comms Physics- WSM6 + YSU • 0.078s+0.008s = 0.086s halo comms • Ntiedtke takes 0.04s on CPU H<->D data transfer • Noah and MO together take 0.003 sec 0.139 sec less than 1msec on CPU H<->D data transfer • Pending 0.085 sec 0.03 sec Total time: 0.275 sec/step 15 km -> 64 V100 GPUs Throughput ~0.9 years/day

Future Work • MPAS Performance o Optimization of remaining physics schemes o Verification and Integration of remaining physics schemes o Integrating Lagged Radiation 24

Thank you! Questions? 25

Moist Dynamics Strong Scaling on Summit at 10 & 15 km 100 15 km 10 km Days/hour AVEC forecast threshold 10 1 8 16 32 64 128 256 512 Number of GPUs 26

How does the scaling compare to dry dynamics? Splitting out tracer timings / tracer scaling 0.19 0.18 Time in sec per timestep 0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.1 100 150 200 250 300 350 Number of GPUs Moist dynamics with 6 tracers Dry dynamics 27

MPAS on GPUs Using OpenACC Supreeth Suresh Software Engineer II - PowerPoint PPT Presentation

MPAS on GPUs Using OpenACC Supreeth Suresh Software Engineer II Special Technical Projects (STP) Group National Center for Atmospheric Research 26 th September, 2019 Outline Team Introduction System and Software Specs

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

MPAS on GPUs Using OpenACC: Portability, Scalability & Performance Dr. Raghu Raj Kumar

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

How are MPAs effectively managed and monitored? ferdinando boero University of Salento,

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

A platform for wildlife monitoring integrating Copernicus and ARGOS data use case for marine

of Chicago Radiative MHD) to GPUs Using OpenACC Rich Loft (Director TDD in CISL, NCAR) Eric

OPERATOR S SECURITY CULTURE Carsten Speicher Ministry of the Environment, Climate Protection

Chair of General Management and Information Systems Prof. Dr. Armin Heinzl Philipp Hoffmann

Investors Presentation Cautionary Statement Regarding Forward Looking Statements his

With Unmatched Amenities Developed by Joe Watson A XIS P OINT D EVELOPERS, LLC Investment Builders

ILO Work in Lesotho Information Sharing Session: Update on the Implementation of the Lesotho

Entwicklung neuer Services in Lehre und Praxis - ein Abgleich von personellem Angebot und

Eco Marine Global Solar Sea Bus Das Transportkonzept der Zukunft Sustainable Excellence

Baden-Wrttemberg STAATLICHES SEMINAR FR DIDAKTIK UND LEHRERBILDUNG (BERUFLICHE SCHULEN)

MPAS on GPUs Using OpenACC Supreeth Suresh Software Engineer II - PowerPoint PPT Presentation

MPAS on GPUs Using OpenACC Supreeth Suresh Software Engineer II Special Technical Projects (STP) Group National Center for Atmospheric Research 26 th September, 2019 Outline Team Introduction System and Software Specs

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

MPAS on GPUs Using OpenACC: Portability, Scalability &amp; Performance Dr. Raghu Raj Kumar

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

How are MPAs effectively managed and monitored? ferdinando boero University of Salento,

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

A platform for wildlife monitoring integrating Copernicus and ARGOS data use case for marine

of Chicago Radiative MHD) to GPUs Using OpenACC Rich Loft (Director TDD in CISL, NCAR) Eric

OPERATOR S SECURITY CULTURE Carsten Speicher Ministry of the Environment, Climate Protection

Chair of General Management and Information Systems Prof. Dr. Armin Heinzl Philipp Hoffmann

Investors Presentation Cautionary Statement Regarding Forward Looking Statements his

With Unmatched Amenities Developed by Joe Watson A XIS P OINT D EVELOPERS, LLC Investment Builders

ILO Work in Lesotho Information Sharing Session: Update on the Implementation of the Lesotho

Entwicklung neuer Services in Lehre und Praxis - ein Abgleich von personellem Angebot und

Eco Marine Global Solar Sea Bus Das Transportkonzept der Zukunft Sustainable Excellence

Baden-Wrttemberg STAATLICHES SEMINAR FR DIDAKTIK UND LEHRERBILDUNG (BERUFLICHE SCHULEN)

MPAS on GPUs Using OpenACC: Portability, Scalability & Performance Dr. Raghu Raj Kumar