Model for CPU-GPU Based High-Performance Computing Systems Wei - PowerPoint PPT Presentation

Accelerating the Cloud Scheme Within the Unified Model for CPU-GPU Based High-Performance Computing Systems Wei Zhang, Min Xu, Mario Morales Hernandez, Matthew Norman, Salil Mahajan and Katherine Evans 2019 MultiCore 9 Workshop, Sep 25 th 2019 ORNL is managed by UT-Battelle, LLC for the US Department of Energy Thanks to the US Air Force, DOE OLCF, Met Office and EPCC for support and help!

Content • Overview – Introduction of project and motivation – CASIM cloud scheme – OLCF Summit supercomputer • CASIM on Summit, from CPU to GPU: current status and future plans 2 Open slide master to edit 2

Overview Forecast Extension to Hydrology – From Rainfall to Flood Rainfall Runoff Streamflow Inundation  Weather  NASA  ERDC  ORNL/TTU Forecasting  Land  Streamflow  TRITON-GPU Models Information Prediction System (LIS) System (SPT) UM Optimization (Cloud Scheme, Radiation Scheme) 3 Open slide master to edit 3

Overview Air Force Weather and OLCF Summit ORNL Collaboration Cloud AeroSol Interacting Microphysics (CASIM) Met Office Unified Model (UM) 4 Open slide master to edit 4

What is cloud microphysics? Cloud microphysics concerns the mechanisms by which cloud droplets generated from water vapor and the particles in the air, and grow to form raindrops, ice and snow. -- John M. Wallace, Peter V. Hobbs, in Atmospheric Science (Second Edition), 2006 CCN Relative sizes of cloud • r = 0.1, n = 10e6, droplets, raindrops and cloud Large cloud droplet v = 0.0001 condensation nuclei (CCN) r = 50, n = 10e3, v = 27 r: radius (um) n: number per liter of air Typical cloud droplet v: fall speed (cm/s) r = 10, n = 10e6, v = 1 Typical raindrop r = 1000, n = 1, v = 650 5 Open slide master to edit 5

Why cloud microphysics matters? • The evolution of cloud/rain mass, the number concentration of droplets and particles • Latent heating/cooling, Temperature – condensation, evaporation, deposition, sublimation, freezing, melting • Affecting surface processes, radiative transfer, cloud-aerosol-precipitation interactions… Schematics of some of the warm cloud and • precipitation microphysical processes 6 Open slide master to edit 6

Cloud AeroSol Interacting Microphysics - CASIM • Long-term replacement for UM microphysics and the default microphysics • User definable – number of cloud species (e.g., cloud, rain, ice, snow, graupel) – number of moments to describe each species (1 - mass, 2: 1 + number, 3: 2 +radar reflectivity) • Detailed representation of aerosol effects and in-cloud processing of aerosol – increase accuracy – more intensive calculation 7 Open slide master to edit 7

• CASIM/src – Modern Fortran code bulk bin – 16329 total lines, 116 subroutines 9 8 7 6 time cost (s) using CASIM 5 4 3 2 1 0 Tau-bin Tp07-1M Tp09-1M Morr-2m CASIM using standard bulk scheme Wallclock for KiD_1D Simulations on Summit (no parallelism) (Same model, same cumulus case, different microphysics schemes) ( Run in UM, same COPE case, different microphysics schemes, adopted from HPC + GPU Computing Met office technical paper) 8 Open slide master to edit 8

Oak Ridge Leadership Computing Facility ( OLCF ) Summit 9 Open slide master to edit 9

• Objects – Applying new coding to CASIM for GPUs – Developing algorithms that will be suited for accelerated machines (Summit now, Frontier, in the future) • Compilers – PGI (19.7 on Summit) – Cray (will be available when Frontier comes out) – CLAW (source-to-source translator, Produces code for the target architectures and directive languages, https://github.com/claw-project/claw-compiler) • Directive – OpenACC • Considerations – Portability limitations, CPU-GPU communication – Validation & Verification, Robust testing – The software stack for these new computing systems 10 Open slide master to edit 10

CASIM on Summit • Parent model: The Kinematic Driver Model ( KiD , Shipway and Hill, 2011) • Kinematic framework to constrain the dynamics and isolate the microphysics • Original KiD has no parallelization directives • Baseline case: 2D squall line case • nx = 320, dx = 750 m, nz = 48, dz = 250 m • dt = 1 s, t_total = 3600 s, output saved every 60 s 11 Open slide master to edit 11

• Step 1. Access KiD-CASIM 2D-SQUALL Performance on CPU – Profiling tool: General Purpose Timing Library ( GPTL ) https://jmrosinski.github.io/GPTL/ CASIM in KiD: 1019.095/1187.963 = 85.79% micro_main in CASIM: 987.515/1019.095 = 96.90% 12 Open slide master to edit 12

13 Open slide master to edit 13

• Step 2: Get CASIM ready for GPU (ongoing) do i = is, ie do j = js, je • General idea: call cpu_calculation1() – Optimize most time-consuming parts end do end do – Avoid/minimize data transfer between !--------------------------------------------------------- CPU and GPU do i = is, ie do j = js, je call gpu_calculation() end do Idealized solution: GPU region sandwiched end do between two CPU calculation regions !--------------------------------------------------------- do i = is, ie do j = js, je but …. call cpu_calculation2() end do end do 14 Open slide master to edit 14

– Challenge 1: Derived Data Type 1) -ta=tesla:deepcopy (testing) type :: process_rate 2) change to flat array (bit-for-bit real(wp), pointer :: column_data(:) on CPU confirmed) end type process_rate … type(process_rate), allocatable :: procs(:,:) type :: process_rate real(wp), target, allocatable :: procs_flat(:,:,:) real(wp), allocatable :: column_data(:) … end type process_rate allocate(procs(ntotalq, nprocs) allocate(procs_flat(nz, ntotalq, nprocs) … do iprocs=1, nprocs type(process_rate), allocatable :: procs(:,:) do iq=1, ntotalq … procs(iq, iprocs)%column_data => & allocate(procs(ntotalq, nprocs)) procs_flat(1:nz, iq, iprocs) … end do call micro_common (…, procs, …) end do call micro_common (…, procs_flat , …) 15 Open slide master to edit 15

3 levels of nested loops do i = is, ie do j = js, je … Not do n = 1, nsubsteps parallelable … – Challenge 2: !! early exit if no hydrometeors and subsaturated n-loop and k-loop are not if (.not. any(precondition(:))) exit parallelable now; !! do the business do k = 1, nz Hotspots locate deep in Hotspots and … the call tree vertical dependence … end do !! k … end do !! n … end do !! j end do !! i 16 Open slide master to edit 16

• Former work done in EPCC and UK Met Office: – Porting the microphysics model CASIM to GPU and KNL Cray machines (Brown et al., 2016) – Parent model: the Met Office NERC Cloud Model ( MONC) – Compiler: Cray – Directive: OpenACC – Offloaded the whole CASIM onto GPU on Piz Daint XC50 and XC30 17 Open slide master to edit 17

Lesson we learned: Much more memory limit code refactoring is needed to • Maximize the number of parallelization in GPU • Minimize the amount of data transfer between CPU and GPU memory limit From: Accelerating the microphysics model CASIM using OpenACC, Alexandr Nigay, 2016 18 Open slide master to edit 18

? How to increase the parallelization? do n = 1, nsubsteps do k = nz-1, 1 if (.not. any(precondition(:))) exit … flux(k) = functions(flux(k+1)) call function(qfields(1:nz)) … update qfields(1:nz) end do !! k end do !! n 19 Open slide master to edit 19

Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 20 Open slide master to edit 20

Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 limitation: nsubstep >= nz 28 Open slide master to edit 28

?How to reduce the memory traffic? – many conditional if branches – lookup table for gamma function in sedimentation.F90 29 Open slide master to edit 29

Model for CPU-GPU Based High-Performance Computing Systems Wei - PowerPoint PPT Presentation

Accelerating the Cloud Scheme Within the Unified Model for CPU-GPU Based High-Performance Computing Systems Wei Zhang, Min Xu, Mario Morales Hernandez, Matthew Norman, Salil Mahajan and Katherine Evans 2019 MultiCore 9 Workshop, Sep 25 th 2019

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

Accuracy of asymptotic approximations to the log-Gamma and Riemann-Siegel theta functions

MobiDIS A Pervasive A Pervasive MobiDIS Architecture for Emergency Architecture for

House of the Rising Sun Traditional Chords in this song: E 7 C F D m D Page 1 Austin Ukulele

GAMBLING ADDICTIONS ON THE INTERNET Dr Mark Griffiths Professor of Gambling Studies International

Ijjou Tizgui Fatima El Guezar Hassane Bouzahir Brahim Benaid 1 Sponsor: National

Inference in Coupled Wright-Fisher Models. Yasin Department of Mathematics, Makerere University

Inclusive photon energy spectra at zero degree of the LHC 7 TeV proton-proton collisions by the

the Nuclear Accident in Fukushima Renate Czarwinski Head, Radiation Safety and Monitoring Section

Model for CPU-GPU Based High-Performance Computing Systems Wei - PowerPoint PPT Presentation

Accelerating the Cloud Scheme Within the Unified Model for CPU-GPU Based High-Performance Computing Systems Wei Zhang, Min Xu, Mario Morales Hernandez, Matthew Norman, Salil Mahajan and Katherine Evans 2019 MultiCore 9 Workshop, Sep 25 th 2019

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

High Performance Hardware, High Performance Hardware, Memory &amp; CPU Memory &amp; CPU Step

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

A High-Precision GPU, CPU and Memory Power Model for the Tegra K1 SoC Kristoffer Robin Stokke

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

Accuracy of asymptotic approximations to the log-Gamma and Riemann-Siegel theta functions

MobiDIS A Pervasive A Pervasive MobiDIS Architecture for Emergency Architecture for

House of the Rising Sun Traditional Chords in this song: E 7 C F D m D Page 1 Austin Ukulele

GAMBLING ADDICTIONS ON THE INTERNET Dr Mark Griffiths Professor of Gambling Studies International

Ijjou Tizgui Fatima El Guezar Hassane Bouzahir Brahim Benaid 1 Sponsor: National

Inference in Coupled Wright-Fisher Models. Yasin Department of Mathematics, Makerere University

Inclusive photon energy spectra at zero degree of the LHC 7 TeV proton-proton collisions by the

the Nuclear Accident in Fukushima Renate Czarwinski Head, Radiation Safety and Monitoring Section

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step