Accelerating the Cloud Scheme Within the Unified Model for CPU-GPU Based High-Performance Computing Systems Wei Zhang, Min Xu, Mario Morales Hernandez, Matthew Norman, Salil Mahajan and Katherine Evans 2019 MultiCore 9 Workshop, Sep 25 th 2019 ORNL is managed by UT-Battelle, LLC for the US Department of Energy Thanks to the US Air Force, DOE OLCF, Met Office and EPCC for support and help!
Content • Overview – Introduction of project and motivation – CASIM cloud scheme – OLCF Summit supercomputer • CASIM on Summit, from CPU to GPU: current status and future plans 2 Open slide master to edit 2
Overview Forecast Extension to Hydrology – From Rainfall to Flood Rainfall Runoff Streamflow Inundation Weather NASA ERDC ORNL/TTU Forecasting Land Streamflow TRITON-GPU Models Information Prediction System (LIS) System (SPT) UM Optimization (Cloud Scheme, Radiation Scheme) 3 Open slide master to edit 3
Overview Air Force Weather and OLCF Summit ORNL Collaboration Cloud AeroSol Interacting Microphysics (CASIM) Met Office Unified Model (UM) 4 Open slide master to edit 4
What is cloud microphysics? Cloud microphysics concerns the mechanisms by which cloud droplets generated from water vapor and the particles in the air, and grow to form raindrops, ice and snow. -- John M. Wallace, Peter V. Hobbs, in Atmospheric Science (Second Edition), 2006 CCN Relative sizes of cloud • r = 0.1, n = 10e6, droplets, raindrops and cloud Large cloud droplet v = 0.0001 condensation nuclei (CCN) r = 50, n = 10e3, v = 27 r: radius (um) n: number per liter of air Typical cloud droplet v: fall speed (cm/s) r = 10, n = 10e6, v = 1 Typical raindrop r = 1000, n = 1, v = 650 5 Open slide master to edit 5
Why cloud microphysics matters? • The evolution of cloud/rain mass, the number concentration of droplets and particles • Latent heating/cooling, Temperature – condensation, evaporation, deposition, sublimation, freezing, melting • Affecting surface processes, radiative transfer, cloud-aerosol-precipitation interactions… Schematics of some of the warm cloud and • precipitation microphysical processes 6 Open slide master to edit 6
Cloud AeroSol Interacting Microphysics - CASIM • Long-term replacement for UM microphysics and the default microphysics • User definable – number of cloud species (e.g., cloud, rain, ice, snow, graupel) – number of moments to describe each species (1 - mass, 2: 1 + number, 3: 2 +radar reflectivity) • Detailed representation of aerosol effects and in-cloud processing of aerosol – increase accuracy – more intensive calculation 7 Open slide master to edit 7
• CASIM/src – Modern Fortran code bulk bin – 16329 total lines, 116 subroutines 9 8 7 6 time cost (s) using CASIM 5 4 3 2 1 0 Tau-bin Tp07-1M Tp09-1M Morr-2m CASIM using standard bulk scheme Wallclock for KiD_1D Simulations on Summit (no parallelism) (Same model, same cumulus case, different microphysics schemes) ( Run in UM, same COPE case, different microphysics schemes, adopted from HPC + GPU Computing Met office technical paper) 8 Open slide master to edit 8
Oak Ridge Leadership Computing Facility ( OLCF ) Summit 9 Open slide master to edit 9
• Objects – Applying new coding to CASIM for GPUs – Developing algorithms that will be suited for accelerated machines (Summit now, Frontier, in the future) • Compilers – PGI (19.7 on Summit) – Cray (will be available when Frontier comes out) – CLAW (source-to-source translator, Produces code for the target architectures and directive languages, https://github.com/claw-project/claw-compiler) • Directive – OpenACC • Considerations – Portability limitations, CPU-GPU communication – Validation & Verification, Robust testing – The software stack for these new computing systems 10 Open slide master to edit 10
CASIM on Summit • Parent model: The Kinematic Driver Model ( KiD , Shipway and Hill, 2011) • Kinematic framework to constrain the dynamics and isolate the microphysics • Original KiD has no parallelization directives • Baseline case: 2D squall line case • nx = 320, dx = 750 m, nz = 48, dz = 250 m • dt = 1 s, t_total = 3600 s, output saved every 60 s 11 Open slide master to edit 11
• Step 1. Access KiD-CASIM 2D-SQUALL Performance on CPU – Profiling tool: General Purpose Timing Library ( GPTL ) https://jmrosinski.github.io/GPTL/ CASIM in KiD: 1019.095/1187.963 = 85.79% micro_main in CASIM: 987.515/1019.095 = 96.90% 12 Open slide master to edit 12
13 Open slide master to edit 13
• Step 2: Get CASIM ready for GPU (ongoing) do i = is, ie do j = js, je • General idea: call cpu_calculation1() – Optimize most time-consuming parts end do end do – Avoid/minimize data transfer between !--------------------------------------------------------- CPU and GPU do i = is, ie do j = js, je call gpu_calculation() end do Idealized solution: GPU region sandwiched end do between two CPU calculation regions !--------------------------------------------------------- do i = is, ie do j = js, je but …. call cpu_calculation2() end do end do 14 Open slide master to edit 14
– Challenge 1: Derived Data Type 1) -ta=tesla:deepcopy (testing) type :: process_rate 2) change to flat array (bit-for-bit real(wp), pointer :: column_data(:) on CPU confirmed) end type process_rate … type(process_rate), allocatable :: procs(:,:) type :: process_rate real(wp), target, allocatable :: procs_flat(:,:,:) real(wp), allocatable :: column_data(:) … end type process_rate allocate(procs(ntotalq, nprocs) allocate(procs_flat(nz, ntotalq, nprocs) … do iprocs=1, nprocs type(process_rate), allocatable :: procs(:,:) do iq=1, ntotalq … procs(iq, iprocs)%column_data => & allocate(procs(ntotalq, nprocs)) procs_flat(1:nz, iq, iprocs) … end do call micro_common (…, procs, …) end do call micro_common (…, procs_flat , …) 15 Open slide master to edit 15
3 levels of nested loops do i = is, ie do j = js, je … Not do n = 1, nsubsteps parallelable … – Challenge 2: !! early exit if no hydrometeors and subsaturated n-loop and k-loop are not if (.not. any(precondition(:))) exit parallelable now; !! do the business do k = 1, nz Hotspots locate deep in Hotspots and … the call tree vertical dependence … end do !! k … end do !! n … end do !! j end do !! i 16 Open slide master to edit 16
• Former work done in EPCC and UK Met Office: – Porting the microphysics model CASIM to GPU and KNL Cray machines (Brown et al., 2016) – Parent model: the Met Office NERC Cloud Model ( MONC) – Compiler: Cray – Directive: OpenACC – Offloaded the whole CASIM onto GPU on Piz Daint XC50 and XC30 17 Open slide master to edit 17
Lesson we learned: Much more memory limit code refactoring is needed to • Maximize the number of parallelization in GPU • Minimize the amount of data transfer between CPU and GPU memory limit From: Accelerating the microphysics model CASIM using OpenACC, Alexandr Nigay, 2016 18 Open slide master to edit 18
? How to increase the parallelization? do n = 1, nsubsteps do k = nz-1, 1 if (.not. any(precondition(:))) exit … flux(k) = functions(flux(k+1)) call function(qfields(1:nz)) … update qfields(1:nz) end do !! k end do !! n 19 Open slide master to edit 19
Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 20 Open slide master to edit 20
Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 21 Open slide master to edit 21
Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 22 Open slide master to edit 22
Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 23 Open slide master to edit 23
Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 24 Open slide master to edit 24
Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 25 Open slide master to edit 25
Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 26 Open slide master to edit 26
Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 27 Open slide master to edit 27
Possible new way for parallelizing n-loop and k-loop …… n=1 n=2 n=3 n=nsubstep-1 n=nsubstep k=nz k=nz-1 k=nz-2 k=nz-3 …… k=3 k=2 k=1 limitation: nsubstep >= nz 28 Open slide master to edit 28
?How to reduce the memory traffic? – many conditional if branches – lookup table for gamma function in sedimentation.F90 29 Open slide master to edit 29
Recommend
More recommend