Unified CPU+GPU Programming for the Production Weather Model ASUCA Michel Müller Research Assistant, Aoki Laboratory michel@sim.gsic.titech.ac.jp Supervised by Prof. Dr. Takayuki Aoki Tokyo Institute of Technology Creative Commons: Nasa Goddard Space Flight Centre, 2010
Unified CPU+GPU Programming for the Production Weather Model ASUCA
ASUCA Unified • National Japanese • Single Fortran code • Performant on weather model • In production both CPU and GPU • Applicable to both since 2014 (PowerPC) physics and dynamics • Meso-scale • Non hydrostatic • Regular mesh FEM dynamical physical core processes
Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Unified • Single Fortran code • Performant on both CPU and GPU • Applicable to both physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
coarse grained Unified parallelism • Single Fortran code • Performant on both CPU and GPU • Applicable to both physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
GPU unfriendly coarse grained Unified storage order parallelism • Single Fortran code • Performant on both CPU and GPU • Applicable to both physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
GPU unfriendly coarse grained Unified storage order parallelism • Single Fortran code • Performant on separation both CPU and GPU • Applicable to both device/host code physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
GPU unfriendly coarse grained Unified storage order parallelism • Single Fortran code • Performant on data movement separation both CPU and GPU • Applicable to both to/from device device/host memory code physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
GPU unfriendly coarse grained Unified storage order parallelism • Single Fortran code • Performant on data movement separation both CPU and GPU • Applicable to both to/from device device/host memory code physics and dynamics CUDA boilerplate Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
? coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
dynamical physical core processes .. of ASUCA Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
dynamical physical core processes .. of ASUCA Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
dynamical physical core processes .. of ASUCA Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
GPU unfriendly storage order Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
GPU unfriendly storage order Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
GPU unfriendly storage order Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
GPU unfriendly storage order Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
coarse grained GPU unfriendly storage order parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
automate ALL THE THINGS! Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
? Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Build System user'defined file'with'CPU+'GPU' file$with$CPU+$GPU$ hybrid'file python'program GNU'Make version input output legend [projectNdir]/Makefile xml'Callgraph'+' python1 python'2 parsed'direc6ves h90'Fortran'source' xml'Callgraph'+'parsed' direc6ves'+'loop'analysis' +'direc6ves F90'Fortran make python'3 executable F90'Fortran F90'Fortran MakeSeIngs buildtools/Makefile storage_order.F90 Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Build System user'defined file'with'CPU+'GPU' file$with$CPU+$GPU$ hybrid'file python'program GNU'Make version input output legend [projectNdir]/Makefile CPU GPU xml'Callgraph'+' python1 python'2 parsed'direc6ves version version calculate_all_columns calculate_all_columns h90'Fortran'source' xml'Callgraph'+'parsed' direc6ves'+'loop'analysis' +'direc6ves F90'Fortran make python'3 executable F90'Fortran F90'Fortran sum_column sum_column MakeSeIngs buildtools/Makefile storage_order.F90 Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Build System user'defined file'with'CPU+'GPU' file$with$CPU+$GPU$ hybrid'file python'program GNU'Make version input output legend [projectNdir]/Makefile xml'Callgraph'+' python1 python'2 parsed'direc6ves h90'Fortran'source' xml'Callgraph'+'parsed' direc6ves'+'loop'analysis' +'direc6ves F90'Fortran make python'3 executable F90'Fortran F90'Fortran MakeSeIngs buildtools/Makefile storage_order.F90 Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
ASUCA on Hybrid Fortran Mid 2014 radiation diag$long makegrid_ideal physics$ convection ideal Tests%passed: long Rad$on$CPU,$KIJ$Order physics$ pbl/surface diag$long Rad$on$CPU,$IJK$Order$ dt long physics$rk$ ∫ Gabls3$on$CPU,$KIJ$Order$ Max/Min/ long Ave Gabls3$on$CPU,$IJK$Order arashi dynamics$rk$ Warmbubble$on$CPU,$KIJ$Order$ output long diagnose$rk$ sediment rungekutta$ short Rad$on$GPU,$KIJ$Order long prep RK short dt short dynamics$rk$ ∫ Rad$on$GPU,$IJK$Order$ diag$adjust$ short Gabls3$on$GPU,$KIJ$Order$ long makegrid Gabls3$on$GPU,$IJK$Order physics$ monitflux microphys. Warmbubble$on$GPU,$KIJ$Order adjust$long$ Not$ ported Ported Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
ASUCA on Hybrid Fortran Now radiation diag$long makegrid_ideal physics$ convection ideal Tests%passed: long Rad$on$CPU,$KIJ$Order physics$ pbl/surface diag$long Rad$on$CPU,$IJK$Order$ dt long physics$rk$ ∫ Gabls3$on$CPU,$KIJ$Order$ Max/Min/ long Ave Gabls3$on$CPU,$IJK$Order arashi dynamics$rk$ Warmbubble$on$CPU,$KIJ$Order$ output long Warmbubble$on$CPU,$IJK$Order diagnose$rk$ sediment rungekutta$ short Rad$on$GPU,$KIJ$Order long prep RK short dt short dynamics$rk$ ∫ Rad$on$GPU,$IJK$Order$ diag$adjust$ short Gabls3$on$GPU,$KIJ$Order$ long makegrid Gabls3$on$GPU,$IJK$Order physics$ monitflux microphys. Warmbubble$on$GPU,$KIJ$Order$ adjust$long$ Not$ Warmbubble$on$GPU,$IJK$Order ported Ported Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
ASUCA OpenMP ASUCA Dynamics OpenACC ASUCA Dynamics OpenMP ASUCA Physics CUDA Fortran ASUCA Physics Hybrid Hybrid Asuca Asuca Physics Dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
ASUCA OpenMP ASUCA Dynamics OpenACC ASUCA Dynamics OpenMP ASUCA Physics CUDA Fortran ASUCA Physics Hybrid Hybrid Asuca Asuca Physics Dynamics rayleigh Shortwave Longwave Planetary HEVI advection diagnose surface damping Radiation Radiation Boundary Layer 121 Kernels 112 Kernels ~21k LOC ~10k LOC CUDA Fortran ✓ ✓ OpenACC nRMS < 1E-9 ✓ ✓ Performance ~1x ~3.6x nRMS < 1E-9 compared to Reference Code inside of outside of kernel on Westmere Xeon kernel(s) not affected kernel by kernel Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Further Results Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
scalar array directive direct multi array local foreign feature param. # analytical ref. data hybrid stencil declared || region early impl. pointer priv.- reduction halo kernel kernel strides access. module module from kernels validation validation ||-isation access in in branch return scheme swap example isation call routine func. data data device kernel arr. getting ✓ ✓ ✓ ✓ 3 started ✓ ✓ ✓ 5D vector 1 ✓ ✓ ✓ ✓ simple stencil 1 stencil w/ ✓ ✓ ✓ ✓ ✓ 1 local array scalar ✓ ✓ ✓ ✓ ✓ 1 passed in multi kernel ✓ ✓ ✓ ✓ 4 routines ✓ ✓ ✓ ✓ strides 2 accessor ✓ ✓ ✓ ✓ 1 functions ✓ ✓ ✓ ✓ II branches 2 ✓ ✓ ✓ ✓ ✓ early returns 3 ✓ ✓ ✓ ✓ ✓ schemes 4 ✓ ✓ ✓ ✓ module data 10 ✓ ✓ ✓ 3D diffusion 4 ✓ ✓ particle push 1 midaco ✓ ✓ ✓ 1 solver poisson FEM ✓ ✓ ✓ ✓ 2 solver Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Branched codebase has partially aged for > 2 years => high code divergence => For production version of Hybrid code, need to basically start over Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤
Recommend
More recommend