unified cpu gpu programming for the production weather
play

Unified CPU+GPU Programming for the Production - PowerPoint PPT Presentation

Unified CPU+GPU Programming for the Production Weather Model ASUCA Michel Mller Research Assistant, Aoki Laboratory michel@sim.gsic.titech.ac.jp Supervised by Prof. Dr. Takayuki Aoki Tokyo Institute of Technology Creative


  1. Unified CPU+GPU Programming for the Production Weather Model ASUCA Michel Müller Research Assistant, Aoki Laboratory michel@sim.gsic.titech.ac.jp Supervised by Prof. Dr. Takayuki Aoki Tokyo Institute of Technology Creative Commons: Nasa Goddard Space Flight Centre, 2010

  2. Unified CPU+GPU Programming for the Production Weather Model ASUCA

  3. ASUCA Unified • National Japanese 
 • Single Fortran code • Performant on 
 weather model • In production 
 both CPU and GPU • Applicable to both 
 since 2014 
 (PowerPC) physics and dynamics • Meso-scale • Non hydrostatic • Regular mesh FEM dynamical physical core processes

  4. Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  5. Unified • Single Fortran code • Performant on 
 both CPU and GPU • Applicable to both 
 physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  6. coarse grained Unified parallelism • Single Fortran code • Performant on 
 both CPU and GPU • Applicable to both 
 physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  7. GPU unfriendly coarse grained Unified storage order parallelism • Single Fortran code • Performant on 
 both CPU and GPU • Applicable to both 
 physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  8. GPU unfriendly coarse grained Unified storage order parallelism • Single Fortran code • Performant on 
 separation both CPU and GPU • Applicable to both 
 device/host code physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  9. GPU unfriendly coarse grained Unified storage order parallelism • Single Fortran code • Performant on 
 data movement separation both CPU and GPU • Applicable to both 
 to/from device device/host memory code physics and dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  10. GPU unfriendly coarse grained Unified storage order parallelism • Single Fortran code • Performant on 
 data movement separation both CPU and GPU • Applicable to both 
 to/from device device/host memory code physics and dynamics CUDA boilerplate Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  11. coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  12. coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  13. ? coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  14. dynamical physical core processes .. of ASUCA Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  15. dynamical physical core processes .. of ASUCA Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  16. dynamical physical core processes .. of ASUCA Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  17. coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  18. GPU unfriendly storage order Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  19. GPU unfriendly storage order Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  20. GPU unfriendly storage order Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  21. GPU unfriendly storage order Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  22. Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  23. coarse grained parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  24. coarse grained GPU unfriendly storage order parallelism Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  25. automate ALL THE THINGS! Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  26. ? Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  27. Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  28. Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  29. Build System user'defined file'with'CPU+'GPU' file$with$CPU+$GPU$ hybrid'file python'program GNU'Make version input output legend [projectNdir]/Makefile xml'Callgraph'+' python1 python'2 parsed'direc6ves h90'Fortran'source' xml'Callgraph'+'parsed' direc6ves'+'loop'analysis' +'direc6ves F90'Fortran make python'3 executable F90'Fortran F90'Fortran MakeSeIngs buildtools/Makefile storage_order.F90 Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  30. Build System user'defined file'with'CPU+'GPU' file$with$CPU+$GPU$ hybrid'file python'program GNU'Make version input output legend [projectNdir]/Makefile CPU GPU xml'Callgraph'+' python1 python'2 parsed'direc6ves version version calculate_all_columns calculate_all_columns h90'Fortran'source' xml'Callgraph'+'parsed' direc6ves'+'loop'analysis' +'direc6ves F90'Fortran make python'3 executable F90'Fortran F90'Fortran sum_column sum_column MakeSeIngs buildtools/Makefile storage_order.F90 Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  31. Build System user'defined file'with'CPU+'GPU' file$with$CPU+$GPU$ hybrid'file python'program GNU'Make version input output legend [projectNdir]/Makefile xml'Callgraph'+' python1 python'2 parsed'direc6ves h90'Fortran'source' xml'Callgraph'+'parsed' direc6ves'+'loop'analysis' +'direc6ves F90'Fortran make python'3 executable F90'Fortran F90'Fortran MakeSeIngs buildtools/Makefile storage_order.F90 Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  32. 
 
 ASUCA on Hybrid Fortran Mid 2014 radiation diag$long makegrid_ideal physics$ convection ideal Tests%passed: 
 long Rad$on$CPU,$KIJ$Order 
 physics$ pbl/surface diag$long Rad$on$CPU,$IJK$Order$ dt long physics$rk$ ∫ Gabls3$on$CPU,$KIJ$Order$ Max/Min/ long Ave Gabls3$on$CPU,$IJK$Order 
 arashi dynamics$rk$ Warmbubble$on$CPU,$KIJ$Order$ output long diagnose$rk$ sediment rungekutta$ short Rad$on$GPU,$KIJ$Order 
 long prep RK short dt short dynamics$rk$ ∫ Rad$on$GPU,$IJK$Order$ diag$adjust$ short Gabls3$on$GPU,$KIJ$Order$ long makegrid Gabls3$on$GPU,$IJK$Order 
 physics$ monitflux microphys. Warmbubble$on$GPU,$KIJ$Order 
 adjust$long$ Not$ ported Ported Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  33. 
 ASUCA on Hybrid Fortran Now radiation diag$long makegrid_ideal physics$ convection ideal Tests%passed: 
 long Rad$on$CPU,$KIJ$Order 
 physics$ pbl/surface diag$long Rad$on$CPU,$IJK$Order$ dt long physics$rk$ ∫ Gabls3$on$CPU,$KIJ$Order$ Max/Min/ long Ave Gabls3$on$CPU,$IJK$Order 
 arashi dynamics$rk$ Warmbubble$on$CPU,$KIJ$Order$ output long Warmbubble$on$CPU,$IJK$Order 
 diagnose$rk$ sediment rungekutta$ short Rad$on$GPU,$KIJ$Order 
 long prep RK short dt short dynamics$rk$ ∫ Rad$on$GPU,$IJK$Order$ diag$adjust$ short Gabls3$on$GPU,$KIJ$Order$ long makegrid Gabls3$on$GPU,$IJK$Order 
 physics$ monitflux microphys. Warmbubble$on$GPU,$KIJ$Order$ adjust$long$ Not$ Warmbubble$on$GPU,$IJK$Order 
 ported Ported Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  34. ASUCA OpenMP ASUCA Dynamics OpenACC ASUCA Dynamics OpenMP ASUCA Physics CUDA Fortran ASUCA Physics Hybrid Hybrid Asuca Asuca Physics Dynamics Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  35. ASUCA OpenMP ASUCA Dynamics OpenACC ASUCA Dynamics OpenMP ASUCA Physics CUDA Fortran ASUCA Physics Hybrid Hybrid Asuca Asuca Physics Dynamics rayleigh Shortwave Longwave Planetary HEVI advection diagnose surface damping Radiation Radiation Boundary Layer 121 Kernels 112 Kernels ~21k LOC ~10k LOC CUDA Fortran ✓ ✓ OpenACC nRMS < 1E-9 ✓ ✓ Performance ~1x ~3.6x nRMS < 1E-9 compared to Reference Code 
 inside of outside of kernel on Westmere Xeon kernel(s) not affected kernel by kernel Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  36. Further Results Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  37. scalar 
 array 
 directive 
 direct 
 multi 
 array 
 local 
 foreign 
 feature param. 
 # 
 analytical 
 ref. data 
 hybrid 
 stencil 
 declared 
 || region 
 early 
 impl. 
 pointer 
 priv.- 
 reduction halo kernel 
 kernel 
 strides access. 
 module 
 module 
 from 
 kernels validation validation ||-isation access in 
 in branch return scheme swap example isation call routine func. data data device kernel arr. getting ✓ ✓ ✓ ✓ 3 started ✓ ✓ ✓ 5D vector 1 ✓ ✓ ✓ ✓ simple stencil 1 stencil w/ ✓ ✓ ✓ ✓ ✓ 1 local array scalar ✓ ✓ ✓ ✓ ✓ 1 passed in multi kernel ✓ ✓ ✓ ✓ 4 routines ✓ ✓ ✓ ✓ strides 2 accessor ✓ ✓ ✓ ✓ 1 functions ✓ ✓ ✓ ✓ II branches 2 ✓ ✓ ✓ ✓ ✓ early returns 3 ✓ ✓ ✓ ✓ ✓ schemes 4 ✓ ✓ ✓ ✓ module data 10 ✓ ✓ ✓ 3D diffusion 4 ✓ ✓ particle push 1 midaco ✓ ✓ ✓ 1 solver poisson FEM ✓ ✓ ✓ ✓ 2 solver Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

  38. 
 Branched codebase has partially aged for > 2 years => high code divergence 
 => For production version of Hybrid code, need to basically start over Motivation Hybrid Fortran Results Outlook ➤ ➤ ➤

Recommend


More recommend