exascale challenges for numerical weather prediction the
play

Exascale challenges for Numerical Weather Prediction : the ESCAPE - PowerPoint PPT Presentation

Exascale challenges for Numerical Weather Prediction : the ESCAPE project O Olivier Marsden This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreem ent No 671627 European


  1. Exascale challenges for Numerical Weather Prediction : the ESCAPE project O Olivier Marsden This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreem ent No 671627

  2. European Centre for Medium-Range Weather Forecasts Independent intergovernmental organisation established in 1975 with 19 Member States 15 Co-operating States 2

  3. The success story of Numerical Weather Prediction: Hurricanes  May be one of the best medium-range forecasts of all times! 3

  4. NWP: Benefit of high-resolution Mean sea-level pressure AN 30 Oct 5d FC T3999 5d FC T1279 5d FC T639 Sandy 28 Oct 2012 3d FC: Wave height Mean sea-level pressure Precipitation: NEXRAD 27 Oct 10 m wind speed 4d FC T639 4d FC T1279 4d FC T3999 4

  5. What is the challenge? Observations Models Volume 20 million = 2 x 10 7 5 million grid points Today: 100 levels 10 prognostic variables = 5 x 10 9 Type 98% from 60 different satellite physical parameters of atmosphere, waves, instruments ocean Observations Models 200 million = 2 x 10 8 Volume 500 million grid points Tomorrow: 200 levels 100 prognostic variables = 1 x 10 13 Type 98% from 80 different satellite physical and chemical parameters of instruments atmosphere, waves, ocean, ice, vegetation Factor 10 per day Factor 2000 per time step 5

  6. AVEC forecast model intercomparison: 13 km 9 13km Case: Speed Normalized to Operational Threshold (8.5 mins per day) 8 7 6 IFS NMM-UJ 5 Fraction of Operational Threshold FV3, single precision NIM 4 FV3, double precision MPAS 3 NEPTUNE 13km Oper. Threshold 2 1 0 0 8192 16384 24576 32768 40960 49152 57344 65536 73728 81920 90112 98304 106496 114688 122880 131072 139264 Number of Edison Cores (CRAY XC-30) [Michalakes et al. 2015: AVEC-Report: NGGPS level-1 benchmarks and 6 software evaluation]

  7. AVEC forecast model intercomparison: 3 km 1.1 3km Case: Speed Normalized to Operational Threshold (8.5 mins per day) 1.0 IFS 0.9 Advanced Computing Evaluation Committee (AVEC) NMM-UJ to evaluate HPC performance of five Next Generation Global 0.8 FV3 single precision Prediction System candidates to meet operational forecast FV3 double precision 0.7 Fraction of Operational Threshold requirements at the National Weather Service through 2025-30 NIM 0.6 NIM, improved MPI comms MPAS 0.5 NEPTUNE 0.4 3km Oper. Threshold 0.3 0.2 0.1 0.0 0 8192 16384 24576 32768 40960 49152 57344 65536 73728 81920 90112 98304 106496 114688 122880 131072 139264 Number of Edison Cores (CRAY XC-30) 7

  8. Technology applied at ECMWF for the last 30 years …  A spectral transform, semi-Lagrangian, semi-implicit (compressible) hydrostatic model  How long can ECMWF continue to run such a model?  IFS data assimilation and model must EACH run in under ONE HOUR for a 10 day global forecast 8

  9. IFS ‘today’ (MPI + OpenMP parallel) IFS = Integrated Forecasting System 9

  10. Predicted 2.5 km model scaling on a XC-30 Operational requirement 2 MW 6 MW (for a single HRES forecast) two XC-30 clusters each with 85K cores ECMWF require system capacity for 10 October 29, 2014 10 to 20 simultaneous HRES forecasts

  11. Numerical methods – Code Adaptation - Architecture ESCAPE *, Energy efficient SCalable Algorithms for weather Prediction at Exascale: • Next generation IFS numerical building blocks and compute intensive algorithms • Compute/energy efficiency diagnostics • New approaches and implementation on novel architectures • Testing in operational configurations *Funded by EC H2020 framework, Future and Emerging Technologies – High-Performance Computing Partners: ECMWF , Météo-France, RMI, DMI, Meteo Swiss, DWD, U Loughborough, PSNC, ICHEC, Bull, NVIDIA, Optalysys

  12. Schematic description of the spectral transform method in the ECMWF IFS model Grid-point space -semi-Lagrangian advection -physical parametrizations -products of terms FFT Inverse FFT Fourier space Fourier space Inverse LT LT Spectral space -horizontal gradients -semi-implicit calculations -horizontal diffusion FFT: Fast Fourier Transform , LT: Legendre Transform 13

  13. d Schematic description of the spectral transform warf in ESCAPE Grid-point space FFT Inverse FFT 100 iterations Fourier space Fourier space Time-stepping loop in dwarf1-atlas.F90 DO JSTEP=1,ITERS call trans%invtrans(spfields,gpfields) call trans%dirtrans(gpfields,spfields) Inverse LT LT ENDDO Spectral space FFT: Fast Fourier Transform , LT: Legendre Transform 14

  14. GPU-related work on this dwarf Work carried out by George Mozdzynski, ECMWF  An OpenACC port of a spectral transform test (transform_test.F90)  Using 1D parallelisation over spectral waves  Contrast with IFS which uses 2D parallelisation (waves, levels)  About 30 routines ported, 280 !$ACC directives  Major focus on FFTs, using NVIDIA cuFFT library  Legendre Transform uses DGEMM_ACC  Fast Legendre Transform not ported (need working deep copy)  CRAY provided access to SWAN (6 NVIDIA K20X GPUs)  Latest 8.4 CRAY compilers  Larger runs performed on TITAN  Each node has 16 AMD Interlagos cores & 1 NVIDIA K20X GPU (6GB)  CRESTA INCITE14 access  Used 8.3.1 CRAY compiler  Compare performance of XK7/Titan node with XC-30 node (24 core Ivybridge) 15

  15. Tc1023 10 km model Spectral Transform Compute Cost (40 nodes, 800 fields) 300 250 239.5 238.9 230.6 230.7 218.4 207.5 200 msec per time-step XC-30 150 TITAN 109.1 100 86.1 50 0 16 LTINV_CTL LTDIR_CTL FTDIR_CTL FTINV_CTL

  16. Tc1999 5 km model Spectral Transform Compute Cost (120 nodes, 800 fields) 700 646.2 645.1 600 500 msec per time-step 400 351.1 345.3 XC-30 300 281.9 281.7 TITAN 189.3 200 152.9 100 0 17 LTINV_CTL LTDIR_CTL FTDIR_CTL FTINV_CTL

  17. Tc3999 2.5 km model Spectral Transform Compute Cost (400 nodes, 800 fields) 1400 1178.6 1200 1024.9 msec per time-step 1000 800 XC-30 600 TITAN 428.3 424.6 400 342.3 341.8 324.3 279.8 200 0 LTINV_CTL LTDIR_CTL FTDIR_CTL FTINV_CTL 18

  18. Relative FFT performance NVIDIA K20X GPU (v2) v 24 core Ivybridge CRAY XC-30 node (FFT992) 1.60 1.40 Relative Performance 1.20 1.00 1.40-1.60 0.80 1.20-1.40 0.60 1.00-1.20 0.80-1.00 3200 0.40 1600 0.60-0.80 800 0.20 400 0.40-0.60 0.00 200 0.20-0.40 T95 T159 T399 T1023 100 T1279 0.00-0.20 T2047 T3999 T7999 K20X GPU performance up to 1.4 times faster 22 than 24 Ivybridge core XC-30 node

  19. Comparison of FFT cost for LOT size 100 0.50 0.45 7400 0.40 GPU ver 1 0.35 GPU ver 2 0.30 Time FFT992 4100 0.25 FFTW 3700 0.20 0.15 0.10 0.05 0.00 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 FFT length (latitude points) 24

  20. What about MPI communications?  Cost very much greater than compute for Spectral Transform test  Tc3999 example follows  XC-30 (Aries) is faster than XK7/Titan (Gemini)  So made prediction for XC-30 comms with K20X GPU  Potential for compute / communications overlap  GPU compute while MPI transfers are taking place  Not done (yet) 25

  21. Tc3999, 400 nodes, 800 fields (ms per time-step) Tc3999 XC-30 TITAN XC-30+GPU Prediction LTINV_CTL 1024.9 324.3 324.3 LTDIR_CTL 1178.6 279.8 279.8 FTDIR_CTL 428.3 342.3 342.3 FTINV_CTL 424.6 341.8 341.8 MTOL 752.5 4763.0 752.5 LTOM 407.9 4782.9 407.9 LTOG 1225.9 1541.9 1225.9 GTOL 401.5 1658.4 401.5 HOST2GPU** 0.0 655.4 655.4 GPU2HOST** 0.0 650.0 650.0 5844.2 14034.4 5381.4 ** included in comms (red) times 26

  22. Spectral transforms experience  OpenACC not that difficult, but  Replaced ~10 OpenMP directives (high-level parallelisation)  By ~280 OpenACC directives (low-level parallelisation)  Most of the porting time spent on  Strategy for porting IFS FFT992 interface (algor/fourier)  Replaced by calls to new cuda FFT993 interface  Calling NVIDIA cuFFT library routines  Coding versions of FTDIR and FTINV where FFT992 and FFT993 both ran on same data to compare results  Writing several offline FFT tests to explore performance  Performance issues  Used nvprof, gstats 27

  23. Physics dwarf : CloudSC • Work done by Sami Saarinen, ECMWF • Adaptation of IFS physics’ cloud scheme (CLOUDSC) to new architectures as part of ECMWF Scalability programme • Emphasis was on GPU-migration by use of OpenACC directives • CLOUDSC consumes about 10% of IFS Forecast time • Some 3500 lines of Fortran2003 – before OpenACC directives • Focus on performance comparison between - OpenMP version of CLOUDSC on Haswell -OpenACC version of CLOUDSC on NVIDIA K40 28

  24. Problem parameters: • Given 160,000 grid point columns (NGPTOT) • Each with 137 levels (NLEV) • About 80,000 columns fit into one K40 GPU • Grid point columns are independent of each other • So no horizontal dependencies here, but ... • ... level dependency prevents parallelization along vertical dim • Arrays are organized in blocks of grid point columns • Instead of using ARRAY(NGPTOT, NLEV) ... • ... we use ARRAY(NPROMA, NLEV, NBLKS) • NPROMA is a (runtime) fixed blocking factor • Arrays are OpenMP thread safe over NBLKS 29

Recommend


More recommend