paralleliza on and performance of the nim weather model
play

Paralleliza(on and Performance of the NIM Weather Model on CPU, - PowerPoint PPT Presentation

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory We Need Be?er Numerical Weather Predic(on Superstorm Sandy Second most destruc(ve in U.S.


  1. Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory

  2. We Need Be?er Numerical Weather Predic(on Superstorm Sandy • Second most destruc(ve in U.S. History • $75B in damages Hurricane Sandy October 28, 2012 • Over 200 deaths “A European forecast that closely predicted Hurricane Sandy's onslaught days ahead of U.S. and other models is raising complaints in the meteorological community.” "The U.S. does not lead the world; we are not No. 1 in weather forecasCng, I'm very sorry to say that," says AccuWeather's Mike Smith…” Source: USA Today, October 30, 2012 Congressional Response: • High Impact Weather Predic(on Program (HIWPP) • Next Genera(on Weather Predic(on Program (NGGPS)

  3. Three Years Later… Hurricane Joaquin Some improvement NOAA’s Hurricane Weather Research & • October 2, 2015 Forecast Model intensity forecasts were accurate US research models had 20” precipita(on • forecasts in South Carolina 36 hours in advance (verified) But … European models predicted Joaquin • would not make landfall (verified) All U..S models incorrectly predicted – landfall The Na(onal Hurricane Center correctly • never issued any hurricane watches or warnings for the mainland Forecasters relied on the European model – for guidance NY Times: Why U.S weather model has fallen behind WashingtonPost: Why the forecast cone of uncertainty is inadequate

  4. Weather Predic(on: Forecast Process • Opera(onal weather predic(on models at NWS are required to run in about 1 percent of real-(me – A one hour forecast produced in 8.5 minutes – Data assimila(on, post processing are similarly constrained HPC NWP Forecaster Post- Data Assimila<on Stakeholders Processing “Accelerators” can speed up Assimila(on and Numerical Weather Predic(on (NWP)

  5. Why Does NWP Need Accelerators? • Increasing computer power has provided linear forecast improvement for decades • CPU clock speeds have NCEP Operational Forecast Skill 36 and 72 Hour Forecasts @ 500 MB over North America [100 * (1-S1/70) Method] stalled 36 Hour Forecast 72 Hour Forecast 90.0 – Increased number of 80.0 processing cores: MIC, 70.0 GPU 60.0 15 Years 50.0 – Lower energy 40.0 requirements 30.0 20.0 IBM CRAY IBM IBM CDC CYBER CRAY IBM IBM IBM IBM IBM IBM 6600 360/195 C90 SP P655+ 704 205 Y-MP P690 Power 6 701 7090 7094 10.0 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 NCEP Central Operations January 2015

  6. Resolu(on Ma?ers: Large Scale Ocean-Land-Atmosphere Interac(ons • Global opera(onal weather models: 13KM

  7. Resolu(on Ma?ers: Fine-Scale SimulaCon of a Tornado-Producing Super-Cell Storm Produces a Tornado 4-km 1-km More Intense UpdraGs Simula(ons with GFDL’s variable-resolu(on FV 3 , non-hydrosta(c (aka cloud-permijng) model. Courtesy of Lin and Harris (2015 manuscript)

  8. Be?er Data Assimila(on = Be?er Forecasts Hurricane Joaquin 00Z October 1, 2015 US model w/new data assimila(on 50°N Actual track US model w/old 00Z October 1, 2015 (through 03Z 07 data assimila(on October) Hurricane Joaquin 45°N Track Forecast 40°N 35°N European model 30°N 25°N 70°W 80°W 60°W Source: Corey Guas(ni EMC’s Model EvaluaCon Group

  9. Formula to Radically Improve U.S. Weather Predic(on (and be #1) • Increase resolu(on of global models to 3KM or finer – Capture moisture, storm scale features – Coupling atmosphere, ocean, chemistry, land surface • Improve data assimila(on – Use ensemble and (me-based varia(onal methods – Massive increase in number of observa(ons handled – Increase scalability to thousands of cores • Increase in compu(ng – 100 – 1000 (mes more than current models use

  10. Non-hydrosta(c Icosahedral Model (NIM) • Experimental global weather forecast model began in 2008 • Uniform Icosahedral grid • Designed for GPU, MIC – Run on 10K GPUs, 600 MIC, 250K CPU cores – Tested at 3KM resolu(on • Single source code (Fortran) – Serial, parallel execu(on on CPU, GPU, MIC • Paralleliza(on direc(ves • GPU OpenACC, F2C-ACC • CPU OpenMP • MIC OpenMP • MPI SMS • Useful for evalua(ng compilers, Fine-Grained Parallelism GPU & MIC hardware GPU • “Blocks” in horizontal • “Threads” in ver(cal • CPU, MIC • “Threading” in horizontal • “Vectoriza(on” of ver(cal •

  11. Hardware Comparisons • Performance comparisons in literature, presenta(ons can be misleading • Ideally want: – Same source code – Op(mized for all architectures – Standard, high volume chips – Comparisons in terms of: • Device • Single node • Mul(-node – Cost – benefit • Programmability

  12. Device Performance 60 NIM DYNAMICS 49.8 50 110 KM RESOLUTION Intel CPU 96 VERTICAL LEVELS 40 NVIDIA GPU run(me (sec) Intel MIC 26.8 30 23.6 20 20 16.4 15.9 15.1 13.9 7.8 10 0 2010/11 2012 2013 2014 Year Intel CPU (cores) NVIDIA GPU (cores) Intel MIC (cores) 2010/11 Westmere (12) Fermi (448) 2012 SandyBridge (16) Kepler K20x (2688) 2013 IvyBridge (20) Kepler K40 (2880) Knights Corner (61) 2014 Haswell (24) Kepler K80 (4992)

  13. Single Node Performance Results from: NOAA / ESRL - August 2014 Numeric values represent node run-(mes for each configura(on 90 120 KM Resolu<on 81 40,968 Columns, 96 Ver<cal Levels 80 74 73 100 <me steps 70 Symmetric Mode • CPU run(me 58 Execu<on Run-<me (sec) 60 • MIC run(me • GPU run(me 50 46 using F2C-ACC 42 40 33 30 20 10 Node Type : 0 IB20 only IB24 only MIC only GPU only IB24 + MIC IB20 + GPU IB20 + 2 GPU – IB20 : Intel IvyBridge, 20 cores, 3.0GHz – GPU : Kepler K40 2880 cores, 745 MHz – IB24 : Intel IvyBridge 24 cores, 2.70 GHz – MIC : KNC 7120 61 cores, 1.23GHZ

  14. Single Node Performance - Strong Scaling - • Intel IvyBridge with up to 4 NVIDIA K80s • As the work per GPU decreases: – inter-GPU communica(ons increases slightly – efficiency decreases • At least 10,000 columns per GPU is best NIM Single Node Performance 40,962 Columns, 100 <mesteps 50 45 40 Parallel 35 Efficiency Run(me Communica(ons Run<me (seconds) 30 25 20 15 0.95 10 0.90 5 0.77 0.71 0 GPUs 0 2 4 6 8 Cols/GPU 40962 20481 10241 6827 5120

  15. CPU – GPU Cost-Benefit • Dynamics only • Different CPUs and GPU configura(ons – 40 Haswell CPUs, 20 K80 GPUs – incorporate off-node MPI communica(ons • All runs executed in the same (me – Meets a ~1% opera(onal (me constraint for a 3KM resolu(on model – 20K columns / GPU used which equates to 95% GPU strong scaling efficiency

  16. Cost-Benefit – NIM Dynamics • 30KM resolu(on runs in same execu(on (me with: - 40 Intel Haswell CPU Nodes (list price: $6,500) - 20 NVIDIA K80 GPUs (list price: $5,000) • Execu(on (me represents ~1.5% of real-(me for 3KM resolu(on – ~2.75% of real-(me when model physics is included CPU versus GPU Cost-Benefit NIM 30 km resolu(on 300 260 230 250 200 165 Cost (thousands) 145.5 132.5 150 100 CPU only CPU & GPU 50 0 40 20 10 7 5 numCPUs: K80s per CPU: 0 1 2 3 4

  17. Lessons Learned: Code Design • Avoid language constructs that are less well supported or difficult for compilers to op(mize – Pointers, derived types • Separate rou(nes for fine-grain (GPU, MIC) and coarse grain (MIC) • Avoid single loop kernels – High cost of kernel startup, synchroniza(on • Avoid large kernels (GPU) – Limited fast register, cache / shared memory • Use scien(fic formula(ons that are highly parallel

  18. Lessons Learned: Inter-Process Communica(ons • Use of icosahedral grid gave flexibility in how columns could be distributed among MPI ranks – MPI regions should be square to minimize points to be communicated – Spiral ordering to eliminate MPI message packing and unpacking helped CPU, GPU, MIC • GPUDirect gave 30% performance improvement • CUDA Mul(-Process Service (MPS) sped up NIM by 35% on Titan – Not reflected in the results shown

  19. Lessons Learned: Fine-Grain • Choice of innermost dimension important – Vectoriza(on on CPU, MIC – SIMD, Coalesced memory on GPU – For NIM, ver(cal dimension used for dynamics • Horizontal dimension for physics • Innermost dimension should be mul(ple of 32 for GPU, bigger is be?er – Mul(ple of 8 is sufficient for MIC • Minimize branching – Very few special cases in NIM

  20. Improved OpenACC Compilers • Performance of PGI nearly matches F2C-ACC – Was 2.1X slower in 2014 • Cray was 1.7X slower • PGI does good job with analysis, data movement – Use !$acc kernels to get the applica(on running • 800 line MPAS kernel running on GPU in 10 minutes – Use !$acc parallel to op(mize performance – Use !$acc data to handle data movement – Diagnos(c output to guide paralleliza(on, op(miza(on • Cray, IBM comparisons planned

Recommend


More recommend