Configuring and Optimizing the Weather Research and Forecast Model - PowerPoint PPT Presentation

Configuring and Optimizing the Weather Research and Forecast Model on the Cray XT Andrew Porter and Mike Ashworth Computational Science and Engineering Department STFC Daresbury Laboratory andrew.porter@stfc.ac.uk 24 th May 2010 Cray User Group, Edinburgh

Overview • Introduction • Machines • Benchmark Configuration • Choice of Compiler/Flags • MPI Versus Mixed Mode (MPI/OpenMP) • Memory Bandwidth Issues • Tuning Cache Usage • Input/Output • Default scheme • pNetCDF • I/O servers & process placement

Introduction - WRF • Regional- to global-scale model for research and operational weather- forecast systems • Developed through a collaboration between various US bodies (NCAR, NOAA...) • Finite difference scheme + physics parametrisations • F90 [+ MPI] [+ OpenMP] • 6000 registered users (June 2008)

Introduction – this work • WRF accounts for significant fraction of usage of UK national facility (HECToR) • Aim here is to investigate ways of ensuring this use is efficient • Mainly through (the many) configuration options • Code optimization when/if required

Machines Used • HECToR – UK national academic supercomputing service – Cray XT4 – 1x AMD Barcelona 2.3GHz quad-core chip per compute node – SeaStar2 interconnect • Monte Rosa – Swiss National Supercomputing Service (CSCS) – Cray XT5 – 2x AMD Istanbul 2.4GHz hexa-core chips per compute node – SeaStar2 interconnect

Benchmark Configuration “Great North Run” Three nested domains with two-way feedback between them: D1 = 356 x 196 D2 = 319 x 322 D3 = 391 x 328 D3 gives 1Km- resolution data over Northern England.

Choice of Compiler/Flags  HECToR offers four different compilers!  Portland Group (PGI)  Pathscale (recently bought by Cray)  Cray  Gnu (gcc + gfortran)  WRF can be built in serial, shared- memory (sm), distributed-memory (dm) and mixed (dm+sm) modes...

Initial Compiler Comparison for dm (MPI) build

Effect of Extra Flags

Compiler notes I  1.1K -> 1.2K time-steps/wall-clock hour on 1024 cores from increasing optimization with PGI  -O3 –fast to –O3 –fastsse –Mvect=noaltcode –Msmartalloc –Mprefetch=distance:8 -Mfprel  1.2K -> 1.3K by re-building to remove array init'n prior to each inter-domain feedback stage  PS with extra optimization flags only very slightly slower than PGI  Gnu (default) is 25% slower than PGI (default) on 256 cores but only 10% slower on 1024  Deficit much larger when extra optimization turned on for PGI

Verification of Results  Compare T at 2m for 6 hr run of default & optimized binaries  Max. diff is only ~0.1K

Mixed mode versus dm on XT4 and XT5

Compiler notes II  PS dm+sm binary faster than PGI version  dm+sm faster than dm on 512+ cores  Reduced MPI communications  Better use of cache  WRF generally faster on 2.3 GHz quad- core XT4 than on 2.4 GHz hexa-core XT5  Only dm+sm version comes close to overcoming the difference

Under-populating XT5 nodes • De-populating steadily reduces time in both user and MPI code • Rate of cache fills for user code steadily increases: ‘memory wall’

Improving cache usage  Efficient use of large, on-chip memory cache is very important in getting high performance from x86-type chips  Under MPI, WRF gives each process a 'patch' to work on. These patches can be further decomposed into 'tiles' (used by the OpenMP implementation) e.g . decomposition of domain into four patches with each patch containing six tiles:

Performance variation with tiling

Notes on tiling performance  Most effect on low core-count jobs because these have large patches and thus large array extents  In this case, still get ~5% speed-up by using four tiles for both 512- and 1024- core MPI jobs  HWPC data shows that improvement is largely due to better use of L2 ‘victim’ cache (20% hit rate => 70+% hit rate)

I/O Considerations • All benchmark results presented so far carefully exclude effects of doing I/O • But, MUST write data to file for job to be scientifically useful… • Data written as ‘frames’ – a snapshot of the system at a given point in time – One frame for GNR is ~1.6GB in total but this is spread across 3 files (1 per domain) and many variables

Approaches to I/O in WRF • Default: data for whole model domain gathered on ‘master’ PE which then writes to disk • All PEs block while master is writing • Does not scale • Memory limitations

Parallel netCDF (pNetCDF) • Uses the pNetCDF library from Argonne • Every PE writes • Current method of last resort when domain won’t fit into memory of single PE – Will become more of a problem as model sizes and numbers of cores/socket increase • Slow – Lots of small writes – e.g . 256-core job, mean time to write domain 3 with default method = 12s. Increases to 103s with parallel netCDF!

I/O Quilting • Use dedicated ‘I/O servers’ to write data • Compute PEs free to continue once data sent to I/O servers • No longer have to block while data is sent to disk • Number of I/O servers may be tuned to minimise time to gather data • Only ‘master’ I/O server currently writes – Domain must still fit into memory

Process mapping Compute process I/O process MPI Communicator I/O • How best to assign compute PEs to I/O servers? • By default, all I/O servers end up grouped together on a few compute nodes

I/O quilting performance

Effect of process mapping

Conclusions  PGI best for dm build, PS for sm+dm  sm+dm scales best; performs much better than dm on fatter nodes of XT5  Less MPI communication  Better cache usage  Codes like WRF that are memory- bandwidth bound are not well-served by proliferation of cores/socket  I/O quilting reduces time lost to I/O and is insensitive to process placement/mapping

Acknowledgements • EPSRC and NAG, UK for funding • Alan Gadian, Ralph Burton (University of Leeds) and Michael Bane (University of Manchester) for project direction • John Michelakes (NCAR) for problem- solving assistance and advice andrew.porter@stfc.ac.uk

Configuring and Optimizing the Weather Research and Forecast Model - PowerPoint PPT Presentation

Configuring and Optimizing the Weather Research and Forecast Model on the Cray XT Andrew Porter and Mike Ashworth Computational Science and Engineering Department STFC Daresbury Laboratory andrew.porter@stfc.ac.uk 24 th May 2010 Cray User

ACRONIS BACKUP Configuring Acronis Backup and Acronis Backup Cloud Acronis Training and

How Weather Forecasting Works Extension Climate Learning Lab Forecasting Weather Weather

45 th Weather Squadron Space Weather Support to Launch Space Weather Workshop, 29 April 2016

Configuring and Using Mutt Ryan Curtin LUG@GT Ryan Curtin Configuring and Using Mutt - p. 1/21

Configuring Data Security Policies in Microsoft Azure CONFIGURING DATA CLASSIFICATION IN

Configuring Git Matthieu Moy Matthieu.Moy@imag.fr

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Research s role in helping society cope with high impact weather events High Impact Weather

The Weather and Climate Enterprise in the United States April 2, 2012 Seoul, South Korea

lessons learned in communicating weather and climate uncertainty Jason Samenow, Capital Weather

Weather Effects (Group 1) Jared Headings, Ted Zhu, Ian Kirchner Weather in Games Audio and

April 9-13, 2018 Severe Weather Awareness Week 2018 What is Severe Weather Awareness Week?

Severe Weather Awareness Week April 8-12, 2019 Severe Weather Awareness Week 2019 What is

Winter Weather Safety Know Your Risk Take Action Be a Force of Nature Winter Weather Safety

Severe Weather Walls/Roofs Walls/Roofs SPFA Conference March 16 th 2008 a c 6 008 Value

Spring Weather Safety Know Your Risk Take Action Be a Force of Nature Spring Weather Safety

Enhancing TSMO Capabilities and Resilience to Address Long-Term Weather and Climate Trends Paul

Next Generation Earth System Prediction: Strategies for Subseasonal to Seasonal Forecasts

First Design: Weather Station Observer Design Pattern Event-Driven Design EECS3311: Software

History of 20 th century macroeconomics Almost no interest in macroeconomic issues late in 19

Comprehensive Annotation of Multiword Expressions in a Social Web Corpus Nathan Schneider,

Supplemental Slides Second Quarter 2020 Earnings August 5, 2020 Forward-Looking Statements T his

WEATHER Weather Forecasting Module 3.1 Proudly developed by SMART with funding from Inspiring

Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer