an agile approach to building a gpu enabled and
play

An Agile Approach to Building a GPU-enabled and Performance- - PowerPoint PPT Presentation

An Agile Approach to Building a GPU-enabled and Performance- portable Global Cloud-resolving Atmospheric Model Dr. Richard Loft* Director, Technology Development CISL/NCAR *National Center for Atmospheric Research GTC, San Jose, CA March 26,


  1. An Agile Approach to Building a GPU-enabled and Performance- portable Global Cloud-resolving Atmospheric Model Dr. Richard Loft* Director, Technology Development CISL/NCAR *National Center for Atmospheric Research GTC, San Jose, CA March 26, 2018

  2. Outline • Origins Backstory • The MPAS Model • Team • Tools and Design • Status 2

  3. Project began with research based on student projects • Two years of student internship projects in the Summer Internships in Parallel Computational Science (SIParCS) at NCAR funded student projects related to architectural inter-comparison. • Projects focused on optimizing atmospheric numerical PDE solvers for both CPUs and GPUs with performance portability in mind. • Architectures compared: o Xeon Broadwell, Haswell; o Xeon Phi KNL; o NVIDIA Tesla P100->V100. 3

  4. Optimizing Stencils for different architectures Benchmark Problem • Shallow Water Equations (SWE) – A set of non-linear partial differential equations (PDE) – Capture features of atmospheric flow around the Earth • Radial basis function-generated finite difference (RBF-FD) methods Cone-shaped mountain Evaluate differential Stencil points operator D at every point Non-stencil points Day 1 Day 15 RBF-FD solution to SWE test case “ Flow over an isolated An example of 75-point stencil 4 mountain ” using 655,532 points [1] on a sphere [1] 3

  5. Directive-based portability in the RBF-FD shallow water equations (2-D unstructured stencil) 350 • CI roofline model generally 300 predicts performance well, even for more complicated algorithms. Performance (GFLOPS) 250 200 • Xeon performance crashes to DRAM BW limit when cache size is 150 exceeded, with some state reuse. 100 • Xeon Phi (KNL) HBM memory is 50 less sensitive to problem size that Xeon, saturates with CI figure. 0 • NVIDIA Pascal P100 performance fits CI model GPU’s require higher levels of parallelism to reach saturation. Broadwell KNL P100 Insufficient Sufficient Workload Workload 5 Parallelism Parallelism

  6. What is MPAS? – The Model for Prediction Across Scales NCAR’s Global Meteorological/Climate Model; ~100,000 SLOC Simulation of 2012 Tropical Cyclones at 4Km Resolution – Courtesy of Falko Judt, NCAR 6

  7. Weather and Climate Alliance (WACA): • NCAR • NVIDIA Corporation • IBM Corporation/The Weather Company • University of Wyoming, CE&EE Department • Korean Institute of Science and Technology Information (KISTI) 7

  8. Initial Divide and Conquer Strategy MPAS Dynamics MPAS Physics Ideas and Results Problem Reports and Support 8

  9. Weather and Climate Alliance (WACA): A Collaboration for Earth System Model Acceleration • NCAR (2+4) Dr. Rich Loft, Director TDD o Dr. Raghu Raj Kumar, Project Scientist TDD o Clint Olson, TDD o Bill Skamarock, Senior Science, MMM o Michael Duda, Software Engineer, MMM o Dave Gill, Software Engineer, MMM o • KISTI (2+1) Minsu Joh, KISTI Director, Disaster Management Research Center o Dr. Ji-Sun Kang. Senior Researcher o Jae-Youp Kim, GRA o • NVIDIA/PGI (1+3) Greg Branch, NVIDIA, Sales o Dr. Carl Ponder, Senior Applications Engineer o Brent Leback, PGI Compiler Engineering Manager o Craig Tierny, Solutions Architect o • University of Wyoming (1+5) Dr. Suresh Muknahallipatna, Professor E&CE, UW o Supreeth Suresh, Pranay Reddy, Sumathi Lakshmiranganathan, Cena Miller, Bradley Riotto - GRAs o ~6 PI +13 technical staff Started in September 2016 (18 months) ~9 FTE-years 9

  10. Since September: added IBM and The Weather Company IBM/TWC participants (1+2) Jaime Moreno o Todd Hutchinson o Constantinos Evangelinos Problem Reports and Support o 10

  11. Tools for Accelerating Code Optimization • Kernel GENerator (KGEN) o Extracts kernels from Fortran applications o Creates: • Standalone source code • Input and output state for verification • Added support for code coverage and representation o Broad user community • 8 Domestic institutions • 5 international institutions • 1 Company o Available on Github: KGEN is a useful tool for accelerating code porting and optimization https://github.com/NCAR/KGen 11

  12. MPAS Synchronous and Asynchronous Execution Dynamics LW and SW Asynch LW and SW or or and Physics Radiation I/O Radiation 𝛦 t Land Surface : : LW and SW Dynamics LW and SW or or Disk Radiation and Physics Radiation Land Surface

  13. Phase 2: pushing on to a full MPAS port • Status of GPU-based model components o Ported, optimized, verified • Dry dynamical core • GPU-direct implementation of MPAS halo exchanges Ported, optimized o • Moist dynamics (tracer transport) • Xu-Randall Cloud fraction o Ported, undergoing optimization • WSM6 Microphysics • YSU Boundary layer scheme o Awaiting Porting • Scale Insensitive Tiedtke convection scheme • Monin-Obukhov surface layer scheme • CPU-based components o Overlapping SW and LW RRTMG Radiation (lagged radiation) o NOAH Land Surface Model (synchronous, remains on CPU) o SIONlib I/O subsystem 13

  14. IBM/TWC MPAS Objectives • MPAS grid with local refinement 24-hour global forecasts • 12 km global grid • 3 km refinement over selected regions. • 32.8 M horizontal points 56 layers • Forecast requirement • Complete 20 hour simulation • …in 45 minutes • xRe = 26.7 • For 𝛦 t = 18 sec, timestep Refined grids can be generated budget is 0.674 seconds anywhere desired. Dr. Kumar will show next that as few as 800 V100s could achieve this goal… 14

Recommend


More recommend