he ur
play

he ur Operated by Los Alamos National Security, LLC for the U.S. - PowerPoint PPT Presentation

he ur Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Los Alamos National Laboratory LA-UR-17-23350 GPU Acceleration of Large Scale you Fluid Dynamics Scientific Codes nt wo Jenniffer Estrada Joseph


  1. he ur Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

  2. Los Alamos National Laboratory LA-UR-17-23350 GPU Acceleration of Large Scale you Fluid Dynamics Scientific Codes nt wo Jenniffer Estrada Joseph Schoonover LANL, Research Scientist CIRES, Research Scientist jme@lanl.gov jschoonover@lanl.gov GTC 2017 May 10, 2017

  3. Los Alamos National Laboratory Motivation • Scale Interactions • Resolution vs Resources • Why are they bragging? 5/10/17 | 3

  4. Los Alamos National Laboratory The Scientific Codes - SELF • Continuous and Discontinuous Galerkin (Polynomial based) Nodal Spectral Element Method • Oceanographic and Geophysical Modelling • Target problem: Model a large scale flow ( ~100 km) catalyzes the formation of small features ( ~ 1 km) due to interactions with topography (Vorticity in the Gulf Stream Shelf) • 10-million degrees of freedom 5/10/17 | 4

  5. Los Alamos National Laboratory SELF-DGSEM Algorithm 5/10/17 | 5

  6. Los Alamos National Laboratory Progression • Hot Spot Identification MappedTimeDerivative • Software Changes • Message Passing 5/10/17 | 6

  7. Los Alamos National Laboratory Progression (Continued) • CPU: AMD Opteron (16 core) GPU: Tesla K20X • Serial (Single Core): • Original: 110.8 sec • After changes: 127.2 sec • OpenACC: 5.3 sec 5/10/17 | 7

  8. Los Alamos National Laboratory 1.5x to Ideal 5/10/17 | 8

  9. Los Alamos National Laboratory 16 1 MPR MPR 14 SPR SPR Scaling Efficiency (RK3) 0.8 SPR-LA 12 SPR-LA Speedup (RK3) Ideal Ideal 10 0.6 8 6 0.4 4 0.2 2 2 4 6 8 10 12 14 16 0 # of Threads 200 0 5 10 15 20 MPR # of Threads SPR SPR-LA 150 Runtime (RK3, sec) - Multiple parallel regions (MPR) - Ideal Standard OpenMP 100 - Single parallel region (SPR) - High Level OpenMP without loop bounds 50 - Single Parallel Region Loop Bound Assignments (SPR-LA) - High Level OpenMP 0 2 4 6 8 10 12 14 16 # of Threads ** Yuliana Zamora and Robert Robey, Effective OpenMP Implementations, https://www.lanl.gov/projects/national-security-education-center/information- science-technology/summer-schools/parallelcomputing/_assets/images/2016projects/Zamora.pdf; https://anl.app.box.com/v/IXPUG2016-presentation-23 , 5/10/17 | 9

  10. Los Alamos National Laboratory Higher is better! 5/10/17 | 10

  11. Los Alamos National Laboratory Thermal Bubble • Initial conditions consist of a anomalous warm blob in an otherwise neutral stratification • Domain Size: 10,000 m (cube) • Discretization: Discontinuous Galerkin Spectral Element Method 20x20x20 Elements, Polynomial Degree 7 • Laplacian Diffusion: 0.8 m 2 /s • Simulation Time: 37 minutes 5/10/17 | 11

  12. Los Alamos National Laboratory 5/10/17 | 12

  13. Los Alamos National Laboratory Thermal Bubble UNM Xena CPU: Intel Xeon GPU: Tesla K40m • 1.2 million time steps Wall-Time (CPU) • 37 days Wall Time (GPU) • 24 hours, 13 min 5/10/17 | 13

  14. Los Alamos National Laboratory Across Architectures Benchmarks for ForwardStepRK3 (Euler 3-D) Tests run with CUDA Fortran Polynomial degree = 7 Laplacian Diffusion 15x15x15 Elements (5 time steps) (Footprint: ~1.9 GB memory space ) GPU Model CPU Model Serial Time GPU Runtime Speedup Tesla K40m Intel Xeon E5- 45.969 (sec) 1.282 (sec) 35.854 x 2683 GeForce GTX TitanX Intel Xeon E3- 35.672 (sec) 1.159 (sec) 30.775 x 1285L Tesla P100-SXM2- Power8NVL 49.588 (sec) 0.439 (sec) 112.913 x 16GB 5/10/17 | 14

  15. Los Alamos National Laboratory Going Forward • Initial development of hybrid GPU-MPI code is underway (Improve weak scaling) • Use GPU-Direct technology to overcome CPU-GPU copy • Continue to update data structure layout and CUDA kernel implementation to improve memory access patterns on the GPU 5/10/17 | 15

  16. Los Alamos National Laboratory The Scientific Codes - Higrad • The fluid dynamics core of Higrad solves the same set of equations (Compressible Navier Stokes) using a Finite Volume discretization • Atmospheric Modelling • Couples with other modules (FIRETEC and wildland fire modelling) 5/10/17 | 16

  17. Los Alamos National Laboratory OpenCL OpenMP CUDA Fortran OpenMPI OpenACC 5/10/17 | 17

  18. Los Alamos National Laboratory Progression • Bottom Up Approach 5/10/17 | 18

  19. Los Alamos National Laboratory Progression (Continued) 5/10/17 | 19

  20. Los Alamos National Laboratory Progression • GPU enabled with OpenACC • Handling memory handling with CUDA Fortran Compute intensive kernels currently on GPU 5/10/17 | 20

  21. Los Alamos National Laboratory Going Forward (continued) • Higrad • Finish memory handling with CUDA Fortran • Scaling with GPU Aware MPI • CUDA Implementation • Higrad/Firetec Gatlinburg Fire Simulation on Titan • Mission needs 10x larger problem size (prior limits 1.6 billion cells) and 10x faster 5/10/17 | 21

  22. Los Alamos National Laboratory Where We Are Going With This? • Suppose that I am working on a problem with 100,000 elements, and I need to perform 10,000,000 time steps (not unrealistic for scale interaction problems), then ideal runtimes would be: • T serial = c serial *(100,000)*(10,000,000) ≈ 4 years T ideal = c ideal *(100,000)*(10,000,000) ≈ 3 months T gpu = c gpu *(100,000)*(10,000,000) ≈ 1.8 months • The reduction in wall-time for small problems translates to huge potential savings for larger problems! 5/10/17 | 22

  23. Los Alamos National Laboratory Acknowledgements • This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. • Special thanks to Fernanda Foertter (ORNL), Jeff Larkin (NVIDIA), David Norton (NVIDIA/PGI), Frank Winkler (ORNL), Matt Otten (LLNL) 5/10/17 | 23

  24. Los Alamos National Laboratory Questions? 5/10/17 | 24

Recommend


More recommend