speeding up simulations of
play

SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May - PowerPoint PPT Presentation

OPENACC & OPENMP4.5 OFFLOADING: SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May 9, 2017 OAK RIDGE LEADERSHIP COMPUTING FACILITY Center for Accelerated Application Readiness (CAAR) Preparing codes to run on the upcoming


  1. OPENACC & OPENMP4.5 OFFLOADING: SPEEDING UP SIMULATIONS OF STELLAR EXPLOSIONS Tom Papatheodore, May 9, 2017

  2. OAK RIDGE LEADERSHIP COMPUTING FACILITY Center for Accelerated Application Readiness (CAAR) Preparing codes to run on the upcoming (CORAL) Summit supercomputer at ORNL Summit – IBM POWER9 + NVIDIA Volta • • EA System – IBM POWER8 + NVIDIA Pascal This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. FLASH – adaptive-mesh, multi-physics simulation code widely used in astrophysics http://flash.uchicago.edu/site/ 2

  3. State University) - HubbleSite: gallery, release. By NASA/CXC/Rutgers/J.Warren & J.Hughes et al. - http://chandra.harvard.edu/photo/2005/tycho By NASA, ESA, J. Hester and A. Loll (Arizona SUPERNOVAE What are supernovae? Supernovae – exploding stars • Among most energetic events in the universe Contribute to galactic dynamics • Create heavy elements (e.g. iron, calcium) • 3

  4. SUPERNOVAE Simulating Supernovae Requires multi-physics code • Hydrodynamics Jordan et al. The Astrophysical Journal, Volume 681, Issue 2, article id. 1448-1457, pp. Nuclear Burning • (2008) Gravity • • Equation of State Relationship between thermodynamic variables in a system ( e.g. P=P( ρ ,T, X ) ) • • Called many times during simulation Can be calculated independently • 4

  5. EQUATION OF STATE Helmholtz EOS (Timmes & Swesty, 2000) Based on Helmholtz free energy formulation High-order interpolation from table of free energy (quintic Hermite polynomials) • Collaborators at Stony Brook University ( Mike Zingale, Max Katz, Adam Jacobs ) • Developed OpenACC version of Helmholtz EOS - part of a shared repository of microphysics (starkiller) that can run in FLASH, as well as BoxLib-based codes such as CASTRO and MAESTRO. FLASH-CAAR Install this accelerated version of EOS into FLASH • • Created a version using OpenMP4.5 w/offloading (as part of hackathon by IBM’s CoE) 5

  6. HELMHOLTZ EOS Driver Program To determine best use of accelerated EOS in FLASH, we created driver program Mimics AMR block structure and time stepping in FLASH • Loops through several time steps • • Change the number of total grid zones Fill these zones with new data • Calculate interpolation in all grid zones • • How many AMR blocks should we calculate (call the EOS on) at once per MPI rank? 6

  7. HELMHOLTZ EOS Basic Flow of Driver Program 1) Allocate main data arrays on host and device • Arrays of Fortran derived types Each elements holds grid data for single zone • • Persist for the duration of the program Used to pass zone data back and forth from host to device • Reduced set sent from H-to-D • • Full set sent from D-to-H 7

  8. HELMHOLTZ EOS Basic Flow of Driver Program 2) Read in tabulated Helmholtz Free Energy data and make copy on device This will persist for the duration of the program • Thermodynamic quantities are interpolated from this table • 8

  9. HELMHOLTZ EOS Basic Flow of Driver Program 3) For each timestep Traditional OpenMP (CPU) parallel region (Each thread gets portion of total zones) Change number of AMR blocks • Update device with new grid data • Launch EOS kernel: calculate interpolation for all grid zones • Update host with newly calculated quantities • 9

  10. HELMHOLTZ EOS Basic Flow of Driver Program !$acc update device(reduced_state(start_element:stop_element)) async(thread_id + 1) !$acc kernels async(thread_id + 1) OpenACC do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$acc end kernels !$acc update self(state(start_element:stop_element)) async(thread_id + 1) !$acc wait !$omp target update to(reduced_state(start_element:stop_element)) !$omp target OpenMP4.5 !$omp teams distribute parallel do thread_limit(128) num_threads(128) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$omp end teams distribute parallel do !$omp end target !$omp target update from(state(start_element:stop_element)) 10

  11. HELMHOLTZ EOS Driver Program Tests Number of “AMR” blocks: 1, 10, 100, 1000, 10000 (each with 256 zones) Ran with 1, 4, and 10 (CPU) OpenMP threads for each block count • OpenACC (PGI 16.10) • • OpenMP4.5 (XL 15.1.5) 11

  12. OPENACC VS OPENMP4.5 Current Functionality for offloading to GPUs PGI’s OpenACC implementation has more mature API (version 16.10) • • It has simply been around longer • IBM’s XL Fortran implementation of OpenMP4.5 (version 15.1.5) • Does not currently allow pinned memory or asynchronous data transfers / kernel execution ➢ Coming soon 12

  13. HELMHOLTZ EOS Preliminary Performance Results • At low AMR block counts, kernel overhead is large and kernel execution does not increase much 13

  14. HELMHOLTZ EOS Preliminary Performance Results • Same is true for multi-threaded, but overhead is increased for each “send, compute, receive” sequence 14

  15. HELMHOLTZ EOS Preliminary Performance Results • At higher block counts, kernel overhead is negligible. Now dominated by D2H transfers. 15

  16. HELMHOLTZ EOS Preliminary Performance Results • Same is true for multi-threaded, even with overlap of data transfers and kernel execution 16

  17. HELMHOLTZ EOS Preliminary Performance Results • At low AMR block counts, kernel execution times are roughly same, and these dominate overall time of each “send, compute, receive” Why so large? Temporary variables CUDA thread scheduling? 17

  18. HELMHOLTZ EOS Preliminary Performance Results • Similar behavior for multi-threaded case, but serialized launch overhead 18

  19. HELMHOLTZ EOS Preliminary Performance Results • At larger AMR block counts, kernel times still dominate but now we are saturating GPUs 19

  20. HELMHOLTZ EOS Preliminary Performance Results • Similar behavior for multi-threaded case, but no benefit of asynchronous behavior 20

  21. HELMHOLTZ EOS Preliminary Performance Results • Compare with CPU-only OpenMP (dashed) • 1, 4, 10 CPU threads • No current advantage of using GPUs 21

  22. HELMHOLTZ EOS Preliminary Performance Results • Advantage from GPUs when >100 AMR blocks Can calculate 100 blocks in roughly the same • time as 1 block • So in FLASH, we should compute 100s of blocks per MPI rank Restructure to calculate many blocks at once • 22

  23. CONCLUSIONS Current Snapshot OpenACC (PGI 16.10) • • Mature API with more features implemented Currently better performance of kernel execution • XL’s Fortran (15.1.5) OpenMP4.5 implementation is still in early stages • • Currently some missing features, but these are being developed as we speak Some bugs are still being worked out • • Looking into long kernel execution times (related to CUDA thread scheduling?) 23

Recommend


More recommend