l8179 zero to gpu hero
play

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March - PowerPoint PPT Presentation

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be covered What is OpenACC Profile-driven Development OpenACC with CUDA Unified Memory OpenACC Data Directives OpenACC Loop


  1. L8179 – ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019

  2. OUTLINE Topics to be covered ▪ What is OpenACC ▪ Profile-driven Development ▪ OpenACC with CUDA Unified Memory ▪ OpenACC Data Directives ▪ OpenACC Loop Optimizations ▪ Where to Get Help

  3. ABOUT THIS SESSION ▪ The objective of this session is to give you a brief introduction of OpenACC programming for NVIDIA GPUs ▪ This is an instructor-led session, there will be no hands on portion ▪ For hands on experience, please consider attending DLIT903 - OpenACC - 2X in 4 Steps or L9112 - Programming GPU-Accelerated POWER Systems with OpenACC if your badge allows ▪ Feel free to interrupt with questions

  4. INTRODUCTION TO OPENACC

  5. OpenACC is a directives- Add Simple Compiler Directive based programming approach main() to parallel computing { <serial code> #pragma acc kernels designed for performance { <parallel code> } and portability on CPUs } and GPUs for HPC.

  6. 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming Compiler Libraries Languages Directives Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility OpenACC

  7. OPENACC PORTABILITY Describing a generic parallel machine Host ▪ OpenACC is designed to be portable to many existing and future parallel platforms Device ▪ The programmer need not think about specific hardware details, but rather express the parallelism in generic terms ▪ An OpenACC program runs on a host Host (typically a CPU) that manages one or more parallel devices (GPUs, etc.). The host and Memory device(s) are logically thought of as having Device separate memories. Memory

  8. OPENACC Three major strengths Low Learning Curve Incremental Single Source

  9. OPENACC Begin with a working Incremental sequential code. Enhance Sequential Code #pragma acc #pragma acc parallel loop ▪ Maintain existing for( i = 0; i < N; i++ ) for( i = 0; i < N; i++ ) { { sequential code < loop code > < loop code > Parallelize it with OpenACC. ▪ Add annotations to } } expose parallelism #pragma #pragma acc acc paral paralle lel l loo oop ▪ After verifying for( i = 0; i < N; i++ ) for( i = 0; i < N; i++ ) correctness, annotate { { Rerun the code to verify more of the code < loop code > < loop code > correct behavior, } } remove/alter OpenACC code as needed.

  10. OPENACC Low Learning Curve Incremental Single Source ▪ Maintain existing sequential code ▪ Add annotations to expose parallelism ▪ After verifying correctness, annotate more of the code

  11. OPENACC The compiler can ignore your Supported Platforms Single Source OpenACC code additions, so the same code can be used for parallel or POWER sequential execution. ▪ Rebuild the same code Sunway on multiple architectures int main(){ int main(){ x86 CPU ▪ Compiler determines ... ... x86 Xeon Phi how to parallelize for #pragma acc parallel loop the desired machine for(int i = 0; i < N; i++) for(int i = 0; i < N; i++) NVIDIA GPU ▪ Sequential code is < loop code > < loop code > maintained PEZY-SC } }

  12. OPENACC Low Learning Curve Incremental Single Source ▪ Rebuild the same code ▪ Maintain existing on multiple sequential code architectures ▪ Add annotations to ▪ Compiler determines expose parallelism how to parallelize for ▪ After verifying the desired machine correctness, annotate ▪ Sequential code is more of the code maintained

  13. OPENACC Parallel Hardware CPU Low Learning Curve ▪ OpenACC is meant to be easy to use, and The programmer will easy to learn give hints to the int main(){ ▪ Programmer remains compiler about which in familiar C, C++, or <sequential code> parts of the code to Fortran parallelize. Compiler #pragma acc kernels ▪ No reason to learn Hint The compiler will then { low-level details of the generate parallelism <parallel code> hardware. } for the target parallel hardware. }

  14. OPENACC Low Learning Curve Incremental Single Source ▪ OpenACC is meant to ▪ Rebuild the same code ▪ Maintain existing be easy to use, and on multiple sequential code easy to learn architectures ▪ Add annotations to ▪ Programmer remains ▪ Compiler determines expose parallelism in familiar C, C++, or how to parallelize for ▪ After verifying Fortran the desired machine correctness, annotate ▪ No reason to learn ▪ Sequential code is more of the code low-level details of the maintained hardware.

  15. OPENACC SUCCESSES COSMO LSDalton PowerGrid INCOMP3D CFD Quantum Chemistry Medical Imaging Weather and Climate NC State University Aarhus University MeteoSwiss, CSCS University of Illinois 12X speedup 40 days to 40X speedup 1 week 2 hours 3X energy efficiency 4X speedup MAESTRO NekCEM CloverLeaf FINE/Turbo CASTRO CFD Comp Electromagnetics Astrophysics Comp Hydrodynamics NUMECA Stony Brook University Argonne National Lab AWE International 2.5X speedup 4X speedup 4.4X speedup 10X faster routines 60% less energy Single CPU/GPU code 4 weeks effort 2X faster app

  16. OPENACC SYNTAX

  17. OPENACC SYNTAX Syntax for using OpenACC directives in code C/C++ Fortran #pragma acc directive clauses !$acc directive clauses <code> <code> ▪ A pragma in C/C++ gives instructions to the compiler on how to compile the code. Compilers that do not understand a particular pragma can freely ignore it. ▪ A directive in Fortran is a specially formatted comment that likewise instructions the compiler in it compilation of the code and can be freely ignored. ▪ “ acc ” informs the compiler that what will come is an OpenACC directive ▪ Directives are commands in OpenACC for altering our code. ▪ Clauses are specifiers or additions to directives.

  18. EXAMPLE CODE

  19. LAPLACE HEAT TRANSFER Introduction to lab code - visual Very Hot Room Temp We will observe a simple simulation of heat distributing across a metal plate. We will apply a consistent heat to the top of the plate. Then, we will simulate the heat distributing across the plate.

  20. EXAMPLE: JACOBI ITERATION ▪ Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. ▪ Common, useful algorithm ▪ Example: Solve Laplace equation in 2D: 𝛂 𝟑 𝒈(𝒚, 𝒛) = 𝟏 A(i,j+1) A(i-1,j) A(i+1,j) A(i,j) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 A(i,j-1) 4

  21. JACOBI ITERATION: C CODE while ( err > tol && iter < iter_max ) { Iterate until converged err=0.0; Iterate across matrix for( int j = 1; j < n-1; j++) { elements for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + Calculate new value from A[j-1][i] + A[j+1][i]); neighbors err = max(err, abs(Anew[j][i] - A[j][i])); Compute max error for } convergence } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; Swap input/output arrays } } iter++; } 21

  22. PROFILE-DRIVEN DEVELOPMENT

  23. OPENACC DEVELOPMENT CYCLE ▪ Analyze your code to determine most likely places needing Analyze Analyze parallelization or optimization. ▪ Parallelize your code by starting with the most time consuming parts and check for correctness. ▪ Optimize your code to improve observed speed-up from parallelization. Parallelize Optimize

  24. PROFILING SEQUENTIAL CODE Profile Your Code Lab Code: Laplace Heat Transfer Obtain detailed information about how Total Runtime: 39.43 seconds the code ran. This can include information such as: ▪ Total runtime ▪ Runtime of individual routines swap calcNext 19.04s ▪ Hardware counters 21.49s Identify the portions of code that took the longest to run. We want to focus on these “hotspots” when parallelizing.

  25. PROFILING SEQUENTIAL CODE First sight when using PGPROF ▪ Profiling a simple, sequential code ▪ Our sequential program will on run on the CPU ▪ To view information about how our code ran, we should select the “CPU Details” tab

  26. PROFILING SEQUENTIAL CODE CPU Details ▪ Within the “CPU Details” tab, we can see the various parts of our code, and how long they took to run ▪ We can reorganize this info using the three options in the top-right portion of the tab ▪ We will expand this information, and see more details about our code

  27. PROFILING SEQUENTIAL CODE CPU Details ▪ We can see that there are two places that our code is spending most of its time ▪ 21.49 seconds in the “ calcNext ” function ▪ 19.04 seconds in a memcpy function ▪ The c_mcopy8 that we see is actually a compiler optimization that is being applied to our “swap” function

  28. PROFILING SEQUENTIAL CODE PGPROF ▪ We are also able to select the different elements in the CPU Details by double-clicking to open the associated source code ▪ Here we have selected the “calcNext:37” element, which opened up our code to show the exact line (line 37) that is associated with that element

  29. OPENACC PARALLEL DIRECTIVE

  30. OPENACC PARALLEL DIRECTIVE Expressing parallelism #pragma acc parallel { gang gang When encountering the parallel directive, the compiler will generate gang gang 1 or more parallel gangs , which execute redundantly. gang gang }

Recommend


More recommend