S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 – April 7, 2016

History of OpenMP & OpenACC Philosophical Differences AGENDA Technical Differences Portability Challenges Conclusions 2

A Tale of Two Specs. 3

A Brief History of OpenMP 1996 - Architecture Review Board (ARB) formed by several vendors implementing their own directives for Shared Memory Parallelism (SMP). 1997 - 1.0 was released for C/C++ and Fortran with support for parallelizing loops across threads. 2000, 2002 – Version 2.0 of Fortran, C/C++ specifications released. 2005 – Version 2.5 released, combining both specs into one. 2008 – Version 3.0 released, added support for tasking 2011 – Version 3.1 release, improved support for tasking 2013 – Version 4.0 released, added support for offloading (and more) 2015 – Version 4.5 released, improved support for offloading targets (and more) 4/6/2016 4

A Brief History of OpenACC 2010 – OpenACC founded by CAPS, Cray, PGI, and NVIDIA, to unify directives for accelerators being developed by CAPS, Cray, and PGI independently 2011 – OpenACC 1.0 released 2013 – OpenACC 2.0 released, adding support for unstructured data management and clarifying specification language 2015 – OpenACC 2.5 released, contains primarily clarifications with some additional features. 4/6/2016 5

Philosophical Differences 6

OpenMP: Compilers are dumb, users are smart. Restructuring non-parallel code is optional. OpenACC: Compilers can be smart and smarter with the user’s help. Non -parallel code must be made parallel. 7

Philosophical Differences OpenMP: The OpenMP API covers only user-directed parallelization, wherein the programmer explicitly specifies the actions to be taken by the compiler and runtime system in order to execute the program in parallel. The OpenMP API does not cover compiler-generated automatic parallelization and directives to the compiler to assist such parallelization. OpenACC: The programming model allows the programmer to augment information available to the compilers , including specification of data local to an accelerator, guidance on mapping of loops onto an accelerator, and similar performance-related details. 4/6/2016 8

Philosophical Trade-offs OpenMP OpenACC Consistent, predictable behavior Quality of implementation will between implementations greatly affect performance Users can parallelize non-parallel Users must restructure their code code and protect data races to be parallel and free of data explicitly races Some optimizations are off the Compiler has more freedom and table information to optimize Substantially different architectures High level parallel directives can be require substantially different applied to different architectures directives. by the compiler 9

Technical Differences 10

Parallel: Similar, but Different OMP Parallel ACC Parallel Creates a team of threads Creates 1 or more gangs of workers Very well-defined how the number Compiler free to choose number of of threads is chosen. gangs, workers, vector length May synchronize within the team May not synchronize between gangs Data races are the user’s Data races not allowed responsibility 11

OMP Teams vs. ACC Parallel OMP Teams ACC Parallel Creates a league of 1 or more Creates 1 or more gangs of workers thread teams Compiler free to choose number of Compiler free to choose number of gangs, workers, vector length teams, threads, and simd lanes. May not synchronize between gangs May not synchronize between teams May be used anywhere Only available within target regions 12

Compiler-Driven Mode OpenMP OpenACC Fully user-driven (no analogue) Kernels directive declares desire to parallelize a region of code, but Some compilers choose to go above places the burden of analysis on the and beyond after applying OpenMP, compiler but not guaranteed Compiler required to be able to do analysis and make decisions. 13

Loop: Similar but Different OMP Loop (For/Do) ACC Loop Splits ( “ Workshares ” ) the iterations Declares the loop iterations as of the next loop to threads in the independent & race free (parallel) team, guarantees the user has or interesting & should be analyzed managed any data races (kernels) Loop will be run over threads and User able to declare independence scheduling of loop iterations may w/o declaring scheduling restrict the compiler Compiler free to schedule with gangs/workers/vector, unless overridden by user 14

Distribute vs. Loop OMP Distribute ACC Loop Must live in a TEAMS region Declares the loop iterations as independent & race free (parallel) Distributes loop iterations over 1 or or interesting & should be analyzed more thread teams (kernels) Only master thread of each team Compiler free to schedule with runs iterations, until PARALLEL is gangs/workers/vector, unless encountered overridden by user Loop iterations are implicitly independent, but some compiler optimizations still restricted 15

Distribute Example #pragma omp target teams #pragma acc parallel { { #pragma omp distribute #pragma acc loop for(i=0; i<n; i++) for(i=0; i<n; i++) for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) for(j=0;j<m;j++) } #pragma acc loop for(k=0;k<p;k++) } 16

Distribute Example #pragma omp target teams #pragma acc parallel Generate a 1 or more { { thread teams #pragma omp distribute #pragma acc loop Distribute “ i ” over for(i=0; i<n; i++) for(i=0; i<n; i++) teams. for(j=0;j<m;j++) #pragma acc loop No information about for(k=0;k<p;k++) for(j=0;j<m;j++) “j” or “k” loops } #pragma acc loop for(k=0;k<p;k++) } 17

Distribute Example #pragma omp target teams #pragma acc parallel Generate a 1 or more { { gangs #pragma omp distribute #pragma acc loop These loops are for(i=0; i<n; i++) for(i=0; i<n; i++) independent, do the right thing for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) for(j=0;j<m;j++) } #pragma acc loop for(k=0;k<p;k++) } 18

Distribute Example #pragma omp target teams #pragma acc parallel { { #pragma omp distribute #pragma acc loop for(i=0; i<n; i++) for(i=0; i<n; i++) for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) for(j=0;j<m;j++) } #pragma acc loop What’s the right thing? for(k=0;k<p;k++) Interchange? Distribute? Workshare? Vectorize? Stripmine ? Ignore? … } 19

Synchronization OpenMP OpenACC Users may use barriers, critical Users expected to refactor code to regions, and/or locks to protect remove data races. data races Code should be made truly parallel It’s possible to parallelize non - and scalable parallel code 20

Synchronization Example #pragma omp parallel private(p) function funcA(p[N]){ { #pragma acc parallel funcA(p); } #pragma omp barrier function funcB(p[N]){ funcB(p); #pragma acc parallel } } 21

Synchronization Example #pragma omp parallel for parallelRand(A); for (i=0; i<N; i++) #pragma acc parallel loop { for (i=0; i<N; i++) #pragma omp critical { A[i] = rand(); A[i] *= 2; A[i] *= 2; } } 22

Portability Challenges 23

How to Write Portable Code (OMP) Ifdefs can be used to #ifdef GPU #pragma omp target omp teams distribute parallel for reduction(max:error) \ choose particular collapse(2) schedule(static,1) directives per device #elif defined(CPU) #pragma omp parallel for reduction(max:error) at compile-time #elif defined(SOMETHING_ELSE) #pragma omp … endif for( int j = 1; j < n-1; j++) { #if defined(CPU) && defined(USE_SIMD) #pragma omp simd #endif for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } 24

How to Write Portable Code (OMP) Creative ifdefs might clean up the code, but #pragma omp \ #ifdef GPU still one target at a target teams distribute \ time. #endif parallel for reduction(max:error) \ #ifdef GPU collapse(2) schedule(static,1) endif for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } 25

How to Write Portable Code (OMP) The OpenMP if clause usegpu = 1; #pragma omp target teams distribute parallel for reduction(max:error) \ can help some too (4.5 #ifdef GPU improves this). collapse(2) schedule(static,1) \ endif if(target:usegpu) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) Note: This example { assumes that a compiler Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); will choose to generate 1 error = fmax( error, fabs(Anew[j][i] - A[j][i])); } team when not in a target, } making it the same as a standard “parallel for.” 26

How to Write Portable Code (ACC) Developer presents the desire to parallelize to the compiler, compiler #pragma acc kernels { handles the rest. for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } } 27

S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC Philosophical Differences AGENDA Technical Differences Portability

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

PORTING VASP TO GPUS WITH OPENACC Stefan Maintz, Dr. Markus Wetzstein 03/26/2018

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

2015 Specifications GCSE English Language (accredited) INTRODUCING EDUQAS In light of the

CircuitData! A PCB Profiler and Specification Language This What is Circuit Data?

Professional Staff Employees to Civil Service Status Employees GENERAL PROVISIONS By statute,

Expedited Policy Development Process on the Temporary Specification for gTLD Registration Data

SPECTRAL ANALYSIS FFT THE PAST (TRADITIONAL) VS. PRESENT EVALUATED ANALYSIS THE PAST

Phase-resolved analysis of LS 5039 with Fermi LAT Gene Leung HKU FAN4 Workshop

Scene Modeling from Motion-Free Radar Sensing Alex Foessel Robotics Institute Carnegie Mellon

Hi g h R e s o l u t i o n D i f f r a c t i o n Hi g h R e s o l u