An Extension of OpenACC Directives for Out-of-Core Stencil - PowerPoint PPT Presentation

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking Nobuhiro Miki Fumihiko Ino Kenichi Hagihara Graduate School of Information Science and Technology Osaka University

Stencil computation in OpenACC Stencil computation • – A fixed pattern is iteratively applied to every data elements to solve time evolution equations – Usually accelerated on a GPU equipped with high memory bandwidth OpenACC: the simplest method for developing GPU code • – Useful to separate accelerator-specific code from CPU code OpenACC is not a perfect solution for out-of-core data • ① Limited problem size due to exhaustion of GPU memory ② Time evolving iterations can transfer many data between CPU and GPU 32 GB/s for (t=0; t<T; t++) { // time evolution loop CPU GPU #pragma acc kernels loop for(i=0; i<N; i++) #pragma acc loop 512 GB/s 77 GB/s for(j=0; j <N; j++) a[i][j] = a[i][j-1] + a[i-1][j] Host Memory Device Memory + a[i][j+1] + a[i+1][j]; 1.5 TB 16 GB } WACCPD2016 2 Memory architecture Stencil code in OpenACC

Out-of-core code with temporal blocking Data decomposition and • allocate buf_p[0], ..., buf_p[num_queue] on host memory; #pragma acc create (buf_p[0:num_queue] [0:b+2*h*k], ...) temporal blocking are Allocate buffers in both host and device for (n=0; n<T; n+=k) { useful for tackling these for (c=0; c<d; c++) { set si as the id of an idle queue; // 0 <= si <num_queue issues copy chunk from p to buf_p[si]; Select an asynchronous queue #pragma acc update device (buf_p[si:1][0:b+2*h*k],...) async (si) The performance • for (i=0; i<k; i++) { portability is degraded #pragma acc kernels present (buf_p[si:1][0:b+2*h*k],...) async(si) { Modify loop structures due to code modification offset = h*(i+1); xsize = b+2*h*(k-1-i); #pragma acc loop independent – Accelerator-specific code is for (x=offset; x<offset+xsize; x++) #pragma acc loop independent mixed with the essence of for (y=1; y<y-1; y++) #pragma acc loop independent computation for (z=1; z<z-1; z++) buf_q[si][x*y*z+y*z+z] += buf_p[si][(x+1)*y*z+y*z+z] + ...; } buf_p[si] = buf_q[si]; Modify indexing scheme } #pragma acc update host (buf_p[si:1][0:b+2*h*k], ...) async (si) copy chunk from buf_p[si] to p; } } WACCPD2016 3

Overview Goal: to facilitate data decomposition and temporal blocking for • GPU-accelerated stencil computation Method: directive-based approach • Pipelined accelerator (PACC): an extension of OpenACC directives ① Source-to-source translator for PACC -> OpenACC translation ② #pragma pacc init #pragma pacc pipeline targetinout(work,a) size([0:Y][0:X]) halo([1:1][1:1]) async for(n=0;n<nn;n++){ #pragma pacc loop dim(2) for(x=1;x<X-1;x++) #pragma pacc loop dim(1) for(y=1;y<Y-1;y++) work[x][y] = (a[x-1][y] + … ) ; #pragma pacc loop dim(2) for(x=1;x<X-1;x++) #pragma pacc loop dim(1) for(y=1;y<Y-1;y++) a[x][y] = work[x][y]; Stencil code with PACC } WACCPD2016 4

PACC(Pipelined ACCelerator) directives PACC extends OpenACC directives with three constructs • #pragma pacc init The init construct #pragma pacc pipeline targetinout(work,a) ¥ allocates host and size([0:Y][0:X]) halo([1:1][1:1]) async device buffers for for(n=0;n<nn;n++){ The pipeline construct • realizing data #pragma pacc loop dim(2) specifies the code block to decomposition be processed in a pipeline for(x=1;x<X-1;x++) This construct can have • #pragma pacc loop dim(1) additional clauses for(y=1;y<Y-1;y++) The loop construct • targetin work[x][y] = (a[x-1][y] + … ) ; • names of read-only arrays indicates which array • targetinout dimension corresponds #pragma pacc loop dim(2) • names of writable arrays to the loop control • size for(x=1;x<X-1;x++) variable • array size #pragma pacc loop dim(1) • halo • halo region size for(y=1;y<Y-1;y++) • async a[x][y] = work[x][y]; } • async flag WACCPD2016 5

Overview of PACC translator 1. C/C++ frontend generate an abstract syntax tree (AST) of input code using the ROSE compiler infrastructure [2] 2. The generated AST is then traversed to detect AST nodes that have directive information 3. In the next traversal, these detected nodes are updated according to code rewrite rules, which we implemented for PACC 4. Finally, the transformed AST is given to a code generator, which outputs an out-of-core OpenACC code Code rewrite rules ROSE compiler framework Abstract # pragma acc … C / C++ Code syntax for (…) { # pragma pacc … for (…) {} for (…){ … } frontend tree generator } (AST) PACC code OpenACC code [2] rosecompiler.org. ROSE compiler infrastructure, 2015. http://rosecompiler.org/. WACCPD2016 6

Rewrite Rules for Temporal Blocking A cache optimization technique for time evolving computation • – Computation area is updated 𝑙 (blocking factor) steps for each data transfer – The number of data transfer between CPU and GPU reduces to 1/ 𝑙 Native implementation Apply Temporal Blocking CPU CPU ・・・ GPU GPU Process a single time step Process k time step for (n=0; n<T; n+=k) { // outer loop for (n=0; n<T; n++) { data transfer from CPU to GPU data transfer from CPU to GPU for (i=0; i<k; i++) { // inner loop kernel invocation kernel invocation data transfer from GPU to CPU } } data transfer from GPU to CPU } WACCPD2016 7

Data decomposition 1-D block scheme • Given a stencil of (2𝑠 + 1)×(2𝑠 + 1) elements, each block requires halos of • size 𝑠𝑙×𝑍 to be processed independently – 𝑠 : the number of neighbor elements in up/down/left/right directions Decomposed segments are processed independently • A software pipeline is used to overlap kernel execution with data transfer • There are two execution parameters, blocking factor 𝑙 and block size 𝑐 • 2𝑠 + 1 Y Block • a computation area 2𝑠 + 1 • Halo region • Decomposition a boundary area • X Halo region rk Transferred with Block • Block size b Segment • Halo region rk Segment – Block + Halo region Original data WACCPD2016 8

Comparison with in-core implementation Out-of-core performances were only • Experimental machine Intel Xeon E5-2680v2 (512 GB) • 11% - 21% lower than in-core NVIDIA Tesla K40 (12 GB) • performance Ubuntu 15.3 • If you accept these slowdowns, you can CUDA 7.0 • • PGI compiler 15.5 • easily process out-of-core data with PACC directives Code Data size Performance In-core Out-of-core In-core Out-of-core Slowdown 𝑒 - (GB) 𝑒 . (GB) 𝑞 - (GFLOPS) 𝑞 . (GFLOPS) 𝟐 − 𝒒 𝟑 /𝒒 𝟐 (%) 11 4.6 18.4 32.2 28.5 Jacobi 21 2.3 15.0 47.5 37.5 Himeno 8.2 15.5 83.9 73.4 13 CIP WACCPD2016 9

Tradeoff relation under CIP method The constraint interpolation profile (CIP) method • – A solver for hyperbolic partial differential equations – 9-point 2-D stencil Tradeoff point As we estimated before, the best tradeoff point was found • The data transfer were fully overlapped with kernel execution • Compute-bound Memory-bound Temporal blocking increased kernel execution time • Temporal blocking • slightly due to redundant computation decreased data transfer time Effective performance 400 80 Execution time (s) 350 70 (GFLOPS) 300 60 250 50 200 40 150 30 100 20 Execution Time Effective Performance 50 10 0 0 4 8 16 32 64 128 256 Blocking factor k WACCPD2016 10

Conclusion • PACC: an extension of OpenACCdirectives capable of accelerating out-of-core stencil computation with temporal blocking on a GPU – A translator using AST-based transformation • Experiments – Out-of-core performances were only 11% - 21% lower than in-core performance – Tradeoff relation between data transfer time and kernel execution time • Future work – An automated framework for finding best execution parameters (block size and blocking factor) WACCPD2016 11

An Extension of OpenACC Directives for Out-of-Core Stencil - PowerPoint PPT Presentation

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking Nobuhiro Miki Fumihiko Ino Kenichi Hagihara Graduate School of Information Science and Technology Osaka University Stencil computation in OpenACC

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

AngularJS Unit Testing AngularJS Directives with Karma & Jasmine Directives Directives

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO

2.0 Recommendations 2.1 Members are requested to note: i. that the Council commissioned a

Reversing ng the he E Exodus Thr hrough h Cult ltivation n LETS TALK KANSAS There are

THE WIND ON YOUR HOW TO POWER UP RV I ntroduction RULES OF RV FULL TIME LIVING By David and

EVENT SUMMARY DC Health Benefit Exchange Authority PRODUCED BY THE WEB PORTAL TEAM In

OpenMP

Assessing the poten,al impacts of developments on fish in marine protected areas: balancing

Software-Defined Network Exchanges (SDXs) and Software-Defined Infrastructure (SDI) Joe

Keeping Your Love Alive The Presenter Cathy Alford Cathy is an executive coach and is

An Extension of OpenACC Directives for Out-of-Core Stencil - PowerPoint PPT Presentation

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking Nobuhiro Miki Fumihiko Ino Kenichi Hagihara Graduate School of Information Science and Technology Osaka University Stencil computation in OpenACC

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

AngularJS Unit Testing AngularJS Directives with Karma &amp; Jasmine Directives Directives

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO

2.0 Recommendations 2.1 Members are requested to note: i. that the Council commissioned a

Reversing ng the he E Exodus Thr hrough h Cult ltivation n LETS TALK KANSAS There are

THE WIND ON YOUR HOW TO POWER UP RV I ntroduction RULES OF RV FULL TIME LIVING By David and

EVENT SUMMARY DC Health Benefit Exchange Authority PRODUCED BY THE WEB PORTAL TEAM In

OpenMP

Assessing the poten,al impacts of developments on fish in marine protected areas: balancing

Software-Defined Network Exchanges (SDXs) and Software-Defined Infrastructure (SDI) Joe

Keeping Your Love Alive The Presenter Cathy Alford Cathy is an executive coach and is

AngularJS Unit Testing AngularJS Directives with Karma & Jasmine Directives Directives