Automatic Testing of OpenACC Applications Khalid Ahmad Michael - PowerPoint PPT Presentation

Automatic Testing of OpenACC Applications Khalid Ahmad Michael Wolfe School of Computing/University of Utah NVIDIA/PGI November 13 th , 2017

Why Test? When optimizing or porting Validate the optimization or the port Identify where the computations start to diverge 2

Use Cases General • Validate a new machine (x86, ARM, OpenPower) • Validate a different version of the compiler • Validate a new compiler optimization • Validate modifications / new algorithms GPU / OpenACC • Find where computations start to diverge • Programmer error: missing data movement • Hardware differences: different FMA, rounding, intrinsics, accumulation order • Compiler bugs 3

Code Example 1) void vectorSinGPU(double *A, double *C, uint32_t N) 2) { 3) // Ensure the data is available on the device 4) #pragma acc data copyin(A[0:N]) copyout(C[0:N]) 5) { 6) // Compute construct 7) #pragma acc kernels loop independent present(A[0:N],C[0:N]) 8) for (int i = 0; i < N; i++) { 9) C[i] = fsin(A[i]); 10) } 11) } 12) } 4

1)General Compare • The user may specify several parameters using environment variables • User passes pointer to data, datatype, size • User creates golden data file with known correct settings / program • User reruns program to compare with golden data file 5

General Compare Code Example 1) void vectorSinGPU(double *A, double * C, uint32_t N){ 2) #pragma acc enter data copyin(A[0:N]) 3) #pragma acc enter data create(C[0:N]) 4) #pragma acc kernels loop present(A[0:N],C[0:N]) independent 5) for (int i = 0; i < N; i++) { 6) C[i] = sin(A[i]); 7) } 8) //Copy output data from the CUDA device to the host memory 9) #pragma acc exit data copyout(C[0:N]) 10) #pragma acc exit data delete(A[0:N]) 11) pgi_compare(C,"double",N,__FILE__,__LINE__); 12) pgi_compare(A,"double",N,__FILE__,__LINE__); 13) } 6

How to use the general compare 1) export PGI_COMPARE=FILE=TRIAL,CREATE 2) Run program with function calls 3) export PGI_COMPARE=FILE=TRIAL,rel=5,COMPARE 4) Rerun program with function calls 7

PGI_Compare Environment Variable Option Description Use 10 -r as an absolute tolerance abs=r Use 10 -r as a relative tolerance rel=r report=n Report first n differences skip=n Skip the first n differences Patch erroneous values with correct patch values stop Stop after report= differences Print a summary of the comparisons and summary differences found at program exit 8

OpenACC Background • OpenACC runtime manages two copies of the data, host and device, and identified by the present table. • Present table is indexed by the host address, contains device address, data size, data type 9

2) Host Device Compare * • User passes pointer to host resident data and size of data • Function locates the relevant device data pointer in the present table • Using the present table we can also know the data type being used • Then we perform a data type based comparison 10 * The autocompare will be exposed with a command line option, when it gets released in an upcoming PGI version sometime hopefully in early 2018

Auto‐compare flow chart CPU GPU Serial code Execute host Execute device compute region compute region Copy the data back from the device Do the comparison and print out the results 11

Host Device CompareCode Example 1) void vectorSinGPU(double *A, double * C, uint32_t N){ 2) #pragma acc enter data copyin(A[0:N]) 3) #pragma acc enter data create(C[0:N]) 4) #pragma acc kernels loop present(A[0:N],C[0:N]) independent 5) for (int i = 0; i < N; i++) { 6) C[i] = sin(A[i]); 7) } 8) acc_compare(C,N); 9) //Copy output data from the CUDA device to the host memory 10) #pragma acc exit data copyout(C[0:N]) 11) #pragma acc exit data delete(A[0:N]) 12) } 12

3) Host Device Compare All * • No parameters to pass, data type is stored in the present table, so the compares are type‐aware even though the user doesn't identify the data types • The function traverses the present table • And calls the compare function on each entry in the present table 13 * The autocompare will be exposed with a command line option, when it gets released in an upcoming PGI version sometime hopefully in early 2018

Host Device Compare AllCode Example 1) void vectorSinGPU(double *A, double * C, uint32_t N){ 2) #pragma acc enter data copyin(A[0:N]) 3) #pragma acc enter data create(C[0:N]) 4) #pragma acc kernels loop present(A[0:N],C[0:N]) independent 5) for (int i = 0; i < N; i++) { 6) C[i] = sin(A[i]); 7) } 8) acc_compare_all(); 9) //Copy output data from the CUDA device to the host memory 10) #pragma acc exit data copyout(C[0:N]) 11) #pragma acc exit data delete(A[0:N]) 12) } 14

Implementation acc_compare.c acc_compare_all.c pgi_compare.c usercompare_all.c usercompare.c FILE=“name” √ CREATE √ COMPARE √ VERBOSE compare.c PATCH STOP SKIP=# check_mod.c REPORT=# √ ABS √ REL √ 15

Auto‐compare Overhead Cost Single core Intel Haswell 16 Nvidia Pascal P100

Benchmark Statistics Variables and Variables and Values Benchmark arrays with Differences tolerated arrays compared compared differences ostencil 202 3,388,997,632 0 0 olbm 61 586,800,000 59 520,634,266 omriq 3 68,608 2 53,240 palm 31,244 1,532,482,935 14,784 374,679,922 ep 4 13 2 2 cg 186 621,600,195 168 4,858,272 csp 4,057 40,132,155,677 3,897 5,693,059 miniGhost 2,506 1,844,059,545 175 175 ilbdc 3,001 53,818,895,200 2,000 35,305,830,600 bt 5,036 15,041,440,200 4,798 38,931,891 17

Comparing Byte Count vs Compare Time 18

Related Work 1) OpenARC compiler framework • Similar to our auto‐compare feature • User specifies the desired compute region to test • The rest of the program is run sequentially including other compute regions 1) Cray Comparative Debugger (CCDB) allows the programmer to • Launch two versions of a program • Add breakpoints • Does not support automatic testing 19

Future Work • General • Implementing more options such as skip, patch, stop, bits ... • Implement a pragma version • Adding support for nested data structures and derived types • Optimize the speed of the comparison • Option that runs the comparisons in parallel • Reduce the number of values being compared • Compare only specific compute constructs to reduce the overall cost • Auto compare • Running the host code in parallel • Running the compare on the GPU 20

Summary • Tool that automatically detect numerical differences and help identify bugs • Overhead of the redundant execution dominated by the slower execution unit • Debuggers and correctness checkers always introduce some overhead, which is fine and in most cases still a lot faster than a manual investigation 21

Automatic Testing of OpenACC Applications Khalid Ahmad Michael - PowerPoint PPT Presentation

Automatic Testing of OpenACC Applications Khalid Ahmad Michael Wolfe School of Computing/University of Utah NVIDIA/PGI November 13 th , 2017 Why Test? When optimizing or porting Validate the optimization or the port Identify where the

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18 AGENDA

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

When Logic Flows: How Environmental Logic in the Investment Field Make Firms Greener Shipeng

_____________ Jerr rry y Volas, las, CE CEO May 5, 2020 Rober bert t Buck, , Pre reside

reliance to household income Unit Analysis Brussels Statistical Matching of 12th March 2015

Indirect CO 2 Credit for DENSO SAS Compressor April 5, 2013 DENSO International America, Inc.

Second Quarter 2020 Earnings Conference Call NYSE: CVA July 30, 2020 Cautionary Statements

N957 - Navy Expeditionary Combat Presentation Closing the Gaps Brief to 17 th NDIA Expeditionary

ATOM Training Company Capability 2013 Andy Cooper / Mike Hayward Andy Cooper - Managing Director

Board of Directors January 10, 2019 Agenda Item 13 (b) Safety *There are currently 505

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us