Other Source-to-Source Technologies PoCC (Polyhedral Compiler Collection) o http://www.cse.ohio-state.edu/~pouchet/software/pocc/ Rose source-to-source translators o http://rosecompiler.org/ Cetus, source-to-Source compiler infrastructure o http://cetus.ecn.purdue.edu EDG, Edison Design Group technology o http://www.edg.com Clang o http://clang.llvm.org Insieme Compiler o http://www.insieme-compiler.org/ … 5/06/13 Uppsala 22
CAPS Auto-tuning Approach
An Embedded DSL Oriented Approach to Auto-Tuning Source-to-Source approach o Exploit native compilers Code partitioning to help offline approach o CodeletFinder Scripting to implement DSL approaches o Generate domain / code specific search / optimization space o Static selections of variants Runtime code parameters instantiation o Variant parameters fixed at runtime Low level API for 'auto-tuning drivers" o Separate objective functions from optimization space generation An engineering issue o How to embed code/domain specific strategies so it is ready to use to programmers o Still dealing with legacy code o Integration in the compiling process is a key feature Focus on node level issues 5/06/13 Uppsala 24
Auto-tuning Flow Overview CT 0 Performance Source code CodeletFinder CT 0 Tools CT 2 Optimizing Optimizing HMPP Compiler Scripts Strategy CAPS Autotunable profiling, collect profiling data auto-tuning executable explore the variants tuning driver space code interface 5/06/13 Uppsala 25
Code Partitioning for Auto-Tuning Tuning and analyzing performance of large/complex applications is usually a challenge o Execution time with real data sets is usually too long to be compatible with the trial/experiment cycle o Many performance or tuning tools cannot be used at large scale Compute intensive parts of the code usually represent a small portion of the total application o Extracts these to focus on them o Allows to use many analysis and optimizing tools o Faster experiments cycle Similar works: Code Isolator (Y.-J. Lee and M. W. Hall) and Rose Outliner (C. Liao, D. J. Quinlan, R. Vuduc, and T. Panas) 5/06/13 Uppsala 26
CodeletFinder Decomposing applications in hotspots Each hotspot can be efficiently analyzed separately o Outlined hotspots Mix of static and dynamic analysis o Code and data 5/06/13 Uppsala 27
CodeletFinder Process Overview Micro Project Hotspot Codelet Bencher Capture Finder Builder - Finds hotspots in the - Builds the codelets - Captures data for - Captures build application using based on the micro-benches process execution profiles identified hotspots - Runs the micro- - Capture execution - Statically extracts (code outliner) benches parameters potential hotspots - Creates standalone - Replays the build micro-benchs on demand - Patterns are given to build the codelets 5/06/13 Uppsala 28
Scripting to Generate the Search Space Most tuning strategies are code/domain specifics o In regards to the code structure and runtime properties o Many codes live long and allow to amortize code specific approaches Many different high-level approaches can be embedded o Stencil code generator (e.g. Patus) o Polyhedral model based approach PoCC o Libraries o Data structures transformations o … 5/06/13 Uppsala 29
New CAPS Compiler Features Code C++ C Fortran Frontend Frontend Frontend transformation s scripts (in Application/Domain Scripting Engine specific scripts Lua) can be added as a Extraction module pre- kernels compilation Host Fun #2 phase code Fun#1 Fun #3 Scripts can read and Instrumentation OpenCL/Cuda module modify the Generation source code AST CPU compiler Native (gcc, ifort, …) compilers 5/06/13 Uppsala 30
Tuning Script Implementation Directives convey programmer knowledge The code provides low level information o e.g. loop index, variables names, … Scripts hide low level code transformation details Many loop transformations can be implemented using hmppcg directives Expressions providing high level information to the scripts script to be activated !$capstune scriptName scriptInput � code region � !$capstune end scriptName � … � 5/06/13 Uppsala 31
Simple Example-1 Specify the script to generate an optimized stencil code using … � various method !$capstune stencil … � - multiple variants !$acc kernel � - external tools - using a library !$acc loop independent � do i=1,10 � !$acc loop independent � do j=1,10 � a(i,j) = … b(i,j) … � end do � end do � !$acc end kernel � !$capstune end stencil � … � 5/06/13 Uppsala 32
Simple Example-2 Transform a data structure for an accelerator: - Take slides of a derived type TYPE foo � REAL :: w(10,10) � REAL :: x(10,10) � - Decision cannot be usually REAL :: y(10,10) � made on local code analysis REAL :: z(10,10) � END type foo � … � !$capstune scalarize state_x => state%x , state_z => state%z � !$acc parallel num_gangs(10) num_workers(10) copyout(state_x) copyin(state_z) � !$acc loop gang � do i=1,10 � !$acc loop worker � do j=1,10 � state%x(i,j) = state%z(i,j) + i+j/1000.0 � end do � end do � !$acc end parallel � !$capstune end scalarize � 5/06/13 Uppsala 33
Making OpenMP Codes Heterogeneous
Code Generation Process Overview Converts OpenMP to the use of GPU automatically Currently focusing on AMD APUs Incremental process to make the OpenMP code GPU friendly 5/06/13 Uppsala 35
Data Uses Analysis Necessary to allocate data on the accelerator and compute basic data transfers overheads Keep analysis overhead low Analysis based on an abstract execution of the OpenMP loop nest sequence 5/06/13 Uppsala 36
Preliminary Example of Experiments Display 5/06/13 Uppsala 37
PART I Conclusion OpenMP, a good start to migrate codes o Data use analysis is a key feature Source-to-source technology well adapted to heterogeneity o Avoid "one compiler fits all" approach Auto-tuning techniques helps to simplify code tuning and deployment The DSL approach helps to guide the auto-tuning process 5/06/13 Uppsala 38
PART II OpenACC Directives for Accelerators
Credits http://www.openacc.org/ o V1.0: November 2011 Specification OpenACC, Directives for Accelerators, Nvidia Slideware CAPS Compilers-3.x OpenACC Reference Manual , CAPS entreprise 5/06/13 Uppsala 40
Agenda OpenACC Overview and Compilers o Lab Session 1: Using CAPS Compilers Programming Model o Lab Session 2: Offloading Computations Managing Data o Lab Session 3: Optimizing Data Transfers Specifying Parallelization o Lab Session 4: Optimizing Compute Kernels Asynchronism o Lab Session 5: Performing Asynchronous Computations Runtime API OpenACC 2.0 Draft Specification 5/06/13 Uppsala 41
OpenACC Overview and Compilers
Directive-based Programming (1) Three ways of programming GPGPU applications: Programming Libraries Directives Languages Ready-to-use Quickly Accelerate Maximum Performance Acceleration Existing Applications 5/06/13 Uppsala 43
Directive-based Programming (2) 5/06/13 Uppsala 44
Advantages of Directive-based Programming Simple and fast development of accelerated applications Non-intrusive Helps to keep a unique version of code o To preserve code assets o To reduce maintenance cost o To be portable on several accelerators Incremental approach Enables "portable" performance 5/06/13 Uppsala 45
OpenACC Initiative A CAPS, CRAY, Nvidia and PGI initiative Open Standard A directive-based approach for programming heterogeneous many-core hardware for C and FORTRAN applications http://www.openacc-standard.com 5/06/13 Uppsala 46
OpenACC Compilers (1) CAPS Compilers: PGI Accelerator Source-to-source Extension of x86 PGI compilers compiler Support Intel Xeon Phi, Support Intel Xeon Phi, NVIDIA GPUs, AMD NVIDIA GPUs, AMD GPUs and APUs GPUs and APUs Cray Compiler: Provided with Cray systems only 5/06/13 Uppsala 47
CAPS Compilers (2) Are source-to-source compilers, composed of 3 parts: The directives (OpenACC or OpenHMPP) o Define parts of code to be accelerated o Indicate resource allocation and communication o Ensure portability The toolchain o Helps building manycore applications o Includes compilers and target code generators o Insulates hardware specific computations o Uses hardware vendor SDK The runtime o Helps to adapt to platform configuration o Manages hardware resource availability 5/06/13 Uppsala 48
CAPS Compilers (3) Take the original application as input and generate another application source code as output o Automatically turn the OpenACC source code into a accelerator- specific source code (CUDA, OpenCL) Compile the entire hybrid application Just prefix the original compilation line with capsmc to produce a hybrid application $ capsmc gcc myprogram.c $ capsmc gfortran myprogram.f90 5/06/13 Uppsala 49
CAPS Compilers (4) C++ C Fortran CAPS Compilers drives Frontend Frontend Frontend all compilation passes Host application Extraction module compilation codelets o Calls traditional CPU Host Fun Fun code compilers #1 Fun #2 #3 o CAPS Runtime is linked to the host part of the Instrumen- OpenCL CUDA application tation Generatio Code module Generation n Device code production OpenCL CPU compiler CUDA o According to the (gcc, ifort, …) compilers compilers specified target o A dynamic library is built HWA Code Executable CAPS (Dynamic Runtime (mybin.exe) library) 5/06/13 Uppsala 50
CAPS Compilers Options Usage: $ capsmc [CAPSMC_FLAGS] <host_compiler> [HOST_COMPILER_FLAGS] <source_files> To display the compilation process $ capsmc –d -c gcc myprogram.c To specify accelerator-specific code $ capsmc –-openacc-target CUDA gcc myprogram.c #(default) $ capsmc –-openacc-target OPENCL gcc myprogram.c #(AMD and Phi) 5/06/13 Uppsala 51
Lab Session 1: Using CAPS Compilers Uppsala 52
Lab 1: Using CAPS Compilers Compile and execute a simple “Hello world!” application Use the –d and –c flags to display the compilation process Use ldd on the output executable to print library dependencies 5/06/13 Uppsala 53
Programming Model 54
Programming Model Express data and computations to be executed on an accelerator o Using marked code regions Data/stream/vector parallelism to be Main OpenACC constructs exploited by HWA o Parallel and kernel regions e.g. CUDA / OpenCL o Parallel loops o Data regions o Runtime API CPU and HWA linked with a PCIx bus 5/06/13 Uppsala 55
Execution Model Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators o Parallel regions o Kernels regions Host is responsible for: o Allocating memory space on accelerator o Initiating data transfers o Launching computations o Waiting for completion o Deallocating memory space Accelerators execute parallel regions: o Use work-sharing directives o Specify level of parallelization 5/06/13 Uppsala 56
OpenACC Execution Model Host-controlled execution Based on three parallelism levels o Gangs – coarse grain o Workers – fine grain o Vectors – finest grain Device Gang Gang … Worker Worker Vector Vector s s 5/06/13 Uppsala 57
Gangs, Workers, Vectors In CAPS Compilers, gangs, workers and vectors correspond to the following in a CUDA grid gridDim.x = number of gangs blockDim.y = gridDim.y = 1 number of workers blockDim.x = number of vectors Beware: this implementation is compiler-dependent 5/06/13 Uppsala 58
Directive Syntax C #pragma acc directive-name [clause [, clause] …] { code to offload } Fortran !$acc directive-name [clause [, clause] …] code to offload !$acc end directive-name 5/06/13 Uppsala 59
Parallel Construct Starts parallel execution on the accelerator Creates gangs and workers The number of gangs and workers remains constant for the parallel region One worker in each gang begins executing the code in the region #pragma acc parallel […] { … for(i=0; i < n; i++) { for(j=0; j < n; j++) { Code executed on the hardware … accelerator } } … } 5/06/13 Uppsala 60
Kernels Construct Defines a region of code to be compiled into a sequence of accelerator kernels o Typically, each loop nest will be a distinct kernel The number of gangs and workers can be different for each kernel #pragma acc kernels […] $!acc kernels […] { for(i=0; i < n; i++) { DO i=1,n 1st Kernel … … } END DO … … for(j=0; j < n; j++) { DO j=1,n 2nd Kernel … … } END DO } $!acc end kernels 5/06/13 Uppsala 61
Lab Session 2: Offloading Computations
Lab 2: Offloading Computations Offload two SAXPY operations on the accelerator device: Y = Alpha . X + Y – X, Y are vectors – Alpha is a scalar Use parallel and kernels construct Pay attention to the compilers notifications Use the logger to understand the behavior of the accelerator $ export HMPPRT_LOG_LEVEL=info Use CUDA profiling to display CUDA grid properties 5/06/13 Uppsala 63
Managing Data 64
What is the problem using discrete accelerators? PCIe transfers have huge latencies In kernels and parallel regions, data are implicitly managed o Data are automatically transferred to and from the device o Implies possible useless communications Avoiding transfers leads to a better performance OpenACC offers a solution to control transfers 5/06/13 Uppsala 65
Device Memory Reuse float A[n]; In this example: o A and B are allocated #pragma acc kernels and transferred for the { first kernels region for(i=0; i < n; i++) { o A and C are allocated A[i] = B[n – i]; and transferred for the } second kernels region } … init(C) How to reuse A … between the two #pragma acc kernels kernels regions? { o And save transfer and for(i=0; i < n; i++) { allocation time C[i] += A[i] * alpha; } } 5/06/13 Uppsala 66
Memory Allocations Avoid data reallocation using the create clause o It declares variables, arrays or subarrays to be allocated in the device memory o No data specified in this clause will be copied between host and device The scope of such a clause corresponds to a data region o Data regions are used to define such scopes (as is, they have no effect) o They define scalars, arrays and subarrays to be allocated on the device memory for the duration of the region Kernels and Parallel regions implicitly define data regions 5/06/13 Uppsala 67
Data Presence How to tell the compiler that data has already been allocated? The present clause declares data that are already present on the device o Thanks to data region that contains this region of code CAPS Runtime will find and use the data on device 5/06/13 Uppsala 68
Data Construct: Create and Present Clause float A[n]; Allocation of A of size n on the #pragma acc data create(A) device { Reuse of A already allocated on #pragma acc kernels present(A) the device { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … Reuse of A already allocated on #pragma acc kernels present(A) the device { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } Deallocation of A on the device } 5/06/13 Uppsala 69
Data Storage: Mirroring How is the data stored in a data region? A data construct defines a section of code where data are mirrored between host and device Mirroring duplicates a CPU memory block into the HWA memory The mirror identifier is a CPU memory block address o Only one mirror per CPU block o Users ensure consistency of copies via directives o CAPS RT Descriptor ……… ……… Master copy Mirror copy ……… ……… ……… ……… ……… ……… ……… ……… ……… ……… ……. ……. Host Memory HWA Memory 5/06/13 Uppsala 70
Arrays and Subarrays (1) In C and C++, specified with start and length #pragma acc data create a[0:n] OR #pragma acc data create a[:n] o Allocation of an array a of size n #pragma acc data create a[2:n/2] o Allocation of an subarray of a of size n/2 • ie: elements a[2], a[3], …, a[n/2-1 + 2] o Static arrays can be allocated automatically o Length of dynamically allocated arrays must be explicitly specified 5/06/13 Uppsala 71
Arrays and Subarrays (2) In Fortran, specified with a list of range specifications !$acc data create a(0:n,0:m) o Allocation of an array a of size n*m !$acc data create a(1:3,5:5) o Allocation of a subarray of a of size 3*1 • ie: elements a(1,5), a(2,5), a(3,5) In any language, any array or subarray must be a contiguous block of memory 5/06/13 Uppsala 72
Arrays and Subarrays Example #pragma acc data create(A[:n]) !$acc data create(A(1:n)) { #pragma acc kernels present(A[:n]) !$acc kernels present(A(1:n)) { do i=1,n for(i=0; i < n; i++) { A(i) = B(n – i) A[i] = B[n – i]; end do } !$acc end kernels } … … init(C) init(C) … … #pragma acc kernels present(A[:n]) !$acc kernels present(A(1:n)) { do i=1,n for(i=0; i < n; i++) { C(i) = A(i) * alpha + C(i) C[i] += A[i] * alpha; end do } !$acc end kernels } } !$acc end data 5/06/13 Uppsala 73
Redundant Transfers In this example: #pragma acc data create(A[:n]) { o A is allocated for the data #pragma acc kernels present(A[:n]) section { • No data transfer of A between for(i=0; i < n; i++) { host and device A[i] = B[n – i]; } o B is allocated and transferred } for the first kernels region … #pragma acc kernels present(A[:n]) • Input transfer { • Output transfer for(i=0; i < n; i++) { o C is allocated and transferred C[i] = A[i] * alpha; for the second kernels region } } • Input transfer } • Output transfer How to avoid useless data transfers for B and C? 5/06/13 Uppsala 74
Input Transfers: Copyin Clause Declares data that need only #pragma acc data create(A[:n]) to be copied from the host { #pragma acc kernels present(A[:n]) \ to the device when entering copyin(B[:n]) the data section { for(i=0; i < n; i++) { o Performs input transfers only A[i] = B[n – i]; } } It defines scalars, arrays and … #pragma acc kernels present(A[:n]) subarrays to be allocated on { the device memory for the for(i=0; i < n; i++) { duration of the data region C[i] = A[i] * alpha; } } } 5/06/13 Uppsala 75
Output Transfers: Copyout Clause Declares data that need only #pragma acc data create(A[:n]) to be copied from the device { #pragma acc kernels present(A[:n]) \ to the host when exiting data copyin(B[:n]) section { for(i=0; i < n; i++) { o Performs output transfers only A[i] = B[n – i]; } } It defines scalars, arrays and … #pragma acc kernels present(A[:n]) \ subarrays to be allocated on copyout(C[:n]) the device memory for the { duration of the data region for(i=0; i < n; i++) { C[i] = A[i] * alpha; } } } 5/06/13 Uppsala 76
Input/Output Transfers: Copy Clause If we change the example, how to #pragma acc data create(A[:n]) express that input and output { transfers of C are required? #pragma acc kernels present(A[:n]) \ copyin(B[:n]) Use copy clause to: { for(i=0; i < n; i++) { o Declare data that need to be copied A[i] = B[n – i]; from the host to the device when entering the data section } o Assign values on the device that } need to be copied back to the host … when exiting the data section init(C) o Allocate scalars, arrays and … subarrays on the device memory for #pragma acc kernels present(A[:n]) \ the duration of the data region copy(C[:n]) { It corresponds to the default for(i=0; i < n; i++) { behavior in our example C[i] += A[i] * alpha; } } } 5/06/13 Uppsala 77
Transfer Example: Summary #pragma acc data create (A[:n]) Allocation of A of size n on the device { Reuse of A already allocated on the device #pragma acc kernels present(A[:n]) \ Allocation of B of size n on the device and copyin (B[:n]) transfer of data of B from host to device { for(i=0; i < n; i++) { A[i] = B[n – i]; } Deallocation of B on the device } … init(C) … Reuse of A already allocated on the device #pragma acc kernels present(A[:n]) \ Allocation of C of size n on the device and copy (C[:n]) transfer of data of C from host to device { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } Transfer of C from device to host and } deallocation of C on the device Deallocation of A on the device } 5/06/13 Uppsala 78
Alternative Behaviors program main In this example: … !$acc data create(X(1:n)) o A is allocated for the data call f1( n, X, Y ) … region !$acc end data o The first call to subroutine … f1 reuses the data of A call f1( n, X, Z ) … already allocated contains What happens for the subroutine f1 ( n, A, B ) second call to f1? … !$acc kernels present(A(1:n)) \ o A is specified as present copyin(B(1:n)) but it has been released at do i=1,n the end of the data section A(i) = B(n – i) end do o It leads to an error when !$acc end kernels executed end subroutine f1 … end program main 5/06/13 Uppsala 79
Present_or_create Clause Combines two behaviors Declares data that may be present o If data is already present, use value in the device memory o If not, allocate data on device when entering region and deallocate when exiting May be shortened to pcreate 5/06/13 Uppsala 80
Present_or_copyin/copyout Clauses If data is already present, use value in the device memory If not: o Both present_or_copyin / present_or_copyout allocate memory on device at region entry o present_or_copyin copies the value from the host at region entry o present_or_copyout copies the value from the device to the host at region exit o Both present_or_copyin / present_or_copyout deallocate memory at region exit May be shortened to pcopyin and pcopyout 5/06/13 Uppsala 81
Present_or_copy Clause If data is already present, use value in the device memory If not: o Allocates data on device and copies the value from the host at region entry o Copies the value from the device to the host and deallocate memory at region exit May be shortened to pcopy 5/06/13 Uppsala 82
Present_or_* Clauses Example program main Allocation of A of size n on the device … !$acc data create(A(1:n)) Reuse of A already allocated on the device call f1( n, A, B ) Allocation of B of size n on the device for the … duration of the subroutine and input transfer of B !$acc end data … Deallocation of A on the device call f1( n, A, C ) Allocation of A and B of size n on the device … for the duration of the subroutine contains Input transfer of B and output transfer of A subroutine f1 ( n, A, B ) … !$acc kernels pcopyout (A(1:n)) \ copyin(B(1:n)) do i=1,n A(i) = B(n – i) Present_or_* clauses are end do !$acc end kernels generally safer end subroutine f1 … end program main 5/06/13 Uppsala 83
Default Behavior CAPS Compilers is able to detect the variables required on the device for the kernels and parallel constructs. According to the specification, depending on the type of the variables, they follow the following policies o Tables: present_or_copy behavior o Scalar • if not live in or live out variable: private behavior • copy behavior otherwise 5/06/13 Uppsala 84
Constructs and Directives OpenACC defines two ways of managing accelerator allocations and transfers o With data constructs followed by allocation or transfer clauses o Or standalone directives for allocations or transfers Data constructs are declarative o They define properties for a code regions and variables Imperative directives are standalone statements 5/06/13 Uppsala 85
Declare Directive In Fortran: used in the declaration section of a subroutine In C/C++: follow a variable declaration Specifies variables or arrays to be allocated on the device memory for the duration of the function, subroutine or program Specifies the kind of transfer to realize (create, copy, copyin, etc) float A[n]; float A[n]; #pragma acc data create(A) #pragma acc declare create(A) { #pragma acc kernels present(A) #pragma acc kernels present(A) { { for(i=0; i < n; i++) { for(i=0; i < n; i++) { A[i] = B[n – i]; A[i] = B[n – i]; } } } } … } 5/06/13 Uppsala 86
Update Directive Used within explicit or implicit data region Updates all or part of host memory arrays with values from the device when used with host clause Updates all or part of device memory arrays with values from the host when used with device clause !$acc data create( A(1:n), \ B(1:n) ) !$acc kernels copyout(A(1:n)) \ copyin (B(1:n)) !$acc update device (B(1:n)) do i=1,n !$acc kernels do i=1,n A(i) = B(n – i) A(i) = B(n – i) end do end do !$acc end kernels !$acc end kernels !$acc update host (A(1:n)) !$acc end kernels 5/06/13 Uppsala 87
Lab session 3: Data Management
Lab 3: Data Management Offload two SAXPY operations (cf. Lab 2) o Where arrays are allocated dynamically Specify data size on kernels and parallel regions and appropriate transfers Avoid deallocating and reallocating the data on the accelerator by defining a data section Ensure the data displayed between the two compute regions are correct by updating the host mirror Notice the performance evolution and understand why thanks to the logger 5/06/13 Uppsala 89
Specifying Parallelization 90
Parallel and Kernels Constructs Default Behavior By default, CAPS Compilers will create 192 gangs and 256 workers containing 1 vector each for parallel and kernels regions o The resulting CUDA grid size will be 192 thread blocks o Each thread block containing 256*1 CUDA threads CAPS Compilers will detect data-independent loops and will distribute iterations among gangs and workers Loop ‘i’ was shared among gangs(192) and workers(256) How to modify the number of gangs, workers or vectors? 5/06/13 Uppsala 91
Gangs, Workers, Vectors in Parallel Constructs #pragma acc parallel, num_gangs(128) \ In parallel constructs, the num_workers(256) number of gangs, workers { and vectors is the same for … the entire section for(i=0; i < n; i++) { for(j=0; j < m; j++) { … The clauses: } } o num_gangs … o num_workers } o vector_length Enable to specify the … number of gangs, workers 256 and vectors in the … … … corresponding parallel section 128 5/06/13 Uppsala 92
Loop Constructs A Loop directive applies to a loop that immediately follow the directive The parallelism to use is described by one of the following clause: o Gang for coarse-grain parallelism o Worker for middle-grain parallelism o Vector for fine-grain parallelism 5/06/13 Uppsala 93
Gangs (1) #pragma acc parallel, num_gangs(128) \ Gang clause: num_workers(192) { o The iterations of the … following loop are executed #pragma acc loop gang for(i=0; i < n; i++) { in parallel for(j=0; j < m; j++) { … } o Iterations are distributed } … among the gangs available } o In a parallel construct, no argument is allowed … 192 … … … i= i= i= i= 0 1 2 0 128 5/06/13 Uppsala 94
Gangs (2) if(i = 0 ; i < n/2 ; i ++) { A[i] = B[i] * B[i] * 3.14; } #pragma parallel num_gang(2) { #pragma acc loop gang for(i = 0; i < n; i ++) { A[i] = B[i] * B[i] * 3.14; } } if(i = n/2 ; i < n ; i ++) { A[i] = B[i] * B[i] * 3.14; } 5/06/13 Uppsala 95
Workers #pragma acc parallel, num_gangs(128) \ Worker clause: num_workers(192) { o The iterations of the … following loop are executed #pragma acc loop gang in parallel for(i=0; i < n; i++) { #pragma acc loop worker for(j=0; j < n; j++) { o Iterations are distributed … among the multiple workers } withing a single gang } … } o Loop iterations must be data independent, unless it performs a reduction j=0 j=1 operation … j=2 192 … … … o In a parallel construct, no argument is allowed i= i= i= i= 0 1 2 0 128 5/06/13 Uppsala 96
Vector #pragma acc parallel, num_gangs(128) \ num_workers(192) Vector clause { … o The iterations of the #pragma acc loop gang following loop are for(i=0; i < n; i++) { #pragma acc loop worker executed in SIMD mode for(j=0; j < m; j++) { #pragma acc loop vector for(k=0; k < l; k++) { … o Iterations are distributed } among the multiple } } workers withing a single … gang } … … j=0 … j=1 j=2 … 192 k=0 k=1 k=2 o In a parallel construct, … … i= i= i= no argument is allowed 0 0 0 128 5/06/13 Uppsala 97
Gang, Worker, Vector in Kernels Constructs #pragma acc kernels The parallelism { description is the same … as in parallel sections #pragma acc loop gang(128) for(i=0; i < n; i++) { … } However, these clauses accept an argument to … … specify the number of … … … gangs, workers or vectors to use i= i= i= #pragma acc loop gang(64) for(j=0; j < m; j++) { 0 0 2 128 Every loop can have a … } different number of } gangs, workers or … vectors in the same kernels region … … … i= i= i= 0 0 2 64 5/06/13 Uppsala 98
Data Independency In kernels sections, the clause independent specifies that iterations of the loop are data-independent The user does not have to think about gangs, workers or vector parameters Allows the compiler to generate code to execute the iterations in parallel with no synchronization A[0] = 0; A(1) = 0 #pragma acc loop independent $!acc loop independent for(i=1; i<n; i++) DO i=2,n { A[i] = A[i]-1; A(i) = A(i-1) } END DO Programming error 5/06/13 Uppsala 99
Sequential Execution It is possible to specify sequential loops using the seq !$acc loop independent clause DO i=0,n !$acc loop seq DO j=1,4 A(j)… Useful to increase the ENDDO ENDDO work per thread for example 5/06/13 Uppsala 100
Recommend
More recommend