programming heterogeneous systems
play

Programming Heterogeneous Systems F. Bodin June 2013 Uppsala - PowerPoint PPT Presentation

Programming Heterogeneous Systems F. Bodin June 2013 Uppsala Introduction HPC and embedded software going for dramatic changes to adapt to massive parallelism o Huge market issue o Many codes and users not ready directives based


  1. Other Source-to-Source Technologies  PoCC (Polyhedral Compiler Collection) o http://www.cse.ohio-state.edu/~pouchet/software/pocc/  Rose source-to-source translators o http://rosecompiler.org/  Cetus, source-to-Source compiler infrastructure o http://cetus.ecn.purdue.edu  EDG, Edison Design Group technology o http://www.edg.com  Clang o http://clang.llvm.org  Insieme Compiler o http://www.insieme-compiler.org/  … 5/06/13 Uppsala 22

  2. CAPS Auto-tuning Approach

  3. An Embedded DSL Oriented Approach to Auto-Tuning  Source-to-Source approach o Exploit native compilers  Code partitioning to help offline approach o CodeletFinder  Scripting to implement DSL approaches o Generate domain / code specific search / optimization space o Static selections of variants  Runtime code parameters instantiation o Variant parameters fixed at runtime  Low level API for 'auto-tuning drivers" o Separate objective functions from optimization space generation  An engineering issue o How to embed code/domain specific strategies so it is ready to use to programmers o Still dealing with legacy code o Integration in the compiling process is a key feature  Focus on node level issues 5/06/13 Uppsala 24

  4. Auto-tuning Flow Overview CT 0 Performance Source code CodeletFinder CT 0 Tools CT 2 Optimizing Optimizing HMPP Compiler Scripts Strategy CAPS Autotunable profiling, collect profiling data auto-tuning executable explore the variants tuning driver space code interface 5/06/13 Uppsala 25

  5. Code Partitioning for Auto-Tuning  Tuning and analyzing performance of large/complex applications is usually a challenge o Execution time with real data sets is usually too long to be compatible with the trial/experiment cycle o Many performance or tuning tools cannot be used at large scale  Compute intensive parts of the code usually represent a small portion of the total application o Extracts these to focus on them o Allows to use many analysis and optimizing tools o Faster experiments cycle  Similar works: Code Isolator (Y.-J. Lee and M. W. Hall) and Rose Outliner (C. Liao, D. J. Quinlan, R. Vuduc, and T. Panas) 5/06/13 Uppsala 26

  6. CodeletFinder  Decomposing applications in hotspots  Each hotspot can be efficiently analyzed separately o Outlined hotspots  Mix of static and dynamic analysis o Code and data 5/06/13 Uppsala 27

  7. CodeletFinder Process Overview Micro Project Hotspot Codelet Bencher Capture Finder Builder - Finds hotspots in the - Builds the codelets - Captures data for - Captures build application using based on the micro-benches process execution profiles identified hotspots - Runs the micro- - Capture execution - Statically extracts (code outliner) benches parameters potential hotspots - Creates standalone - Replays the build micro-benchs on demand - Patterns are given to build the codelets 5/06/13 Uppsala 28

  8. Scripting to Generate the Search Space  Most tuning strategies are code/domain specifics o In regards to the code structure and runtime properties o Many codes live long and allow to amortize code specific approaches  Many different high-level approaches can be embedded o Stencil code generator (e.g. Patus) o Polyhedral model based approach  PoCC o Libraries o Data structures transformations o … 5/06/13 Uppsala 29

  9. New CAPS Compiler Features  Code C++ C Fortran Frontend Frontend Frontend transformation s scripts (in Application/Domain Scripting Engine specific scripts Lua) can be added as a Extraction module pre- kernels compilation Host Fun #2 phase code Fun#1 Fun #3  Scripts can read and Instrumentation OpenCL/Cuda module modify the Generation source code AST CPU compiler Native (gcc, ifort, …) compilers 5/06/13 Uppsala 30

  10. Tuning Script Implementation  Directives convey programmer knowledge  The code provides low level information o e.g. loop index, variables names, …  Scripts hide low level code transformation details  Many loop transformations can be implemented using hmppcg directives Expressions providing high level information to the scripts script to be activated !$capstune scriptName scriptInput � code region � !$capstune end scriptName � … � 5/06/13 Uppsala 31

  11. Simple Example-1 Specify the script to generate an optimized stencil code using … � various method !$capstune stencil … � - multiple variants !$acc kernel � - external tools - using a library !$acc loop independent � do i=1,10 � !$acc loop independent � do j=1,10 � a(i,j) = … b(i,j) … � end do � end do � !$acc end kernel � !$capstune end stencil � … � 5/06/13 Uppsala 32

  12. Simple Example-2 Transform a data structure for an accelerator: - Take slides of a derived type TYPE foo � REAL :: w(10,10) � REAL :: x(10,10) � - Decision cannot be usually REAL :: y(10,10) � made on local code analysis REAL :: z(10,10) � END type foo � … � !$capstune scalarize state_x => state%x , state_z => state%z � !$acc parallel num_gangs(10) num_workers(10) copyout(state_x) copyin(state_z) � !$acc loop gang � do i=1,10 � !$acc loop worker � do j=1,10 � state%x(i,j) = state%z(i,j) + i+j/1000.0 � end do � end do � !$acc end parallel � !$capstune end scalarize � 5/06/13 Uppsala 33

  13. Making OpenMP Codes Heterogeneous

  14. Code Generation Process Overview  Converts OpenMP to the use of GPU automatically  Currently focusing on AMD APUs  Incremental process to make the OpenMP code GPU friendly 5/06/13 Uppsala 35

  15. Data Uses Analysis  Necessary to allocate data on the accelerator and compute basic data transfers overheads  Keep analysis overhead low  Analysis based on an abstract execution of the OpenMP loop nest sequence 5/06/13 Uppsala 36

  16. Preliminary Example of Experiments Display 5/06/13 Uppsala 37

  17. PART I Conclusion  OpenMP, a good start to migrate codes o Data use analysis is a key feature  Source-to-source technology well adapted to heterogeneity o Avoid "one compiler fits all" approach  Auto-tuning techniques helps to simplify code tuning and deployment  The DSL approach helps to guide the auto-tuning process 5/06/13 Uppsala 38

  18. PART II OpenACC Directives for Accelerators

  19. Credits  http://www.openacc.org/ o V1.0: November 2011 Specification  OpenACC, Directives for Accelerators, Nvidia Slideware  CAPS Compilers-3.x OpenACC Reference Manual , CAPS entreprise 5/06/13 Uppsala 40

  20. Agenda  OpenACC Overview and Compilers o Lab Session 1: Using CAPS Compilers  Programming Model o Lab Session 2: Offloading Computations  Managing Data o Lab Session 3: Optimizing Data Transfers  Specifying Parallelization o Lab Session 4: Optimizing Compute Kernels  Asynchronism o Lab Session 5: Performing Asynchronous Computations  Runtime API  OpenACC 2.0 Draft Specification 5/06/13 Uppsala 41

  21. OpenACC Overview and Compilers

  22. Directive-based Programming (1)  Three ways of programming GPGPU applications: Programming Libraries Directives Languages Ready-to-use Quickly Accelerate Maximum Performance Acceleration Existing Applications 5/06/13 Uppsala 43

  23. Directive-based Programming (2) 5/06/13 Uppsala 44

  24. Advantages of Directive-based Programming  Simple and fast development of accelerated applications  Non-intrusive  Helps to keep a unique version of code o To preserve code assets o To reduce maintenance cost o To be portable on several accelerators  Incremental approach  Enables "portable" performance 5/06/13 Uppsala 45

  25. OpenACC Initiative  A CAPS, CRAY, Nvidia and PGI initiative  Open Standard  A directive-based approach for programming heterogeneous many-core hardware for C and FORTRAN applications  http://www.openacc-standard.com 5/06/13 Uppsala 46

  26. OpenACC Compilers (1) CAPS Compilers: PGI Accelerator  Source-to-source  Extension of x86 PGI compilers compiler  Support Intel Xeon Phi,  Support Intel Xeon Phi, NVIDIA GPUs, AMD NVIDIA GPUs, AMD GPUs and APUs GPUs and APUs Cray Compiler:  Provided with Cray systems only 5/06/13 Uppsala 47

  27. CAPS Compilers (2) Are source-to-source compilers, composed of 3 parts:  The directives (OpenACC or OpenHMPP) o Define parts of code to be accelerated o Indicate resource allocation and communication o Ensure portability  The toolchain o Helps building manycore applications o Includes compilers and target code generators o Insulates hardware specific computations o Uses hardware vendor SDK  The runtime o Helps to adapt to platform configuration o Manages hardware resource availability 5/06/13 Uppsala 48

  28. CAPS Compilers (3)  Take the original application as input and generate another application source code as output o Automatically turn the OpenACC source code into a accelerator- specific source code (CUDA, OpenCL)  Compile the entire hybrid application  Just prefix the original compilation line with capsmc to produce a hybrid application $ capsmc gcc myprogram.c $ capsmc gfortran myprogram.f90 5/06/13 Uppsala 49

  29. CAPS Compilers (4) C++ C Fortran  CAPS Compilers drives Frontend Frontend Frontend all compilation passes  Host application Extraction module compilation codelets o Calls traditional CPU Host Fun Fun code compilers #1 Fun #2 #3 o CAPS Runtime is linked to the host part of the Instrumen- OpenCL CUDA application tation Generatio Code module Generation n  Device code production OpenCL CPU compiler CUDA o According to the (gcc, ifort, …) compilers compilers specified target o A dynamic library is built HWA Code Executable CAPS (Dynamic Runtime (mybin.exe) library) 5/06/13 Uppsala 50

  30. CAPS Compilers Options  Usage: $ capsmc [CAPSMC_FLAGS] <host_compiler> [HOST_COMPILER_FLAGS] <source_files>  To display the compilation process $ capsmc –d -c gcc myprogram.c  To specify accelerator-specific code $ capsmc –-openacc-target CUDA gcc myprogram.c #(default) $ capsmc –-openacc-target OPENCL gcc myprogram.c #(AMD and Phi) 5/06/13 Uppsala 51

  31. Lab Session 1: Using CAPS Compilers Uppsala 52

  32. Lab 1: Using CAPS Compilers  Compile and execute a simple “Hello world!” application  Use the –d and –c flags to display the compilation process  Use ldd on the output executable to print library dependencies 5/06/13 Uppsala 53

  33. Programming Model 54

  34. Programming Model  Express data and computations to be executed on an accelerator o Using marked code regions Data/stream/vector parallelism to be  Main OpenACC constructs exploited by HWA o Parallel and kernel regions e.g. CUDA / OpenCL o Parallel loops o Data regions o Runtime API CPU and HWA linked with a PCIx bus 5/06/13 Uppsala 55

  35. Execution Model  Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators o Parallel regions o Kernels regions  Host is responsible for: o Allocating memory space on accelerator o Initiating data transfers o Launching computations o Waiting for completion o Deallocating memory space  Accelerators execute parallel regions: o Use work-sharing directives o Specify level of parallelization 5/06/13 Uppsala 56

  36. OpenACC Execution Model  Host-controlled execution  Based on three parallelism levels o Gangs – coarse grain o Workers – fine grain o Vectors – finest grain Device Gang Gang … Worker Worker Vector Vector s s 5/06/13 Uppsala 57

  37. Gangs, Workers, Vectors  In CAPS Compilers, gangs, workers and vectors correspond to the following in a CUDA grid gridDim.x = number of gangs blockDim.y = gridDim.y = 1 number of workers blockDim.x = number of vectors  Beware: this implementation is compiler-dependent 5/06/13 Uppsala 58

  38. Directive Syntax  C #pragma acc directive-name [clause [, clause] …] { code to offload }  Fortran !$acc directive-name [clause [, clause] …] code to offload !$acc end directive-name 5/06/13 Uppsala 59

  39. Parallel Construct  Starts parallel execution on the accelerator  Creates gangs and workers  The number of gangs and workers remains constant for the parallel region  One worker in each gang begins executing the code in the region #pragma acc parallel […] { … for(i=0; i < n; i++) { for(j=0; j < n; j++) { Code executed on the hardware … accelerator } } … } 5/06/13 Uppsala 60

  40. Kernels Construct  Defines a region of code to be compiled into a sequence of accelerator kernels o Typically, each loop nest will be a distinct kernel  The number of gangs and workers can be different for each kernel #pragma acc kernels […] $!acc kernels […] { for(i=0; i < n; i++) { DO i=1,n 1st Kernel … … } END DO … … for(j=0; j < n; j++) { DO j=1,n 2nd Kernel … … } END DO } $!acc end kernels 5/06/13 Uppsala 61

  41. Lab Session 2: Offloading Computations

  42. Lab 2: Offloading Computations  Offload two SAXPY operations on the accelerator device: Y = Alpha . X + Y – X, Y are vectors – Alpha is a scalar  Use parallel and kernels construct  Pay attention to the compilers notifications  Use the logger to understand the behavior of the accelerator $ export HMPPRT_LOG_LEVEL=info  Use CUDA profiling to display CUDA grid properties 5/06/13 Uppsala 63

  43. Managing Data 64

  44. What is the problem using discrete accelerators?  PCIe transfers have huge latencies  In kernels and parallel regions, data are implicitly managed o Data are automatically transferred to and from the device o Implies possible useless communications  Avoiding transfers leads to a better performance  OpenACC offers a solution to control transfers 5/06/13 Uppsala 65

  45. Device Memory Reuse float A[n];  In this example: o A and B are allocated #pragma acc kernels and transferred for the { first kernels region for(i=0; i < n; i++) { o A and C are allocated A[i] = B[n – i]; and transferred for the } second kernels region } … init(C)  How to reuse A … between the two #pragma acc kernels kernels regions? { o And save transfer and for(i=0; i < n; i++) { allocation time C[i] += A[i] * alpha; } } 5/06/13 Uppsala 66

  46. Memory Allocations  Avoid data reallocation using the create clause o It declares variables, arrays or subarrays to be allocated in the device memory o No data specified in this clause will be copied between host and device  The scope of such a clause corresponds to a data region o Data regions are used to define such scopes (as is, they have no effect) o They define scalars, arrays and subarrays to be allocated on the device memory for the duration of the region  Kernels and Parallel regions implicitly define data regions 5/06/13 Uppsala 67

  47. Data Presence  How to tell the compiler that data has already been allocated?  The present clause declares data that are already present on the device o Thanks to data region that contains this region of code  CAPS Runtime will find and use the data on device 5/06/13 Uppsala 68

  48. Data Construct: Create and Present Clause float A[n]; Allocation of A of size n on the #pragma acc data create(A) device { Reuse of A already allocated on #pragma acc kernels present(A) the device { for(i=0; i < n; i++) { A[i] = B[n – i]; } } … init(C) … Reuse of A already allocated on #pragma acc kernels present(A) the device { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } } Deallocation of A on the device } 5/06/13 Uppsala 69

  49. Data Storage: Mirroring How is the data stored in a data region?  A data construct defines a section of code where data are mirrored between host and  device Mirroring duplicates a CPU memory block into the HWA memory  The mirror identifier is a CPU memory block address o Only one mirror per CPU block o Users ensure consistency of copies via directives o CAPS RT Descriptor ……… ……… Master copy Mirror copy ……… ……… ……… ……… ……… ……… ……… ……… ……… ……… ……. ……. Host Memory HWA Memory 5/06/13 Uppsala 70

  50. Arrays and Subarrays (1)  In C and C++, specified with start and length #pragma acc data create a[0:n] OR #pragma acc data create a[:n] o Allocation of an array a of size n #pragma acc data create a[2:n/2] o Allocation of an subarray of a of size n/2 • ie: elements a[2], a[3], …, a[n/2-1 + 2] o Static arrays can be allocated automatically o Length of dynamically allocated arrays must be explicitly specified 5/06/13 Uppsala 71

  51. Arrays and Subarrays (2)  In Fortran, specified with a list of range specifications !$acc data create a(0:n,0:m) o Allocation of an array a of size n*m !$acc data create a(1:3,5:5) o Allocation of a subarray of a of size 3*1 • ie: elements a(1,5), a(2,5), a(3,5)  In any language, any array or subarray must be a contiguous block of memory 5/06/13 Uppsala 72

  52. Arrays and Subarrays Example #pragma acc data create(A[:n]) !$acc data create(A(1:n)) { #pragma acc kernels present(A[:n]) !$acc kernels present(A(1:n)) { do i=1,n for(i=0; i < n; i++) { A(i) = B(n – i) A[i] = B[n – i]; end do } !$acc end kernels } … … init(C) init(C) … … #pragma acc kernels present(A[:n]) !$acc kernels present(A(1:n)) { do i=1,n for(i=0; i < n; i++) { C(i) = A(i) * alpha + C(i) C[i] += A[i] * alpha; end do } !$acc end kernels } } !$acc end data 5/06/13 Uppsala 73

  53. Redundant Transfers  In this example: #pragma acc data create(A[:n]) { o A is allocated for the data #pragma acc kernels present(A[:n]) section { • No data transfer of A between for(i=0; i < n; i++) { host and device A[i] = B[n – i]; } o B is allocated and transferred } for the first kernels region … #pragma acc kernels present(A[:n]) • Input transfer { • Output transfer for(i=0; i < n; i++) { o C is allocated and transferred C[i] = A[i] * alpha; for the second kernels region } } • Input transfer } • Output transfer  How to avoid useless data transfers for B and C? 5/06/13 Uppsala 74

  54. Input Transfers: Copyin Clause  Declares data that need only #pragma acc data create(A[:n]) to be copied from the host { #pragma acc kernels present(A[:n]) \ to the device when entering copyin(B[:n]) the data section { for(i=0; i < n; i++) { o Performs input transfers only A[i] = B[n – i]; } }  It defines scalars, arrays and … #pragma acc kernels present(A[:n]) subarrays to be allocated on { the device memory for the for(i=0; i < n; i++) { duration of the data region C[i] = A[i] * alpha; } } } 5/06/13 Uppsala 75

  55. Output Transfers: Copyout Clause  Declares data that need only #pragma acc data create(A[:n]) to be copied from the device { #pragma acc kernels present(A[:n]) \ to the host when exiting data copyin(B[:n]) section { for(i=0; i < n; i++) { o Performs output transfers only A[i] = B[n – i]; } }  It defines scalars, arrays and … #pragma acc kernels present(A[:n]) \ subarrays to be allocated on copyout(C[:n]) the device memory for the { duration of the data region for(i=0; i < n; i++) { C[i] = A[i] * alpha; } } } 5/06/13 Uppsala 76

  56. Input/Output Transfers: Copy Clause  If we change the example, how to #pragma acc data create(A[:n]) express that input and output { transfers of C are required? #pragma acc kernels present(A[:n]) \ copyin(B[:n])  Use copy clause to: { for(i=0; i < n; i++) { o Declare data that need to be copied A[i] = B[n – i]; from the host to the device when entering the data section } o Assign values on the device that } need to be copied back to the host … when exiting the data section init(C) o Allocate scalars, arrays and … subarrays on the device memory for #pragma acc kernels present(A[:n]) \ the duration of the data region copy(C[:n]) { It corresponds to the default  for(i=0; i < n; i++) { behavior in our example C[i] += A[i] * alpha; } } } 5/06/13 Uppsala 77

  57. Transfer Example: Summary #pragma acc data create (A[:n]) Allocation of A of size n on the device { Reuse of A already allocated on the device #pragma acc kernels present(A[:n]) \ Allocation of B of size n on the device and copyin (B[:n]) transfer of data of B from host to device { for(i=0; i < n; i++) { A[i] = B[n – i]; } Deallocation of B on the device } … init(C) … Reuse of A already allocated on the device #pragma acc kernels present(A[:n]) \ Allocation of C of size n on the device and copy (C[:n]) transfer of data of C from host to device { for(i=0; i < n; i++) { C[i] += A[i] * alpha; } Transfer of C from device to host and } deallocation of C on the device Deallocation of A on the device } 5/06/13 Uppsala 78

  58. Alternative Behaviors program main  In this example: … !$acc data create(X(1:n)) o A is allocated for the data call f1( n, X, Y ) … region !$acc end data o The first call to subroutine … f1 reuses the data of A call f1( n, X, Z ) … already allocated contains  What happens for the subroutine f1 ( n, A, B ) second call to f1? … !$acc kernels present(A(1:n)) \ o A is specified as present copyin(B(1:n)) but it has been released at do i=1,n the end of the data section A(i) = B(n – i) end do o It leads to an error when !$acc end kernels executed end subroutine f1 … end program main 5/06/13 Uppsala 79

  59. Present_or_create Clause  Combines two behaviors  Declares data that may be present o If data is already present, use value in the device memory o If not, allocate data on device when entering region and deallocate when exiting  May be shortened to pcreate 5/06/13 Uppsala 80

  60. Present_or_copyin/copyout Clauses  If data is already present, use value in the device memory  If not: o Both present_or_copyin / present_or_copyout allocate memory on device at region entry o present_or_copyin copies the value from the host at region entry o present_or_copyout copies the value from the device to the host at region exit o Both present_or_copyin / present_or_copyout deallocate memory at region exit  May be shortened to pcopyin and pcopyout 5/06/13 Uppsala 81

  61. Present_or_copy Clause  If data is already present, use value in the device memory  If not: o Allocates data on device and copies the value from the host at region entry o Copies the value from the device to the host and deallocate memory at region exit  May be shortened to pcopy 5/06/13 Uppsala 82

  62. Present_or_* Clauses Example program main Allocation of A of size n on the device … !$acc data create(A(1:n)) Reuse of A already allocated on the device call f1( n, A, B ) Allocation of B of size n on the device for the … duration of the subroutine and input transfer of B !$acc end data … Deallocation of A on the device call f1( n, A, C ) Allocation of A and B of size n on the device … for the duration of the subroutine contains Input transfer of B and output transfer of A subroutine f1 ( n, A, B ) … !$acc kernels pcopyout (A(1:n)) \ copyin(B(1:n)) do i=1,n A(i) = B(n – i) Present_or_* clauses are end do !$acc end kernels generally safer end subroutine f1 … end program main 5/06/13 Uppsala 83

  63. Default Behavior  CAPS Compilers is able to detect the variables required on the device for the kernels and parallel constructs.  According to the specification, depending on the type of the variables, they follow the following policies o Tables: present_or_copy behavior o Scalar • if not live in or live out variable: private behavior • copy behavior otherwise 5/06/13 Uppsala 84

  64. Constructs and Directives  OpenACC defines two ways of managing accelerator allocations and transfers o With data constructs followed by allocation or transfer clauses o Or standalone directives for allocations or transfers  Data constructs are declarative o They define properties for a code regions and variables  Imperative directives are standalone statements 5/06/13 Uppsala 85

  65. Declare Directive  In Fortran: used in the declaration section of a subroutine  In C/C++: follow a variable declaration  Specifies variables or arrays to be allocated on the device memory for the duration of the function, subroutine or program  Specifies the kind of transfer to realize (create, copy, copyin, etc) float A[n]; float A[n]; #pragma acc data create(A) #pragma acc declare create(A) { #pragma acc kernels present(A) #pragma acc kernels present(A) { { for(i=0; i < n; i++) { for(i=0; i < n; i++) { A[i] = B[n – i]; A[i] = B[n – i]; } } } } … } 5/06/13 Uppsala 86

  66. Update Directive  Used within explicit or implicit data region  Updates all or part of host memory arrays with values from the device when used with host clause  Updates all or part of device memory arrays with values from the host when used with device clause !$acc data create( A(1:n), \ B(1:n) ) !$acc kernels copyout(A(1:n)) \ copyin (B(1:n)) !$acc update device (B(1:n)) do i=1,n !$acc kernels do i=1,n A(i) = B(n – i) A(i) = B(n – i) end do end do !$acc end kernels !$acc end kernels !$acc update host (A(1:n)) !$acc end kernels 5/06/13 Uppsala 87

  67. Lab session 3: Data Management

  68. Lab 3: Data Management  Offload two SAXPY operations (cf. Lab 2) o Where arrays are allocated dynamically  Specify data size on kernels and parallel regions and appropriate transfers  Avoid deallocating and reallocating the data on the accelerator by defining a data section  Ensure the data displayed between the two compute regions are correct by updating the host mirror  Notice the performance evolution and understand why thanks to the logger 5/06/13 Uppsala 89

  69. Specifying Parallelization 90

  70. Parallel and Kernels Constructs Default Behavior  By default, CAPS Compilers will create 192 gangs and 256 workers containing 1 vector each for parallel and kernels regions o The resulting CUDA grid size will be 192 thread blocks o Each thread block containing 256*1 CUDA threads  CAPS Compilers will detect data-independent loops and will distribute iterations among gangs and workers  Loop ‘i’ was shared among gangs(192) and workers(256)  How to modify the number of gangs, workers or vectors? 5/06/13 Uppsala 91

  71. Gangs, Workers, Vectors in Parallel Constructs #pragma acc parallel, num_gangs(128) \  In parallel constructs, the num_workers(256) number of gangs, workers { and vectors is the same for … the entire section for(i=0; i < n; i++) { for(j=0; j < m; j++) { …  The clauses: } } o num_gangs … o num_workers } o vector_length  Enable to specify the … number of gangs, workers 256 and vectors in the … … … corresponding parallel section 128 5/06/13 Uppsala 92

  72. Loop Constructs  A Loop directive applies to a loop that immediately follow the directive  The parallelism to use is described by one of the following clause: o Gang for coarse-grain parallelism o Worker for middle-grain parallelism o Vector for fine-grain parallelism 5/06/13 Uppsala 93

  73. Gangs (1) #pragma acc parallel, num_gangs(128) \  Gang clause: num_workers(192) { o The iterations of the … following loop are executed #pragma acc loop gang for(i=0; i < n; i++) { in parallel for(j=0; j < m; j++) { … } o Iterations are distributed } … among the gangs available } o In a parallel construct, no argument is allowed … 192 … … … i= i= i= i= 0 1 2 0 128 5/06/13 Uppsala 94

  74. Gangs (2) if(i = 0 ; i < n/2 ; i ++) { A[i] = B[i] * B[i] * 3.14; } #pragma parallel num_gang(2) { #pragma acc loop gang for(i = 0; i < n; i ++) { A[i] = B[i] * B[i] * 3.14; } } if(i = n/2 ; i < n ; i ++) { A[i] = B[i] * B[i] * 3.14; } 5/06/13 Uppsala 95

  75. Workers #pragma acc parallel, num_gangs(128) \  Worker clause: num_workers(192) { o The iterations of the … following loop are executed #pragma acc loop gang in parallel for(i=0; i < n; i++) { #pragma acc loop worker for(j=0; j < n; j++) { o Iterations are distributed … among the multiple workers } withing a single gang } … } o Loop iterations must be data independent, unless it performs a reduction j=0 j=1 operation … j=2 192 … … … o In a parallel construct, no argument is allowed i= i= i= i= 0 1 2 0 128 5/06/13 Uppsala 96

  76. Vector #pragma acc parallel, num_gangs(128) \ num_workers(192)  Vector clause { … o The iterations of the #pragma acc loop gang following loop are for(i=0; i < n; i++) { #pragma acc loop worker executed in SIMD mode for(j=0; j < m; j++) { #pragma acc loop vector for(k=0; k < l; k++) { … o Iterations are distributed } among the multiple } } workers withing a single … gang } … … j=0 … j=1 j=2 … 192 k=0 k=1 k=2 o In a parallel construct, … … i= i= i= no argument is allowed 0 0 0 128 5/06/13 Uppsala 97

  77. Gang, Worker, Vector in Kernels Constructs #pragma acc kernels  The parallelism { description is the same … as in parallel sections #pragma acc loop gang(128) for(i=0; i < n; i++) { … }  However, these clauses accept an argument to … … specify the number of … … … gangs, workers or vectors to use i= i= i= #pragma acc loop gang(64) for(j=0; j < m; j++) { 0 0 2 128  Every loop can have a … } different number of } gangs, workers or … vectors in the same kernels region … … … i= i= i= 0 0 2 64 5/06/13 Uppsala 98

  78. Data Independency In kernels sections, the clause independent specifies that iterations of the  loop are data-independent The user does not have to think about gangs, workers or vector parameters  Allows the compiler to generate code to execute the iterations in parallel  with no synchronization A[0] = 0; A(1) = 0 #pragma acc loop independent $!acc loop independent for(i=1; i<n; i++) DO i=2,n { A[i] = A[i]-1; A(i) = A(i-1) } END DO Programming error 5/06/13 Uppsala 99

  79. Sequential Execution  It is possible to specify sequential loops using the seq !$acc loop independent clause DO i=0,n !$acc loop seq DO j=1,4 A(j)…  Useful to increase the ENDDO ENDDO work per thread for example 5/06/13 Uppsala 100

Recommend


More recommend