gpu computing with
play

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of - PowerPoint PPT Presentation

GPU Computing with OpenACC Directives GPUs Reaching Broader Set of Developers 1,000,000s CAE CFD Finance Rendering Data Analytics Universities Supercomputing Centers Life Sciences 100,000s Defense Oil & Gas Weather Climate


  1. GPU Computing with OpenACC Directives

  2. GPUs Reaching Broader Set of Developers 1,000,000’s CAE CFD Finance Rendering Data Analytics Universities Supercomputing Centers Life Sciences 100,000’s Defense Oil & Gas Weather Climate Research Early Adopters Plasma Physics 2004 Present Time

  3. 3 Ways to Accelerate Applications Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration

  4. OpenACC Directives CPU GPU Simple Compiler hints Compiler Parallelizes code Program myscience ... serial code ... !$acc kernels do k = 1,n1 OpenACC Works on many-core GPUs & do i = 1,n2 Compiler ... parallel code ... enddo Hint multicore CPUs enddo !$acc end kernels ... End Program myscience Your original Fortran or C code

  5. Familiar to OpenMP Programmers OpenMP OpenACC CPU CPU GPU main() { main() { double pi = 0.0; long i; double pi = 0.0; long i; #pragma acc kernels #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) for (i=0; i<N; i++) { { double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t); } } printf (“pi = %f \ n”, pi/N); printf(“pi = %f \ n”, pi/N); } }

  6. OpenACC The Standard for GPU Directives Easy: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU

  7. Ope penA nACC CC Me Membe mbers s and nd Sup uppor porter ers

  8. Directives: Easy & Powerful Real-Time Object Valuation of Stock Portfolios Interaction of Solvents and Detection using Monte Carlo Biomolecules Global Manufacturer of Navigation Global Technology Consulting Company University of Texas at San Antonio Systems 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours “ Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The ” most important thing is avoiding restructuring of existing code for production applications. -- Developer at the Global Manufacturer of Navigation Systems

  9. Focus on Exposing Parallelism With Directives, tuning work focuses on exposing parallelism , which makes codes inherently better Example: Application tuning work using directives for new Titan system at ORNL CAM-SE S3D Answer questions about specific Research more efficient climate change adaptation and combustion with next- mitigation scenarios generation fuels • Tuning top 3 kernels (90% of runtime) • Tuning top key kernel (50% of runtime) • 3 to 6x faster on CPU+GPU vs. CPU+CPU 6.5x faster on CPU+GPU vs. CPU+CPU • But also improved all-CPU version by 50% Improved performance of CPU version by 100% • •

  10. A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran subroutine subrouti ne saxpy(n, a, x, y) void saxpy(int void sax py(int n, n, real :: x(:), y(:), a float a, integer :: n, i float *x, $!acc $! acc kernels float *restrict y) do do i=1,n =1,n { y(i) = a*x(i)+y( y( )+y(i) #pragma acc #pragma acc kerne kernels ls enddo enddo for for (int i = 0; i < n; ++i) $!acc $! acc end kernels y[i] = a*x[i] + y[i]; end subroutine end subr outine saxpy saxpy } ... ... ... ... $ Perfor $ Perform SAXP m SAXPY on 1M Y on 1M elemen elements ts // Perform SAX // Perfo rm SAXPY on 1M PY on 1M eleme elements nts call call saxpy(2**20, 2.0, x_d x_d, , y_d y_d) saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ... ... ...

  11. Directive Syntax Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block

  12. els : Your first OpenACC Directive kernel Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 kernel 1 b(i) = 1.0 Kernel: c(i) = 2.0 end do A parallel function that runs on the GPU do i=1,n kernel 2 a(i) = b(i) + c(i) end do !$acc end kernels

  13. Kernels Construct Fortran C !$ !$acc acc kernels rnels [clause …] #prag ragma ma acc acc kern ernels els [clause …] stru tructure ctured b d bloc lock { st structu ructured red bl block } ock } !$ !$acc acc end ker d kernel nels Clauses if( if ( cond ondition ition ) as async nc( ( expres expressio ion ) Also, any data clause (more later)

  14. C tip: the restric rict keyword Declaration of intent given by the programmer to the compiler Applied to a pointer, e.g. float * float *restrict restrict ptr ptr Meaning: “for the lifetime of ptr tr , only it or a value directly derived from it (such as ptr tr + 1 ) will be used to access the object to which it points ” * Limits the effects of pointer aliasing OpenACC compilers often require restric trict to determine independence Otherwise the compiler can’t parallelize loops that access ptr tr Note: if programmer violates the declaration, behavior is undefined http://en.wikipedia.org/wiki/Restrict

  15. Complete SAXPY example code Trivial first example int nt ma main in(int nt ar argc gc, , ch char ** **ar argv) { Apply a loop directive int nt N N = = 1< 1<<20 20; ; // // 1 1 mi mill llio ion f flo loat ats Learn compiler commands if if (ar argc gc > > 1) 1) N N = = at atoi oi(ar argv gv[1 [1]) ]); flo loat *x *x = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); flo loat *y *y = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); *restrict : #in incl clude de < <st stdli lib. b.h> “ I promise y does not alias x” for or (i (int nt i i = = 0; 0; i i < < N; N; + ++i +i) { { x[ x[i] = = 2 2.0f 0f; voi oid sax axpy py(in int n, n, y[i] = 1.0f; y[ floa fl oat a, a, } floa fl oat *x *x, float *restrict y) y) sax axpy(N, N, 3 3.0f 0f, , x, x, y y); { #pr prag agma a ac acc ker erne nels ls ret eturn rn 0; 0; for or (int nt i = = 0; 0; i < n n; + ++i) } y[i] = y[ = a a * * x[ x[i] + + y[ y[i]; ]; }

  16. Compile and run C: pgcc pgcc – acc acc -ta= ta=nvidia nvidia -Minfo Minfo=accel accel – o o saxpy_acc saxpy_acc saxpy.c saxpy.c Fortran: pgf90 pgf90 – acc acc -ta= ta=nvidia nvidia -Minfo Minfo=accel accel – o o saxpy_acc saxpy_acc saxpy.f90 saxpy.f90 Compiler output: pgc gcc -ac acc -Mi Minf nfo=ac acce cel -ta ta=nv nvid idia ia -o o sax axpy py_a _acc sa saxp xpy. y.c sax axpy py: 8, 8, G Gen enera rati ting ng co copy pyin in(x (x[:n :n-1]) ]) Gen enera rati ting ng c copy py(y (y[: [:n-1]) ]) Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 1 1.0 .0 b bina nary ry Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 2 2.0 .0 b bina nary ry 9, 9, L Loo oop i is s pa para ralle leli liza zable le Acc ccele lera rato tor r ker erne nel l gen ener erat ated ed 9, 9, #p #pra ragm gma a acc cc loo oop w wor orke ker, r, ve vect ctor or(25 256) 6) / /* * blo lock ckId Idx.x .x thr hrea eadId Idx. x.x */ */ CC 1 CC 1.0 .0 : : 4 4 re regi giste ters rs; ; 52 52 sh shar ared ed, 4 4 c con onst stant nt, , 0 0 loc ocal al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy

  17. Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛂 𝟑 𝒈(𝒚, 𝒛) = 𝟏 A(i,j+1) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 4 A(i-1,j) A(i,j) A(i+1,j) A(i,j-1)

  18. Jacobi Iteration C Code while while ( error > tol tol && iter iter < < iter_max ) { ) { Iterate until converged error=0.0; for for( ( int nt j j = 1 1; ; j j < n < n-1; 1; j++) j++) { { Iterate across matrix for for(int int i = 1; = 1; i < m-1; 1; i++) { ++) { elements Calculate new value from Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + A[j [j-1][ ][i] + + A A[j+1] [j+1][i]); ]); neighbors Compute max error for error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); convergence } } for( for ( int j = 1; j < n-1; j++) { for( for ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { Swap input/output arrays A[j][i] = Anew[j][i]; } } iter++; iter }

Recommend


More recommend