gpu computing with openacc 3 ways to accelerate
play

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS - PowerPoint PPT Presentation

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum Drop - in Applications Flexibility Acceleration 2 OPENACC DIRECTIVES CPU GPU


  1. GPU COMPUTING WITH OPENACC

  2. 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration 2

  3. OPENACC DIRECTIVES CPU GPU Simple Compiler hints Compiler Parallelizes code Program myscience ... serial code ... !$acc kernels do k = 1,n1 OpenACC Works on many-core GPUs & do i = 1,n2 Compiler ... parallel code ... enddo Hint multicore CPUs enddo !$acc end kernels ... End Program myscience Your original Fortran or C code 3

  4. FAMILIAR TO OPENMP PROGRAMMERS OpenMP OpenACC CPU CPU GPU main() { main() { double pi = 0.0; long i; double pi = 0.0; long i; #pragma acc kernels #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) for (i=0; i<N; i++) { { double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t); } } printf (“pi = %f \ n”, pi/N); printf(“pi = %f \ n”, pi/N); } } 4

  5. Op OpenA enACC CC Me Member mbers s and nd Sup uppor porter ers

  6. DIRECTIVES: EASY & POWERFUL Real-Time Object Valuation of Stock Portfolios Interaction of Solvents and Detection using Monte Carlo Biomolecules Global Manufacturer of Navigation Global Technology Consulting Company University of Texas at San Antonio Systems 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours “ Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The ” most important thing is avoiding restructuring of existing code for production applications. 6 -- Developer at the Global Manufacturer of Navigation Systems

  7. FOCUS ON EXPOSING PARALLELISM With Directives, tuning work focuses on exposing parallelism , which makes codes inherently better Example: Application tuning work using directives for new Titan system at ORNL CAM-SE S3D Answer questions about specific Research more efficient climate change adaptation and combustion with next- mitigation scenarios generation fuels • Tuning top 3 kernels (90% of runtime) • Tuning top key kernel (50% of runtime) • 3 to 6x faster on CPU+GPU vs. CPU+CPU 6.5x faster on CPU+GPU vs. CPU+CPU • But also improved all-CPU version by 50% Improved performance of CPU version by 100% • • 7

  8. A VERY SIMPLE EXERCISE: SAXPY SAXPY in C SAXPY in Fortran subroutine subrouti ne sa saxpy py(n (n, , a, x, a, x, y y) void saxpy(int void sax py(int n, n, real :: x(:), y(:), a float a, integer :: n, i fl float at * *x, x, $! $!acc acc kernels float *restrict y) do do i=1,n { y(i) = a*x(i)+y( y( )+y(i) #pragma #pragma acc acc kernels enddo enddo for for (int i = 0; i < n; ++i) $! $!acc acc end kernels y[i] = a*x[i] + y[i]; end subroutine end subr outine saxpy saxpy } ... ... ... ... $ Perform SAXP $ Perfor m SAXPY on 1M Y on 1M elemen elements ts // Perform SAX // Perfo rm SAXPY on 1M PY on 1M eleme elements nts call sa call saxpy py(2 (2**20, **20, 2 2.0 .0, x_d x_d, , y_d y_d) saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ... ... ... 8

  9. DIRECTIVE SYNTAX Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block 9

  10. KERNELS : YOUR FIRST OPENACC DIRECTIVE Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 kernel 1 Kernel: b(i) = 1.0 A parallel function c(i) = 2.0 end do that runs on the GPU do i=1,n a(i) = b(i) + c(i) kernel 2 end do !$acc end kernels 10

  11. KERNELS CONSTRUCT Fortran C !$acc kernels [clause …] #pragma acc kernels [clause …] structured block { structured block } !$acc end kernels Clauses if( condition ) async( expression ) Also, any data clause (more later) 11

  12. COMPLETE SAXPY EXAMPLE CODE Trivial first example int nt ma main in(int nt ar argc gc, , ch char ** **ar argv) { int nt N N = = 1< 1<<20 20; ; // // 1 1 mi mill llio ion f flo loat ats Apply a loop directive if if (ar argc gc > > 1) 1) Learn compiler commands N N = = at atoi oi(ar argv gv[1 [1]) ]); flo loat *x *x = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); *restrict : flo loat *y *y = = (flo loat at*) *)mal allo loc(N (N * * siz izeo eof(flo loat at)); ); “ I promise y does not alias #in incl clude de < <st stdli lib. b.h> for or (i (int nt i i = = 0; 0; i i < < N; N; + ++i +i) { { x” x[ x[i] = = 2 2.0f 0f; voi oid sax axpy py(in int n, n, y[ y[i] = = 1 1.0f 0f; fl floa oat a, a, } floa fl oat *x *x, floa fl oat *r *res estri rict ct y) y) sax axpy(N, N, 3 3.0f 0f, , x, x, y y); { #pr prag agma a ac acc ker erne nels ls ret eturn rn 0; 0; for or (int nt i = = 0; 0; i < n n; + ++i) } y[i] = a * x[i] + y[i]; y[ ]; } 12

  13. COMPILE AND RUN C: pgcc – acc -ta=nvidia -Minfo=accel – o saxpy_acc saxpy.c Fortran: pgf90 – acc -ta=nvidia -Minfo=accel – o saxpy_acc saxpy.f90 Compiler output: pgc gcc -ac acc -Mi Minf nfo=ac acce cel -ta ta=nv nvid idia ia -o o sax axpy py_a _acc sa saxp xpy. y.c sax axpy py: 8, 8, G Gen enera rati ting ng co copy pyin in(x (x[:n :n-1]) ]) Gen enera rati ting ng c copy py(y (y[: [:n-1]) ]) Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 1 1.0 .0 b bina nary ry Gen enera rati ting ng c comp mput ute e cap apab abil ilit ity 2 2.0 .0 b bina nary ry 9, Loop is parallelizable Acc ccele lera rato tor r ker erne nel l gen ener erat ated ed 9, 9, #p #pra ragm gma a acc cc loo oop w wor orke ker, r, ve vect ctor or(25 256) 6) / /* * blo lock ckId Idx.x .x thr hrea eadId Idx. x.x */ */ CC CC 1 1.0 .0 : : 4 4 re regi giste ters rs; ; 52 52 sh shar ared ed, 4 4 c con onst stant nt, , 0 0 loc ocal al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy CC CC 2 2.0 .0 : : 8 8 re regi giste ters rs; ; 4 4 sha hare red, d, 64 64 c con onst stant nt, , 0 0 loc ocal al m mem emory ry b byt ytes; s; 1 100 00% % occ ccup upan ancy 13

  14. EXAMPLE: JACOBI ITERATION Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛂 𝟑 𝒈(𝒚, 𝒛) = 𝟏 A(i,j+1) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 4 A(i-1,j) A(i,j) A(i+1,j) A(i,j-1) 14

  15. JACOBI ITERATION C CODE while while ( error > tol tol && iter iter < < iter_max ) { ) { Iterate until converged error=0.0; for for( ( int j = 1; j < n-1; j++) { Iterate across matrix for(int for int i = 1; = 1; i < m m-1; 1; i++) { ++) { elements Calculate new value from Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + ] + neighbors A[j-1][ ][i] + A[j+1][i]); ]); Compute max error for error = max(error, abs(Anew[j][i] - A[j][i]) A[j][i]); convergence } } for( for ( int j = 1; j < n-1; j++) { for( for ( int int i = 1; = 1; i < m < m-1; i++ ) { ++ ) { Swap input/output arrays A[j][i] = A A[j][ ] = Ane new[ w[j][ ][i]; ]; } } iter iter++; } 15

  16. JACOBI ITERATION FORTRAN CODE do do while ( err > tol tol .and. iter < < iter_max ) Iterate until converged err=0 err=0._ ._fp_kind do do j=1,m Iterate across matrix do do i=1,n =1,n elements Anew( Anew(i,j i,j) = .25 ) = .25_fp fp_k _kind ind * * (A (A(i+1, (i+1, j j ) + A(i ) + A(i-1, , j j ) ) + & + & Calculate new value from A(i A( , j-1) + A( , j ) + A(i , j+1)) neighbors err = = max(err (err, Anew(i,j) ) - A( A(i,j)) )) end do end do Compute max error for convergence end end do do do do j=1,m-2 do i=1,n do =1,n-2 A(i,j A( i,j) = Anew(i,j) Swap input/output arrays end do en end d do do iter iter = = iter iter +1 +1 end do end do 16

  17. EXERCISES General instructions (compiling) Exercises are in “exercises/ openacc ” directory Solutions in “ exercise_solutions/openacc ” directory module load pgi/14.6 To compile, use one of the provided makefiles C: > make Fortran: > make – f Makefile_f90 Remember these flags -acc – ta=nvidia – Minfo=accel 17

  18. EXERCISES General instructions (running) To run, use sbatch with one of the provided job files > sbatch runit.acc > qstat – u <username> # prints qsub status Output is placed in slurm.* when finished. 18

  19. EXERCISE 1 Jacobi kernels Task: use acc kernels to parallelize the Jacobi loop nests Edit laplace2D.c or laplace2D.f90 (your choice) In the 001-laplace2D-kernels directory Add directives where it helps Figure out the proper compilation flags to use Optionally: Run OpenACC version with laplace_acc Q: can you get a speedup with just kernels directives? Versus 1 CPU core? Versus 6 CPU cores? 19

Recommend


More recommend