COMP 633 - Parallel Computing Lecture 15 October 1, 2020 Programming Accelerators using Directives Credits: Introduction to OpenACC and toolkit – Jeff Larkin, Nvidia – Oct 2015 COMP 633 - Prins Heterogeneous Programming 1
Heterogeneous Parallel Computers • Composed of – CPU(s) • Low-latency processor optimized for sequential execution • large memory size and deep memory hierarchy – 1-8 Accelerator(s) • high throughput SIMD or MIMD processors optimized for data-parallel execution • high-performance local memory with limited size (16-24 GB) and small depth memory hierarchy • Example – Multisocket compute server • Host: two-socket 20 – 40 Intel Xeon cores with 128 – 512 GB CC-NUMA shared memory • Accelerators: 1-8 accelerators (e.g. Nvidia Cuda cards connected via PCIe x16 interfaces (16GB/s) – host controls data to/from accelerator memory COMP 633 - J. F. Prins Heterogeneous Programming 2
Scaling accelerators and interconnect • DGX-2 (2018) 16 GPUs and 300GB/s full bisection-width interconnect COMP 633 - J. F. Prins Heterogeneous Programming 3
Basic Programming Models • Offload model – idea: offload computational kernels GPU Xeon Phi • send data • call kernel(s) • retrieve data – accelerator-specific compiler support CPU • Cuda compiler ( nvcc ) for Nvidia GPUs • Intel vectorizing compiler ( icc –mmic ) for Intel Xeon Phi KNL – #pragma offload target(mic: n ) in(…) out(…) inout(…) – accelerator-neutral OpenCL • Cuda-like notation • OpenCL compiler can target Nvidia or Intel Xeon Phi COMP 633 - J. F. Prins Heterogeneous Programming 4
Emerging Programming Models • directive model – idea: identify sections of code to be compiled for accelerator(s) • data transfer and kernel invocation generated by compiler – accelerator-neutral efforts • OpenACC – #pragma acc parallel loop for (…) { … } – gang, worker, vector (threadblock, warp, warp in SIMT lockstep) – gcc 5, PGI, Cray, CAPS, Nvidia compilers • OpenMP 4.0 – similar directives to (but more general than) OpenACC – implemented by gcc 4.9 and icc compiler • accelerator-specific compiler support – Intel Cilk Plus and C++ compilers for Intel Xeon Phi COMP 633 - J. F. Prins Heterogeneous Programming 5
Introduction to OpenACC Jeff Larkin, NVIDIA Developer Technologies
Why OpenACC? 6
University of Illinois PowerGrid- MRI Reconstruction main() { <serial code> #pragma acc kernels OpenACC //automatically runs on GPU { <parallel code> } Simple | Powerful | Portable } 70x Speed-Up 2 Days of Effort Fueling the Next Wave of RIKEN Japan NICAM- Climate Modeling 8000+ Scientific Discoveries in HPC Developers using OpenACC 7-8x Speed-Up 5% of Code Modified http://www.cr ay.com/sites/default/files/r esources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf http://www.hpcwire.com/off-the-wir e/first-round-of-2015-hackathons-gets-under way 7 http://on-demand.gputechconf.com/gtc/2015/pr esentation/S5297-Hisashi-Yashir o.pdf http://www.openacc.or g/content/ex periences-por ting-molecular -dynamics-code-gpus-cr ay-x k7
OpenACC Directives Manage Incremental #pragma acc data copyin(a,b) copyout(c) Data { Movement Single source ... #pragma acc parallel Interoperable { Initiate #pragma acc loop gang vector Parallel for (i = 0; i < n; ++i) { Performance portable Execution z[i] = x[i] + y[i]; ... CPU, GPU, MIC } Optimize } Loop ... Mappings } 9
Accelerated Computing Fundamentals 10
Accelerated Computing 10x Performance & 5x Energy Efficiency for HPC GPU Accelerator GPU Accelera era ator Optimized Optimized p d for d or fo CPU CPU Parallel Tasks Optimized for Optimized for ptimized f Serial Tasks 11
What is Heterogeneous Programming? Application Code Compute-Intensive Functions Rest of Sequential A few % of Code CPU Code GPU CPU A large % of Time + 12
Portability & Performance Portability Accelerated Libraries High performance with little or no code change Limited by what libraries are available Compiler Directives High Level: Based on existing languages; simple, familiar, portable High Level: Performance may not be optimal Parallel Language Extensions Greater flexibility and control for maximum performance Performance Often less portable and more time consuming to implement 13
Code for Portability & Performance • Implement as much as possible using Libraries portable libraries • Use directives for rapid and Directives portable development • Use lower level languages Languages for important kernels 14
OpenACC Programming Cycle 15
Identify Available Parallelism Optimize Express Loop Parallelism Performance Express Data Movement 16
Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: ! "(#, $) = % A(i,j+1) A(i-1,j) A(i+1,j) A(i,j) !"# $, % = ! ($ & 1, %) + ! $ + 1,% + ! $, % & 1 + ! $, % + 1 A(i,j-1) 4 17
Jacobi Iteration: C Code while ( err > tol && iter < iter_max ) { Iterate until converged err=0.0; Iterate across matrix for( int j = 1; j < n-1; j++) { elements for(int i = 1; i < m-1; i++) { Calculate new value from Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + neighbors A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); Compute max error for } convergence } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Swap input/output arrays A[j][i] = Anew[j][i]; } } iter++; } 18 18
Identify Available Parallelism Optimize Express Loop Parallelism Performance Express Data Movement 19
Identify Parallelism Data dependency while ( err > tol && iter < iter_max ) { between iterations. err=0.0; for( int j = 1; j < n-1; j++) { Independent loop for(int i = 1; i < m-1; i++) { iterations Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); } } Independent loop for( int j = 1; j < n-1; j++) { iterations for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } 20 20
Identify Available Parallelism Optimize Express Loop Parallelism Performance Express Data Movement 21
OpenACC kernels Directive The kernels directive identifies a region that may contain loops that the compiler can turn into parallel kernels . #pragma acc kernels { for(int i=0; i<N; i++) The compiler identifies { kernel 1 x[i] = 1.0; 2 parallel loops and y[i] = 2.0; generates 2 kernels. } for(int i=0; i<N; i++) { kernel 2 y[i] = a*x[i] + y[i]; } } 22 22
Parallelize with OpenACC kernels while ( err > tol && iter < iter_max ) { err=0.0; Look for parallelism #pragma acc kernels { within this region. for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } iter++; } 23 23
Building the code $ pgcc -fast -ta=tesla -Minfo=all laplace2d.c main: 40, Loop not fused: function call before adjacent loop Generated vector sse code for the loop 51, Loop not vectorized/parallelized: potential early exits 55, Generating copyout(Anew[1:4094][1:4094]) Generating copyin(A[:][:]) Generating copyout(A[1:4094][1:4094]) Generating Tesla code 57, Loop is parallelizable 59, Loop is parallelizable Accelerator kernel generated 57, #pragma acc loop gang /* blockIdx.y */ 59, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 63, Max reduction generated for error 67, Loop is parallelizable 69, Loop is parallelizable Accelerator kernel generated 67, #pragma acc loop gang /* blockIdx.y */ 69, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 24 24
Intel Xeon E5- 2698 v3 @ Speed-up (Higher is Better) 2.30GHz 3.50X (Haswell) 3.29X Why did OpenACC vs. 3.00X slow down here? NVIDIA T esla 2.91X K40 2.77X 2.50X 2.00X 1.66X 1.50X 1.00X 1.00X 0.90X 0.50X 0.00X Single Thread 2 Threads 4 Threads 6 Threads 8 Threads OpenACC 25
Very low Compute/Memcpy ratio Compute 5 seconds Memory Copy 62 seconds 26
104ms/iteration PCIe Copies 27
Excessive Data Transfers while ( err > tol && iter < iter_max ) { err=0.0; #pragma acc kernels A, Anew resident Anew reside A, A, Anew resident on Anew resident on host accelerator C for( int j = 1; j < n-1; j++) { for for( int j = 1; j < n r( int j = 1; r( int j = 1; j < n o for(int i = 1; i < m-1; i++) { p These copies Anew[j][i] = 0.25 * (A[j][i+1] + happen every y A[j][i-1] + A[j-1][i] + C A[j+1][i]); iteration of the o err = max(err, abs(Anew[j][i] – outer while A[j][i]); p loop! } y } ... A, Anew resident Anew reside t A, Anew resident on Anew resident ... on host accelerator } 28
Recommend
More recommend