GPU WORKSHOP University of Maryland
1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3 CUDA C/C++ with hands-on 4 5 2
Parallel programming — Why do you care? 3
The world IS parallel
Accelerator Programming — Why do you Care? 5
Power for the city Power of 300 Petaflop = of San Francisco CPU-only Supercomputer HPC’s Biggest Challenge: Power
UNPRECEDENTED VALUE TO SCIENTIFIC COMPUTING AMBER Molecular Dynamics Simulation DHFR NVE Benchmark 64 Sandy Bridge CPUs 1 Tesla K40 GPU 58 ns/day 102 ns/day 7
3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration 8
GPU ACCELERATOED LIBRARIES “Drop - on” Acceleration for your Applications Linear Algebra NVIDIA cuFFT, FFT , BLAS, cuBLAS, SPARSE, Matrix cuSPARSE Numerical & Math RAND, Statistics NVIDIA NVIDIA cuRAND Math Lib Data Struct. & AI Sort, Scan, Zero Sum GPU AI – GPU AI – Path Finding Board Games Visual Processing NVIDIA NVIDIA Image & Video Video NPP Encode 9
3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC Libraries Languages Directives Easily Accelerate Maximum “Drop - in” Applications Flexibility Acceleration 10
OPENACC DIRECTIVES CPU GPU Simple Compiler hints Compiler Parallelizes code Program myscience ... serial code ... !$acc kernels do k = 1,n1 OpenACC Works on many-core GPUs & do i = 1,n2 Compiler ... parallel code ... enddo Hint multicore CPUs enddo !$acc end kernels ... End Program myscience Your original Fortran or C code 11
FAMILIAR TO OPENMP PROGRAMMERS OpenMP OpenACC CPU CPU GPU main() { main() { double pi = 0.0; long i; double pi = 0.0; long i; #pragma acc kernels #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) for (i=0; i<N; i++) { { double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t); } } printf (“pi = %f \ n”, pi/N); printf(“pi = %f \ n”, pi/N); } } 12
DIRECTIVES: EASY & POWERFUL Real-Time Object Valuation of Stock Portfolios Interaction of Solvents and Detection using Monte Carlo Biomolecules Global Manufacturer of Navigation Global Technology Consulting Company University of Texas at San Antonio Systems 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours “ Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ” 13 -- Developer at the Global Manufacturer of Navigation Systems
A VERY SIMPLE EXERCISE: SAXPY SAXPY in C SAXPY in Fortran subrouti subroutine ne sa saxpy py(n (n, , a, x, a, x, y y) void saxpy(int void sax py(int n, n, real :: x(:), y(:), a float a, integer :: n, i float fl at * *x, x, $!acc $! acc kernels float *restrict y) do i=1,n do { y( y(i) = a*x(i)+y( )+y(i) #pragma #pragma acc ke acc kernels rnels enddo enddo for (int i = 0; i < n; ++i) for $! $!acc acc end kernels y[i] = a*x[i] + y[i]; end subr end subroutine outine saxpy saxpy } ... ... ... ... $ Perform SAXP $ Perfor m SAXPY on 1M Y on 1M elemen elements ts // Perfo // Perform SAX rm SAXPY on 1M PY on 1M eleme elements nts call sa call saxpy py(2 (2**20, **20, 2 2.0 .0, x_d x_d, , y_d y_d) saxpy(1< saxpy(1<<20, 2 <20, 2.0, x, y .0, x, y); ); ... ... ... ... 14
GPU Architecture 15
GPU ARCHITECTURE Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 12 GB ECC on/off options for Quadro and Tesla products Streaming Multiprocessors (SM) Perform the actual computation Each SM has its own: Control units, registers, execution pipelines, caches 16
GPU ARCHITECTURE Instruction Cache Scheduler Scheduler Dispatch Dispatch Streaming Multiprocessor (SM) Register File Core Core Core Core Many CUDA Cores per SM Core Core Core Core Core Core Core Core Architecture dependent Core Core Core Core Special-function units Core Core Core Core Core Core Core Core cos/sin/tan, etc. Core Core Core Core Core Core Core Core Shared mem + L1 cache Load/Store Units x 16 Special Func Units x 4 Thousands of 32-bit registers Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache 17
GPU ARCHITECTURE Instruction Cache Scheduler Scheduler Dispatch Dispatch CUDA Core Register File Core Core Core Core Floating point & Integer unit Core Core Core Core Core Core Core Core IEEE 754-2008 floating-point Core Core Core Core CUDA Core standard Core Core Core Core Dispatch Port Fused multiply-add (FMA) Core Core Core Core Operand Collector instruction for both single and Core Core Core Core double precision FP Unit INT Unit Core Core Core Core Logic unit Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network Move, compare unit 64K Configurable Cache/Shared Mem Branch unit Uniform Cache 18
GPU MEMORY HIERARCHY REVIEW SM-1 SM-0 SM-N Registers Registers Registers SMEM SMEM L1 SMEM L1 L1 L2 Global Memory 19
GPU ARCHITECTURE Memory System on each SM Extremely fast, but small, i.e., 10s of Kb Programmer chooses whether to use cache as L1 or Shared Mem L1 Hardware-managed — used for things like register spilling Should NOT attempt to utilize like CPU caches Shared Memory — programmer MUST synchronize data accesses!!! User-managed scratch pad Repeated access to same data or multiple threads with same data 20
GPU ARCHITECTURE Memory system on each GPU board Unified L2 cache (100s of Kb) Fast, coherent data sharing across all cores in the GPU ECC protection DRAM ECC supported for GDDR5 memory All major internal memories are ECC protected Register file, L1 cache, L2 cache 21
CUDA Programming model 22
ANATOMY OF A CUDA C/C++ APPLICATION Serial code executes in a Host (CPU) thread Parallel code executes in many Device (GPU) threads across multiple processing elements CUDA C/C++ Application Host = CPU Serial code Device = GPU Parallel code … Host = CPU Serial code Device = GPU Parallel code ... 23
CUDA C : C WITH A FEW KEYWORDS vo void id sa saxpy xpy_se _serial ial(i (int nt n, n, floa loat t a, a, flo float * t *x, x, fl float oat *y) *y) { for (i for (int nt i = i = 0; i 0; i < n; ++ n; ++i) i) y[i y[i] = ] = a* a*x[ x[i] i] + y + y[i]; i]; Standard C Code } // Invoke seri erial al SAXPY kernel sa saxpy xpy_s _seri erial( al(n, 2 , 2.0 .0, x , x, y , y); __ __glo globa bal__ l__ voi void s d sax axpy_ py_par paralle llel( l(int int n, n, flo float at a, a, fl float at *x *x, f , floa loat *y *y) { int int i i = = bl block ockId Idx.x* .x*blo blockD ckDim im.x + .x + th threa readI dIdx dx.x; .x; Parallel C Code if (i if (i < n) n) y[i y[i] = ] = a* a*x[i] [i] + + y[ y[i]; i]; } // Invoke par arall llel el SAXPY kernel with 256 threads/block in int n t nbl block ocks = s = (n (n + + 255 255) / ) / 256 256; sa saxpy xpy_p _para aralle llel<<< <<<nb nbloc locks, ks, 256 256>> >>>(n, (n, 2. 2.0, 0, x, x, y) y); 25
CUDA KERNELS Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels 26
CUDA KERNELS: PARALLEL THREADS A kernel is a function executed on the GPU as an array of threads in parallel float x = input[threadIdx.x]; float y = func(x); output[threadIdx.x] = y; All threads execute the same code, can take different paths Each thread has an ID Select input/output data Control decisions 27
CUDA Kernels: Subdivide into Blocks
CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks
CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid
CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
Kernel Execution CUDA core • Each thread is executed by a CUDA thread core • Each block is executed by CUDA Streaming one SM and does not migrate CUDA thread block Multiprocessor • Several concurrent blocks can reside on one SM depending … on the blocks’ memory requirements and the SM’s memory resources CUDA-enabled GPU CUDA kernel grid • Each kernel is executed on … one device …… • Multiple kernels can execute ... on a device at one time
Recommend
More recommend