motivation to learn gpgpu
play

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? - PowerPoint PPT Presentation

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs. CPU Why to Learn About GPU? NVIDIA GPU relative performances Why to Learn About GPU? Hardware Why to Learn About GPU? Interactive rendering


  1. Motivation to Learn GPGPU Julius Parulek

  2. Why to Learn About GPU? Computational power of GPU vs. CPU

  3. Why to Learn About GPU? NVIDIA GPU relative performances

  4. Why to Learn About GPU? Hardware

  5. Why to Learn About GPU? Interactive rendering delivers almost off-line quality

  6. Why to Learn About GPU? GPU programming allows to parallelize the computation via data parallel streaming strategy SIMT Single instruction multiple threads 10000s of simultaneously active threads Examples OpenCL, CUDA Direct graphical enhancements OpenGL / Direct X Shaders (GLSL,HLSL,Cg) Replaceable state-machines on the GPU

  7. Why to Learn About GPU? Application fields Force-field simulations Particles systems, molecular simulations, graph drawing Voronoi diagrams Data analysis, motion planning, geometric optimization Sorting and searching Database and range queries Matrix multiplication Physical simulation FFT Signal processing Visualization Volume rendering, raycasting

  8. Why to Learn About GPU? Application fields Force-field simulations Particles systems, molecular simulations, graph drawing Voronoi diagrams Data analysis, motion planning, geometric optimization Sorting and searching Database and range queries Matrix multiplication Physical simulation Aim: FFT real-time and interactive Signal processing visualization of Visualization complex phenomena! Volume rendering, raycasting

  9. Introduction to CUDA Julius Parulek

  10. CUDA Why to learn CUDA? Random and unlimited access to memory Read/write whenever necessary Cooperation through shared memory High learning curve Few extensions to C No graphics overhead Quick implementation Focus on parallelization and not programming A pen and a paper to analyze algorithms Number of blocks of a number of threads

  11. CUDA CUDA = serial program with parallel kernels Differences between CUDA and CPU threads CUDA threads are extremely lightweight Very little creation overhead Instant switching CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can use only a few CUDA threads are physical NVIDIA GPUs

  12. CUDA Kernel A simple C program Kernels are executed by thread Thousands of threads execute the same kernel Parallelization Each thread has its own threadID threadID

  13. CUDA Threads are organized into blocks Threads in a block can synchronize (parallel task) Blocks are grouped into a grid Blocks are independent (might coordinate)

  14. CUDA Memory hierarchy Thread accesses threadID registers Block accesses shared memory Grid accesses global memory

  15. CUDA Memory space overview

  16. CUDA Kernel/Thread execution One kernel at a time is executed as a grid A block executes on one thread processor Several blocks can execute concurrently on one thread processor (multiprocessor)

  17. CUDA Variable keywords __global__ void KernelFunc(...); kernel function, runs on device __device__ int GlobalVar; variable in device memory __shared__ int SharedVar; variable in per-block shared memory Execute the kernel kernelFunc<<<500,128>>>(a,b) 1D array of 500 blocks where each contains 1D array of 128 threads = 500x128=64000 threads

  18. CUDA Get the thread ID CUDA variables dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x threadIdx.x blockDim.x blockIdx.x

  19. CUDA Get the thread ID CUDA variables dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x threadIdx.x blockDim.x blockIdx.x //A is in shared memory A[threadID] = begin[threadID]; //sync threads within a block __syncthreads(); int left = A[threadID - 1];

  20. CUDA Get the thread ID CUDA variables dim3 threadIdx, blockIdx, blockDim, gridDim; int threadID = blockDim.x*blockIdx.x+threadIdx.x blockDim.x blockIdx.x //A is in shared memory A[threadID] = begin[threadID]; /* atomic counter is increased when element threadID was accessed atomicInc(&a,i) -> (*a>=i) ? 0 : (*a++) */ val = atomicInc(B[threadID],val); int left = A[threadID - 1];

  21. CUDA Atomic functions Atomic operation is guaranteed to be performed without interference from other threads! atomicAdd,atomicSub,atomicExch,atomicMin,atomic Max,atomicInc,atomicDec,atomicCAS, atomicAnd, atomicOr, atomicXor Notes Shared memory is faster then using atomic operations in global memory Minimize their usage Causing code serialization 20

  22. CUDA Hello World in CUDA //add two arrays a+b in O(1) __global__ void increment_gpu( float *a, float *b, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N) a[idx] = a[idx] + b[idx]; } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N); }

  23. CUDA Hello World in CUDA //add two arrays a+b in O(1) __global__ void increment_gpu( float *a, float *b, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N) a[idx] = a[idx] + b[idx]; } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); increment_gpu<<<dimGrid, dimBlock>>>(a, b, N); }

  24. CUDA Histogram in CUDA using atomicInc //compute a histogram b of the array a with a bin size of d __global__ void histogram( float *a, int *b, float d, int N) { int idx = blockIdx.x * blockDim.x+ threadIdx.x; if (idx < N){ int bin = int(a[idx]/d); atomicInc(b+bin, N); } } void main() { ….. int blockSize = 500; dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); histogram <<<dimGrid, dimBlock>>>(a, b, d, N); }

  25. CUDA Thread accessing privileges We already know that each thread can: Read/write per-thread registers Read/write per-block shared memory Read/write per-grid global memory Additionally each thread can also: Read/write per-thread local memory Read only per-grid constant memory Read only per-grid texture memory texture<float,2> my_texture; declare texture reference float4 texel = texfetch (my_texture, u, v); fetch the texel value

  26. CUDA Any source file containing CUDA language extensions must be compiled with NVCC NVCC separates code running on the host from code running on the device CPU C++ source file nvcc GPU nvcc – deviceemu //use for debugging When running in device emulation mode, one can: Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf ) and vice- versa Detect deadlock situations caused by improper usage of __syncthreads

  27. CUDA Reducing problems Reduce N values to a single one Sum(v 0 , v 1 , … , v N-2 , v N-1 ) Max(v 0 , v 1 , … , v N-2 , v N-1 ) Min(v 0 , v 1 , … , v N-2 , v N-1 )

  28. CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 )

  29. CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 ) __global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }

  30. CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 )

  31. CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 ) __global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }

  32. CUDA Example: Sum(V)=Sum(x 0 , x 1 , … , x k-2 , x k-1 ) __global__ void Sum(int *X, int *sum) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = X[i]; __syncthreads(); // do reduction in shared mem, unroll for blockDim = 32 If (tid<16) sdata[tid] += sdata[tid + 16]; __syncthreads(); If (tid<8) sdata[tid] += sdata[tid + 8]; __syncthreads(); If (tid<4) sdata[tid] += sdata[tid + 4]; __syncthreads(); If (tid<2) sdata[tid] += sdata[tid + 2]; __syncthreads(); If (tid<1) sdata[tid] += sdata[tid + 1]; __syncthreads(); // write result for this block to global mem if (tid == 0) Y[blockIdx.x] = sdata[0]; }

  33. CUDA Example: Sparse matrix vector multiplication Kernel construction: y=Ax Strategy One thread is assigned to each row in matrix A

  34. CUDA Example: Sparse matrix vector multiplication Kernel construction: y=Ax Strategy One thread is assigned to each row in matrix A

  35. CUDA Example: Sparse matrix vector multiplication Kernel construction: y=Ax Strategy One thread is assigned to each row in matrix A

  36. CUDA Example: Sparse matrix vector multiplication Sparse matrix A is represented by data , rows , and cols

Recommend


More recommend