Leftovers from yesterday Basics Memory Control flow Applications GPU computing Part 2: CUDA examples (with some admixture of introduction) Ch. Hoelbling Wuppertal University Lattice Practices 2011 Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Outline Leftovers from yesterday 1 Basics 2 Memory 3 Control flow 4 Applications 5 Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications CUDA overview CUDA PROGRAMMING MODEL Threads are bundled in blocks 1, 2 or 3D index shared memory can synchronize branching with penalty current max. number: 1024 Blocks are bundeled in grids 1 or 2D index same kernel for all threads can not synchronize blocks branch without penalty Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications CUDA overview CUDA HARDWARE MODEL Multiprocessor: 32 scalar cores 4 FP ops per core per cycle SIMT (SIMD+branching) 32k registers shared memory works on N × 32 threads ( = N warps) at a time Extremely fast context switches GPU contains e.g. 16 MPs share global and texture memory work independently Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications CUDA overview CUDA MEMORY MODEL Per thread registers local memory (RW), 512k Per block Shared memory(RW, max. 48k) Per grid global memory(RW) constant memory(R) texture memory(R, separately cached) From host (CPU) global memory(RW) constant memory(RW) texture memory(RW) all 3 are persistent Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Application overview BENCHMARK 35 GPU 8800 GTX GPU 7900 GTX 30 GPU 7800 GTX CPU P4 SSE 25 20 Gflops 15 10 5 4 3 16 3 8 3 24 3 64 8 8 16 16 16 Volume Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Application overview THINGS TO NOTE GPUs need large local lattice Efficient use of massive parallelism Further parallelization difficult at GPU speeds, bus is a huge bottleneck unnecessary memcopies due to drivers (rapidly improving) better stay on one GPU! Computation is for free! GPUs (memory) bandwidth limited it pays off e.g. to reconstruct SU(3) matrix from its generators! Reliability is a very serious issue gamers don’t care about one bad pixel some models > 50 % faliure rate danger of unnoticed memory corruption (no ECC) TESLA more expensive but not necessarily more reliable Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Application overview GPUs IN PRACTICE Really cheap farming (e.g. thermodynamics) Best for failsafe code Matrix inversion: final residue check on CPU Convergence in spite of intermittent error Restart when error detected Control part on CPU Why not openCL? CUDA currently faster Similar (low level) syntax nvcc compiles openCL via intermediate CUDA Vendor lock in - ATI-openCL is getting better Rapid development render to texture, CUDA, IEEE, double, caches, multi-GPU, device-to-device, . . . Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Device code A SIMPLE KERNEL #include<stdio.h> #include<stdlib.h> #include<math.h> #define VL 128 __global__ void add_s( float * s, float * a, float * b) /* s=a+b */ { int i=threadIdx.x; s[i]=a[i]+b[i]; } Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Device code A SIMPLE KERNEL __global__ void add_s( float * s, float * a, float * b) __global__ ☞ function on device, callable from host __device__ ☞ function on device, callable from device __host__ ☞ function on host, callable from host (default) Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Device code A SIMPLE KERNEL __host__ __device__ func() { #if __CUDA_ARCH__ == 100 // Device code path for compute capability 1.0 #elif __CUDA_ARCH__ == 200 // Device code path for compute capability 2.0 #elif !defined(__CUDA_ARCH__) // Host code path #endif } __CUDA_ARCH__ ☞ conditional execution macro Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Device code A SIMPLE KERNEL int i=threadIdx.x; threadIdx ☞ thread index within block - uint3 (x,y,z) blockDim ☞ dimensions of the current thread block blockIdx ☞ block index within grid gridDim ☞ dimensions of the current block grid Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Setup CALLING THE KERNEL How are the dimensions determined? __global__ void add_s( float * s, float * a, float * b) ... int main() { ... /* now call the compute kernel on the device */ add_s<<<dim_grid,dim_block>>>(_s,_a,_b); ... } Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Setup BASIC MEM ALLOCATION Allocating device memory: /* allocate device memory */ cudaMalloc(( void **)&_a,VL* sizeof ( float )); Copying it to the device: /* copy over a to the device */ cudaMemcpy(_a,a,VL* sizeof ( float ),cudaMemcpyHostToDevice); Copying result back from device: /* copy result back to host */ cudaMemcpy(s,_s,VL* sizeof ( float ),cudaMemcpyDeviceToHost); Free device memory: /* free device memory */ cudaFree(_a); Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Setup SOME SETUP Checking devices present: /* get device info */ int deviceCount; cudaGetDeviceCount(&deviceCount); Query device properties: /* check device properties - only use device 0 */ cudaDeviceProp deviceProp; if (cudaGetDeviceProperties(&deviceProp,0)==cudaSuccess) printf(" Device: %s \n ",deviceProp.name); Select a specific device: /* run on the device 0 */ cudaSetDevice(0); Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Setup SIMPLE EXAMPLE CODE All code in one file (demo.cu) No additional include files No libraries to link explicitly Compile with nvcc warper: nvcc -o demo.x demo.cu Direct machine code possible (PTX pseudo-assembly) Usually not efficient Runtime: Driver compiles PTX into cubin binary format API allows for low level access Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Timing TIMING Create event structures: /* create timing events */ cudaEvent_t start,stop; cudaEventCreate(&start); cudaEventCreate(&stop); Record an event: /* start timie - all streams */ cudaEventRecord(start,0); Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Timing TIMING (ctd.) Record second event, synchronize and get time: /* end time - all streams */ cudaEventRecord(stop,0); /* synchronize after asynchronous call */ cudaEventSynchronize(stop); /* get time between events */ float dt; cudaEventElapsedTime(&dt,start,stop); Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Host memory MEMORY OPTIMIZATION Bandwidth in device memory: up to ∼ 200GB/s Bandwidth host-device: up to 8GB/s (PCIe 2.0 × 16) ☞ Host-to-device often bottleneck Improve by using pinned (non-pageable) host memory /* allocate non-pageable memory */ cudaHostAlloc(( void **)&a,VL* sizeof ( float ), cudaHostAllocDefault); Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Host memory MEMORY OPTIMIZATION Even faster: write-combined memory /* allocate non-pageable memory */ cudaHostAlloc(( void **)&a,VL* sizeof ( float ), cudaHostAllocWriteCombined); Warning: slow CPU access (only for pushing to device) Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Host memory MEMORY OPTIMIZATION Copy on demand: mapped memory /* allocate mapped memory */ cudaHostAlloc(( void **)&a,VL* sizeof ( float ), cudaHostAllocMapped); Access from within a device kernel: /* access mapped memory from a device */ cudaHostGetDevicePointer(( void **)&_a,a,0); Christian Hoelbling (Wuppertal) GPU computing
Leftovers from yesterday Basics Memory Control flow Applications Host memory HIDING MEMORY TRANSFER Asynchronous memcopy to device: /* copy over a to the device */ cudaMemcpyAsync(_a,a,size,cudaMemcpyHostToDevice,0); Concurrent with CPU code Default “stream” 0 serializes GPU code Create non-default streams to parallelize with GPU /* create streams */ cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); /* host to device copy */ cudaMemcpyAsync(_a,a,size,cudaMemcpyHostToDevice,stream1); /* overlay with independent operation */ add_s<<<NB,NT,0,stream2>>>(_s,_c,_d); Christian Hoelbling (Wuppertal) GPU computing
Recommend
More recommend