“Cartoon” parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 � Thanks to Jee Choi (a senior PhD student) for a big “assist” 1
2
3
4
5
6
7
8
9
10
11
12
13
14
~ socket 14
~ core 14
~ HWMT+SIMD (“SIMT”) 14
Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”
~ 500 GF/s (single) Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”
~ 4 TF/s (single) ~ 500 GF/s (single) Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”
Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”
~ 50 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”
~ 50 GB/s ~ 250 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”
~ 50 GB/s ~ 250 GB/s 6 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”
System Comparison Intel Xeon NVIDIA Difference E5-2687W K20X # Cores/SMX 8 14 1.75 × Clock frequency 3.8 GHz 735 MHz 0.20 × (max) SIMD Width 256-bits Thread processors 2688 SP + 896 DP Performance 8 cores × 3.8 GHz × 2688 × 735 MHz × 8.12 × (single precision) (8 Add + 8 Mul) = 2 (FMA) = Performance 8 cores × 3.8 GHz × 896 × 735 MHz × 5.42 × (double precision) (4 Add + 4 Mul) = 2 (FMA) = Memory bandwidth 51.2 GB/s 250 GB/s 4.88 × TDP 150 W 235 W 1.57 ×
17
17
6 GB/s 17
18
19
20
21
22
23
24
“CUDA” is NVIDIA’s implementation of this execution model
Thread hierarchy “Single instruction multiple thread” ( SIMT )
An example to compare models OpenMP: Naïve: #pragma omp parallel for for (i=0; i<N; i++) for (i=0; i<N; i++) A[i] += 2; A[i] += 2; CUDA , with N threads: int i = f(global thread ID) ; A[i] += 2;
Global thread IDs blockIdx.x 3 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 threadIdx.x 0 1 2 3 … … 15 global ID
Global thread IDs blockIdx.x 3 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 threadIdx.x A 0 1 2 3 … … 15 global ID
Thread hierarchy • Given a 3-D grid of thread blocks – there are (gridDim.x*gridDim.y*gridDim.z) thread blocks in the grid ¡ – each block’s position is identified by blockIdx.x , blockIdx.y , and blockIdx.z ¡ • Similarly for a 3-D thread block ¡ – blockDim.x, blockDim.y, blockDim.z ¡ – threadIdx.x, threadIdx.y, threadIdx.z ¡ • Thread-to-data mapping depends on how the work is divided amongst the threads
Memory hierarchy variables thread local memory thread shared memory block global memory grid constant memory (read-only) texture memory (read-only)
CUDA by example Basic CUDA code __global__ void test (int* in, int* out, int N) { int gId = threadIdx.x + blockDim.x * blockIdx.x; out[gId] = in[gId]; } � int main (int argc, char** argv) { int N = 1048576; in tbSize = 256; � int nBlocks = N / tbSize; � dim3 grid (nBlocks); dim3 block (tbSize); � test <<<grid, block>>> (d_in, d_out, N); cudaThreadSynchronize (); }
CUDA by example Basic CUDA code int main (int argc, char** argv) allocate memory { on device /* allocate memory for host and device */ int* h_in, h_out, d_in, d_out; h_in = (int*) malloc (N * sizeof (int)); h_out = (int*) malloc (N * sizeof (int)); cudaMalloc ((void**) &d_in, N * sizeof (int)); cudaMalloc ((void**) &d_out, N * sizeof (int)); � /* copy data from device to host */ Copy data from cudaMemcpy (d_in, h_in, N * sizeof (int), CPU to GPU cudaMemcpyHostToDevice); � /* body of the problem here */ . . . Copy data from /* copy data back to host */ cudaMemcpy (h_out, d_out, N * sizeof (int), GPU to CPU cudaMemcpyDeviceToHost); /* free memory */ free (h_in); free (h_out) free memory cudaFree (d_in); cudaFree (d_out); }
CUDA by example What is this code doing? __global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y; � __shared__ buffer[16][16]; � buffer[tidx][tidy] = in[gIdX + gIdY * blockDim.x * gridDim.x]; __syncthreads(); � if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }
CUDA by example What is this code doing? __global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y; � shared memory __shared__ buffer[16][16]; � buffer[tidx][tidy] = in[gIdX + gIdY * blockDim.x * gridDim.x]; why do we need __syncthreads(); this? � if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }
Synchronization • Within a thread block – via __syncthreads (); ¡ • Global synchronization – implicit synchronization between kernels ¡ – only way to synchronize globally is to finish the grid and start another grid
Scheduling • Each thread block gets scheduled on a multiprocessor (SMX) ¡ – there is no guarantee in the order in which they get scheduled ¡ – thread blocks run independently to each other ¡ • Multiple thread blocks can reside on a single SMX simultaneously (occupancy) ¡ – the number of thread blocks is determined by the resource usage and availability (shared memory and registers) ¡ • Once scheduled, each thread blocks runs to completion
Execution Minimum unit of execution: warp ¡ • – typically 32 threads ¡ At any given time, multiple warps will be executing ¡ • – could be from the same or different thread blocks ¡ A warp of threads could be either ¡ • – executing ¡ – waiting (for data or their turn) ¡ When a warp gets stalled, they could be switched out • “instantaneously” so that another warp can start executing ¡ – hardware multi-threading
Performance Notes Thread Divergence • On a branch, threads in a warp can diverge – execution is serialized – threads taking one branch executes while others idle ¡ • Avoid divergence!!! – use bitwise operation when possible ¡ – diverge at granularity of warps (no penalty)
Performance Notes Occupancy • Occupancy = # resident warps / max # warps ¡ – # resident warps is determined by per-thread register and per-block shared memory usage ¡ – max # warps is specific to the hardware generation ¡ • More warps means more threads with which to hide latency ¡ – increases the chance of keeping the GPU busy at all times ¡ – does not necessarily mean better performance
Performance Notes Bandwidth Utilization • Reading from the DRAM occurs at the granularity of 128 Byte transactions ¡ – requests are further decomposed to aligned cache lines ¡ • read-only cache:128 Bytes • L2 cache: 32 Bytes • Minimize loading redundant cache lines to maximize bandwidth utilization ¡ – aligned access to memory ¡ – sequential access pattern
Performance Notes Bandwidth Utilization
Performance Notes Bandwidth Utilization
Performance Notes Bandwidth Utilization
Backup 44
GPU Architecture
Performance Notes Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent
Performance Notes Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent ¡ • Memory Bandwidth Bandwidth ( λ ) Latency (W)
Performance Notes Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent ¡ • Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth ( λ ) Latency (W)
In summary • Use as many “cheap” threads as possible ¡ – maximizes occupancy ¡ – increases the number of memory requests ¡ • Avoid thread divergence ¡ – if unavoidable, diverge at the warp level ¡ • Use aligned and sequential data access pattern ¡ – minimize redundant data loads
CUDA by example Quicksort • Let’s now consider quicksort on a GPU • Step 1 Partition the initial list – how do we partition the list amongst thread blocks? ¡ – recall that thread blocks CANNOT co-operate and thread blocks can go in ANY order ¡ – however, we need to have MANY threads and thread blocks in order to see good performance
CUDA by example Quicksort 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 thread thread thread thread block 0 block 1 block 2 block 3
Recommend
More recommend