cartoon parallel architectures cpus and gpus
play

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 - PowerPoint PPT Presentation

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD


  1. “Cartoon” parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 � Thanks to Jee Choi (a senior PhD student) for a big “assist” 1

  2. 2

  3. 3

  4. 4

  5. 5

  6. 6

  7. 7

  8. 8

  9. 9

  10. 10

  11. 11

  12. 12

  13. 13

  14. 14

  15. ~ socket 14

  16. ~ core 14

  17. ~ HWMT+SIMD (“SIMT”) 14

  18. Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  19. ~ 500 GF/s (single) Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  20. ~ 4 TF/s (single) ~ 500 GF/s (single) Intel E5-2687W 14 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  21. Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  22. ~ 50 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  23. ~ 50 GB/s ~ 250 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  24. ~ 50 GB/s ~ 250 GB/s 6 GB/s Intel E5-2687W 15 “Sandy Bridge-EP” vs. NVIDIA K20X vs. “Kepler”

  25. System Comparison Intel Xeon NVIDIA Difference E5-2687W K20X # Cores/SMX 8 14 1.75 × Clock frequency 3.8 GHz 735 MHz 0.20 × (max) SIMD Width 256-bits Thread processors 2688 SP + 896 DP Performance 8 cores × 3.8 GHz × 2688 × 735 MHz × 8.12 × (single precision) (8 Add + 8 Mul) = 2 (FMA) = Performance 8 cores × 3.8 GHz × 896 × 735 MHz × 5.42 × (double precision) (4 Add + 4 Mul) = 2 (FMA) = Memory bandwidth 51.2 GB/s 250 GB/s 4.88 × TDP 150 W 235 W 1.57 ×

  26. 17

  27. 17

  28. 6 GB/s 17

  29. 18

  30. 19

  31. 20

  32. 21

  33. 22

  34. 23

  35. 24

  36. “CUDA” is NVIDIA’s implementation of this execution model

  37. Thread hierarchy “Single instruction multiple thread” ( SIMT )

  38. An example to compare models OpenMP: Naïve: #pragma omp parallel for for (i=0; i<N; i++) for (i=0; i<N; i++) A[i] += 2; A[i] += 2; CUDA , with N threads: int i = f(global thread ID) ; A[i] += 2;

  39. Global thread IDs blockIdx.x 3 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 threadIdx.x 0 1 2 3 … … 15 global ID

  40. Global thread IDs blockIdx.x 3 0 1 2 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 threadIdx.x A 0 1 2 3 … … 15 global ID

  41. Thread hierarchy • Given a 3-D grid of thread blocks – there are (gridDim.x*gridDim.y*gridDim.z) thread blocks in the grid ¡ – each block’s position is identified by blockIdx.x , blockIdx.y , and blockIdx.z ¡ • Similarly for a 3-D thread block ¡ – blockDim.x, blockDim.y, blockDim.z ¡ – threadIdx.x, threadIdx.y, threadIdx.z ¡ • Thread-to-data mapping depends on how the work is divided amongst the threads

  42. Memory hierarchy variables thread local memory thread shared memory block global memory grid constant memory (read-only) texture memory (read-only)

  43. CUDA by example 
 Basic CUDA code __global__ void test (int* in, int* out, int N) { int gId = threadIdx.x + blockDim.x * blockIdx.x; out[gId] = in[gId]; } � int main (int argc, char** argv) { int N = 1048576; in tbSize = 256; � int nBlocks = N / tbSize; � dim3 grid (nBlocks); dim3 block (tbSize); � test <<<grid, block>>> (d_in, d_out, N); cudaThreadSynchronize (); }

  44. CUDA by example 
 Basic CUDA code int main (int argc, char** argv) allocate memory { on device /* allocate memory for host and device */ int* h_in, h_out, d_in, d_out; h_in = (int*) malloc (N * sizeof (int)); h_out = (int*) malloc (N * sizeof (int)); cudaMalloc ((void**) &d_in, N * sizeof (int)); cudaMalloc ((void**) &d_out, N * sizeof (int)); � /* copy data from device to host */ Copy data from cudaMemcpy (d_in, h_in, N * sizeof (int), CPU to GPU cudaMemcpyHostToDevice); � /* body of the problem here */ . . . Copy data from /* copy data back to host */ cudaMemcpy (h_out, d_out, N * sizeof (int), GPU to CPU cudaMemcpyDeviceToHost); /* free memory */ free (h_in); free (h_out) free memory cudaFree (d_in); cudaFree (d_out); }

  45. CUDA by example 
 What is this code doing? __global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y; � __shared__ buffer[16][16]; � buffer[tidx][tidy] = in[gIdX + gIdY * blockDim.x * gridDim.x]; __syncthreads(); � if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }

  46. CUDA by example 
 What is this code doing? __global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y; � shared memory __shared__ buffer[16][16]; � buffer[tidx][tidy] = in[gIdX + gIdY * blockDim.x * gridDim.x]; why do we need __syncthreads(); this? � if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }

  47. Synchronization • Within a thread block – via __syncthreads (); ¡ • Global synchronization – implicit synchronization between kernels ¡ – only way to synchronize globally is to finish the grid and start another grid

  48. Scheduling • Each thread block gets scheduled on a multiprocessor (SMX) ¡ – there is no guarantee in the order in which they get scheduled ¡ – thread blocks run independently to each other ¡ • Multiple thread blocks can reside on a single SMX simultaneously (occupancy) ¡ – the number of thread blocks is determined by the resource usage and availability (shared memory and registers) ¡ • Once scheduled, each thread blocks runs to completion

  49. Execution Minimum unit of execution: warp ¡ • – typically 32 threads ¡ At any given time, multiple warps will be executing ¡ • – could be from the same or different thread blocks ¡ A warp of threads could be either ¡ • – executing ¡ – waiting (for data or their turn) ¡ When a warp gets stalled, they could be switched out • “instantaneously” so that another warp can start executing ¡ – hardware multi-threading

  50. Performance Notes 
 Thread Divergence • On a branch, threads in a warp can diverge – execution is serialized – threads taking one branch executes while others idle ¡ • Avoid divergence!!! – use bitwise operation when possible ¡ – diverge at granularity of warps (no penalty)

  51. Performance Notes 
 Occupancy • Occupancy = # resident warps / max # warps ¡ – # resident warps is determined by per-thread register and per-block shared memory usage ¡ – max # warps is specific to the hardware generation ¡ • More warps means more threads with which to hide latency ¡ – increases the chance of keeping the GPU busy at all times ¡ – does not necessarily mean better performance

  52. Performance Notes 
 Bandwidth Utilization • Reading from the DRAM occurs at the granularity of 128 Byte transactions ¡ – requests are further decomposed to aligned cache lines ¡ • read-only cache:128 Bytes • L2 cache: 32 Bytes • Minimize loading redundant cache lines to maximize bandwidth utilization ¡ – aligned access to memory ¡ – sequential access pattern

  53. Performance Notes 
 Bandwidth Utilization

  54. Performance Notes 
 Bandwidth Utilization

  55. Performance Notes 
 Bandwidth Utilization

  56. Backup 44

  57. GPU Architecture

  58. Performance Notes 
 Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent

  59. Performance Notes 
 Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent ¡ • Memory Bandwidth Bandwidth ( λ ) Latency (W)

  60. Performance Notes 
 Bandwidth Utilization II • Little’s Law – L = λ W ¡ • L = average number of customers in a store • λ = arrival rate ¡ • W = average time spent ¡ • Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth ( λ ) Latency (W)

  61. In summary • Use as many “cheap” threads as possible ¡ – maximizes occupancy ¡ – increases the number of memory requests ¡ • Avoid thread divergence ¡ – if unavoidable, diverge at the warp level ¡ • Use aligned and sequential data access pattern ¡ – minimize redundant data loads

  62. CUDA by example 
 Quicksort • Let’s now consider quicksort on a GPU • Step 1 Partition the initial list – how do we partition the list amongst thread blocks? ¡ – recall that thread blocks CANNOT co-operate and thread blocks can go in ANY order ¡ – however, we need to have MANY threads and thread blocks in order to see good performance

  63. CUDA by example 
 Quicksort 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 thread thread thread thread block 0 block 1 block 2 block 3

Recommend


More recommend