cuda optimization with
play

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight VSE Companion


  1. CUDA OPTIMIZATION WITH NVIDIA NSIGHT ™ VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA

  2. WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight VSE Companion Code: https://github.com/chmaruni/nsight-gtc2015

  3. INTRODUCING THE APPLICATION Grayscale Blur Edges

  4. INTRODUCING THE APPLICATION Grayscale Conversion // r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

  5. INTRODUCING THE APPLICATION Blur: 7x7 Gaussian Filter foreach pixel p: p = weighted sum of p and its 48 neighbors 1 2 3 4 3 2 1 2 4 6 8 6 4 2 3 6 9 12 9 6 3 4 8 12 16 12 8 4 3 6 9 12 9 6 3 2 4 6 8 6 4 2 1 2 3 4 3 2 1 Image from Wikipedia

  6. INTRODUCING THE APPLICATION Edges: 3x3 Sobel Filters foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy) Weights for Gx: Weights for Gy: -1 0 1 1 2 1 -2 0 2 0 0 0 -1 0 1 -1 -2 -1

  7. ENVIRONMENT NVIDIA GTX Titan X GM200 SM5.2 Windows 7 NVIDIA Nsight Visual Studio Edition 4.6

  8. PERFORMANCE OPTIMIZATION CYCLE 1. Profile Application 2. Identify 5. Change and Performance Test Code Limiter 3. Analyze Profile 4. Reflect & Find Indicators 4b. Build Knowledge Chameleon from http://www.vectorportal.com, Creative Commons

  9. PREREQUISITES Basic understanding of the GPU Memory Hierarchy Global Memory (slow, generous) Shared Memory (fast, limited) Registers (very fast, very limited) (Texture Cache) Basic understanding of the CUDA execution model Grid 1D/2D/3D Block 1D/2D/3D Warp-synchronous execution (32 threads per warp)

  10. ITERATION 1

  11. TRACING THE APPLICATION Verify Parameters Select Trace Application Activate CUDA Launch

  12. NAVIGATING THE ANALYSIS REPORTS Timeline CUDA Summary CUDA Launches

  13. TIMELINE

  14. IDENTIFY HOTSPOT (CUDA SUMMARY) Hotspot Identify the hotspot: gaussian_filter_7x7_v0() Kernel Time Speedup Original Version 1.971ms 1.00x

  15. PERFORM KERNEL ANALYSIS Select Profile CUDA Application Select the Kernel Select the Experiments (All) Launch

  16. THE CUDA LAUNCHES VIEW Select Kernel Select Experiment Experiment Results

  17. IDENTIFY MAIN PERFORMANCE LIMITER Memory Utilization vs Compute Utilization Four possible combinations: 60% Comp Comp Mem Mem Comp Mem Comp Mem Compute Bandwidth Latency Compute and Bound Bound Bound Bandwidth Bound

  18. MEMORY BANDWIDTH SM SM Registers Registers SMEM/L1$ SMEM/L1$ L2$ Global Memory (Framebuffer)

  19. IDENTIFY PERFORMANCE LIMITER Utilization of L2$ Bandwidth (BW) limited and DRAM BW < 2% Not limited by memory bandwidth

  20. INSTRUCTION THROUGHPUT SM Each SM has 4 schedulers (Maxwell) 256KB Register File 96KB Shared Memory Schedulers issue instructions to pipes Sched Sched Each scheduler schedules up to 2 instructions Pipes Pipes per cycle Tex/L1$ A scheduler issues inst. from a single warp Sched Sched Cannot issue to a pipe if its issue slot is full Pipes Pipes TEX/L1$

  21. INSTRUCTION THROUGHPUT Schedulers and pipe Schedulers saturated Pipe saturated saturated Sched Sched Sched Sched Sched Sched Sched Sched Sched Sched Sched Sched Utilization: 92% Utilization: 90% Utilization: 64% Shared Control Shared Control Shared Control ALU Texture Texture ALU Texture ALU Mem Flow Mem Flow Mem Flow 90% 78% 65% 24% 27% 11% 6% 8% 4% 4%

  22. WARP ISSUE EFFICIENCY Percentage of issue slots used (blue) Aggregated over all the schedulers

  23. PIPE UTILIZATION Percentages of issue slots used per pipe Accounts for pipe throughputs Four groups of pipes: Shared Memory Texture Control Flow Arithmetic (ALU)

  24. INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated Not limited by the instruction throughput  Our Kernel is Latency Bound

  25. LOOKING FOR INDICATORS 56% of theoretical occupancy 29.35 active warps per cycle 1.18 warps eligible per cycle Let’s start with occupancy

  26. OCCUPANCY Each SM has limited resources 64K Registers (32 bit) shared by threads Up to 48KB of shared memory per block (96KB per SMM) 32 Active Blocks per SMM Full occupancy: 2048 threads per SM (64 warps) Values vary with Compute Capability

  27. LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) Fully covered latency Exposed latency warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issuing

  28. LATENCY: LACK OF OCCUPANCY Not enough active warps warp 0 warp 1 warp 2 warp 3 No warp issues The schedulers cannot find eligible warps at every cycle

  29. LOOKING FOR MORE INDICATORS We don’t want to change the register count yet Block Size seems OK

  30. CONTINUE LOOKING FOR INDICATORS 4-8 L2 Transactions per 1 Request

  31. MEMORY TRANSACTIONS: BEST CASE A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 1x 128B load/store request per warp 4x 32B L2 transactions per warp 4x L2 transactions: 128B needed / 128B transferred

  32. MEMORY TRANSACTIONS: WORST CASE Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment Stride: 32x4B 1x 128B load/store request per warp warp 2 1x 32B L2 transaction per thread 32x L2 transactions: 128B needed / 32x 32B transferred

  33. TRANSACTIONS AND REPLAYS A warp reads from addresses spanning 3 lines of 128B 1 st line: Threads 0-7 Threads 24-31 2 nd line: Threads 8-15 3 rd line: Threads 16-23 1 instr. executed and 2 replays = 1 request and 3 transactions Instruction issued Instruction re-issued Instruction re-issued 1 st replay 2 nd replay Time Threads Threads Threads 0-7/24-31 8-15 16-23

  34. TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Inst. 1 Inst. 2 Inst. 0 Inst. 1 Inst. 2 Issued Issued Issued Completed Completed Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads Threads Threads Threads Threads Threads 0-7/24-31 8-15 16-23 0-7/24-31 8-15 16-23

  35. CHANGING THE BLOCK LAYOUT Our blocks are 8x8 threadIdx.x (stride-1, uchar) Warp 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 Warp 1 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 Data Overfetch We should use blocks of size 32x2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

  36. IMPROVED MEMORY ACCESS Blocks of size 32x2 Memory is used more efficiently Kernel Time Speedup Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x

  37. PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Occupancy Problem: Latency is exposed due to low occupancy Goal: Hide latency behind more parallel work Indicators: Occupancy low (< 60%) Execution Dependency High Strategy: Increase occupancy by: • Varying block size • Varying shared memory usage • Varying register count

  38. PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Coalescing Problem: Memory is accessed inefficiently => high latency Goal: Reduce #transactions/request to reduce latency Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality

  39. PERF-OPT QUICK REFERENCE CARD Category: Bandwidth Bound - Coalescing Problem: Too much unused data clogging memory system Goal: Reduce traffic, move more useful data per request Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality

  40. ITERATION 2

  41. IDENTIFY HOTSPOT Hotspot gaussian_filter_7x7_v0() still the hotspot Kernel Time Speedup Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x

Recommend


More recommend