nvidia nsight eclipse edition
play

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE Companion Code:


  1. CUDA OPTIMIZATION WITH NVIDIA NSIGHT ™ ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA

  2. WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight EE Companion Code: https://github.com/chmaruni/nsight-gtc2015

  3. INTRODUCING THE APPLICATION Grayscale Blur Edges

  4. INTRODUCING THE APPLICATION Grayscale Conversion // r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

  5. INTRODUCING THE APPLICATION Blur: 7x7 Gaussian Filter foreach pixel p: p = weighted sum of p and its 48 neighbors 1 2 3 4 3 2 1 2 4 6 8 6 4 2 3 6 9 12 9 6 3 4 8 12 16 12 8 4 3 6 9 12 9 6 3 2 4 6 8 6 4 2 1 2 3 4 3 2 1 Image from Wikipedia

  6. INTRODUCING THE APPLICATION Edges: 3x3 Sobel Filters foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy) Weights for Gx: Weights for Gy: -1 0 1 1 2 1 -2 0 2 0 0 0 -1 0 1 -1 -2 -1

  7. ENVIRONMENT NVIDIA Tesla K40m GK110B SM3.5 ECC off 3004 MHz memory clock, 875 MHz SM clock NVIDIA CUDA 7.0 release candidate Similar results are obtained on Windows

  8. PERFORMANCE OPTIMIZATION CYCLE 1. Profile Application 2. Identify 5. Change and Performance Test Code Limiter 3. Analyze Profile 4. Reflect & Find Indicators 4b. Build Knowledge Chameleon from http://www.vectorportal.com, Creative Commons

  9. PREREQUISITES Basic understanding of the GPU Memory Hierarchy Global Memory (slow, generous) Shared Memory (fast, limited) Registers (very fast, very limited) (Texture Cache) Basic understanding of the CUDA execution model Grid 1D/2D/3D Block 1D/2D/3D Warp-synchronous execution (32 threads per warp)

  10. ITERATION 1

  11. CREATE A NEW NVVP SESSION

  12. THE PROFILER WINDOW Timeline Summary Guide Analysis Results

  13. TIMELINE

  14. EXAMINE INDIVIDUAL KERNELS (GUIDED ANALYSIS) Launch

  15. IDENTIFY HOTSPOT Hotspot Identify the hotspot: gaussian_filter_7x7_v0() Kernel Time Speedup Original Version 5.233ms 1.00x

  16. PERFORM KERNEL ANALYSIS Select Launch

  17. IDENTIFY PERFORMANCE LIMITER

  18. PERFORMANCE LIMITER CATEGORIES Memory Utilization vs Compute Utilization Four possible combinations: 60% Comp Comp Mem Mem Comp Mem Comp Mem Compute Bandwidth Latency Compute and Bound Bound Bound Bandwidth Bound

  19. IDENTIFY PERFORMANCE LIMITER Load/Store Memory Ops Memory Related Issues?

  20. LOOKING FOR INDICATORS Launch Large number of memory operations stalling LSU

  21. LOOKING FOR MORE INDICATORS Unguided Analysis 4-5 Global Load/Store Transactions per 1 Request

  22. MEMORY TRANSACTIONS: BEST CASE A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 1x 128B load/store request per warp 1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred

  23. MEMORY TRANSACTIONS: WORST CASE Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment Stride: 32x4B 1x 128B load/store request per warp warp 2 1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred

  24. TRANSACTIONS AND REPLAYS A warp reads from addresses spanning 3 lines of 128B 1 st line: Threads 0-7 Threads 24-31 2 nd line: Threads 8-15 3 rd line: Threads 16-23 1 instr. executed and 2 replays = 1 request and 3 transactions Instruction issued Instruction re-issued Instruction re-issued 1 st replay 2 nd replay Time Threads Threads Threads 0-7/24-31 8-15 16-23

  25. TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Inst. 1 Inst. 2 Inst. 0 Inst. 1 Inst. 2 Issued Issued Issued Completed Completed Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads Threads Threads Threads Threads Threads 0-7/24-31 8-15 16-23 0-7/24-31 8-15 16-23

  26. CHANGING THE BLOCK LAYOUT Our blocks are 8x8 threadIdx.x (stride-1, uchar) Warp 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 Warp 1 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 Data Overfetch We should use blocks of size 32x2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

  27. IMPROVED MEMORY ACCESS Blocks of size 32x2 Memory is used more efficiently Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x

  28. PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Coalescing Problem: Memory is accessed inefficiently => high latency Goal: Reduce #transactions/request to reduce latency Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality

  29. PERF-OPT QUICK REFERENCE CARD Category: Bandwidth Bound - Coalescing Problem: Too much unused data clogging memory system Goal: Reduce traffic, move more useful data per request Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality

  30. ITERATION 2

  31. IDENTIFY HOTSPOT Hotspot gaussian_filter_7x7_v0() still the hotspot Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x

  32. IDENTIFY PERFORMANCE LIMITER Still Latency Bound

  33. A lot of idle LOOKING FOR INDICATORS time Launch Not enough work inside a thread to hide latency?

  34. STALL REASONS: EXECUTION DEPENDENCY a = b + c; // ADD a = b[i]; // LOAD d = a + e; // ADD d = a + e; // ADD Memory accesses may influence execution dependencies Global accesses create longer dependencies than shared accesses Read-only/texture dependencies are counted in Texture Instruction level parallelism can reduce dependencies a = b + c; // Independent ADDs d = e + f;

  35. ILP AND MEMORY ACCESSES No ILP 2-way ILP (with loop unrolling) float a, a0 = 0.0f, a1 = 0.0f; float a = 0.0f; for( int i = 0 ; i < N ; i += 2 ) for( int i = 0 ; i < N ; ++i ) { a += logf(b[i]); a0 += logf(b[i]); a1 += logf(b[i+1]); c = b[0] } a += logf(c) a = a0 + a1 c = b[1] c0 = b[0] a += logf(c) c1 = b[1] a0 += logf(c0) c = b[2] a1 += logf(c1) c0 = b[2] a += logf(c) c1 = b[3] a0 += logf(c0) c = b[3] a1 += logf(c1) ... a = a0 + a1 a += logf(c) #pragma unroll is useful to extract ILP Manually rewrite code if not a simple loop

  36. LOOKING FOR MORE INDICATORS

  37. LOOKING FOR MORE INDICATORS Not enough active warps to hide latencies?

  38. LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) Fully covered latency Exposed latency warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issuing

  39. LATENCY: LACK OF OCCUPANCY Not enough active warps warp 0 warp 1 warp 2 warp 3 No warp issues The schedulers cannot find eligible warps at every cycle

  40. IMPROVED OCCUPANCY Bigger blocks of size 32x4 Increases achieved occupancy slightly (from 47.6% to 52.4%) Kernel Time Speedup Original Version 5.233ms 1.00x Better Memory Accesses 1.589ms 3.29x Higher Occupancy 1.562ms 3.35x

  41. PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Occupancy Problem: Latency is exposed due to low occupancy Goal: Hide latency behind more parallel work Indicators: Occupancy low (< 60%) Execution Dependency High Strategy: Increase occupancy by: • Varying block size • Varying shared memory usage • Varying register count

Recommend


More recommend