s8630 what the profiler is telling you
play

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS - PowerPoint PPT Presentation

S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 BEFORE YOU START The five steps to enlightenment 1. Know your hardware What are the target machines, how many nodes? Machine-specific


  1. S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018

  2. BEFORE YOU START The five steps to enlightenment 1. Know your hardware What are the target machines, how many nodes? Machine-specific optimizations okay? • 2. Know your tools • Strengths and weaknesses of each tool? Learn how to use them (and learn one well!) 3. Know your application • What does it compute? How is it parallelized? What final performance is expected? 4. Know your process • Performance optimization is a constant learning process 5. Make it so! 2

  3. THE APOD CYCLE 4. D eploy 1. A ssess and Test • Identify Performance Limiter • Analyze Profile • Find Indicators 3. O ptimize 2. P arallelize 3b. Build Knowledge 3

  4. GUIDING OPTIMIZATION EFFORT “Drilling Down into the Metrics” • Challenge: How to know where to start? Scope • Top-down Approach: Find Hotspot Kernel • Identify Performance Limiter of the Hotspot • • Find performance bottleneck indicators related to the limiter Identify associated regions in the source code • Come up with strategy to fix and change the code • Start again • 4

  5. KNOW YOUR HARDWARE: VOLTA ARCHITECTURE 5

  6. VOLTA V100 FEATURES Improved NVLink & Volta MPS Improved SIMT Model Tensor Core Volta Architecture HBM2 120 Programmable Inference Utilization New Algorithms Most Productive GPU Efficient Bandwidth TFLOPS Deep Learning 6

  7. GPU COMPARISON P100 (SXM2) V100 (SXM2) Double/Single/Half TFlop/s 5.3/10.6/21.2 7.8/15.7/125 (TensorCores) Memory Bandwidth (GB/s) 732 900 Memory Size 16GB 16GB L2 Cache Size 4096 KB 6144 KB Base/Boost Clock (Mhz) 1328/1480 1312/1530 TDP (Watts) 300 300 7

  8. VOLTA GV100 SM GV100 FP32 units 64 FP64 units 32 INT32 units 64 Tensor Cores 8 Register File 256 KB Unified L1/Shared 128 KB memory Active Threads 2048 8

  9. IMPROVED L1 CACHE Pascal SM Volta SM Load/Store Units Load/Store Units Shared L1$ and Shared Memory Memory Low Latency 128 KB 64 KB Streaming L1$ 24 KB L2$ L2$ 4 MB 6 MB 9

  10. KNOW YOUR TOOLS: PROFILERS 10

  11. PROFILING TOOLS Many Options! From NVIDIA Third Party • nvprof • TAU Performance System • NVIDIA Visual Profiler (nvvp) • VampirTrace • Nsight Visual Studio Edition • PAPI CUDA component Coming Soon: • HPC Toolkit • NVIDIA Nsight Systems • (Tools using CUPTI) • NVIDIA Nsight Compute Without loss of generality, in this talk we will be showing nvvp screenshots 11

  12. THE NVVP PROFILER WINDOW Timeline Summary Guide Analysis Results 12

  13. KNOW YOUR APPLICATION: HPGMG 13

  14. 3/24/2018 HPGMG High-Performance Geometric Multi-Grid, Hybrid Implementation V-CYCLE F-CYCLE SMOOTHER SMOOTHER & RESIDUAL GPU SMOOTHER SMOOTHER THRESHOLD & RESIDUAL CPU DIRECT SOLVE Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU) http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/ 14

  15. MAKE IT SO: ITERATION 1 2 ND ORDER 7-POINT STENCIL 15

  16. IDENTIFY HOTSPOT Hotspot Identify the hotspot: smooth_kernel() Kernel Time Speedup Original Version 2.079ms 1.00x 16

  17. IDENTIFY PERFORMANCE LIMITER Memory utilization Compute utilization 17 17

  18. PERFORMANCE LIMITER CATEGORIES Memory Utilization vs Compute Utilization Four possible combinations: Comp Mem Comp Comp Mem Mem Comp Mem Compute Bandwidth Latency Compute and Bound Bound Bound Bandwidth Bound 18

  19. LATENCY BOUND ON P100 19 19

  20. BANDWIDTH BOUND ON V100 20 20

  21. DRILLING DOWN: LATENCY ANALYSIS (V100) The profiler warns about low occupancy Limited by block size of only 8x4=32 threads 21 21

  22. OCCUPANCY GPU Utilization Each SM has limited resources: • max. 64K Registers (32 bit) distributed between threads max. 48KB (96KB opt in) of shared memory per block (96KB per SMM) • • max. 32 Active Blocks per SMM Full occupancy: 2048 threads per SM (64 warps) • When a resource is used up, occupancy is reduced Values vary with Compute Capability 22

  23. LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) Exposed latency, not enough warps Fully covered latency warp 0 warp 0 warp 1 warp 1 warp 2 warp 2 warp 3 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issues 23

  24. LATENCY AT HIGH OCCUPANCY Many active warps but with high latency instructions Exposed latency at high occupancy warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issuing 24

  25. LOOKING FOR MORE INDICATORS Source Code Association 12 Global Load For line numbers use: Transactions per 1 Request nvcc -lineinfo 25 25

  26. MEMORY TRANSACTIONS: BEST CASE A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 1x 128B load/store request per warp 1x 128B L1 transaction per warp 4x 32B L2 transactions per warp 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred 26

  27. MEMORY TRANSACTIONS: WORST CASE Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment Stride: 32x4B 1x 128B load/store request per warp thread 2 1x 128B L1 transaction per thread 1x 32B L2 transaction per thread 32x L1 transactions: 128B needed / 32x 128B transferred 32x L2 transactions: 128B needed / 32x 32B transferred 27

  28. TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Inst. 1 Inst. 2 Inst. 0 Inst. 1 Inst. 2 Issued Issued Issued Completed Completed Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads Threads Threads Threads Threads Threads 28 0-7/24-31 8-15 16-23 0-7/24-31 8-15 16-23

  29. FIX: BETTER GPU TILING Before After Block Size Up Transactions Per Access Down +10% Memory Utilization Up Kernel Time Speedup Original Version 2.079ms 1.00x Better Memory Accesses 1.756ms 1.18x 29 29

  30. 30 30

  31. ITERATION 2: DATA MIGRATION 31

  32. PAGE FAULTS Details 32

  33. MEMORY MANAGEMENT Using Unified Memory Developer View With No changes to data structures Unified Memory No explicit data movements Single pointer for CPU and GPU data Use cudaMallocManaged for allocations Unified Memory 3/24/2 33 018

  34. UNIFIED MEMORY Eliminating page migrations and faults F-CYCLE GPU THRESHOLD CPU Page faults Solution: allocate the first CPU level with cudaMallocHost (zero-copy memory) 3/24/2 34 018

  35. PAGE FAULTS Almost gone 35

  36. PAGE FAULTS Significant speedup for affected kernel 36

  37. 3/24/2018 MEM ADVICE API Not used here cudaMemPrefetchAsync (ptr, length, destDevice, stream) Migrate data to destDevice: overlap with compute Update page table: much lower overhead than page fault in kernel Async operation that follows CUDA stream semantics cudaMemAdvise (ptr, length, advice, device) Specifies allocation and usage policy for memory region User can set and unset at any time 37

  38. ITERATION 3: REGISTER OPTIMIZATION AND CACHING 38

  39. LIMITER: STILL MEMORY BANDWIDTH 39 39

  40. GPU MEMORY HIERARCHY V100 Registers (256 KB/SM): good for • Functional Units Functional Units intra-thread data reuse Shared mem / L1$ (128 KB/SM): • Register File Register File good for explicit intra-block data reuse Shared Memory / Shared Memory / L1$ L1$ L2$ (6144 KB): implicit data • Bring reused SM SM reuse data closer to the SMs L2$ Global Memory (Framebuffer) 40

  41. 3/24/2018 CACHING IN REGISTERS No data loaded initially 41

  42. 3/24/2018 CACHING IN REGISTERS Load first set of data load 42

  43. 3/24/2018 CACHING IN REGISTERS Perform calculation Stencil 43

  44. 3/24/2018 CACHING IN REGISTERS Naively load next set of data? load 44

  45. 3/24/2018 CACHING IN REGISTERS Reusing already loaded data is better keep keep load 45

  46. 3/24/2018 CACHING IN REGISTERS Repeat Stencil Higher register usage may result in reduced occupancy => trade off (run experiments!) 46

  47. THE EFFECT OF REGISTER CACHING Transactions for cached loads reduced by a factor of 8 Memory utilization still high, but transferring less redundant data Kernel Time Speedup Original Version 2.079ms 1.00x Better Memory Accesses 1.756ms 1.18x Register Caching 1.486ms 1.40x 47 47

  48. SHARED MEMORY Programmer-managed cache Great for caching data reused across threads in a CTA 128KB split between shared memory and L1 cache per SM Each block can use at most 96KB shared memory on GV100 Search for cudaFuncAttributePreferredSharedMemoryCarveout in the docs global __global__ void sharedMemExample(int *d) { registers __shared__ float s[64]; int t = threadIdx.x; s[t] = d[t]; __syncthreads(); global if(t>0 && t<63) stencil[t] = -2.0f*s[t] + s[t-1] + s[t+1]; shared } registers 48

  49. 49

  50. ITERATION 4: KERNELS WITH INCREASED ARITHMETIC INTENSITY 50

Recommend


More recommend