CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ™ VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA

WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight VSE Companion Code: https://github.com/chmaruni/nsight-gtc2015

INTRODUCING THE APPLICATION Grayscale Blur Edges

INTRODUCING THE APPLICATION Grayscale Conversion // r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

INTRODUCING THE APPLICATION Blur: 7x7 Gaussian Filter foreach pixel p: p = weighted sum of p and its 48 neighbors 1 2 3 4 3 2 1 2 4 6 8 6 4 2 3 6 9 12 9 6 3 4 8 12 16 12 8 4 3 6 9 12 9 6 3 2 4 6 8 6 4 2 1 2 3 4 3 2 1 Image from Wikipedia

INTRODUCING THE APPLICATION Edges: 3x3 Sobel Filters foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy) Weights for Gx: Weights for Gy: -1 0 1 1 2 1 -2 0 2 0 0 0 -1 0 1 -1 -2 -1

ENVIRONMENT NVIDIA GTX Titan X GM200 SM5.2 Windows 7 NVIDIA Nsight Visual Studio Edition 4.6

PERFORMANCE OPTIMIZATION CYCLE 1. Profile Application 2. Identify 5. Change and Performance Test Code Limiter 3. Analyze Profile 4. Reflect & Find Indicators 4b. Build Knowledge Chameleon from http://www.vectorportal.com, Creative Commons

PREREQUISITES Basic understanding of the GPU Memory Hierarchy Global Memory (slow, generous) Shared Memory (fast, limited) Registers (very fast, very limited) (Texture Cache) Basic understanding of the CUDA execution model Grid 1D/2D/3D Block 1D/2D/3D Warp-synchronous execution (32 threads per warp)

ITERATION 1

TRACING THE APPLICATION Verify Parameters Select Trace Application Activate CUDA Launch

NAVIGATING THE ANALYSIS REPORTS Timeline CUDA Summary CUDA Launches

TIMELINE

IDENTIFY HOTSPOT (CUDA SUMMARY) Hotspot Identify the hotspot: gaussian_filter_7x7_v0() Kernel Time Speedup Original Version 1.971ms 1.00x

PERFORM KERNEL ANALYSIS Select Profile CUDA Application Select the Kernel Select the Experiments (All) Launch

THE CUDA LAUNCHES VIEW Select Kernel Select Experiment Experiment Results

IDENTIFY MAIN PERFORMANCE LIMITER Memory Utilization vs Compute Utilization Four possible combinations: 60% Comp Comp Mem Mem Comp Mem Comp Mem Compute Bandwidth Latency Compute and Bound Bound Bound Bandwidth Bound

MEMORY BANDWIDTH SM SM Registers Registers SMEM/L1$ SMEM/L1$ L2$ Global Memory (Framebuffer)

IDENTIFY PERFORMANCE LIMITER Utilization of L2$ Bandwidth (BW) limited and DRAM BW < 2% Not limited by memory bandwidth

INSTRUCTION THROUGHPUT SM Each SM has 4 schedulers (Maxwell) 256KB Register File 96KB Shared Memory Schedulers issue instructions to pipes Sched Sched Each scheduler schedules up to 2 instructions Pipes Pipes per cycle Tex/L1$ A scheduler issues inst. from a single warp Sched Sched Cannot issue to a pipe if its issue slot is full Pipes Pipes TEX/L1$

INSTRUCTION THROUGHPUT Schedulers and pipe Schedulers saturated Pipe saturated saturated Sched Sched Sched Sched Sched Sched Sched Sched Sched Sched Sched Sched Utilization: 92% Utilization: 90% Utilization: 64% Shared Control Shared Control Shared Control ALU Texture Texture ALU Texture ALU Mem Flow Mem Flow Mem Flow 90% 78% 65% 24% 27% 11% 6% 8% 4% 4%

WARP ISSUE EFFICIENCY Percentage of issue slots used (blue) Aggregated over all the schedulers

PIPE UTILIZATION Percentages of issue slots used per pipe Accounts for pipe throughputs Four groups of pipes: Shared Memory Texture Control Flow Arithmetic (ALU)

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated Not limited by the instruction throughput  Our Kernel is Latency Bound

LOOKING FOR INDICATORS 56% of theoretical occupancy 29.35 active warps per cycle 1.18 warps eligible per cycle Let’s start with occupancy

OCCUPANCY Each SM has limited resources 64K Registers (32 bit) shared by threads Up to 48KB of shared memory per block (96KB per SMM) 32 Active Blocks per SMM Full occupancy: 2048 threads per SM (64 warps) Values vary with Compute Capability

LATENCY GPUs cover latencies by having a lot of work in flight The warp issues The warp waits (latency) Fully covered latency Exposed latency warp 0 warp 1 warp 2 warp 3 warp 4 warp 5 warp 6 warp 7 warp 8 warp 9 No warp issuing

LATENCY: LACK OF OCCUPANCY Not enough active warps warp 0 warp 1 warp 2 warp 3 No warp issues The schedulers cannot find eligible warps at every cycle

LOOKING FOR MORE INDICATORS We don’t want to change the register count yet Block Size seems OK

CONTINUE LOOKING FOR INDICATORS 4-8 L2 Transactions per 1 Request

MEMORY TRANSACTIONS: BEST CASE A warp issues 32x4B aligned and consecutive load/store request Threads read different elements of the same 128B segment 1x 128B load/store request per warp 4x 32B L2 transactions per warp 4x L2 transactions: 128B needed / 128B transferred

MEMORY TRANSACTIONS: WORST CASE Threads in a warp read/write 4B words, 128B between words Each thread reads the first 4B of a 128B segment Stride: 32x4B 1x 128B load/store request per warp warp 2 1x 32B L2 transaction per thread 32x L2 transactions: 128B needed / 32x 32B transferred

TRANSACTIONS AND REPLAYS A warp reads from addresses spanning 3 lines of 128B 1 st line: Threads 0-7 Threads 24-31 2 nd line: Threads 8-15 3 rd line: Threads 16-23 1 instr. executed and 2 replays = 1 request and 3 transactions Instruction issued Instruction re-issued Instruction re-issued 1 st replay 2 nd replay Time Threads Threads Threads 0-7/24-31 8-15 16-23

TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources More instructions issued More memory traffic Increased execution time Execution time Inst. 0 Inst. 1 Inst. 2 Inst. 0 Inst. 1 Inst. 2 Issued Issued Issued Completed Completed Completed Extra work (SM) Extra latency Transfer data for inst. 0 Transfer data for inst. 1 Transfer data for inst. 2 Extra memory traffic Threads Threads Threads Threads Threads Threads 0-7/24-31 8-15 16-23 0-7/24-31 8-15 16-23

CHANGING THE BLOCK LAYOUT Our blocks are 8x8 threadIdx.x (stride-1, uchar) Warp 0 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 Warp 1 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 56 57 58 59 60 61 62 63 Data Overfetch We should use blocks of size 32x2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

IMPROVED MEMORY ACCESS Blocks of size 32x2 Memory is used more efficiently Kernel Time Speedup Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x

PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Occupancy Problem: Latency is exposed due to low occupancy Goal: Hide latency behind more parallel work Indicators: Occupancy low (< 60%) Execution Dependency High Strategy: Increase occupancy by: • Varying block size • Varying shared memory usage • Varying register count

PERF-OPT QUICK REFERENCE CARD Category: Latency Bound – Coalescing Problem: Memory is accessed inefficiently => high latency Goal: Reduce #transactions/request to reduce latency Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality

PERF-OPT QUICK REFERENCE CARD Category: Bandwidth Bound - Coalescing Problem: Too much unused data clogging memory system Goal: Reduce traffic, move more useful data per request Indicators: Low global load/store efficiency, High #transactions/#request compared to ideal Strategy: Improve memory coalescing by: • Cooperative loading inside a block • Change block layout • Aligning data • Changing data layout to improve locality

ITERATION 2

IDENTIFY HOTSPOT Hotspot gaussian_filter_7x7_v0() still the hotspot Kernel Time Speedup Original Version 1.971ms 1.00x Better Memory Accesses 0.725ms 2.72x

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight VSE Companion

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

The growth of emerging economies and global macroeconomic stability Vincenzo Quadrini University

Household Production September 2009 Michael Kremer Stanley Watt Obstacles to Migration

Cost Padding, Monitoring, and Regulation Shinji Kobayashi and Shigemi Ohba Shinji Kobayashi and

WM Bandusena Secretary Ministry of Productivity Promotion 24 Jan 2011 The Sri Lankan human

Natural Resource Governance Daniel Kaufmann President & CEO, NRGI At the Norad Policy Forum

Divorce and Business Valuations Navigating Complex Valuation Issues and Methodologies and

trusts: Why trustees (and trustee-spouses) need to be wary of taking the current approach to

C ONT . A PPLICABILITY OF I NTERNAL A UDIT Companies Act - 2013 Particulars Listed Unlisted

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION - PowerPoint PPT Presentation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL LEARN An iterative method to optimize your GPU code A way to conduct that method with NVIDIA Nsight VSE Companion

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

S9751: ACCELERATE YOUR CUDA DEVELOPMENT WITH LATEST DEBUGGING AND CODE ANALYSIS DEVELOPER TOOLS

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 Runtime C++11 cuSOLVER Compilation

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

The growth of emerging economies and global macroeconomic stability Vincenzo Quadrini University

Household Production September 2009 Michael Kremer Stanley Watt Obstacles to Migration

Cost Padding, Monitoring, and Regulation Shinji Kobayashi and Shigemi Ohba Shinji Kobayashi and

WM Bandusena Secretary Ministry of Productivity Promotion 24 Jan 2011 The Sri Lankan human

Natural Resource Governance Daniel Kaufmann President &amp; CEO, NRGI At the Norad Policy Forum

Divorce and Business Valuations Navigating Complex Valuation Issues and Methodologies and

trusts: Why trustees (and trustee-spouses) need to be wary of taking the current approach to

C ONT . A PPLICABILITY OF I NTERNAL A UDIT Companies Act - 2013 Particulars Listed Unlisted

Natural Resource Governance Daniel Kaufmann President & CEO, NRGI At the Norad Policy Forum