CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC - PowerPoint PPT Presentation

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017

The art of doing more with less 2

RULE #1: DON’T TRY TOO HARD Performance Peak Performance Time 3

RULE #1: DON’T TRY TOO HARD Performance Peak Performance Unrealistic Effort/Reward Time 4

RULE #1: DON’T TRY TOO HARD Performance Peak Performance Time 5

RULE #1: DON’T TRY TOO HARD Performance Peak Performance Don’t waste this time Get on this curve Reduce this time Time 6

RULE #1: DON’T TRY TOO HARD Performance Peak Performance Here be ninjas Hire an intern Premature Point of excitement diminishing returns Most people Trough of give up here despair Wait, it’s 4 weeks and going slower?? this is it? Time 7

PERFORMANCE CONSTRAINTS Compute Intensity 10% Divergence 3% Instruction 2% Occupancy 10% Memory 75% 8

PERFORMANCE CONSTRAINTS Chart Title CPU <> GPU Compute Intensity Transfer Divergence Instruction Occupancy Coalescence Divergent Access Cache Inefficiency Register Spilling 9

MEMORY ORDERS OF MAGNITUDE SM L1$ L2 Cache GDRAM DRAM CPU regs shmem PCIe regs shmem bus regs shmem 20,000 2,000 300 16 150 GB/sec GB/sec GB/sec GB/sec GB/sec 10

TALK BREAKDOWN In no particular order 1. Why Didn’t I Think Of That? 2. CPU Memory to GPU Memory (the PCIe Bus) 3. GPU Memory to the SM 4. Registers & Shared Memory 5. Occupancy, Divergence & Latency 6. Weird Things You Never Thought Of (and probably shouldn’t try) 11

WHERE TO BEGIN? 12

THE OBVIOUS NVIDIA Visual Profiler Start with the Visual Profiler 13

CPU <> GPU DATA MOVEMENT 14

PCI ISSUES regs shmem PCIe regs shmem bus regs shmem 16 GB/sec Moving data over the PCIe bus 15

PIN YOUR CPU MEMORY GPU Memory CPU Memory Data Copy 16

PIN YOUR CPU MEMORY GPU Memory CPU Memory DMA Data Controller 17

PIN YOUR CPU MEMORY GPU Memory CPU Memory DMA Controller Data Swap 18

PIN YOUR CPU MEMORY GPU Memory CPU Memory Data CPU allocates & pins page then DMA copies locally before DMA Controller Pinned Copy of Data 19

PIN YOUR CPU MEMORY GPU Memory CPU Memory User DMA Pinned Controller Data cudaHostAlloc( &data, size, cudaHostAllocMapped ); cudaHostRegister( &data, size, cudaHostRegisterDefault ); 20

PIN YOUR CPU MEMORY 21

REMEMBER: PCIe GOES BOTH WAYS 22

STREAMS & CONCURRENCY Hiding the cost of data transfer Operations in a single stream are ordered But hardware can copy and compute at the same time Single Copy data Copy data Compute Stream to GPU to Host Time 23

STREAMS & CONCURRENCY Copy Copy Stream 2 Work up back Saved Time Copy Copy Stream 1 Work up back Single Copy data Copy data Compute Stream to GPU to Host Time 24

STREAMS & CONCURRENCY Can keep on breaking work into smaller chunks and saving time 8 2 1 streams streams stream 25

SMALL PCIe TRANSFERS PCIe is designed for large data transfers But fine-grained copy/compute overlap prefers small transfers So how small can we go? Too 8 2 1 many 26

APPARENTLY NOT THAT SMALL 27

FROM GPU MEMORY TO GPU THREADS 28

FEEDING THE MACHINE regs shmem PCIe regs shmem bus regs shmem From GPU Memory to the SMs 29

USE THE PARALLEL ARCHITECTURE Hardware is optimized to use all SIMT threads at once L2 Cache Line Threads run Cache is sized to service High-speed GPU memory in groups of 32 sets of 32 requests at a time works best with linear access 30

VECTORIZE MEMORY LOADS Multi-Word as well as Multi-Thread T0-T32 int 31

VECTORIZE MEMORY LOADS T0-T15 T16-T31 int2 Fill multiple cache lines in a single fetch 32

VECTORIZE MEMORY LOADS T0-T7 T8-T15 int4 T16-T23 T24-T31 Fill multiple cache lines in a single fetch 33

VECTORIZE MEMORY LOADS 34

DO MULTIPLE LOADS PER THREAD Multi-Thread, Multi-Word AND Multi-Iteration __global__ void copy(int2 *input, __global__ void copy(int2 *input, int2 *output, int2 *output, int max) { int max, int loadsPerThread) { int id = threadIdx.x + blockDim.x * blockIdx.x; int id = threadIdx.x + blockDim.x * blockIdx.x; if( id < max ) { output[id] = input[id]; for(int n=0; n<loadsPerThread; n++) { } if( id >= max ) { } break; } output[id] = input[id]; id += blockDim.x * gridDim.x; } } One copy per thread Multiple copies per thread Maximum overhead Amortize overhead 35

“MAXIMAL” LAUNCHES ARE BEST 36

COALESCED MEMORY ACCESS It’s not just good enough to use all SIMT threads Coalesced: Sequential memory accesses are adjacent 1 2 3 4 3 1 Uncoalesced: Sequential memory accesses are unassociated 4 2 37

SIMT PENALTIES WHEN NOT COALESCED x = data[threadIdx.x] x = data[rand()] Single 32-wide operation 32 one-wide operations 38

SCATTER & GATHER Gathering Scattering 1 2 3 4 1 2 3 4 3 3 1 1 4 4 2 2 Reading randomly Reading sequentially Writing sequentially Writing randomly 39

AVOID SCATTER/GATHER IF YOU CAN 40

AVOID SCATTER/GATHER IF YOU CAN 41

SORTING MIGHT BE AN OPTION If reading non-sequential data is expensive, is it worth sorting it to make it sequential? Gathering Coalesced Read 1 2 3 4 1 2 3 4 Fast Slow Sort 2 4 1 3 1 2 3 4 42

SORTING MIGHT BE AN OPTION Even if you’re only going to read it twice, then yes! 43

PRE-SORTING TURNS OUT TO BE GOOD 44

DATA LAYOUT: “AOS vs. SOA” Sometimes you can’t just sort your data Array-of-Structures Structure-of-Arrays #define NPTS 1024 * 1024 #define NPTS 1024 *1024 struct Coefficients_AOS { struct Coefficients_SOA { double u[3]; double u[3] [NPTS] ; double x[3][3]; double x[3][3] [NPTS] ; double p; double p [NPTS] ; double rho; double rho [NPTS] ; double eta; double eta [NPTS] ; }; }; Coefficients_AOS gridData [NPTS] ; Coefficients_SOA gridData; Single-thread code prefers arrays of SIMT code prefers structures of arrays, structures, for cache efficiency for execution & memory efficiency 45

DATA LAYOUT: “AOS vs. SOA” #define NPTS 1024 * 1024 u0 u1 u2 struct Coefficients_AOS { x00 x01 x02 double u[3]; x10 x11 x12 double x[3][3]; double p; x20 x21 x22 double rho; p double eta; }; rho eta Coefficients_AOS gridData [NPTS] ; Structure Definition Conceptual Layout 46

SOA: STRIDED ARRAY ACCESS GPU reads data one element at a time, but in parallel by 32 threads in a warp Array-of-Structures Memory Layout u0 u1 u2 x00 x01 x02 x10 x11 x12 x20 x21 x22 p rho eta double u0 = gridData[threadIdx.x].u[0]; Conceptual Layout 47

AOS: COALESCED BUT COMPLEX GPU reads data one element at a time, but in parallel by 32 threads in a warp Array-of-Structures Memory Layout u0 u1 u2 x00 x01 x02 Structure-of-Arrays Memory Layout x10 x11 x12 x20 x21 x22 p rho eta Conceptual Layout double u0 = gridData.u[0][threadIdx.x]; 48

BLOCK-WIDE LOAD VIA SHARED MEMORY Read data linearly as bytes. Use shared memory to convert to struct Device Memory Block copies data to shared memory Shared Memory 49

BLOCK-WIDE LOAD VIA SHARED MEMORY Read data linearly as bytes. Use shared memory to convert to struct Device Memory Shared Memory Threads which own the data grab it from shared memory 50

CLEVER AOS/SOA TRICKS 51

CLEVER AOS/SOA TRICKS Helps for any data size 52

HANDY LIBRARY TO HELP YOU Trove – A utility library for fast AOS/SOA access and transposition https://github.com/bryancatanzaro/trove 53

(AB)USING THE CACHE 54

MAKING THE MOST OF L2-CACHE L2 Cache GDRAM L2 cache is fast but small: Architecture L2 Cache Total Cache Bytes Size Threads per Thread Kepler 1536 KB 30,720 51 Maxwell 3072 KB 49,152 64 Pascal 4096 KB 114,688 36 2,000 300 GB/sec GB/sec 55

TRAINING DEEP NEURAL NETWORKS 56

LOTS OF PASSES OVER DATA 3x3 W1 convolution 5x5 + Cat! FFT W2 convolution 7x7 W3 convolution 57

MULTI-RESOLUTION CONVOLUTIONS Pass 1 : 3x3 Pass 2: 5x5 Pass 3: 7x7 58

TILED, MULTI-RESOLUTION CONVOLUTION Pass 1 : 3x3 Pass 2: 5x5 Pass 3: 7x7 Each tile sized to fit in L2 cache Do 3 passes per-tile 59

LAUNCHING FEWER THAN MAXIMUM THREADS 60

SHARED MEMORY: DEFINITELY WORTH IT 61

USING SHARED MEMORY WISELY Shared memory arranged into “banks” for concurrent SIMT access 32 threads can read simultaneously so long as into separate banks ▪ Shared memory has 4-byte and 8- byte “bank” sizes 62

STENCIL ALGORITHM Many algorithms have high data re-use: potentially good for shared memory “Stencil” algorithms accumulate data from neighbours onto a central point ▪ Stencil has width “W” (in the above case, W=5) Adjacent threads will share (W-1) items of data – good potential for data re-use 63

STENCILS IN SHARED MEMORY 64

SIZE MATTERS 65

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC - PowerPoint PPT Presentation

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 RULE #1: DONT TRY TOO HARD Performance Peak Performance Time 3 RULE #1: DONT TRY TOO HARD Performance Peak Performance

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Unpacking tips and tricks Protector Techniques Conclusion Samuel Chevet w4kfu@lse.epita.fr

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Productivity tips & tricks Productivity tips & tricks for a developer for a developer

Android UI Development: Tips, Tricks, and Techniques Romain Guy Chet Haase Android UI Toolkit

101 Tips and Techniques for Amazing Presentations 101 Tips and Techniques for Amazing

Tips and Tricks For Broadcast Engineers and Managers Agenda Overview Jeff Welton, CBRE Tips

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

TIPS 2015-2016 What is TIPS? Talent Identification Program 3 Programs: 1) Sixth Grade Duke

Grant Writing Tips & Tricks Sean Vroom NJIT TAB Felicia Fred USEPA Region 3, Brownfields

HSC FCO PRESENTS UNMJobs 2.0 Onboarding, Updates, Tips & Tricks, Q&A Session May 16,

Writing Speculative Fiction: Tips, Hints, Tricks, Pitfalls, Successes Dr. Sara L. Uckelman

Healthcare Solutions Participate Change... Save August 25, 2016 1 Medical Trend Increase:

Medicaid Strategies and Rural Health Gil Liu, MD Chief Medical Officer, Kentucky Department for

HEALTH in VIRGINIA: VIRGINIA as the HEALTHIEST STATE IN THE NATION Marissa J. Levine, MD MPH

Aging Research at Illinois: Cognition, Lifespan Engagement, Aging, and Resilience (CLEAR)

Investor Presentation August 2019 Forward-Looking Statements Statements contained in this

02 HIGHER PERFORMANCE Improvement in ball screw acceleration and feed speeds for a better

SAP HANA Replication and SUSE HA Security Best Practice 2 April 2019 Tinus Brink Consulting

Vantage Drilling International Ihab Toma, Chief Executive Officer April, 2018 1 Legal

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC - PowerPoint PPT Presentation

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 RULE #1: DONT TRY TOO HARD Performance Peak Performance Time 3 RULE #1: DONT TRY TOO HARD Performance Peak Performance

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Unpacking tips and tricks Protector Techniques Conclusion Samuel Chevet w4kfu@lse.epita.fr

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Productivity tips &amp; tricks Productivity tips &amp; tricks for a developer for a developer

Android UI Development: Tips, Tricks, and Techniques Romain Guy Chet Haase Android UI Toolkit

101 Tips and Techniques for Amazing Presentations 101 Tips and Techniques for Amazing

Tips and Tricks For Broadcast Engineers and Managers Agenda Overview Jeff Welton, CBRE Tips

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

TIPS 2015-2016 What is TIPS? Talent Identification Program 3 Programs: 1) Sixth Grade Duke

Grant Writing Tips &amp; Tricks Sean Vroom NJIT TAB Felicia Fred USEPA Region 3, Brownfields

HSC FCO PRESENTS UNMJobs 2.0 Onboarding, Updates, Tips &amp; Tricks, Q&amp;A Session May 16,

Writing Speculative Fiction: Tips, Hints, Tricks, Pitfalls, Successes Dr. Sara L. Uckelman

Healthcare Solutions Participate Change... Save August 25, 2016 1 Medical Trend Increase:

Medicaid Strategies and Rural Health Gil Liu, MD Chief Medical Officer, Kentucky Department for

HEALTH in VIRGINIA: VIRGINIA as the HEALTHIEST STATE IN THE NATION Marissa J. Levine, MD MPH

Aging Research at Illinois: Cognition, Lifespan Engagement, Aging, and Resilience (CLEAR)

Investor Presentation August 2019 Forward-Looking Statements Statements contained in this

02 HIGHER PERFORMANCE Improvement in ball screw acceleration and feed speeds for a better

SAP HANA Replication and SUSE HA Security Best Practice 2 April 2019 Tinus Brink Consulting

Vantage Drilling International Ihab Toma, Chief Executive Officer April, 2018 1 Legal

Productivity tips & tricks Productivity tips & tricks for a developer for a developer

Grant Writing Tips & Tricks Sean Vroom NJIT TAB Felicia Fred USEPA Region 3, Brownfields

HSC FCO PRESENTS UNMJobs 2.0 Onboarding, Updates, Tips & Tricks, Q&A Session May 16,