Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in - PowerPoint PPT Presentation

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang, Nathan Otterness , Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith All image sources and references are provided at the end. 1

Computer Vision & AI Expertise GPU Real-time Behavior Expertise Expertise 3

Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ● GPU concurrency and performance ● CUDA programming perils 4

CUDA Programming Fundamentals (i) Allocate GPU memory cudaMalloc(&devicePtr, bufferSize); (ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize); (iii) Launch the kernel computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr); ( kernel = code that runs on GPU) (iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize); (v) Free GPU memory cudaFree(devicePtr); 5

Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ● GPU concurrency and performance ● CUDA programming perils 11

Explicit Synchronization 12

Explicit Synchronization Each CUDA stream is managed by a separate CPU thread in the same address space. 13

Explicit Synchronization K1 starts K1 completes 14

Explicit Synchronization 1024 threads 256 threads 15

Explicit Synchronization 1. Thread 3 calls cudaDeviceSynchronize (explicit synchronization). (a) 2. Thread 3 sleeps for 0.2 seconds. (c) 3. Thread 3 launches kernel K3. (d) 16

Explicit Synchronization 1. Thread 3 calls cudaDeviceSynchronize (explicit synchronization). (a) 2. Thread 4 launches kernel K4. (b) 3. Thread 3 sleeps for 0.2 seconds. (c) 4. Thread 3 launches kernel K3. (d) 17

Explicit Synchronization Pitfall 1. Explicit synchronization does not block future commands issued by other tasks. 18

Implicit Synchronization CUDA toolkit 9.2.88 Programming Guide, Section 3.2.5.5.4, "Implicit Synchronization": Two commands from different streams cannot run concurrently [if separated by]: 1. A page-locked host memory allocation 2. A device memory allocation 3. A device memory set 4. A memory copy between two addresses to the same device memory 5. Any CUDA command to the NULL stream 19

Implicit Synchronization ➔ Pitfall 2. Documented sources of implicit synchronization may not occur. 1. A page-locked host memory allocation 2. A device memory allocation 3. A device memory set 4. A memory copy between two addresses to the same device memory 5. Any CUDA command to the NULL stream 20

Implicit Synchronization 21

Implicit Synchronization 1. Thread 3 calls cudaFree . (a) 2. Thread 3 sleeps for 0.2 seconds. (c) 3. Thread 3 launches kernel K3. (d) 22

Implicit Synchronization 1. Thread 3 calls cudaFree . (a) 2. Thread 4 is blocked on the CPU when trying to launch kernel 4. (b) 3. Thread 4 finishes launching kernel K4, thread 3 sleeps for 0.2 seconds. (c) 4. Thread 3 launches kernel K3. (d) 23

Implicit Synchronization ➔ Pitfall 3. The CUDA documentation neglects to list some functions that cause implicit synchronization. ➔ Pitfall 4. Some CUDA API functions will block future CUDA tasks on the CPU. 24

Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ○ Suggestion: use CUDA Multi-Process Service (MPS). ● GPU concurrency and performance ● CUDA programming perils 25

GPU Concurrency and Performance ● Implicit synchronization penalty = Processes with MPS vs. Threads 27

GPU Concurrency and Performance ● Implicit synchronization penalty = Processes with MPS vs. Threads ● GPU concurrency benefit = Processes with MPS vs. Processes without MPS 28

GPU Concurrency and Performance ● Implicit synchronization penalty = Processes with MPS vs. Threads ● GPU concurrency benefit = Processes with MPS vs. Processes without MPS ● MPS overhead = Threads vs. Threads with MPS (not in plots) 29

GPU Concurrency and Performance 30

GPU Concurrency and Performance 70% of the time, a single Hough transform iteration completed in 12 ms or less. 32

GPU Concurrency and Performance This occurred when four concurrent instances were running in separate CPU threads. 33

GPU Concurrency and Performance The observed WCET using threads was over 4x the WCET using multiple processes. 34

GPU Concurrency and Performance ➔ Pitfall 5. The suggestion from NVIDIA’s documentation to exploit concurrency through user-defined streams may be of limited use. 37

Synchronous Defaults if (!CheckCUDAError( cudaMemsetAsync( state->device_block_smids, 0, data_size))) { return 0; } Why does this cause implicit synchronization? 41

Synchronous Defaults • The CUDA docs say that memset if (!CheckCUDAError( causes implicit synchronization... cudaMemsetAsync( state->device_block_smids, 0, data_size))) { return 0; } 42

Synchronous Defaults • The CUDA docs say that memset if (!CheckCUDAError( causes implicit synchronization... cudaMemsetAsync( state->device_block_smids, • But didn't slide 20 say memset 0, data_size))) { doesn't cause implicit return 0; synchronization? } 43

Synchronous Defaults if (!CheckCUDAError( if (!CheckCUDAError( cudaMemsetAsync( cudaMemsetAsync( state->device_block_smids, state->device_block_smids, 0, data_size))) { 0, data_size, return 0; state->stream))) { } return 0; } ➔ Pitfall 6. Async CUDA functions use the GPU-synchronous NULL stream by default. 44

Other Perils ➔ Pitfall 7. Observed CUDA behavior often diverges from what the documentation states or implies. 45

Other Perils ➔ Pitfall 8. CUDA documentation can be contradictory. 46

Other Perils ➔ Pitfall 8. CUDA documentation can be contradictory. CUDA Programming Guide, section 3.2.5.1: The following device operations are asynchronous with respect to the host: [...] Memory copies performed by functions that are suffixed with Async CUDA Runtime API Documentation, section 2: For transfers from device memory to pageable host memory, [cudaMemcpyAsync] will return only once the copy has completed. 47

Other Perils ➔ Pitfall 9. What we learn about current black-box GPUs may not apply in the future. 48

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in - PowerPoint PPT Presentation

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang, Nathan Otterness , Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith All image sources and references are provided at the end. 1 2

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang,

NVIDIA GRID Linux Virtual Desktops with NVIDIA Virtual GPUs for Chip-Design Applications Shailesh

Cutting Edge Tools and Techniques for Real-Time Rendering with NVIDIA GameWorks David Coombes,

NVIDIA GPUS Mark Kilgard Principal S ystem S oftware Engineer, NVIDIA Piers Daniell S enior

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

NVIDIA OPTICAL FLOW Abhijit Patait, 3/18/2019 Optical Flow in Turing GPUs NVIDIA Optical Flow

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

NVIDIA INDEX IMPLEMENTING CLOUD SERVICES FOR MASSIVE DATA VISUALIZATION Marc Nienhaus (NVIDIA),

GARBAGE BAGE CO COLLECTIO LLECTION: N: @EvaAndreasson, @Cloudera AGENDA Garbage

Portable Parallelization Strategies Charles Leggett CCE Kickoff Meeting, ANL March 9 2020 1 C.

HMM: GUP NO MORE ! XDC 2018 Jrme Glisse HETEROGENEOUS COMPUTING CPU is dead, long live the

abstractions at scale our experiences at twitter marius a. eriksen @marius QConSF , November

Dynamic Memory Management The Linux Perspective Allocating memory: The

ECE 6504: Deep Learning for Perception Topics: LSTMs (intuition and variants) [Abhishek:]

More Advanced OpenMP This is an abbreviated form of Tim Mattsons and Larry Meadows

Short-term Memory for Self-collecting Mutators Martin Aigner, Andreas Haas , Christoph M. Kirsch,