avoiding pitfalls when using nvidia gpus for real time
play

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in - PowerPoint PPT Presentation

Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang, Nathan Otterness , Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith All image sources and references are provided at the end. 1 2


  1. Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems Ming Yang, Nathan Otterness , Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith All image sources and references are provided at the end. 1

  2. 2

  3. Computer Vision & AI Expertise GPU Real-time Behavior Expertise Expertise 3

  4. Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ● GPU concurrency and performance ● CUDA programming perils 4

  5. CUDA Programming Fundamentals (i) Allocate GPU memory cudaMalloc(&devicePtr, bufferSize); (ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize); (iii) Launch the kernel computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr); ( kernel = code that runs on GPU) (iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize); (v) Free GPU memory cudaFree(devicePtr); 5

  6. CUDA Programming Fundamentals (i) Allocate GPU memory cudaMalloc(&devicePtr, bufferSize); (ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize); (iii) Launch the kernel computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr); ( kernel = code that runs on GPU) (iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize); (v) Free GPU memory cudaFree(devicePtr); 6

  7. CUDA Programming Fundamentals (i) Allocate GPU memory cudaMalloc(&devicePtr, bufferSize); (ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize); (iii) Launch the kernel computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr); ( kernel = code that runs on GPU) (iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize); (v) Free GPU memory cudaFree(devicePtr); 7

  8. CUDA Programming Fundamentals (i) Allocate GPU memory cudaMalloc(&devicePtr, bufferSize); (ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize); (iii) Launch the kernel computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr); ( kernel = code that runs on GPU) (iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize); (v) Free GPU memory cudaFree(devicePtr); 8

  9. CUDA Programming Fundamentals (i) Allocate GPU memory cudaMalloc(&devicePtr, bufferSize); (ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize); (iii) Launch the kernel computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr); ( kernel = code that runs on GPU) (iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize); (v) Free GPU memory cudaFree(devicePtr); 9

  10. CUDA Programming Fundamentals (i) Allocate GPU memory cudaMalloc(&devicePtr, bufferSize); (ii) Copy data from CPU to GPU cudaMemcpy(devicePtr, hostPtr, bufferSize); (iii) Launch the kernel computeResult<<<numBlocks, threadsPerBlock>>>(devicePtr); ( kernel = code that runs on GPU) (iv) Copy results from GPU to CPU cudaMemcpy(hostPtr, devicePtr, bufferSize); (v) Free GPU memory cudaFree(devicePtr); 10

  11. Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ● GPU concurrency and performance ● CUDA programming perils 11

  12. Explicit Synchronization 12

  13. Explicit Synchronization Each CUDA stream is managed by a separate CPU thread in the same address space. 13

  14. Explicit Synchronization K1 starts K1 completes 14

  15. Explicit Synchronization 1024 threads 256 threads 15

  16. Explicit Synchronization 1. Thread 3 calls cudaDeviceSynchronize (explicit synchronization). (a) 2. Thread 3 sleeps for 0.2 seconds. (c) 3. Thread 3 launches kernel K3. (d) 16

  17. Explicit Synchronization 1. Thread 3 calls cudaDeviceSynchronize (explicit synchronization). (a) 2. Thread 4 launches kernel K4. (b) 3. Thread 3 sleeps for 0.2 seconds. (c) 4. Thread 3 launches kernel K3. (d) 17

  18. Explicit Synchronization Pitfall 1. Explicit synchronization does not block future commands issued by other tasks. 18

  19. Implicit Synchronization CUDA toolkit 9.2.88 Programming Guide, Section 3.2.5.5.4, "Implicit Synchronization": Two commands from different streams cannot run concurrently [if separated by]: 1. A page-locked host memory allocation 2. A device memory allocation 3. A device memory set 4. A memory copy between two addresses to the same device memory 5. Any CUDA command to the NULL stream 19

  20. Implicit Synchronization ➔ Pitfall 2. Documented sources of implicit synchronization may not occur. 1. A page-locked host memory allocation 2. A device memory allocation 3. A device memory set 4. A memory copy between two addresses to the same device memory 5. Any CUDA command to the NULL stream 20

  21. Implicit Synchronization 21

  22. Implicit Synchronization 1. Thread 3 calls cudaFree . (a) 2. Thread 3 sleeps for 0.2 seconds. (c) 3. Thread 3 launches kernel K3. (d) 22

  23. Implicit Synchronization 1. Thread 3 calls cudaFree . (a) 2. Thread 4 is blocked on the CPU when trying to launch kernel 4. (b) 3. Thread 4 finishes launching kernel K4, thread 3 sleeps for 0.2 seconds. (c) 4. Thread 3 launches kernel K3. (d) 23

  24. Implicit Synchronization ➔ Pitfall 3. The CUDA documentation neglects to list some functions that cause implicit synchronization. ➔ Pitfall 4. Some CUDA API functions will block future CUDA tasks on the CPU. 24

  25. Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ○ Suggestion: use CUDA Multi-Process Service (MPS). ● GPU concurrency and performance ● CUDA programming perils 25

  26. Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ○ Suggestion: use CUDA Multi-Process Service (MPS). ● GPU concurrency and performance ● CUDA programming perils 26

  27. GPU Concurrency and Performance ● Implicit synchronization penalty = Processes with MPS vs. Threads 27

  28. GPU Concurrency and Performance ● Implicit synchronization penalty = Processes with MPS vs. Threads ● GPU concurrency benefit = Processes with MPS vs. Processes without MPS 28

  29. GPU Concurrency and Performance ● Implicit synchronization penalty = Processes with MPS vs. Threads ● GPU concurrency benefit = Processes with MPS vs. Processes without MPS ● MPS overhead = Threads vs. Threads with MPS (not in plots) 29

  30. GPU Concurrency and Performance 30

  31. GPU Concurrency and Performance 31

  32. GPU Concurrency and Performance 70% of the time, a single Hough transform iteration completed in 12 ms or less. 32

  33. GPU Concurrency and Performance This occurred when four concurrent instances were running in separate CPU threads. 33

  34. GPU Concurrency and Performance The observed WCET using threads was over 4x the WCET using multiple processes. 34

  35. GPU Concurrency and Performance 35

  36. GPU Concurrency and Performance 36

  37. GPU Concurrency and Performance ➔ Pitfall 5. The suggestion from NVIDIA’s documentation to exploit concurrency through user-defined streams may be of limited use. 37

  38. Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ○ Suggestion: use CUDA Multi-Process Service (MPS). ● GPU concurrency and performance ● CUDA programming perils 38

  39. Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ○ Suggestion: use CUDA Multi-Process Service (MPS). ● GPU concurrency and performance ● CUDA programming perils 39

  40. Pitfalls for Real-Time GPU Usage ● Synchronization and blocking ○ Suggestion: use CUDA Multi-Process Service (MPS). ● GPU concurrency and performance ● CUDA programming perils 40

  41. Synchronous Defaults if (!CheckCUDAError( cudaMemsetAsync( state->device_block_smids, 0, data_size))) { return 0; } Why does this cause implicit synchronization? 41

  42. Synchronous Defaults • The CUDA docs say that memset if (!CheckCUDAError( causes implicit synchronization... cudaMemsetAsync( state->device_block_smids, 0, data_size))) { return 0; } 42

  43. Synchronous Defaults • The CUDA docs say that memset if (!CheckCUDAError( causes implicit synchronization... cudaMemsetAsync( state->device_block_smids, • But didn't slide 20 say memset 0, data_size))) { doesn't cause implicit return 0; synchronization? } 43

  44. Synchronous Defaults if (!CheckCUDAError( if (!CheckCUDAError( cudaMemsetAsync( cudaMemsetAsync( state->device_block_smids, state->device_block_smids, 0, data_size))) { 0, data_size, return 0; state->stream))) { } return 0; } ➔ Pitfall 6. Async CUDA functions use the GPU-synchronous NULL stream by default. 44

  45. Other Perils ➔ Pitfall 7. Observed CUDA behavior often diverges from what the documentation states or implies. 45

  46. Other Perils ➔ Pitfall 8. CUDA documentation can be contradictory. 46

  47. Other Perils ➔ Pitfall 8. CUDA documentation can be contradictory. CUDA Programming Guide, section 3.2.5.1: The following device operations are asynchronous with respect to the host: [...] Memory copies performed by functions that are suffixed with Async CUDA Runtime API Documentation, section 2: For transfers from device memory to pageable host memory, [cudaMemcpyAsync] will return only once the copy has completed. 47

  48. Other Perils ➔ Pitfall 9. What we learn about current black-box GPUs may not apply in the future. 48

Recommend


More recommend