real time gpu management
play

Real-Time GPU Management Heechul Yun 1 This Week Topic: General - PowerPoint PPT Presentation

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing Unit (GPGPU) management Today GPU architecture GPU programming model Challenges Real-Time GPU management 2 History GPU


  1. Real-Time GPU Management Heechul Yun 1

  2. This Week • Topic: General Purpose Graphic Processing Unit (GPGPU) management • Today – GPU architecture – GPU programming model – Challenges – Real-Time GPU management 2

  3. History • GPU – Graphic is embarrassingly parallel by nature – GeForce 6800 (2003): 53GFLOPs (MUL) – Some PhDs tried to use GPU to do some general purpose computing, but difficult to program • GPGPU – Ian Buck (Stanford PhD, 2004) joined Nvidia and created CUDA language and runtime. – General purpose: (relatively) easy to program, many scientific applications 3

  4. Discrete GPU Intel Core i7 Nvidia Tesla K80 4992 GPU cores 4 CPU cores Graphic DRAM Host DRAM PCIE 3.0 • Add-on PCIe cards on PC – GPU and CPU memories are separate – GPU memory (GDDR) is much faster than CPU one (DDR) 4

  5. Integrated CPU-GPU SoC GPU cores Core1 Core2 Core3 Core4 GPUSync: A Famework for R eal-Time GP Management Nvidia Tegra K1 Shared Memory Controller Shared DRAM • Tighter integration of CPU and GPU – Memory is shared by both CPU and GPU – Good for embedded systems (e.g., smartphone) 5

  6. NVIDIA Titan Xp • 3840 CUDA cores, 12GB GDDR5X • Peak performance: 12 TFLOPS 6

  7. NVIDIA Jetson TX2 • 256 CUDA GPU cores + 4 CPU cores Image credit: T. Amert et al., “GPU Scheduling on the NVIDIA TX2: Hidde n Details Revealed,” RTSS17 7

  8. NVIDIA Jetson Platforms 8

  9. CPU vs. GPGPU • CPU – Designed to run sequential programs faster – High ILP: pipeline, superscalar, out-of-order, multi-level cache hierarchy – Powerful, but complex and big • GPGPU – Designed to compute math faster for embarrassingly parallel data (e.g., pixels) – No need for complex logics (no superscalar, out-of-order, cache) – Simple, less powerful, but small ---can put many of them 9

  10. 10 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

  11. 11 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

  12. 12 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

  13. 13 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

  14. 14 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

  15. 15 “From Shader Code to a Teraflop: How GPU Shader Cores Work”, Kayvon Fatahalian, Stanford University

  16. GPU Programming Model • Host = CPU • Device = GPU • Kernel – Function that executes on the device – Multiple threads execute each kernel 16

  17. Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf 17

  18. Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf 18

  19. Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf 19

  20. Source: http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf 20

  21. Challenges for Discrete GPU User buffer user • Data movement problem kernel – Host mem <-> gpu mem Kernel buffer – Copy overhead can be high GPU CPU 4992 GPU c 4 CPU cores ores Graphic DR Host DRAM • Scheduling problem AM PCIE 3.0 – Limited ability to prioritize important GPU kernels – Most (old) GPUs don’t support preemption – New GPUs support preemption within a process 21

  22. Data Movement Challenge Intel Core i7 Nvidia Tesla K80 4992 GPU cores 4 CPU cores 480 GB/s 25 GB/s Graphic DRAM Host DRAM PCIE 3.0 16 GB/s Data transfer is the bottleneck 22

  23. An Example CPU GPU GPU CPU PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11 23

  24. Inefficient Data migration #> capture | xform | filter | detect & capture capture xform xform filter filter detect detect write() read() write() read() write() read() read() OS executive copy copy copy copy IRP to to from from GPU GPU GPU GPU camdrv GPU driver HIDdrv PCI-xfer PCI-xfer PCI-xfer PCI-xfer GPU Run! A lot of copies 24 Acknowledgement: This slide is from the paper’s author’s slide

  25. Scheduling Challenge CPU priorities do not apply to GPU Long running GPU task (xform) is not preemptible delaying short GPU task (mouse update) PTask: Operating System Abstractions To Manage GPUs as Compute Devices, SOSP'11

  26. Challenges for Integrated CPU-GPU • Memory is shared by both CPU and GPU • Data movement may be easier but… PMC PMC PMC PMC GPUSync: A Famework for R Core1 Core2 Core3 Core4 GPU cores eal-Time GP Management Nvidia Tegra X2 Shared Memory Controller 16 GB/s Shared DRAM 26

  27. Memory Bandwidth Contention Co-scheduling memory intensive CPU task affects GPU performance on Integrated CPU-GPU SoC Co-runners GPU App 3 2 1 0 GPU CPU Waqar Ali, Heechul Yun. Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms. Euromicro Conference on Real-Time Systems (ECRTS) , 2018 [pdf] [arXiv] [ppt] [code] 27

  28. Summary • GPU Architecture – Many simple in-order cores • GPU Programming Model – SIMD • Challenges – Data movement cost – Scheduling – Bandwidth bottleneck – NOT time predictable! 28

  29. Real-Time GPU Management • Goal – Time predictable and efficient GPU sharing in multi-tasking environment • Challenges – High data copy overhead – Real-time scheduling support -- preemption – Shared resource (bandwidth) contention 29

  30. References • Timegraph: Gpu scheduling for real-time multi-tasking environments. In ATC, 2011. • Gdev: First-class gpu resource management in the operating system. In ATC, 2012. • GPES: a preemptive execution system for gpgpu computing. In RTAS, 2015 • Gpusync: A framework for real-time gpu management. In RTSS, 2013. • A server based approach for predictable gpu access control. In RTCSA, 2017. 30

  31. Real-Time GPU Scheduling • Early real-time GPU schedulers – Timegraph – Gdev • GPU kernel slicing – GPES • Synchronization (Lock) based approach – GPUSync • Server based approach – GPU Server 31

  32. GPU Software Stack Acknowledgement: This slide is from the paper author’s slide 32 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

  33. TimeGraph • First work to support “soft” real -time GPU scheduling. • Implemented at the device driver level S. Kato, K. Lakshmanan, R. R. Rajkumar , and Y. Ishikawa, “ TimeGraph: GPU scheduling for real-time multi-tasking environm 33 ents,” in USENIX ATC , 2011

  34. TimeGraph Scheduling • GPU commands are not immediately sent to the GPU, if it is busy • Schedule high-priority GPU commands when GPU becomes idle 34

  35. GDev • Implemented at the kernel level on top of the stock GPU driver Acknowledgement: This slide is from the paper author’s slide 35 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

  36. TimeGraph Scheduling • high priority tasks can still suffer long delays • Due to lack of hardware preemption Acknowledgement: This slide is from the paper author’s slide 36 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

  37. GDev’s BAND Scheduler • Monitor consumed b/w, add some delay to wait high-priority requests • Non-work conserving, no real-time guarantee Acknowledgement: This slide is from the paper author’s slide 37 Gdev: First-class gpu resource management in the operating system. In ATC, 2012.

  38. GPES • Based on Gdev • Implement kernel slicing to reduce latency (*) GPES: A Preemptive Execution System for GPGPU Computing, RTAS'14 38

  39. GPUSync http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 39

  40. http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 40

  41. http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 41

  42. http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 42

  43. http://www.ece.ucr.edu/~hyoseung/pdf/rtcsa17-gpu-server-slides.pdf 43

  44. Hardware Preemption • Recent GPUs (NVIDIA Pascal) support hardware preemption capability – Problem solved? • Issues – Works only between GPU streams within a single address space (process) – High context switching overhead • ~100us per context switch (*) (*) AnandTech, “Preemption Improved: Fine -Grained Preemption for Time-Critical Tasks ” 44

  45. Hardware Preemption 45

  46. Discussion • Long running low priority GPU kernel? • Memory interference from the CPU? 46

  47. Challenges for Integrated CPU-GPU • Memory is shared by both CPU and GPU • Data movement may be easier but… PMC PMC PMC PMC GPUSync: A Famework for R Core1 Core2 Core3 Core4 GPU cores eal-Time GP Management Nvidia Tegra X2 Shared Memory Controller 16 GB/s Shared DRAM 47

  48. References • SiGAMMA: server based integrated GPU arbitration mechanism for memory accesses, RTNS, 2017 • GPUguard: towards supporting a predictable execution model for heterogeneous SoC, DATE, 2017 • Protecting Real-Time GPU Applications on Integrated CPU- GPU SoC Platforms, ECRTS, 2018 48

  49. SiGAMMA • Protect PREM compliant real-time CPU tasks • Throttle GPU when CPU is in a mem-phase 49

  50. SiGAMMA • GPU is throttled by launching a high-priority spinning kernel 50

Recommend


More recommend