gpgpu 03
play

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 - PowerPoint PPT Presentation

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 Impossible to maximize throughput with such a rigid architecture: you cant keep vertex and fragment shading units busy all the time As a result, many bottlenecks in the


  1. Turing SM ● Divided into 4 pipelines, each housing ○ 16 FP32 ○ 16 INT32 ○ 2 Tensor core ○ 1 warp scheduler ○ 1 dispatch ● 96 KB L1/shared ○ 64 KB is “shader RAM” (per SM) when executing graphics works ● L0 instruction cache

  2. Memory latencies

  3. A*B + C

  4. Raytracing

  5. Raytracing “before”

  6. Raytracing “now”

  7. DXR

  8. Raytracing in practice ● Hybrid solutions to minimize the number of rays ○ Low sample counts usually come with extreme noise - denoising to the forefront of research ■ https://www.youtube.com/watch?v=5pxnDsFLAuY ■ https://research.nvidia.com/publication/interactive-reconstruction-monte-carlo-image-seq uences-using-recurrent-denoising ■ https://www.youtube.com/watch?v=mtdRfl4fmvQ ● Acceleration Structures mean a considerable increase in GPU memory ● Decrease payload sizes as much as you can

  9. Raytracing in practice

  10. DXR

  11. Mesh shader - motivation

  12. Mesh shaders

  13. Mesh shaders ● Task shader: threads in workgroups. Each can launch an arbitrary number (including zero) mesh shader workgroups ● Mesh shader: each thread can create primitives.

  14. Mesh shaders

  15. Mesh shaders

  16. Mesh shaders

  17. Texture space shading ● Turing feature, only available via extensions (just like mesh shading) ● Store the shaded fragments of a triangle in a separate texture ● Independent of visibility ● Re-sample this stashed texture instead of re-evaluating the full shading ● Unless we moved around too much ● For certain applications it’s almost a given that we are at least roughly at the same place for a frame: VR left and right eyes

  18. Classic

  19. Texture space

  20. Texture space shading ● https://devblogs.nvidia.com/texture-space-shading/ ● https://www.youtube.com/watch?v=Rpy0-q0TyB0

  21. References ● Fermi whitepaper: ○ http://www.nvidia.com/content/pdf/fermi_white_papers/p.glaskowsky_nvidia's_fermi-the_first_complete_gpu_a rchitecture.pdf ○ http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf ● Kepler whitepaper: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf ● Maxwell whitepaper: ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FIN AL.PDF ● Pascal whitepaper: ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FIN AL.pdf ○ https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf ● Volta whitepaper: ○ http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf ● Turing whitepaper: ○ https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVI DIA-Turing-Architecture-Whitepaper.pdf

  22. References The lowest level details are unfortunately only available via reverse-engineering: ● Volta: https://arxiv.org/abs/1804.06826 ● Turing: https://arxiv.org/pdf/1903.07486.pdf

  23. Ampere

  24. In numbers ● 7 GPCs ● Each GPC contains ○ 6 TPCs ○ 1 raster engine ○ (NEW) 2 ROP partitions ○ (NEW) 8 ROP units per ROP partition ● Each TPC contains ○ 2 SMs ○ 1 polymorph engine ● Each SM contains ○ 128 CUDA cores ○ 4 Texture units ○ 4 Tensor Cores (3rd gen) ○ 1 RT Core (2nd gen) ○ 256 KB register file partitioned into 4 64 KB parts ○ 128 KB of configurable L1/Shared memory

  25. In numbers ● 12 x 32 bit memory controllers (384 bit) ● 512 KB L2 cache per controller (6144 KB in total) ● An SM partition can now service 2 FP32 operations (in Turing: it could only double issue a float-int operation pair)

Recommend


More recommend