Turing SM ● Divided into 4 pipelines, each housing ○ 16 FP32 ○ 16 INT32 ○ 2 Tensor core ○ 1 warp scheduler ○ 1 dispatch ● 96 KB L1/shared ○ 64 KB is “shader RAM” (per SM) when executing graphics works ● L0 instruction cache
Memory latencies
A*B + C
Raytracing
Raytracing “before”
Raytracing “now”
DXR
Raytracing in practice ● Hybrid solutions to minimize the number of rays ○ Low sample counts usually come with extreme noise - denoising to the forefront of research ■ https://www.youtube.com/watch?v=5pxnDsFLAuY ■ https://research.nvidia.com/publication/interactive-reconstruction-monte-carlo-image-seq uences-using-recurrent-denoising ■ https://www.youtube.com/watch?v=mtdRfl4fmvQ ● Acceleration Structures mean a considerable increase in GPU memory ● Decrease payload sizes as much as you can
Raytracing in practice
DXR
Mesh shader - motivation
Mesh shaders
Mesh shaders ● Task shader: threads in workgroups. Each can launch an arbitrary number (including zero) mesh shader workgroups ● Mesh shader: each thread can create primitives.
Mesh shaders
Mesh shaders
Mesh shaders
Texture space shading ● Turing feature, only available via extensions (just like mesh shading) ● Store the shaded fragments of a triangle in a separate texture ● Independent of visibility ● Re-sample this stashed texture instead of re-evaluating the full shading ● Unless we moved around too much ● For certain applications it’s almost a given that we are at least roughly at the same place for a frame: VR left and right eyes
Classic
Texture space
Texture space shading ● https://devblogs.nvidia.com/texture-space-shading/ ● https://www.youtube.com/watch?v=Rpy0-q0TyB0
References ● Fermi whitepaper: ○ http://www.nvidia.com/content/pdf/fermi_white_papers/p.glaskowsky_nvidia's_fermi-the_first_complete_gpu_a rchitecture.pdf ○ http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf ● Kepler whitepaper: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf ● Maxwell whitepaper: ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FIN AL.PDF ● Pascal whitepaper: ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FIN AL.pdf ○ https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf ● Volta whitepaper: ○ http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf ● Turing whitepaper: ○ https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVI DIA-Turing-Architecture-Whitepaper.pdf
References The lowest level details are unfortunately only available via reverse-engineering: ● Volta: https://arxiv.org/abs/1804.06826 ● Turing: https://arxiv.org/pdf/1903.07486.pdf
Ampere
In numbers ● 7 GPCs ● Each GPC contains ○ 6 TPCs ○ 1 raster engine ○ (NEW) 2 ROP partitions ○ (NEW) 8 ROP units per ROP partition ● Each TPC contains ○ 2 SMs ○ 1 polymorph engine ● Each SM contains ○ 128 CUDA cores ○ 4 Texture units ○ 4 Tensor Cores (3rd gen) ○ 1 RT Core (2nd gen) ○ 256 KB register file partitioned into 4 64 KB parts ○ 128 KB of configurable L1/Shared memory
In numbers ● 12 x 32 bit memory controllers (384 bit) ● 512 KB L2 cache per controller (6144 KB in total) ● An SM partition can now service 2 FP32 operations (in Turing: it could only double issue a float-int operation pair)
Recommend
More recommend