opengl nvidia command list approaching zero driver
play

OpenGL NVIDIA "Command-List": " Approaching Zero - PowerPoint PPT Presentation

OpenGL NVIDIA "Command-List": " Approaching Zero Driver Overhead " Tristan Lorach Manager of Devtech for Professional Visualization group GPU Scalability GPUs are powerful Quadro M6000: 3072 cores How to leverage all this


  1. OpenGL NVIDIA "Command-List": " Approaching Zero Driver Overhead " Tristan Lorach Manager of Devtech for Professional Visualization group

  2. GPU Scalability GPUs are powerful Quadro M6000: 3072 cores How to leverage all this power ? Do it right: Application and Graphic API (Driver) responsibility Increase amount of work per batch (Job) Minimize CPU  GPU interactions Lower Memory traffic Lower API calls: Batch things together Factorize Data: Re-use data uploaded to Video memory (instancing…) 4 Siggraph 2015

  3. Use of the GPU Games ~1500 to 3000 Drawcalls for a scene Intensive use of Multi-layer image processing over the scene Heavy shaders CAD/FCC/professional applications Hard to batch user’s works (CAD applications) ~10,000 to… 300,000 Drawcalls for a scene: heavy CPU workload for our driver Shaders simpler than games (but catching-up these days...) Post-Processing more and more used (SSAO) 5 Siggraph 2015

  4. Use of the GPU New Graphic APIs are trying to address these concerns Vulkhan Metal DX12 All propose better ways to issue commands and render-states But can we improve OpenGL to go the same path ? Yes: NV_command_list 6 Siggraph 2015

  5. Challenge of Issuing Commands Issuing drawcalls and state changes can be a real bottleneck CPU GPU Excessive Work from App & Driver On CPU ! App + driver GPU ! idle courtesy of PTC  650,000 Triangles  3,700,000 Triangles  14,338,275 Triangles/lines  68,000 Parts  98 000 Parts  300,528 drawcalls (parts)  ~ 10 Triangles per part  ~ 37 Triangles per part  ~ 48 Triangles per part 7 Siggraph 2015

  6. Big Picture – Typical Case Push-Buffer (FIFO) cmds Cmd bundles GPU OpenGL Application Driver Element buffer (EBO) Front-End (decoder) Draw Indirect Buffer OpenGL Vertex Buffer (VBO) Vertex Puller (IA) Commands Vertex Shader Uniform Block TCS (Tessellation) Tessellator Texture Fetch TES (Tessellation) Geometry Shader Image Load/Store OpenGL Transform Feedback resources Atomic Counter Rasterization Shader Storage Fragment Shader Id  64 bits Handles Per-Fragment Ops Addr. FBO resources (IDs) Framebuffer 64 bits (Textures / RB) pointers Tr. Feedback buffer 8

  7. Big Picture Push-Buffer (FIFO) cmds Cmd bundles GPU OpenGL OpenGL Application Commands Driver Element buffer (EBO) Front-End (decoder) Draw Indirect Buffer Token-buffer Vertex Buffer (VBO) Vertex Puller (IA) Offload cmd bundles Vertex Shader creation to the App. Uniform Block TCS (Tessellation) Tessellator Texture Fetch TES (Tessellation) resources Geometry Shader Image Load/Store 64 bits Transform Feedback Atomic Counter Pointers Rasterization (bindless) Shader Storage Fragment Shader Per-Fragment Ops FBO resources Framebuffer (Textures / RB) Tr. Feedback buffer 64 bits GPU Address 10

  8. Big Picture Push-Buffer Token-buffers + state objects (FIFO) cmds Cmd bundles OpenGL OpenGL Application Commands Driver GPU More work for Front-End FE – but fast ! (decoder) Element buffer (EBO) Token-buffer (==Cmds) Draw Indirect Buffer Vertex Buffer (VBO) Vertex Puller (IA) Vertex Shader State Uniform Block TCS (Tessellation) Object Tessellator Texture Fetch resources TES (Tessellation) 64 bits Geometry Shader Pointers Image Load/Store (bindless) Transform Feedback Atomic Counter Rasterization cmd-list object Shader Storage Command-list Fragment Shader Per-Fragment Ops FBO resources Framebuffer (Textures / RB) Tr. Feedback buffer 11

  9. Demo Set of CAD models together CAD Car model 14,338,275 Primitives 29,344,075 Primitives 300,528 drawcalls 26,144 drawcalls 348,862 attribute update 18,371 attribute update 12,004 uniform update 13,632 uniform update 12 Siggraph 2015

  10. Demo 13 Siggraph 2015

  11. Demo Set of CAD models together K5000 1,211,684,096 primitives/S 1,079,545 drawcalls/S 14 Siggraph 2015

  12. Demo Set of CAD models together CAD Car model K5000: K5000 920,363,200 primitives/S 1,211,684,096 primitives/S 19,290,668 drawcalls/S 1,079,545 drawcalls/S 15 Siggraph 2015

  13. DemoS – Maxwell Set of CAD models together CAD Car model K5000: K5000 920,363,200 primitives/S 1,211,684,096 primitives/S 19,290,668 drawcalls/S 1,079,545 drawcalls/S M6000 M6000 ! 1,782,257,920 primitives/S 3,012,795,392 primitives/S 37,355,848 drawcalls/S 2,684,239 drawcalls/S Drawcall time : 8 Micro- Drawcall time: 42 Micro- seconds on CPU (!) seconds on CPU 16 Siggraph 2015

  14. More Performances 5 000 shader changes : toggling between two shaders in „shaded & edges“ Timing GPU (ms) CPU (ms) Regular OpenGL 12.7 15.1 HW TOKEN-buffers 2.9 4.3 x 0.4 37 x Command-Lists object 2.8 4.5 x 0.005 BIG x 5 000 fbo changes: similar as above but with fbo toggle instead of shader Almost no additional cost compared to rendering without fbo changes Timer GPU (ms) CPU (ms) CPU-emulated TOKEN-buffers 60.0 60.0 HW TOKEN-buffers 1.8 33 x 0.9 66 x Command-Lists object 1.7 35 x 0.022 BIG x 17 Preliminary results on K5000 Siggraph 2015

  15. Bindless Technology What is it about? Push buffer Send Ptrs Front-End Work from native GPU pointers/handles (NVIDIA pioneered this (decoder) technology) A lot less CPU work (memory hopping, validation...) Vertex Puller (IA) Element buffer (EBO) Idx Allow GPU to use flexible data structures Vertex Shader Attr.s Bindless Buffers Vertex Buffer (VBO) TCS (Tessellation) 64 bits address Vertex & Global memory since Tesla Generation (CUDA Tessellator capable) Uniform Block TES (Tessellation) Bindless Textures Geometry Shader Since Kepler Texture Fetch Transform Feedback Bindless Constants (UBO) Uniform Block Rasterization New driver feature, support for Fermi and above GPU Fragment Shader Virtual Bindless plays a central role for Command-List Per-Fragment Ops Memory Framebuffer 18 Siggraph 2015

  16. Example On Using Bindless UBO #define UBA UNIFORM_BUFFER_ADDRESS_NV regular UBO binds are now ignored! UpdateBuffers(); glEnableClientState(UNIFORM_BUFFER_UNIFIED_NV); glBufferAddressRangeNV (UBA, 0, addrView, viewSize); pointer for UBO#0 updated once for all foreach (obj in scene) { objects ... // glBindBufferRange (UBO, 1, uboMatrices, obj.matrixOffset, maSize); glBufferAddressRangeNV(UBA, 1, addrMatrices + obj.matrixOffset, maSize); New pointer for UBO#1 updated per Object foreach ( batch in obj.primitive_group_material) { // glBindBufferRange (UBO, 2, uboMaterial, batch.materialOffset, mtlSize); glBufferAddressRangeNV(UBA, 2, addrMaterial + batch.materialOffset, mtlSize); ... New pointer for UBO#2 updated per } Primitive group } 19 Siggraph 2015

  17. Bindless And Memory Alignement Pointers must be aligned Uniform Material 0 256b So do Ptr Offsets glBufferAddressRangeNV (UBA, 2, addrMaterial glGetIntegerv(GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, + batch.materialOffset Uniform Material 1 &offsetAlignment); , mtlSize); Normally: 256 bytes 64 bits address Uniform Material 2  gaps between each item Try to fit as much data as possible… Uniform array materials[] Material 0 … Not the case if passing an index for array access Material 1 (MDI + “ Base- instance”) Material 2 GPU … Virtual But requires special GLSL code (indexing) Uniform … Memory 20 Siggraph 2015

  18. NV_command_list Key Concepts 1. Tokenized Rendering: some state changes and draw commands are encoded into binary data stream Depends on bindless technology 2. State Objects Whole OpenGL States (program, blending...) captured as an object Allows pre-validation of state combinations, later reuse of objects is very fast 3. Command List Object „ Display-list “ paradigm but more flexible: buffer are outside (referenced by address), so content can still be modified (matrics, vertices...) 21

  19. Scene Drawing Converted To Token Buffer Token buffer foreach (obj in scene) { Set Attr#0 on glBufferAddressRangeNV(VERTEX.., 0, obj.geometry->addrVBO, ...); VBO address … glBufferAddressRangeNV(ELEMENT..., 0, obj.geometry->addrIBO, ...); glBufferAddressRangeNV(UNIFORM..., 1, addrMatrices ...); Set Attr#1 on foreach ( batch in obj.materialGroups) { VBO address … For all glBufferAddressRangeNV(UNIFORM, 2, addrMaterial ...); objects Set Elements on glMultiDrawElements(...) } EBO address … } Uniform Matrix on UBO address … becomes a single drawcall Uniform Material on UBO address … replaces 80k calls to GL Obj#1 (for graphic-card model) DrawElements Uniform Material on UBO address … Obj#2 glDrawCommandsNV (GL_TRIANGLES, tokenBuffer DrawElements … , offsets[], sizes[], count); … // {0}, {bufferSize}, 1 22

Recommend


More recommend