10 × Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia
Opaque Transparent 460 (2010): 6 FPS Titan X (Now): 142 FPS
12 Million Polygons 460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS
10× Faster Transparency (Software) ● Gained with two techniques ○ Backwards Memory Allocation - better occupancy ○ Register Based Block Sort - faster sorting ● Involves low level optimizations (OpenGL+GLSL) ● Interesting technical details ● Insight from CUDA, similarities to OpenGL ● Important to know hardware and language ● Now within 10× opaque rendering
Transparency Objects, glass, visualization Antialiasing Particles Shadow Maps Images from: Creo Parametric 2, Portal 2, Shadow Of Mordor, marmoset.co, GRID 2, Unity3D, Chen et. al.’s Real-Time Volumetric Shadows using 1D Min-Max Mipmaps, Lokovic and Veach’s Deep Shadow Maps.
Transparency ● Transparency uses alpha blending ● Weighted average ● Based on surface order
Not Sorted Sorted
Sorting for Transparency Sort triangles ● Geometry dependent Sort fragments (potential pixel colors) ● Rasterize and store ● Geometry independent ● Order Independent Transparency (OIT) ...
Order Independent Transparency (OIT) ● Two passes ○ Build a deep image ○ Sort and blend fragments ● Exact OIT: sort all fragments ● Code snippets ○ On my poster ○ https://github.com/pknowles/oit
1. Deep Image ● Many fragments per pixel ● Construct in fragment shader ○ Race conditions ○ Different data structures Knowles, P.: Real-Time deep image rendering and order independent transparency . PhD Thesis, RMIT University, 2015.
2. Sort and composite Full screen pass: vec2 frags[MAX_FRAGS]; 1. Read all fragments void main() { 2. Sort int count = loadFragments(gl_FragCoord.xy); 3. Blend sortFragments(count); //insertion sort Bottleneck for large scenes colour = vec4(1.0); for (int i = 0; i < count; ++i) { vec4 c = unpackColour(frags[i].y) colour = mix(colour, c, c.a); } }
OpenGL+GLSL vs CUDA OpenGL CUDA Same hardware Graphics Compute GLSL Shaders Kernels Specification, not implementation Implementation well documented Improving Nsight support Good Nsight support ● CUDA gives insight into GLSL execution ● Some significant architectural differences...
GPU Architecture GPU Slow Global Memory L2 Cache SM/SMX SM/SMX SM/SMX Faster L1 Cache / Shared / “Local” Fastest Registers SP SP SP SP …
OpenGL vs CUDA - An Interesting Example #define SIZE set_by_application vec4 myArray[SIZE]; uniform int zero; out vec4 fragColour; void main() { fragColour = myArray[zero]; } ● Why would allocating more memory make a shader slower?
OpenGL vs CUDA - An Interesting Example ● In GLSL local memory is reserved GPU Global Memory ● The more required ● The less active threads L2 Cache ● Low occupancy SM/SMX SM/SMX SM/SMX L1 Cache / Shared / “Local” Thread Thread Thread Registers d a e SP SP SP SP … r h T
Sorting in OIT ● Local memory is fixed ● Use conservative maximum ● Want dynamic size #define MAX_FRAGS set_by_application vec2 frags[ MAX_FRAGS ]; //conservative max void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort ...
Backwards Memory Allocation Knowles, P., Leach, G., Zambetta, F.: Backwards Memory Allocation and Improved OIT . In Proceedings of Pacific Graphics 2013, pages 59–64, October 2013.
Register-Based Block Sort ● Local memory still slow ● External sort in registers ○ From local memory ○ Copy blocks to registers ○ Sort ○ Copy back ○ k-way merge Knowles, P., Leach, G., Zambetta, F.: Fast Sorting for Exact OIT of Complex Scenes . The Visual Computer (TVCJ), vol. 30, no. 6-8, pages 603–613 June 2014.
Intermediate Compiler Output ########################################## OPTION NV_bindless_texture; PARAM c[9] = { program.local[0..8] }; TEMP R0, R1; ● glGetProgramBinary TEMP RC, HC; TEMP lmem[8]; MOV.F lmem[0].x, c[0]; ● Provided by Nvidia driver MOV.F lmem[1].x, c[1]; MOV.F lmem[2].x, c[2]; ● Poor man’s --keep (CUDA) MOV.F lmem[3].x, c[3]; MOV.S R0.y, {1, 0, 0, 0}.x; REP.S ; SGE.S.CC HC.x, R0.y, c[8]; BRK (NE.x); MOV.S R0.z, R0.y; TEMP R0, R1; REP.S ; SLE.S.CC HC.x, R0.y, {0, 0, 0, 0}; BRK (NE.x); ... ADD.S R0.w, R0.z, -{1, 0, 0, 0}.x; MOV.U R0.x, R0.z; MOV.U R0.w, R0; TEMP lmem[8]; MOV.F R0.x, lmem[R0.x].x; MOV.F R1.x, lmem[R0.w].x; SGT.F R0.x, R1, R0; TRUNC.U.CC HC.x, R0; ADD.S R0.z, R0, -{1, 0, 0, 0}.x; ENDREP; ADD.S R0.y, R0, {1, 0, 0, 0}.x; ENDREP; ADD.F result.position, R0.y, R0.x; END ##########################################
Results - Milliseconds per frame Scene Atrium Hairball (front / back) Power Plant Baseline 7 170 / 652 374 BMA+RBS 6 195 / 212 30 Opaque (no OIT) 1 5 / 3 9 ● Up to 10x improvement, at worst minor overhead Titan X, 1920x1080
GPU Progression Power plant scene (milliseconds per frame) ● Speedup improves with each new GPU GPU (year) 460 (2010) 670 (2012) Titan (2013) Titan X (2015) Baseline 1004 670 476 374 BMA+RBS 258 94 56 30 Speedup 3.9 7.1 8.5 12.3
Conclusion ● Low level optimizations necessary despite trend for higher level languages ● Need to be exposed to hardware architecture via language and tools ● Perhaps increasingly necessary with newer GPUs ● 10× faster OIT with BMA+RBS ● Much bigger scenes possible (also displays, i.e. 4K/8K) ● Better sorting and deep image rendering ● Much closer to opaque rendering speeds ● Sorting is no longer the bottleneck in many scenes
Questions? pyarelal.knowles@gmail.com
Recommend
More recommend