10 Faster Transparency from Low Level Shader Optimisation - PowerPoint PPT Presentation

10 × Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia

Opaque Transparent 460 (2010): 6 FPS Titan X (Now): 142 FPS

12 Million Polygons 460 (base): 1 FPS Titan X (base): 3 FPS Titan X (ours): 30 FPS

10× Faster Transparency (Software) ● Gained with two techniques ○ Backwards Memory Allocation - better occupancy ○ Register Based Block Sort - faster sorting ● Involves low level optimizations (OpenGL+GLSL) ● Interesting technical details ● Insight from CUDA, similarities to OpenGL ● Important to know hardware and language ● Now within 10× opaque rendering

Transparency Objects, glass, visualization Antialiasing Particles Shadow Maps Images from: Creo Parametric 2, Portal 2, Shadow Of Mordor, marmoset.co, GRID 2, Unity3D, Chen et. al.’s Real-Time Volumetric Shadows using 1D Min-Max Mipmaps, Lokovic and Veach’s Deep Shadow Maps.

Transparency ● Transparency uses alpha blending ● Weighted average ● Based on surface order

Not Sorted Sorted

Sorting for Transparency Sort triangles ● Geometry dependent Sort fragments (potential pixel colors) ● Rasterize and store ● Geometry independent ● Order Independent Transparency (OIT) ...

Order Independent Transparency (OIT) ● Two passes ○ Build a deep image ○ Sort and blend fragments ● Exact OIT: sort all fragments ● Code snippets ○ On my poster ○ https://github.com/pknowles/oit

1. Deep Image ● Many fragments per pixel ● Construct in fragment shader ○ Race conditions ○ Different data structures Knowles, P.: Real-Time deep image rendering and order independent transparency . PhD Thesis, RMIT University, 2015.

2. Sort and composite Full screen pass: vec2 frags[MAX_FRAGS]; 1. Read all fragments void main() { 2. Sort int count = loadFragments(gl_FragCoord.xy); 3. Blend sortFragments(count); //insertion sort Bottleneck for large scenes colour = vec4(1.0); for (int i = 0; i < count; ++i) { vec4 c = unpackColour(frags[i].y) colour = mix(colour, c, c.a); } }

OpenGL+GLSL vs CUDA OpenGL CUDA Same hardware Graphics Compute GLSL Shaders Kernels Specification, not implementation Implementation well documented Improving Nsight support Good Nsight support ● CUDA gives insight into GLSL execution ● Some significant architectural differences...

GPU Architecture GPU Slow Global Memory L2 Cache SM/SMX SM/SMX SM/SMX Faster L1 Cache / Shared / “Local” Fastest Registers SP SP SP SP …

OpenGL vs CUDA - An Interesting Example #define SIZE set_by_application vec4 myArray[SIZE]; uniform int zero; out vec4 fragColour; void main() { fragColour = myArray[zero]; } ● Why would allocating more memory make a shader slower?

OpenGL vs CUDA - An Interesting Example ● In GLSL local memory is reserved GPU Global Memory ● The more required ● The less active threads L2 Cache ● Low occupancy SM/SMX SM/SMX SM/SMX L1 Cache / Shared / “Local” Thread Thread Thread Registers d a e SP SP SP SP … r h T

Sorting in OIT ● Local memory is fixed ● Use conservative maximum ● Want dynamic size #define MAX_FRAGS set_by_application vec2 frags[ MAX_FRAGS ]; //conservative max void main() { int count = loadFragments(gl_FragCoord.xy); sortFragments(count); //insertion sort ...

Backwards Memory Allocation Knowles, P., Leach, G., Zambetta, F.: Backwards Memory Allocation and Improved OIT . In Proceedings of Pacific Graphics 2013, pages 59–64, October 2013.

Register-Based Block Sort ● Local memory still slow ● External sort in registers ○ From local memory ○ Copy blocks to registers ○ Sort ○ Copy back ○ k-way merge Knowles, P., Leach, G., Zambetta, F.: Fast Sorting for Exact OIT of Complex Scenes . The Visual Computer (TVCJ), vol. 30, no. 6-8, pages 603–613 June 2014.

Intermediate Compiler Output ########################################## OPTION NV_bindless_texture; PARAM c[9] = { program.local[0..8] }; TEMP R0, R1; ● glGetProgramBinary TEMP RC, HC; TEMP lmem[8]; MOV.F lmem[0].x, c[0]; ● Provided by Nvidia driver MOV.F lmem[1].x, c[1]; MOV.F lmem[2].x, c[2]; ● Poor man’s --keep (CUDA) MOV.F lmem[3].x, c[3]; MOV.S R0.y, {1, 0, 0, 0}.x; REP.S ; SGE.S.CC HC.x, R0.y, c[8]; BRK (NE.x); MOV.S R0.z, R0.y; TEMP R0, R1; REP.S ; SLE.S.CC HC.x, R0.y, {0, 0, 0, 0}; BRK (NE.x); ... ADD.S R0.w, R0.z, -{1, 0, 0, 0}.x; MOV.U R0.x, R0.z; MOV.U R0.w, R0; TEMP lmem[8]; MOV.F R0.x, lmem[R0.x].x; MOV.F R1.x, lmem[R0.w].x; SGT.F R0.x, R1, R0; TRUNC.U.CC HC.x, R0; ADD.S R0.z, R0, -{1, 0, 0, 0}.x; ENDREP; ADD.S R0.y, R0, {1, 0, 0, 0}.x; ENDREP; ADD.F result.position, R0.y, R0.x; END ##########################################

Results - Milliseconds per frame Scene Atrium Hairball (front / back) Power Plant Baseline 7 170 / 652 374 BMA+RBS 6 195 / 212 30 Opaque (no OIT) 1 5 / 3 9 ● Up to 10x improvement, at worst minor overhead Titan X, 1920x1080

GPU Progression Power plant scene (milliseconds per frame) ● Speedup improves with each new GPU GPU (year) 460 (2010) 670 (2012) Titan (2013) Titan X (2015) Baseline 1004 670 476 374 BMA+RBS 258 94 56 30 Speedup 3.9 7.1 8.5 12.3

Conclusion ● Low level optimizations necessary despite trend for higher level languages ● Need to be exposed to hardware architecture via language and tools ● Perhaps increasingly necessary with newer GPUs ● 10× faster OIT with BMA+RBS ● Much bigger scenes possible (also displays, i.e. 4K/8K) ● Better sorting and deep image rendering ● Much closer to opaque rendering speeds ● Sorting is no longer the bottleneck in many scenes

Questions? pyarelal.knowles@gmail.com

10 Faster Transparency from Low Level Shader Optimisation - PowerPoint PPT Presentation

10 Faster Transparency from Low Level Shader Optimisation Pyarelal Knowles Geoff Leach Fabio Zambetta RMIT University, Australia Opaque Transparent 460 (2010): 6 FPS Titan X (Now): 142 FPS 12 Million Polygons 460 (base): 1 FPS

Displacement Shader Writing CSCD 472 Slide 1 4/5/10 Displacement Shader Variables CSCD 472

Medicines optimisation The road to excellence Workshop Overview of meds optimisation Your

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

From Shader Code to a Tera Terafl flop op: How Shader Cores Work Kayvon Fatahalian Stanford

RenderMan Shader Assignment So You Want to Write RenderMan shaders Due: Monday, May 3 rd

Shaders Rasmus Vahtra, Andres Traks What is a shader? Maybe this thing? Shader definition

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Real Real- -Time Systems Time Systems Low- Low -level programming level programming Low-

Automated and Accurate Geometry Extraction and Shape Optimisation of 3D Topology Optimisation

Introductory Course on Non-smooth Optimisation Lecture 09 - Non-convex optimisation Jingwei Liang

Introduction to program optimisation Michel Schinz (based on Erik Stenmans slides) Advanced

www.transparencyindia.org Transparency International India Transparency International-India

Employing Dynamic Employing Dynamic Transparency for 3D Occlusion Transparency for 3D Occlusion

Transparency-Enhancing Tools PETs PhD Course at Chalmers Tobias Pulls Karlstad University,

The Fragment Shader CS418 Computer Graphics John C. Hart Fragment Pipeline Rasterization Model

Teaching SQA by Encouraging Student Contributions to an Open Source Web-based System for the

SADDLEBACK COLLEGE SOUTH ORANGE COUNTY COMMUNITY COLLEGE DISTRICT JIM ROGERS PMP / ADBIA

Plan Spring Update - CAC April, 2020 Process Overview Agenda Rounding out the Long Range

How Do Exporters Adjust to Exchange-Rate Fluctuations? New Evidence from the East African

Marmot: an Optimizing Compiler for Java R.Fitzgerald, T.B.Knoblock, E.Ruf, B. Steensgaard, D.

Key themes Reducing health inequalities is a matter of fairness and social justice Action is

Tackling health inequalities Institute of Health Equity Jessica Allen Jessica.allen@ucl.ac.uk

School Marmot Basin Snowsports Visit Jasper CANADA 6 th -14 th APRIL 2020 CAN ANAD ADA A