An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen – AMD AMD ‘ s Favorite Effects 28th February 2011 2
Agenda • Motivation • Recap of a high-level explanation of DDOF • Recap of earlier DDOF solvers • A Vanilla Cyclic Reduction(CR) DDOF solver • A DX11 optimized CR solver for DDOF • Results AMD ‘ s Favorite Effects 28th February 2011 3
Motivation • Solver presented at GDC 2010 [RS2010] has some weaknesses • Great implementation but memory reqs and runtime too high for many game developers • Looking for faster and memory efficient solver AMD ‘ s Favorite Effects 28th February 2011 4
Diffusion DOF recap 1 • DDOF is an enhanced way of blurring a picture taking an arbitrary CoC at a pixel into account • Interprets input image as a heat distribution • Uses the CoC at a pixel to derive a per pixel heat conductivity CoC=Circle of Confusion AMD ‘ s Favorite Effects 28th February 2011 5
Diffusion DOF recap 2 • Blurring is done by time stepping a differential equation that models the diffusion of heat • ADI method used to arrive at a separable solution for stepping • Need to solve tri-diagonal linear system for each row and then each colum of the input AMD ‘ s Favorite Effects 28th February 2011 6
DDOF Tri-diagonal system • row/col of input b c 0 y x 1 1 1 1 image a b c y x 2 2 2 2 2 • derived from CoC at a b c y x 3 3 3 3 3 each pixel of an input row/col • resulting blurred 0 a b y x n n n n row/col AMD ‘ s Favorite Effects 28th February 2011 7
Solver recap 1 • The GDC2010 solver [RS2010] is a ‚hybrid‘ solver – Performs three PCR steps upfront – Performs serial ‚Sweep‘ algorithm to solve small resulting systems – Check [ZCO2010] for details on other hybrid solvers AMD ‘ s Favorite Effects 28th February 2011 8
Solver recap 2 • The GDC2010 solver [RS2010] has drawbacks – It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm • GPUs without RW cache will suffer – For high resolutions three PCR steps produce tri-diagonal system of substantial size • This means a serial (sweep) algorithm is run on a ‚big‘ system AMD ‘ s Favorite Effects 28th February 2011 9
Solver recap 3 • Cyclic Reduction (CR) solver – Used by [Kass2006] in the original DDOF paper – Runs in two phases 1. reduction phase 2. backward substitution phase AMD ‘ s Favorite Effects 28th February 2011 10
Solver recap 4 • According to [ZCO2010]: – CR solver has lowest computational complexity of all solvers – It suffers from lack of parallelism though • At the end of the reduction phase • At the start of the backwards substitution phase AMD ‘ s Favorite Effects 28th February 2011 11
Passes of a Vanilla CR Solver b c 0 y x 1 1 1 1 Input image X a b c y x 2 2 2 2 2 Pass 1: a b c y x 3 3 3 3 3 construct abc from CoC 0 a b y x n n n n AMD ‘ s Favorite Effects 28th February 2011 12
Passes of a Vanilla CR Solver Input image … X reduce reduce Solve for the Stop at size 1 first y Pass 1: … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 13
Vanilla Solver Results • Higher performance than reported in [Bavoil2010] (~6 ms vs. ~8ms at 1600x1200) • Memory footprint prohibitively high – >200 MB at 1600x1200 • Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010] AMD ‘ s Favorite Effects 28th February 2011 14
Vanilla CR Solver Input image … X reduce reduce Solve for the This is Stop at size 1 first y what kills Pass 1: parallelism … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 15
Keeping the parallelism high Input image … X reduce reduce Stop at a Solve for Y at reasonable that resolution to size Pass 1: have a big … construct enough parallel abc reduce reduce from CoC workload (e.g using PCR see [ZCO2010]) Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 16
Memory Optimizations 1 Input image … X reduce reduce Stop at a Solve for Y at reasonable that resolution size Pass 1: … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 17
Memory Optimizations 1 rgab32f rgab32f … X reduce reduce Stop at a Solve for Y at reasonable that resolution size … rgab32f rgab32f abc reduce reduce … rgba32f rgab32f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 18
Memory Optimizations 1 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable This saves some significant that resolution size amount of memory - We found … rgab32f no artifacts for going from rgab32f abc reduce reduce rgba32f to rgba16f … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 19
Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable This does again save a that resolution size significant amount of … rgab32f memory as this is the rgab32f abc reduce reduce biggest surface used by the solver … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 20
Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable that resolution Skip abc size construction pass … and compute abc rgab32f abc reduce on-the-fly during 1. reduction pass … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 21
Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 ~117 (guesstimate) [Bavoil 2010] 3.66 3.33 ~132 Standard Solver (already skips high res abc construction) AMD ‘ s Favorite Effects 28th February 2011 22
Memory Optimizations 3 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at Yet again this saves a reasonable that resolution significant amount of Skip abc size construction memory ! … pass compute rgab32f abc reduce abc during 1. reduction pass … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 23
Memory Optimizations 3 rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 24
Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 ~117 (guesstimate) [Bavoil 2010] 3.66 3.33 ~132 Standard Solver (already skips high res abc construction) 4 – to-1 Reduction 2.87 3.32 ~73 AMD ‘ s Favorite Effects 28th February 2011 25
DX11 Memory Optimizations 1 rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 26
DX11 Memory Optimizations 1 Pack abc and X into one rgba_uint surface rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 27
Using SM5 for data packing uint pack x,y channel rgab16f X uint (f32tof16(X.x) + (f32tof16(X.y) << 16)) uint rgab32f abc uint AMD ‘ s Favorite Effects 28th February 2011 28
Using SM5 for data packing uint rgab16f X uint uint higher 27 bits of x channel rgab32f abc (asuint(abc.x) &0xFFFFFFC0) | uint (f32tof16(X.z) & 0x3F)) Steal 6 lowest mantissa bits of abc.x to store some bits of X.z AMD ‘ s Favorite Effects 28th February 2011 29
Using SM5 for data packing uint rgab16f X uint uint higher 27 bits of y channel rgab32f abc (asuint(abc.y) &0xFFFFFFC0) | uint ((f32tof16(X.z) >>6 )& 0x3F)) Steal 6 lowest mantissa bits of abc.y to store some bits of X.z AMD ‘ s Favorite Effects 28th February 2011 30
Recommend
More recommend