an optimized diffusion depth of
play

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen - PowerPoint PPT Presentation

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen AMD AMD s Favorite Effects 28th February 2011 2 Agenda Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic


  1. An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen – AMD AMD ‘ s Favorite Effects 28th February 2011 2

  2. Agenda • Motivation • Recap of a high-level explanation of DDOF • Recap of earlier DDOF solvers • A Vanilla Cyclic Reduction(CR) DDOF solver • A DX11 optimized CR solver for DDOF • Results AMD ‘ s Favorite Effects 28th February 2011 3

  3. Motivation • Solver presented at GDC 2010 [RS2010] has some weaknesses • Great implementation but memory reqs and runtime too high for many game developers • Looking for faster and memory efficient solver AMD ‘ s Favorite Effects 28th February 2011 4

  4. Diffusion DOF recap 1 • DDOF is an enhanced way of blurring a picture taking an arbitrary CoC at a pixel into account • Interprets input image as a heat distribution • Uses the CoC at a pixel to derive a per pixel heat conductivity CoC=Circle of Confusion AMD ‘ s Favorite Effects 28th February 2011 5

  5. Diffusion DOF recap 2 • Blurring is done by time stepping a differential equation that models the diffusion of heat • ADI method used to arrive at a separable solution for stepping • Need to solve tri-diagonal linear system for each row and then each colum of the input AMD ‘ s Favorite Effects 28th February 2011 6

  6. DDOF Tri-diagonal system • row/col of input      b c 0 y x 1 1 1 1      image a b c y x      2 2 2 2 2 • derived from CoC at       a b c y x 3 3 3 3 3 each pixel of an           input row/col           • resulting blurred 0 a b y x n n n n row/col AMD ‘ s Favorite Effects 28th February 2011 7

  7. Solver recap 1 • The GDC2010 solver [RS2010] is a ‚hybrid‘ solver – Performs three PCR steps upfront – Performs serial ‚Sweep‘ algorithm to solve small resulting systems – Check [ZCO2010] for details on other hybrid solvers AMD ‘ s Favorite Effects 28th February 2011 8

  8. Solver recap 2 • The GDC2010 solver [RS2010] has drawbacks – It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm • GPUs without RW cache will suffer – For high resolutions three PCR steps produce tri-diagonal system of substantial size • This means a serial (sweep) algorithm is run on a ‚big‘ system AMD ‘ s Favorite Effects 28th February 2011 9

  9. Solver recap 3 • Cyclic Reduction (CR) solver – Used by [Kass2006] in the original DDOF paper – Runs in two phases 1. reduction phase 2. backward substitution phase AMD ‘ s Favorite Effects 28th February 2011 10

  10. Solver recap 4 • According to [ZCO2010]: – CR solver has lowest computational complexity of all solvers  – It suffers from lack of parallelism though  • At the end of the reduction phase • At the start of the backwards substitution phase AMD ‘ s Favorite Effects 28th February 2011 11

  11. Passes of a Vanilla CR Solver      b c 0 y x 1 1 1 1 Input image      X a b c y x      2 2 2 2 2       Pass 1: a b c y x 3 3 3 3 3 construct      abc from CoC                0 a b y x n n n n AMD ‘ s Favorite Effects 28th February 2011 12

  12. Passes of a Vanilla CR Solver Input image … X reduce reduce Solve for the Stop at size 1 first y Pass 1: … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 13

  13. Vanilla Solver Results • Higher performance than reported in [Bavoil2010]  (~6 ms vs. ~8ms at 1600x1200) • Memory footprint prohibitively high  – >200 MB at 1600x1200 • Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010] AMD ‘ s Favorite Effects 28th February 2011 14

  14. Vanilla CR Solver Input image … X reduce reduce Solve for the This is Stop at size 1 first y what kills Pass 1: parallelism … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 15

  15. Keeping the parallelism high Input image … X reduce reduce Stop at a Solve for Y at reasonable that resolution to size Pass 1: have a big … construct enough parallel abc reduce reduce from CoC workload (e.g using PCR see [ZCO2010]) Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 16

  16. Memory Optimizations 1 Input image … X reduce reduce Stop at a Solve for Y at reasonable that resolution size Pass 1: … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 17

  17. Memory Optimizations 1 rgab32f rgab32f … X reduce reduce Stop at a Solve for Y at reasonable that resolution size … rgab32f rgab32f abc reduce reduce … rgba32f rgab32f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 18

  18. Memory Optimizations 1 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable This saves some significant that resolution size amount of memory - We found … rgab32f no artifacts for going from rgab32f abc reduce reduce rgba32f to rgba16f … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 19

  19. Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable This does again save a that resolution size significant amount of … rgab32f memory as this is the rgab32f abc reduce reduce biggest surface used by the solver … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 20

  20. Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable that resolution Skip abc size construction pass … and compute abc rgab32f abc reduce on-the-fly during 1. reduction pass … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 21

  21. Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 ~117 (guesstimate) [Bavoil 2010] 3.66 3.33 ~132 Standard Solver (already skips high res abc construction) AMD ‘ s Favorite Effects 28th February 2011 22

  22. Memory Optimizations 3 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at Yet again this saves a reasonable that resolution significant amount of Skip abc size construction memory ! … pass compute rgab32f abc reduce abc during 1. reduction pass … rgba16f rgab16f Y substitute substitute substi- tute AMD ‘ s Favorite Effects 28th February 2011 23

  23. Memory Optimizations 3 rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 24

  24. Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 ~117 (guesstimate) [Bavoil 2010] 3.66 3.33 ~132 Standard Solver (already skips high res abc construction) 4 – to-1 Reduction 2.87 3.32 ~73 AMD ‘ s Favorite Effects 28th February 2011 25

  25. DX11 Memory Optimizations 1 rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 26

  26. DX11 Memory Optimizations 1 Pack abc and X into one rgba_uint surface rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 27

  27. Using SM5 for data packing uint pack x,y channel rgab16f X uint (f32tof16(X.x) + (f32tof16(X.y) << 16)) uint rgab32f abc uint AMD ‘ s Favorite Effects 28th February 2011 28

  28. Using SM5 for data packing uint rgab16f X uint uint higher 27 bits of x channel rgab32f abc (asuint(abc.x) &0xFFFFFFC0) | uint (f32tof16(X.z) & 0x3F)) Steal 6 lowest mantissa bits of abc.x to store some bits of X.z AMD ‘ s Favorite Effects 28th February 2011 29

  29. Using SM5 for data packing uint rgab16f X uint uint higher 27 bits of y channel rgab32f abc (asuint(abc.y) &0xFFFFFFC0) | uint ((f32tof16(X.z) >>6 )& 0x3F)) Steal 6 lowest mantissa bits of abc.y to store some bits of X.z AMD ‘ s Favorite Effects 28th February 2011 30

Recommend


More recommend