An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen - PowerPoint PPT Presentation

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen – AMD AMD ‘ s Favorite Effects 28th February 2011 2

Agenda • Motivation • Recap of a high-level explanation of DDOF • Recap of earlier DDOF solvers • A Vanilla Cyclic Reduction(CR) DDOF solver • A DX11 optimized CR solver for DDOF • Results AMD ‘ s Favorite Effects 28th February 2011 3

Motivation • Solver presented at GDC 2010 [RS2010] has some weaknesses • Great implementation but memory reqs and runtime too high for many game developers • Looking for faster and memory efficient solver AMD ‘ s Favorite Effects 28th February 2011 4

Diffusion DOF recap 1 • DDOF is an enhanced way of blurring a picture taking an arbitrary CoC at a pixel into account • Interprets input image as a heat distribution • Uses the CoC at a pixel to derive a per pixel heat conductivity CoC=Circle of Confusion AMD ‘ s Favorite Effects 28th February 2011 5

Diffusion DOF recap 2 • Blurring is done by time stepping a differential equation that models the diffusion of heat • ADI method used to arrive at a separable solution for stepping • Need to solve tri-diagonal linear system for each row and then each colum of the input AMD ‘ s Favorite Effects 28th February 2011 6

DDOF Tri-diagonal system • row/col of input      b c 0 y x 1 1 1 1      image a b c y x      2 2 2 2 2 • derived from CoC at       a b c y x 3 3 3 3 3 each pixel of an           input row/col           • resulting blurred 0 a b y x n n n n row/col AMD ‘ s Favorite Effects 28th February 2011 7

Solver recap 1 • The GDC2010 solver [RS2010] is a ‚hybrid‘ solver – Performs three PCR steps upfront – Performs serial ‚Sweep‘ algorithm to solve small resulting systems – Check [ZCO2010] for details on other hybrid solvers AMD ‘ s Favorite Effects 28th February 2011 8

Solver recap 2 • The GDC2010 solver [RS2010] has drawbacks – It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm • GPUs without RW cache will suffer – For high resolutions three PCR steps produce tri-diagonal system of substantial size • This means a serial (sweep) algorithm is run on a ‚big‘ system AMD ‘ s Favorite Effects 28th February 2011 9

Solver recap 3 • Cyclic Reduction (CR) solver – Used by [Kass2006] in the original DDOF paper – Runs in two phases 1. reduction phase 2. backward substitution phase AMD ‘ s Favorite Effects 28th February 2011 10

Solver recap 4 • According to [ZCO2010]: – CR solver has lowest computational complexity of all solvers  – It suffers from lack of parallelism though  • At the end of the reduction phase • At the start of the backwards substitution phase AMD ‘ s Favorite Effects 28th February 2011 11

Passes of a Vanilla CR Solver      b c 0 y x 1 1 1 1 Input image      X a b c y x      2 2 2 2 2       Pass 1: a b c y x 3 3 3 3 3 construct      abc from CoC                0 a b y x n n n n AMD ‘ s Favorite Effects 28th February 2011 12

Passes of a Vanilla CR Solver Input image … X reduce reduce Solve for the Stop at size 1 first y Pass 1: … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 13

Vanilla Solver Results • Higher performance than reported in [Bavoil2010]  (~6 ms vs. ~8ms at 1600x1200) • Memory footprint prohibitively high  – >200 MB at 1600x1200 • Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010] AMD ‘ s Favorite Effects 28th February 2011 14

Vanilla CR Solver Input image … X reduce reduce Solve for the This is Stop at size 1 first y what kills Pass 1: parallelism … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 15

Keeping the parallelism high Input image … X reduce reduce Stop at a Solve for Y at reasonable that resolution to size Pass 1: have a big … construct enough parallel abc reduce reduce from CoC workload (e.g using PCR see [ZCO2010]) Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 16

Memory Optimizations 1 Input image … X reduce reduce Stop at a Solve for Y at reasonable that resolution size Pass 1: … construct abc reduce reduce from CoC Blurred … Y substitute substitute image AMD ‘ s Favorite Effects 28th February 2011 17

Memory Optimizations 1 rgab32f rgab32f … X reduce reduce Stop at a Solve for Y at reasonable that resolution size … rgab32f rgab32f abc reduce reduce … rgba32f rgab32f Y substitute substitute substitute AMD ‘ s Favorite Effects 28th February 2011 18

Memory Optimizations 1 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable This saves some significant that resolution size amount of memory - We found … rgab32f no artifacts for going from rgab32f abc reduce reduce rgba32f to rgba16f … rgba16f rgab16f Y substitute substitute substitute AMD ‘ s Favorite Effects 28th February 2011 19

Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable This does again save a that resolution size significant amount of … rgab32f memory as this is the rgab32f abc reduce reduce biggest surface used by the solver … rgba16f rgab16f Y substitute substitute substitute AMD ‘ s Favorite Effects 28th February 2011 20

Memory Optimizations 2 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at reasonable that resolution Skip abc size construction pass … and compute abc rgab32f abc reduce on-the-fly during 1. reduction pass … rgba16f rgab16f Y substitute substitute substitute AMD ‘ s Favorite Effects 28th February 2011 21

Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 ~117 (guesstimate) [Bavoil 2010] 3.66 3.33 ~132 Standard Solver (already skips high res abc construction) AMD ‘ s Favorite Effects 28th February 2011 22

Memory Optimizations 3 rgab16f rgab16f … X reduce reduce Stop at a Solve for Y at Yet again this saves a reasonable that resolution significant amount of Skip abc size construction memory ! … pass compute rgab32f abc reduce abc during 1. reduction pass … rgba16f rgab16f Y substitute substitute substitute AMD ‘ s Favorite Effects 28th February 2011 23

Memory Optimizations 3 rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 24

Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 ~117 (guesstimate) [Bavoil 2010] 3.66 3.33 ~132 Standard Solver (already skips high res abc construction) 4 – to-1 Reduction 2.87 3.32 ~73 AMD ‘ s Favorite Effects 28th February 2011 25

DX11 Memory Optimizations 1 rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 26

DX11 Memory Optimizations 1 Pack abc and X into one rgba_uint surface rgab16f … X reduce4 Stop at a Solve for Y at reasonable that resolution Reduce 4-to-1 Skip abc size in a special first construction … reduction pass pass compute abc abc during 1. reduction pass Substitute 1-to-4 in a … special rgba16f Y substitute substitute substitution pass substitute4 AMD ‘ s Favorite Effects 28th February 2011 27

Using SM5 for data packing uint pack x,y channel rgab16f X uint (f32tof16(X.x) + (f32tof16(X.y) << 16)) uint rgab32f abc uint AMD ‘ s Favorite Effects 28th February 2011 28

Using SM5 for data packing uint rgab16f X uint uint higher 27 bits of x channel rgab32f abc (asuint(abc.x) &0xFFFFFFC0) | uint (f32tof16(X.z) & 0x3F)) Steal 6 lowest mantissa bits of abc.x to store some bits of X.z AMD ‘ s Favorite Effects 28th February 2011 29

Using SM5 for data packing uint rgab16f X uint uint higher 27 bits of y channel rgab32f abc (asuint(abc.y) &0xFFFFFFC0) | uint ((f32tof16(X.z) >>6 )& 0x3F)) Steal 6 lowest mantissa bits of abc.y to store some bits of X.z AMD ‘ s Favorite Effects 28th February 2011 30

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen - PowerPoint PPT Presentation

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen AMD AMD s Favorite Effects 28th February 2011 2 Agenda Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic

c + = Diffusion Diffusion 2 6.82 10 -6 v c D c 10 -1 Equation

PLS Advanced Diffusion Model New Advanced Diffusion Model for Dopants in Silicon Advanced Dopant

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

NON-SYMMETRIC FRACTIONAL DIFFUSION NON-SYMMETRIC FRACTIONAL DIFFUSION AS A SPECIAL CASE OF AS A

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

A Bloch Torrey Equation for Diffusion in a Deforming Media Damien Rohmer November 21, 2006 A

Inhomogeneous materials can become homogeneous by diffusion. For an active diffusion to occur, the

31/10/2019 Diffusion General Note. Atomic diffusion is a process whereby the random

Information Diffusion on Social Networks SMART Summer School 2017 Sylvain Lamprier LIP6 - UPMC

Directed Diffusion for Wireless Sensor Networking Jussi Nikander Jussi.Nikander@hut.fi 9th

From normal to anomalous deterministic diffusion Part 1: Normal deterministic diffusion Rainer

Questions about Exercise 2? Lecture: Natural diffusion Introduction to the

Energy Diffusion in a System of An-harmonic Oscillators Stefano Olla CEREMADE, Paris Makiko

Questions about Exercise 2? Lecture: Natural diffusion Introduction to the

Optimized design and analysis of Optimized design and analysis of sparse-sampling fMRI

Synchronous Forest Substitution Grammars Andreas Maletti Institute for Natural Language

Royal Economic Society Gary Beckers "A Theory of the Allocation of Time" Royal

Linear Systems Linear Systems Transform Ax = b into an equivalent but Transform Ax b into

Innovation Priorities for UK Bioenergy: Technological Expectations within Path Dependence Les

Taxpayer Opportunity Network Presents: VITA Training for You! November 5, 2018 2-3pm ET;

FISCAL IMPACT OF NEW MARKET TAX CREDITS IN NORTH CAROLINA Regional Economic Models, Inc. what

Teaching the n th Derivative Test with inquiry-based Mathematica activities David M. McClendon

the ADA Amanda Maisels Deputy Chief, Disability Rights Section U.S. Department of Justice 1

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen - PowerPoint PPT Presentation

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen AMD AMD s Favorite Effects 28th February 2011 2 Agenda Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic

c + = Diffusion Diffusion 2 6.82 10 -6 v c D c 10 -1 Equation

PLS Advanced Diffusion Model New Advanced Diffusion Model for Dopants in Silicon Advanced Dopant

for each dst in my.out_edges if dst.depth &gt; my.depth+1 then dst.depth = my.depth+1

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

NON-SYMMETRIC FRACTIONAL DIFFUSION NON-SYMMETRIC FRACTIONAL DIFFUSION AS A SPECIAL CASE OF AS A

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

A Bloch Torrey Equation for Diffusion in a Deforming Media Damien Rohmer November 21, 2006 A

Inhomogeneous materials can become homogeneous by diffusion. For an active diffusion to occur, the

31/10/2019 Diffusion General Note. Atomic diffusion is a process whereby the random

Information Diffusion on Social Networks SMART Summer School 2017 Sylvain Lamprier LIP6 - UPMC

Directed Diffusion for Wireless Sensor Networking Jussi Nikander Jussi.Nikander@hut.fi 9th

From normal to anomalous deterministic diffusion Part 1: Normal deterministic diffusion Rainer

Questions about Exercise 2? Lecture: Natural diffusion Introduction to the

Energy Diffusion in a System of An-harmonic Oscillators Stefano Olla CEREMADE, Paris Makiko

Questions about Exercise 2? Lecture: Natural diffusion Introduction to the

Optimized design and analysis of Optimized design and analysis of sparse-sampling fMRI

Synchronous Forest Substitution Grammars Andreas Maletti Institute for Natural Language

Royal Economic Society Gary Beckers &quot;A Theory of the Allocation of Time&quot; Royal

Linear Systems Linear Systems Transform Ax = b into an equivalent but Transform Ax b into

Innovation Priorities for UK Bioenergy: Technological Expectations within Path Dependence Les

Taxpayer Opportunity Network Presents: VITA Training for You! November 5, 2018 2-3pm ET;

FISCAL IMPACT OF NEW MARKET TAX CREDITS IN NORTH CAROLINA Regional Economic Models, Inc. what

Teaching the n th Derivative Test with inquiry-based Mathematica activities David M. McClendon

the ADA Amanda Maisels Deputy Chief, Disability Rights Section U.S. Department of Justice 1

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Royal Economic Society Gary Beckers "A Theory of the Allocation of Time" Royal