An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen - - PowerPoint PPT Presentation
An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen - - PowerPoint PPT Presentation
An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen AMD AMD s Favorite Effects 28th February 2011 2 Agenda Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic
An Optimized Diffusion Depth Of Field Solver (DDOF)
28th February 2011 2 AMD‘s Favorite Effects
Holger Gruen – AMD
Agenda
- Motivation
- Recap of a high-level explanation of DDOF
- Recap of earlier DDOF solvers
- A Vanilla Cyclic Reduction(CR) DDOF solver
- A DX11 optimized CR solver for DDOF
- Results
28th February 2011 AMD‘s Favorite Effects 3
Motivation
- Solver presented at GDC 2010 [RS2010] has
some weaknesses
- Great implementation but memory reqs and
runtime too high for many game developers
- Looking for faster and memory efficient solver
28th February 2011 AMD‘s Favorite Effects 4
Diffusion DOF recap 1
- DDOF is an enhanced way of blurring a picture
taking an arbitrary CoC at a pixel into account
- Interprets input image as a heat distribution
- Uses the CoC at a pixel to derive a per pixel
heat conductivity
CoC=Circle of Confusion
28th February 2011 AMD‘s Favorite Effects 5
Diffusion DOF recap 2
- Blurring is done by time stepping a differential
equation that models the diffusion of heat
- ADI method used to arrive at a separable
solution for stepping
- Need to solve tri-diagonal linear system for
each row and then each colum of the input
28th February 2011 AMD‘s Favorite Effects 6
DDOF Tri-diagonal system
28th February 2011 AMD‘s Favorite Effects 7
1 1 1 1 2 2 2 2 2 3 3 3 3 3 n n n n
b c y x a b c y x a b c y x a b y x
- row/col of input
image
- derived from CoC at
each pixel of an input row/col
- resulting blurred
row/col
Solver recap 1
- The GDC2010 solver [RS2010] is a ‚hybrid‘ solver
– Performs three PCR steps upfront – Performs serial ‚Sweep‘ algorithm to solve small resulting systems – Check [ZCO2010] for details on other hybrid solvers
28th February 2011 AMD‘s Favorite Effects 8
Solver recap 2
- The GDC2010 solver [RS2010] has drawbacks
– It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm
- GPUs without RW cache will suffer
– For high resolutions three PCR steps produce tri-diagonal system of substantial size
- This means a serial (sweep) algorithm is run on a ‚big‘ system
28th February 2011 AMD‘s Favorite Effects 9
Solver recap 3
- Cyclic Reduction (CR) solver
– Used by [Kass2006] in the original DDOF paper – Runs in two phases
- 1. reduction phase
- 2. backward substitution phase
28th February 2011 AMD‘s Favorite Effects 10
Solver recap 4
- According to [ZCO2010]:
– CR solver has lowest computational complexity of all solvers – It suffers from lack of parallelism though
- At the end of the reduction phase
- At the start of the backwards substitution phase
28th February 2011 AMD‘s Favorite Effects 11
Passes of a Vanilla CR Solver
28th February 2011 AMD‘s Favorite Effects 12
Input image X Pass 1: construct from CoC abc
1 1 1 1 2 2 2 2 2 3 3 3 3 3 n n n n
b c y x a b c y x a b c y x a b y x
Passes of a Vanilla CR Solver
28th February 2011 AMD‘s Favorite Effects 13
Input image X Pass 1: construct from CoC abc
reduce reduce reduce reduce
… … Stop at size 1 Solve for the first y Y
substitute substitute
… Blurred image
Vanilla Solver Results
- Higher performance than reported in
[Bavoil2010] (~6 ms vs. ~8ms at 1600x1200)
- Memory footprint prohibitively high
– >200 MB at 1600x1200
- Need an answer to tackling the lack of
parallelism problem – answer given in [ZCO2010]
28th February 2011 AMD‘s Favorite Effects 14
Vanilla CR Solver
28th February 2011 AMD‘s Favorite Effects 15
Input image X Pass 1: construct from CoC abc
reduce reduce reduce reduce
… … Stop at size 1 Solve for the first y Y
substitute substitute
… Blurred image This is what kills parallelism
Keeping the parallelism high
28th February 2011 AMD‘s Favorite Effects 16
Input image X Pass 1: construct from CoC abc
reduce reduce reduce reduce
… … Stop at a reasonable size Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010]) Y
substitute substitute
… Blurred image
Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 17
Input image X Pass 1: construct from CoC abc
reduce reduce reduce reduce
… … Stop at a reasonable size Solve for Y at that resolution Y
substitute substitute
… Blurred image
Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 18
rgab32f X rgab32f abc
rgab32f rgab32f
reduce reduce reduce reduce
… … Stop at a reasonable size Solve for Y at that resolution Y
substitute substitute
… rgba32f
rgab32f
substi- tute
Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 19
rgab16f X rgab32f abc
rgab16f rgab32f
reduce reduce reduce reduce
… … Stop at a reasonable size Solve for Y at that resolution Y
substitute substitute
… rgba16f
rgab16f
substi- tute
This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f
Memory Optimizations 2
28th February 2011 AMD‘s Favorite Effects 20
rgab16f X rgab32f abc
rgab16f rgab32f
reduce reduce reduce reduce
… … Stop at a reasonable size Solve for Y at that resolution Y
substitute substitute
… rgba16f
rgab16f
substi- tute
This does again save a significant amount of memory as this is the biggest surface used by the solver
Memory Optimizations 2
28th February 2011 AMD‘s Favorite Effects 21
rgab16f X abc
rgab16f rgab32f
reduce reduce reduce
… … Stop at a reasonable size Solve for Y at that resolution Y
substitute substitute
… rgba16f
rgab16f
substi- tute
Skip abc construction pass and compute abc
- n-the-fly during 1.
reduction pass
Intermediate Results 1600x1200
28th February 2011 AMD‘s Favorite Effects 22
Solver Time in ms Memory in Megabytes
HD5870 GTX480
GDC2010 hybrid solver on GTX480 ~8.5
8.00
[Bavoil 2010]
~117 (guesstimate)
Standard Solver
(already skips high res abc construction)
3.66 3.33 ~132
Memory Optimizations 3
28th February 2011 AMD‘s Favorite Effects 23
rgab16f X abc
rgab16f rgab32f
reduce reduce reduce
… … Stop at a reasonable size Solve for Y at that resolution Y
substitute substitute
… rgba16f
rgab16f
substi- tute
Skip abc construction pass compute abc during 1. reduction pass
Yet again this saves a significant amount of memory !
Memory Optimizations 3
28th February 2011 AMD‘s Favorite Effects 24
rgab16f X abc
reduce4
… … Stop at a reasonable size Solve for Y at that resolution Y
substitute substitute
… rgba16f
substitute4
Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass Substitute 1-to-4 in a special substitution pass
Intermediate Results 1600x1200
28th February 2011 AMD‘s Favorite Effects 25
Solver Time in ms Memory in Megabytes
HD5870 GTX480
GDC2010 hybrid solver on GTX480 ~8.5
8.00
[Bavoil 2010]
~117 (guesstimate)
Standard Solver
(already skips high res abc construction)
3.66 3.33 ~132
4–to-1 Reduction
2.87 3.32 ~73
DX11 Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 26
rgab16f X abc
reduce4
… … Stop at a reasonable size Solve for Y at that resolution Y
substitute substitute
… rgba16f
substitute4
Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass Substitute 1-to-4 in a special substitution pass
DX11 Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 27
rgab16f X abc
reduce4
… … Stop at a reasonable size Solve for Y at that resolution Y
substitute substitute
… rgba16f
substitute4
Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass Substitute 1-to-4 in a special substitution pass Pack abc and X into
- ne rgba_uint surface
Using SM5 for data packing
28th February 2011 AMD‘s Favorite Effects 28
rgab16f X rgab32f abc uint uint uint uint
pack x,y channel
(f32tof16(X.x) + (f32tof16(X.y) << 16))
Using SM5 for data packing
28th February 2011 AMD‘s Favorite Effects 29
rgab16f X rgab32f abc uint uint uint uint
higher 27 bits of x channel
(asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F))
Steal 6 lowest mantissa bits of abc.x to store some bits of X.z
Using SM5 for data packing
28th February 2011 AMD‘s Favorite Effects 30
rgab16f X rgab32f abc uint uint uint uint
higher 27 bits of y channel
(asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F))
Steal 6 lowest mantissa bits of abc.y to store some bits of X.z
SM5 Memory Optimizations 1
28th February 2011 AMD‘s Favorite Effects 31
rgab16f X rgab32f abc uint uint uint uint
higher 27 bits of z channel
(asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F))
Steal 6 lowest mantissa bits of abc.z to store some bits of X.z
Sample Screenshot
28th February 2011 AMD‘s Favorite Effects 32
Abs(Packed-Unpacked) x 255.0f
28th February 2011 AMD‘s Favorite Effects 33
DX11 Memory Optimizations 2
- Solver does a horizonal and vertical pass
- Chain of lower res RTs needs to be there twice
– Horizontal reduction/substitution chain – Vertical reduction/substitution chain
- How can DX11 help?
28th February 2011 AMD‘s Favorite Effects 34
DX11 Memory Optimizations 2
- UAVs allow us to reuse data of the horizontal
chain for the vertical chain
- A proof of concept implementation shows that this
works nicely but impacts the runtime significantly
– ~40% lower fps
- Stayed with RTs as memory was already quite low
- Use only if you are really concerned about memory
28th February 2011 AMD‘s Favorite Effects 35
Final Results 1600x1200
28th February 2011 AMD‘s Favorite Effects 36
Solver Time in ms Memory in Megabytes
HD5870 GTX480
GDC2010 hybrid solver on GTX480 ~8.5
8.00
[Bavoil 2010]
~117 (guesstimate,)
Standard Solver
(already skips high res abc construction)
3.66 3.33 ~132
4–to-1 Reduction
2.87 3.32 ~73
4-to-1 Reduction + SM5 Packing
2.75 3.14 ~58
Future Work
- Look into CS acceleration of the solver
– 4-to-1 reduction pass – 1-to-4 substitution pass
- Look into using heat diffusion for other effects
– e.g. Motion blur
28th February 2011 AMD‘s Favorite Effects 37
Conclusion
- Optimized CR solver is fast and mem-efficient
– Used in Dragon Age 2 – 4aGames considering its use for new projects – Detailed description in ‚Game Engine Gems 2‘
- Mail me (holger.gruen@amd.com) if you want
access to the sources
28th February 2011 AMD‘s Favorite Effects 38
References
- [Kass2006] “Interactive depth of field using simulated diffusion on a GPU”
Michael Kass, Pixar Animation studios, Pixar technical memo #06-01
- [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D.
Owens, PPoPP 2010
- [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O.
Shishkovtsov, GDC 2010
- *Bavoil2010+ „Modern Real-Time Rendering Techniques“, L. Bavoil,
FGO2010
28th February 2011 AMD‘s Favorite Effects 39
Backup
28th February 2011 AMD‘s Favorite Effects 40
Results 1920x1200
28th February 2011 AMD‘s Favorite Effects 41
Solver Time in ms Memory in Megabytes
HD5870 GTX480
Standard Solver
(already skips high res abc construction)
4.31 4.03 ~158
4–to-1 Reduction
3.36 4.02 ~88
4-to-1 Reduction + SM5 Packing
3.23 3.79 ~70