An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen - - PowerPoint PPT Presentation

an optimized diffusion depth of
SMART_READER_LITE
LIVE PREVIEW

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen - - PowerPoint PPT Presentation

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen AMD AMD s Favorite Effects 28th February 2011 2 Agenda Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic


slide-1
SLIDE 1
slide-2
SLIDE 2

An Optimized Diffusion Depth Of Field Solver (DDOF)

28th February 2011 2 AMD‘s Favorite Effects

Holger Gruen – AMD

slide-3
SLIDE 3

Agenda

  • Motivation
  • Recap of a high-level explanation of DDOF
  • Recap of earlier DDOF solvers
  • A Vanilla Cyclic Reduction(CR) DDOF solver
  • A DX11 optimized CR solver for DDOF
  • Results

28th February 2011 AMD‘s Favorite Effects 3

slide-4
SLIDE 4

Motivation

  • Solver presented at GDC 2010 [RS2010] has

some weaknesses

  • Great implementation but memory reqs and

runtime too high for many game developers

  • Looking for faster and memory efficient solver

28th February 2011 AMD‘s Favorite Effects 4

slide-5
SLIDE 5

Diffusion DOF recap 1

  • DDOF is an enhanced way of blurring a picture

taking an arbitrary CoC at a pixel into account

  • Interprets input image as a heat distribution
  • Uses the CoC at a pixel to derive a per pixel

heat conductivity

CoC=Circle of Confusion

28th February 2011 AMD‘s Favorite Effects 5

slide-6
SLIDE 6

Diffusion DOF recap 2

  • Blurring is done by time stepping a differential

equation that models the diffusion of heat

  • ADI method used to arrive at a separable

solution for stepping

  • Need to solve tri-diagonal linear system for

each row and then each colum of the input

28th February 2011 AMD‘s Favorite Effects 6

slide-7
SLIDE 7

DDOF Tri-diagonal system

28th February 2011 AMD‘s Favorite Effects 7

1 1 1 1 2 2 2 2 2 3 3 3 3 3 n n n n

b c y x a b c y x a b c y x a b y x                                         

  • row/col of input

image

  • derived from CoC at

each pixel of an input row/col

  • resulting blurred

row/col

slide-8
SLIDE 8

Solver recap 1

  • The GDC2010 solver [RS2010] is a ‚hybrid‘ solver

– Performs three PCR steps upfront – Performs serial ‚Sweep‘ algorithm to solve small resulting systems – Check [ZCO2010] for details on other hybrid solvers

28th February 2011 AMD‘s Favorite Effects 8

slide-9
SLIDE 9

Solver recap 2

  • The GDC2010 solver [RS2010] has drawbacks

– It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm

  • GPUs without RW cache will suffer

– For high resolutions three PCR steps produce tri-diagonal system of substantial size

  • This means a serial (sweep) algorithm is run on a ‚big‘ system

28th February 2011 AMD‘s Favorite Effects 9

slide-10
SLIDE 10

Solver recap 3

  • Cyclic Reduction (CR) solver

– Used by [Kass2006] in the original DDOF paper – Runs in two phases

  • 1. reduction phase
  • 2. backward substitution phase

28th February 2011 AMD‘s Favorite Effects 10

slide-11
SLIDE 11

Solver recap 4

  • According to [ZCO2010]:

– CR solver has lowest computational complexity of all solvers  – It suffers from lack of parallelism though 

  • At the end of the reduction phase
  • At the start of the backwards substitution phase

28th February 2011 AMD‘s Favorite Effects 11

slide-12
SLIDE 12

Passes of a Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 12

Input image X Pass 1: construct from CoC abc

1 1 1 1 2 2 2 2 2 3 3 3 3 3 n n n n

b c y x a b c y x a b c y x a b y x                                         

slide-13
SLIDE 13

Passes of a Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 13

Input image X Pass 1: construct from CoC abc

reduce reduce reduce reduce

… … Stop at size 1 Solve for the first y Y

substitute substitute

… Blurred image

slide-14
SLIDE 14

Vanilla Solver Results

  • Higher performance than reported in

[Bavoil2010]  (~6 ms vs. ~8ms at 1600x1200)

  • Memory footprint prohibitively high 

– >200 MB at 1600x1200

  • Need an answer to tackling the lack of

parallelism problem – answer given in [ZCO2010]

28th February 2011 AMD‘s Favorite Effects 14

slide-15
SLIDE 15

Vanilla CR Solver

28th February 2011 AMD‘s Favorite Effects 15

Input image X Pass 1: construct from CoC abc

reduce reduce reduce reduce

… … Stop at size 1 Solve for the first y Y

substitute substitute

… Blurred image This is what kills parallelism

slide-16
SLIDE 16

Keeping the parallelism high

28th February 2011 AMD‘s Favorite Effects 16

Input image X Pass 1: construct from CoC abc

reduce reduce reduce reduce

… … Stop at a reasonable size Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010]) Y

substitute substitute

… Blurred image

slide-17
SLIDE 17

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 17

Input image X Pass 1: construct from CoC abc

reduce reduce reduce reduce

… … Stop at a reasonable size Solve for Y at that resolution Y

substitute substitute

… Blurred image

slide-18
SLIDE 18

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 18

rgab32f X rgab32f abc

rgab32f rgab32f

reduce reduce reduce reduce

… … Stop at a reasonable size Solve for Y at that resolution Y

substitute substitute

… rgba32f

rgab32f

substi- tute

slide-19
SLIDE 19

Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 19

rgab16f X rgab32f abc

rgab16f rgab32f

reduce reduce reduce reduce

… … Stop at a reasonable size Solve for Y at that resolution Y

substitute substitute

… rgba16f

rgab16f

substi- tute

This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f

slide-20
SLIDE 20

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 20

rgab16f X rgab32f abc

rgab16f rgab32f

reduce reduce reduce reduce

… … Stop at a reasonable size Solve for Y at that resolution Y

substitute substitute

… rgba16f

rgab16f

substi- tute

This does again save a significant amount of memory as this is the biggest surface used by the solver

slide-21
SLIDE 21

Memory Optimizations 2

28th February 2011 AMD‘s Favorite Effects 21

rgab16f X abc

rgab16f rgab32f

reduce reduce reduce

… … Stop at a reasonable size Solve for Y at that resolution Y

substitute substitute

… rgba16f

rgab16f

substi- tute

Skip abc construction pass and compute abc

  • n-the-fly during 1.

reduction pass

slide-22
SLIDE 22

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 22

Solver Time in ms Memory in Megabytes

HD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5

8.00

[Bavoil 2010]

~117 (guesstimate)

Standard Solver

(already skips high res abc construction)

3.66 3.33 ~132

slide-23
SLIDE 23

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 23

rgab16f X abc

rgab16f rgab32f

reduce reduce reduce

… … Stop at a reasonable size Solve for Y at that resolution Y

substitute substitute

… rgba16f

rgab16f

substi- tute

Skip abc construction pass compute abc during 1. reduction pass

Yet again this saves a significant amount of memory !

slide-24
SLIDE 24

Memory Optimizations 3

28th February 2011 AMD‘s Favorite Effects 24

rgab16f X abc

reduce4

… … Stop at a reasonable size Solve for Y at that resolution Y

substitute substitute

… rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass Substitute 1-to-4 in a special substitution pass

slide-25
SLIDE 25

Intermediate Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 25

Solver Time in ms Memory in Megabytes

HD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5

8.00

[Bavoil 2010]

~117 (guesstimate)

Standard Solver

(already skips high res abc construction)

3.66 3.33 ~132

4–to-1 Reduction

2.87 3.32 ~73

slide-26
SLIDE 26

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 26

rgab16f X abc

reduce4

… … Stop at a reasonable size Solve for Y at that resolution Y

substitute substitute

… rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass Substitute 1-to-4 in a special substitution pass

slide-27
SLIDE 27

DX11 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 27

rgab16f X abc

reduce4

… … Stop at a reasonable size Solve for Y at that resolution Y

substitute substitute

… rgba16f

substitute4

Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass Substitute 1-to-4 in a special substitution pass Pack abc and X into

  • ne rgba_uint surface
slide-28
SLIDE 28

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 28

rgab16f X rgab32f abc uint uint uint uint

pack x,y channel

(f32tof16(X.x) + (f32tof16(X.y) << 16))

slide-29
SLIDE 29

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 29

rgab16f X rgab32f abc uint uint uint uint

higher 27 bits of x channel

(asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F))

Steal 6 lowest mantissa bits of abc.x to store some bits of X.z

slide-30
SLIDE 30

Using SM5 for data packing

28th February 2011 AMD‘s Favorite Effects 30

rgab16f X rgab32f abc uint uint uint uint

higher 27 bits of y channel

(asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F))

Steal 6 lowest mantissa bits of abc.y to store some bits of X.z

slide-31
SLIDE 31

SM5 Memory Optimizations 1

28th February 2011 AMD‘s Favorite Effects 31

rgab16f X rgab32f abc uint uint uint uint

higher 27 bits of z channel

(asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F))

Steal 6 lowest mantissa bits of abc.z to store some bits of X.z

slide-32
SLIDE 32

Sample Screenshot

28th February 2011 AMD‘s Favorite Effects 32

slide-33
SLIDE 33

Abs(Packed-Unpacked) x 255.0f

28th February 2011 AMD‘s Favorite Effects 33

slide-34
SLIDE 34

DX11 Memory Optimizations 2

  • Solver does a horizonal and vertical pass
  • Chain of lower res RTs needs to be there twice

– Horizontal reduction/substitution chain – Vertical reduction/substitution chain

  • How can DX11 help?

28th February 2011 AMD‘s Favorite Effects 34

slide-35
SLIDE 35

DX11 Memory Optimizations 2

  • UAVs allow us to reuse data of the horizontal

chain for the vertical chain

  • A proof of concept implementation shows that this

works nicely but impacts the runtime significantly

– ~40% lower fps

  • Stayed with RTs as memory was already quite low
  • Use only if you are really concerned about memory

28th February 2011 AMD‘s Favorite Effects 35

slide-36
SLIDE 36

Final Results 1600x1200

28th February 2011 AMD‘s Favorite Effects 36

Solver Time in ms Memory in Megabytes

HD5870 GTX480

GDC2010 hybrid solver on GTX480 ~8.5

8.00

[Bavoil 2010]

~117 (guesstimate,)

Standard Solver

(already skips high res abc construction)

3.66 3.33 ~132

4–to-1 Reduction

2.87 3.32 ~73

4-to-1 Reduction + SM5 Packing

2.75 3.14 ~58

slide-37
SLIDE 37

Future Work

  • Look into CS acceleration of the solver

– 4-to-1 reduction pass – 1-to-4 substitution pass

  • Look into using heat diffusion for other effects

– e.g. Motion blur

28th February 2011 AMD‘s Favorite Effects 37

slide-38
SLIDE 38

Conclusion

  • Optimized CR solver is fast and mem-efficient

– Used in Dragon Age 2 – 4aGames considering its use for new projects – Detailed description in ‚Game Engine Gems 2‘

  • Mail me (holger.gruen@amd.com) if you want

access to the sources

28th February 2011 AMD‘s Favorite Effects 38

slide-39
SLIDE 39

References

  • [Kass2006] “Interactive depth of field using simulated diffusion on a GPU”

Michael Kass, Pixar Animation studios, Pixar technical memo #06-01

  • [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D.

Owens, PPoPP 2010

  • [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O.

Shishkovtsov, GDC 2010

  • *Bavoil2010+ „Modern Real-Time Rendering Techniques“, L. Bavoil,

FGO2010

28th February 2011 AMD‘s Favorite Effects 39

slide-40
SLIDE 40

Backup

28th February 2011 AMD‘s Favorite Effects 40

slide-41
SLIDE 41

Results 1920x1200

28th February 2011 AMD‘s Favorite Effects 41

Solver Time in ms Memory in Megabytes

HD5870 GTX480

Standard Solver

(already skips high res abc construction)

4.31 4.03 ~158

4–to-1 Reduction

3.36 4.02 ~88

4-to-1 Reduction + SM5 Packing

3.23 3.79 ~70