Non-Gaussian posterior approximation ◮ Various methods to make a Gaussian approximation, p ( f | y ) ≈ q ( f ) = N � f | µ = ? , C = ? � . ◮ Only need to obtain an approximate posterior at the training locations. ◮ At test locations, the data only e ff ects their probabily via the posterior at these locations. p ( f , f ∗ | x ∗ , x , y ) = p ( f ∗ | f , x ∗ ) p ( f | x , y )
Why do we want the posterior anyway? True posterior, posterior approximation, or samples are needed to make predictions at new locations, x ∗ . � p ( f ∗ | x ∗ , x , y ) = p ( f ∗ | f , x ∗ ) p ( f | y , x ) d f � q ( f ∗ | x ∗ , x , y ) = p ( f ∗ | f , x ∗ ) q ( f | x ) d f
Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons
Methods overview Given choice of Gaussian approximation of posterior. How do we choose the parameter values µ and C ? There a number of di ff erent methods in which to choose how to set the parameters of our Gaussian approximation.
Parameters e ff ect - mean
Parameters e ff ect - variance
How to choose the parameters? Two approaches that we might take: ◮ Is to match the mean and variance at some point, for example the mode. ◮ Attempt to minimise some divergence measure between the approximate distribution and the true distribution. ◮ Laplace takes the former ◮ Variational bayes takes the latter ◮ EP kind of takes the latter
Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons
Laplace approximation Task: for some generic random variable, f , and data, y , find a good approximation to di ffi cult to compute posterior distribution, p ( f | y ). Laplace approach: fit a Gaussian by matching the curvature at the modal point of the posterior. ◮ Use a second-order taylor expansion around the mode of the log-posterior. ◮ Use the expansion to find an equivalent Gaussian in the probability space.
Laplace approximation ◮ Log of a Gaussian distribution, q ( f ) = N � f | µ, C � , is a quadratic function of f . ◮ A second-order taylor expansion is an approximation of a function using only quadratic terms. ◮ Laplace approximation expands the un-normalised posterior, and then uses it to set the linear and quadratic terms of the log q ( f ). ◮ The first and second derivatives of the form of the log-posterior, at the mode, will match the derivatives of the approximate Gaussian at this same point.
Second-order taylor expansion p ( f | y ) = 1 Zh ( f ) In our case: h ( f ) = p ( y | f ) p ( f ) 2.50 2.00 1.50 1.00 0.50 h ( f ) 0.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
Second-order taylor expansion log p ( f | y ) = log 1 Z + log h ( f ) 2.00 1.00 0.00 -1.00 -2.00 logh ( f ) -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
Second-order taylor expansion log p ( f | y ) = log 1 Z + log h ( f ) Z + log h ( a ) + d log h ( a ) ≈ log 1 ( f − a ) d a 2( f − a ) ⊤ d 2 log h ( a ) + 1 ( f − a ) + · · · d a 2
Second-order taylor expansion Z + log h ( a ) + d log h ( a ) ≈ log 1 ( f − a ) d a 2( f − a ) ⊤ d 2 log h ( a ) + 1 ( f − a ) + · · · d a 2 Want to make the expansion around the mode, ˆ f : d log h ( a ) � � = 0 � � d a � a = ˆ f
Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f 2.00 1.00 0.00 -1.00 logh ( f ) Mode f -2.00 taylor O=1 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f f ) ⊤ d 2 log h (ˆ f ) + 1 2( f − ˆ ( f − ˆ f ) d ˆ f 2 2.00 1.00 0.00 logh ( f ) -1.00 Mode f taylor O=1 at f -2.00 taylor O=2 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f f ) ⊤ d 2 log h (ˆ f ) + 1 2( f − ˆ ( f − ˆ f ) + · · · d ˆ f 2 2.00 1.00 0.00 logh ( f ) Mode f -1.00 taylor O=1 at f taylor O=2 at f -2.00 taylor O=3 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
Second-order taylor expansion f ) + d log h (ˆ f ) log p ( f | y ) ≈ log 1 Z + log h (ˆ ( f − ˆ f ) d ˆ f f ) ⊤ d 2 log h (ˆ f ) + 1 2( f − ˆ ( f − ˆ f ) d ˆ f 2 2.00 1.00 0.00 -1.00 log h(f) Mode -2.00 taylor O=2 at f -3.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
Second-order taylor expansion − d 2 log h (ˆ f ) p ( f | y ) ≈ 1 − 1 Zh (ˆ 2( f − ˆ ( f − ˆ f ) ⊤ f ) exp f ) d ˆ f 2 2.50 2.00 1.50 1.00 h(f) exp(taylor O=2 at f ) 0.50 Mode f 0.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
Second-order taylor expansion − d 2 log h (ˆ f ) p ( f | y ) ≈ 1 − 1 Zh (ˆ 2( f − ˆ ( f − ˆ f ) ⊤ f ) exp f ) d ˆ f 2 − 1 − d 2 log h (ˆ f ) f | ˆ = N f , d ˆ f 2 2.50 2.00 1.50 1.00 h(f) N ( f | f , log h ( f ) 1 ) 0.50 Mode f 0.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
Laplace appoximation for Gaussian processes In our case, h ( f ) = p ( y | f ) p ( f ), so we need to evaluate − d 2 log h (ˆ = − d 2 (log p ( y | ˆ f ) + log p (ˆ f ) f )) d ˆ d ˆ f 2 f 2 = − d 2 log p ( y | ˆ f ) + K − 1 d ˆ f 2 � W + K − 1 giving a posterior approximation: � W + K − 1 � − 1 � f | ˆ � p ( f | y ) ≈ q ( f ) = N f ,
Laplace approximation - algorithm overview ◮ Find the mode, ˆ f of the true log posterior, via Newton’s method. ◮ Use second-order Taylor expansion around this modal value. ◮ Form Gaussian approximation setting the mean equal to the posterior mode, ˆ f , and matching the curvature. f , ( K − 1 + W ) − 1 � � f | ˆ ◮ p ( f | y ) ≈ q ( f | µ , C ) = N ◮ W � − d 2 log p ( y | ˆ f ) . d ˆ f 2 ◮ For factorizing likelihoods (most), W is diagonal.
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) mode, f 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) Evaluate curvature 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) log prior, log p ( f ) Evaluate curvature 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) log posterior, log p ( f | y = 4) 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 15.0 8 6 4 2 0 2 4 6 8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 f 1 Visualization of Laplace log prior, log p ( f ) Evaluate curvature 0.0 log likelihood, log p ( y = 4| ( f )) log posterior, log p ( f | y = 4) mode, f 2.5 5.0 7.5 10.0 12.5 15.0 8 6 4 2 0 2 4 6 8 f 1
Visualization of Laplace 1.0 prior, p ( f ) likelihood, p ( y = 4| ( f )) posterior, p ( f | y = 4) 0.8 laplace, q ( f ) mode, f 0.6 0.4 0.2 0.0 8 6 4 2 0 2 4 6 8 f 1
Visualise of Laplace - Bernoulli 1.0 prior, p ( f ) likelihood, p ( y = 1| ( f )) posterior, p ( f | y = 1) 0.8 laplace, q ( f ) mode, f 0.6 0.4 0.2 0.0 8 6 4 2 0 2 4 6 8 f 1
Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons
Variational Bayes (VB) Task: for some generic random variable, z , and data, y , find a good approximation to di ffi cult to compute posterior distribution, p ( z | y ). VB approach: minimise a divergence measure between an approximate posterior, q ( z ) and true posterior, p ( z | y ). ◮ KL divergence, KL � q ( z ) � p ( z | y ) � . ◮ Minimize this with respect to parameters of q ( z ).
KL divergence ◮ General for any two distributions q ( x ) and p ( x ). ◮ KL � q ( x ) � p ( x ) � is the average additional amount of information lost when p ( x ) is used to approximate q ( x ). It’s a measure of divergence of one distribution to another. ◮ KL � q ( x ) � p ( x ) � = log q ( x ) � � p ( x ) q ( x ) ◮ Always 0 or positive, not symmetric. ◮ Lets look at how it changes with response to changes in the approximating distribution.
0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ
0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ
0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ
0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ
0.40 0.40 0.40 0.40 q(x) ~ N ( µ = − 2 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ = − 1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =1 . 0 ,σ 2 =1 . 0) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 0) 0.35 0.35 0.35 0.35 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.30 0.30 0.30 0.25 0.25 0.25 0.25 pdf pdf pdf pdf 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 2.0 2.0 2.0 2.0 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 2.0 2.0 2.0 2.0 µ µ µ µ KL varying mean 0.40 q(x) ~ N ( µ =2 . 0 ,σ 2 =1 . 0) 0.35 p(x) ~ N (0 . 0 , 1 . 0) 0.30 0.25 pdf 0.20 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 x 2.0 1.5 KL [ q ( x ) || p ( x )] 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 µ
0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2
0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2
0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2
0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2
0.8 0.8 0.8 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 725) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 575) q(x) ~ N ( µ =0 . 0 ,σ 2 =1 . 15) q(x) ~ N ( µ =0 . 0 ,σ 2 =0 . 3) 0.7 0.7 0.7 0.7 p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 pdf pdf pdf pdf 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0 2 2 2 2 4 4 4 4 6 6 6 6 x x x x 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] KL [ q ( x ) || p ( x )] 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.2 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.6 1.6 1.6 1.6 1.8 1.8 1.8 1.8 2.0 2.0 2.0 2.0 σ 2 σ 2 σ 2 σ 2 KL varying variance 0.8 q(x) ~ N ( µ =0 . 0 ,σ 2 =2 . 0) 0.7 p(x) ~ N (0 . 0 , 1 . 0) 0.6 0.5 pdf 0.4 0.3 0.2 0.1 0.0 6 4 2 0 2 4 6 x 0.5 0.4 KL [ q ( x ) || p ( x )] 0.3 0.2 0.1 0.0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 σ 2
Variational Bayes Don’t have access to or can’t compute for computational reasons: p ( z | y ) or p ( y ), and hence KL � q ( z ) � p ( z | y ) � How can we minimize something we can’t compute? ◮ Can compute q ( z ) and p ( y | z ) for any z . ◮ q ( z ) is parameterised by ‘variational parameters’. ◮ True posterior using Bayes rule, p ( z | y ) = p ( y | z ) p ( z ) . p ( y ) ◮ p ( y ) doesn’t change when variational parameters are changed.
Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) �
Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y )
Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y ) � � � log q ( z ) = q ( z ) p ( z ) − log p ( y | z ) + log p ( y ) dz
Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y ) � � � log q ( z ) = q ( z ) p ( z ) − log p ( y | z ) + log p ( y ) dz = KL � q ( z ) � p ( z ) � − � q ( z ) � log p ( y | z ) � dz + log p ( y )
Variational Bayes - Derivation KL � q ( z ) � p ( z | y ) � � � log q ( z ) � q ( z ) dz = p ( z | y ) � � � log q ( z ) = q ( z ) p ( z ) − log p ( y | z ) + log p ( y ) dz = KL � q ( z ) � p ( z ) � − � q ( z ) � log p ( y | z ) � dz + log p ( y ) � q ( z ) � log p ( y | z ) � dz − KL � q ( z ) � p ( z ) � + KL � q ( z ) � p ( z | y ) � log p ( y ) =
Variational Bayes - Derivation � q ( z ) � log p ( y | z ) � dz − KL � q ( z ) � p ( z ) � + KL � q ( z ) � p ( z | y ) � log p ( y ) = � q ( z ) � log p ( y | z ) � dz − KL � q ( z ) � p ( z ) � ≥ ◮ Tractable terms give lower bound on log p ( y ) as KL � q ( z ) � p ( z | y ) � always positive. ◮ Adjust variational parameters of q ( z ) to make tractable terms as large as possible, thus KL � q ( z ) � p ( z | y ) � as small as possible.
VB optimisation illustration
Variational Bayes for Gaussian processes ◮ Make a Gaussian approximation, q ( f ) = N ( f | µ , C ), as similar possible to true posterior, p ( f | y ). ◮ Treat µ and C as ‘variational parameters’, e ff ecting quality of approximation. � � KL � q ( f ) � p ( f | y ) � = log q ( f ) p ( f | y ) q ( f ) � � log q ( f ) = p ( f ) − log p ( y | f ) + log p ( y ) q ( f ) = KL � q ( f ) � p ( f ) � − � log p ( y | f ) � q ( f ) + log p ( y ) q ( f ) − KL � q ( f ) � p ( f ) � + KL � q ( f ) � p ( f | y ) � log p ( y ) = � log p ( y | f ) �
Variational Bayes for Gaussian processes - bound q ( f ) − KL � q ( f ) � p ( f ) � + KL � q ( f ) � p ( f | y ) � log p ( y ) = � log p ( y | f ) � ≥ � log p ( y | f ) � q ( f ) − KL � q ( f ) � p ( f ) � ◮ Adjust variational parameters µ and C to make tractable terms as large as possible, thus KL � q ( f ) � p ( f | y ) � as small as possible. ◮ � log p ( y | f ) � q ( f ) with factorizing likelihood can be done with a series of n 1 dimensional integrals. ◮ In practice, can reduce the number of variational parameters by reparameterizing C = ( K ff − 2 Λ ) − 1 by noting that the bound is constant in o ff diagonal terms of C .
VB optimisation illustration for Gaussian processes
Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons
Expectation propagation n � p ( f | y ) ∝ p ( f ) p ( y i | f i ) i = 1 n 1 � t i ( f i | ˜ σ 2 q ( f ) � p ( f ) Z i , ˜ µ i , ˜ i ) = N ( f | µ , Σ ) Z ep i = 1 � � t i � ˜ σ 2 Z i N f i | ˜ µ i , ˜ i ◮ Individual likelihood terms, p ( y i | f i ), replaced by independent un-normalised 1D Gaussians, t i . ◮ Uses an iterative algorithm to update t i ’s, to get more and more accurate approximation.
Expectation propagation 1. Remove one factor t i from the approximation q ( f ).
Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i )
Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i ) 3. Find t i that minimises KL � p ( y i | f i ) q − i ( f i ) / z i � q ( f i ) � by matching moments.
Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i ) 3. Find t i that minimises KL � p ( y i | f i ) q − i ( f i ) / z i � q ( f i ) � by matching moments. 4. Repeat until convergence.
Expectation propagation 1. Remove one factor t i from the approximation q ( f ). 2. The approximate marginal q ( f i ) with t i contribution removed is called cavity distribution, q − i ( f i ) 3. Find t i that minimises KL � p ( y i | f i ) q − i ( f i ) / z i � q ( f i ) � by matching moments. 4. Repeat until convergence. This approximately minimises KL � p ( f | y ) � q ( f ) � locally, but not globally.
Expectation propagation - in math Step 1 & 2. First choose a local likelihood contribution, i , to leave out, and find the marginal cavity distribution, p ( f ) � n n n j = 1 t j ( f j ) � � q ( f | y ) ∝ p ( f ) t j ( f j ) → → p ( f ) t j ( f j ) t i ( f i ) j = 1 j � i � � t j ( f j ) d f j � i � q − i ( f i ) → p ( f ) j � i
Expectation propagation - in math Step 1 & 2. First choose a local likelihood contribution, i , to leave out, and find the marginal cavity distribution, p ( f ) � n n n j = 1 t j ( f j ) � � q ( f | y ) ∝ p ( f ) t j ( f j ) → → p ( f ) t j ( f j ) t i ( f i ) j = 1 j � i � � t j ( f j ) d f j � i � q − i ( f i ) → p ( f ) j � i � ˆ � � � σ 2 Step 3.1. ˆ q ( f i ) ≈ min KL p ( y i | f i ) q − i ( f i ) � N f i | ˆ µ i , ˆ Z i i
Expectation propagation - in math Step 1 & 2. First choose a local likelihood contribution, i , to leave out, and find the marginal cavity distribution, p ( f ) � n n n j = 1 t j ( f j ) � � q ( f | y ) ∝ p ( f ) t j ( f j ) → → p ( f ) t j ( f j ) t i ( f i ) j = 1 j � i � � t j ( f j ) d f j � i � q − i ( f i ) → p ( f ) j � i � ˆ � � � σ 2 Step 3.1. ˆ q ( f i ) ≈ min KL p ( y i | f i ) q − i ( f i ) � N f i | ˆ µ i , ˆ Z i i Step 3.2: Compute parameters of t i ( f i | ˜ σ 2 Z i , ˜ µ i , ˜ i ) making � � moments of q ( f i ) match those of ˆ σ 2 Z i N f i | ˆ µ i , ˆ . i
Outline Motivation Non-Gaussian posteriors Approximate methods Laplace approximation Variational bayes Expectation propagation Comparisons
Comparing posterior approximations Prior p ( f 1 ,f 2 ) Likelihood p ( y =1 | f 1 ,f 2 ) 20 20 10 10 f 2 0 f 2 0 10 10 20 20 20 10 0 10 20 20 10 0 10 20 f 1 f 1 ◮ Gaussian prior between two function values { f 1 , f 2 } , at { x 1 , x 2 } respectively. ◮ Bernoulli likelihood, y 1 = 1 and y 2 = 1.
Comparing posterior approximations True posterior Laplace approximation 20 20 10 10 f 2 0 f 2 0 10 10 20 20 20 10 0 10 20 20 10 0 10 20 f 1 f 1 ◮ p ( f | y ) ∝ p ( y | f ) p ( f ) p ( y ) ◮ True posterior is non-Gaussian. ◮ Laplace approximates with a Gaussian at the mode of the posterior.
Recommend
More recommend