approximate inference
play

Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and - PowerPoint PPT Presentation

Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Laplace and Variational Inference. 4. Basic Sampling Algorithms. 5. Markov chain


  1. Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1

  2. Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Laplace and Variational Inference. 4. Basic Sampling Algorithms. 5. Markov chain Monte Carlo algorithms. 2

  3. References/Acknowledgements • Chris Bishop’s book: Pattern Recognition and Machine Learning , chapter 11 (many figures are borrowed from this book). • David MacKay’s book: Information Theory, Inference, and Learning Algorithms , chapters 29-32. • Radford Neals’s technical report on Probabilistic Inference Using Markov Chain Monte Carlo Methods . • Zoubin Ghahramani’s ICML tutorial on Bayesian Machine Learning: http://www.gatsby.ucl.ac.uk/ ∼ zoubin/ICML04-tutorial.html • Ian Murray’s tutorial on Sampling Methods: http://www.cs.toronto.edu/ ∼ murray/teaching/ 3

  4. Basic Notation P ( x ) probability of x P ( x | θ ) conditional probability of x given θ P ( x, θ ) joint probability of x and θ Bayes Rule: P ( x | θ ) P ( θ ) P ( θ | x ) = P ( x ) where � P ( x ) = P ( x, θ ) dθ Marginalization I will use probability distribution and probability density interchangeably. It should be obvious from the context. 4

  5. Inference Problem Given a dataset D = { x 1 , ..., x n } : Bayes Rule: P ( D| θ ) Likelihood function of θ P ( θ |D ) = P ( D | θ ) P ( θ ) P ( θ ) Prior probability of θ P ( D ) P ( θ |D ) Posterior distribution over θ Computing posterior distribution is known as the inference problem. But: � P ( D ) = P ( D , θ ) dθ This integral can be very high-dimensional and difficult to compute. 5

  6. Prediction P ( D| θ ) Likelihood function of θ P ( θ |D ) = P ( D | θ ) P ( θ ) P ( θ ) Prior probability of θ P ( D ) P ( θ |D ) Posterior distribution over θ Prediction : Given D , computing conditional probability of x ∗ requires computing the following integral: � P ( x ∗ |D ) P ( x ∗ | θ, D ) P ( θ |D ) dθ = E P ( θ |D ) [ P ( x ∗ | θ, D )] = which is sometimes called predictive distribution . Computing predictive distribution requires posterior P ( θ |D ) . 6

  7. Model Selection Compare model classes, e.g. M 1 and M 2 . Need to compute posterior probabilities given D : P ( M|D ) = P ( D|M ) P ( M ) P ( D ) where � P ( D|M ) = P ( D| θ, M ) P ( θ |M ) dθ is known as the marginal likelihood or evidence . 7

  8. Computational Challenges • Computing marginal likelihoods often requires computing very high- dimensional integrals. • Computing posterior distributions (and hence predictive distributions) is often analytically intractable. • In this class, we will concentrate on Markov Chain Monte Carlo (MCMC) methods for performing approximate inference . • First, let us look at some specific examples: – Bayesian Probabilistic Matrix Factorization – Bayesian Neural Networks – Dirichlet Process Mixtures (last class) 8

  9. Bayesian PMF User Features 1 2 3 4 5 6 7 ... 1 5 3 ? 1 ... V 2 3 ? 4 ? 3 2 ... 3 4 Movie ~ ~ 5 U R Features 6 7 ... We have N users, M movies, and integer rating values from 1 to K . Let r ij be the rating of user i for movie j , and U ∈ R D × N , V ∈ R D × M be latent user and movie feature matrices: R ≈ U ⊤ V Goal: Predict missing ratings. 9

  10. Bayesian PMF α V α Probabilistic linear model with Gaussian U observation noise. Likelihood: Θ Θ p ( r ij | u i , v j , σ 2 ) = N ( r ij | u ⊤ i v j , σ 2 ) V U Gaussian Priors over parameters: V j U i N � p ( U | µ U , Λ U ) = N ( u i | µ u , Σ u ) , i =1 R ij M i=1,...,N � j=1,...,M p ( V | µ V , Λ V ) = N ( v i | µ v , Σ v ) . i =1 σ Conjugate Gaussian-inverse-Wishart priors on the user and movie hyperparameters Θ U = { µ u , Σ u } and Θ V = { µ v , Σ v } . Hierarchical Prior. 10

  11. Bayesian PMF Predictive distribution : Consider predicting a rating r ∗ ij for user i and query movie j : �� p ( r ∗ p ( r ∗ ij | R ) = ij | u i , v j ) p ( U, V, Θ U , Θ V | R ) d { U, V } d { Θ U , Θ V } � �� � Posterior over parameters and hyperparameters Exact evaluation of this predictive distribution is analytically intractable. Posterior distribution p ( U, V, Θ U , Θ V | R ) is complicated and does not have a closed form expression. Need to approximate. 11

  12. Bayesian Neural Nets Regression problem: Given a set of i.i.d observations X = { x n } N n =1 with corresponding targets D = { t n } N n =1 . Likelihood: N � N ( t n | y ( x n , w ) , β 2 ) p ( D| X , w ) = n =1 The mean is given by the output of the neural network: M � D � � � w 2 w 1 y k ( x , w ) = kj σ ji x i j =0 i =0 where σ ( x ) is the sigmoid function. Gaussian prior over the network parameters: p ( w ) = N (0 , α 2 I ) . 12

  13. Bayesian Neural Nets Likelihood: N � N ( t n | y ( x n , w ) , β 2 ) p ( D| X , w ) = n =1 Gaussian prior over parameters: p ( w ) = N (0 , α 2 I ) Posterior is analytically intractable: p ( D| w , X ) p ( w ) p ( w |D , X ) = � p ( D| w , X ) p ( w ) d w Remark: Under certain conditions, Radford Neal (1994) showed, as the number of hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian process prior for functions. 13

  14. Undirected Models x is a binary random vector with x i ∈ { +1 , − 1 } : p ( x ) = 1 � � � � Z exp θ ij x i x j + θ i x i . i ∈ V ( i,j ) ∈ E where Z is known as partition function: � � � � � Z = exp θ ij x i x j + θ i x i . x i ∈ V ( i,j ) ∈ E If x is 100-dimensional, need to sum over 2 100 terms. The sum might decompose (e.g. junction tree). Otherwise we need to approximate. Remark: Compare to marginal likelihood. 14

  15. Inference For most situations we will be interested in evaluating the expectation: � E [ f ] = f ( z ) p ( z ) dz We will use the following notation: p ( z ) = ˜ p ( z ) Z . We can evaluate ˜ p ( z ) pointwise, but cannot evaluate Z . 1 • Posterior distribution: P ( θ |D ) = P ( D ) P ( D| θ ) P ( θ ) • Markov random fields: P ( z ) = 1 Z exp( − E ( z )) 15

  16. Laplace Approximation Consider: 0.8 p ( z ) = ˜ p ( z ) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q ( z ) which is centered on a mode of the distribution p ( z ) . 0 −2 −1 0 1 2 3 4 At a stationary point z 0 the gradient ▽ ˜ p ( z ) vanishes. Consider a Taylor expansion of ln ˜ p ( z ) : p ( z 0 ) − 1 2( z − z 0 ) T A ( z − z 0 ) p ( z ) ≈ ln ˜ ln ˜ where A is a Hessian matrix: A = − ▽ ▽ ln ˜ p ( z ) | z = z 0 16

  17. Laplace Approximation Consider: 0.8 p ( z ) = ˜ p ( z ) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q ( z ) which is centered on a mode of the distribution p ( z ) . 0 −2 −1 0 1 2 3 4 Exponentiating both sides: � � − 1 2( z − z 0 ) T A ( z − z 0 ) p ( z ) ≈ ˜ ˜ p ( z 0 ) exp We get a multivariate Gaussian approximation: � � q ( z ) = | A | 1 / 2 − 1 2( z − z 0 ) T A ( z − z 0 ) (2 π ) D/ 2 exp 17

  18. Laplace Approximation Remember p ( z ) = ˜ p ( z ) Z , where we approximate: � � � � p ( z 0 )(2 π ) D/ 2 − 1 2( z − z 0 ) T A ( z − z 0 ) Z = p ( z ) d z ≈ ˜ ˜ p ( z 0 ) exp = ˜ | A | 1 / 2 1 Bayesian Inference: P ( θ |D ) = P ( D ) P ( D| θ ) P ( θ ) . Identify: ˜ p ( θ ) = P ( D| θ ) P ( θ ) and Z = P ( D ) : • The posterior is approximately Gaussian around the MAP estimate θ MAP � � p ( θ |D ) ≈ | A | 1 / 2 − 1 2( θ − θ MAP ) T A ( θ − θ MAP ) (2 π ) D/ 2 exp 18

  19. Laplace Approximation Remember p ( z ) = ˜ p ( z ) Z , where we approximate: � � � � p ( z 0 )(2 π ) D/ 2 − 1 2( z − z 0 ) T A ( z − z 0 ) Z = p ( z ) d z ≈ ˜ ˜ p ( z 0 ) exp = ˜ | A | 1 / 2 1 Bayesian Inference: P ( θ |D ) = P ( D ) P ( D| θ ) P ( θ ) . p ( θ ) = P ( D| θ ) P ( θ ) and Z = P ( D ) : Identify: ˜ • Can approximate Model Evidence: � P ( D ) = P ( D| θ ) P ( θ ) dθ • Using Laplace approximation ln P ( D ) ≈ ln P ( D | θ MAP ) + ln P ( θ MAP ) + D 2 ln 2 π − 1 2 ln | A | � �� � Occam factor: penalize model complexity 19

  20. Bayesian Information Criterion BIC can be obtained from the Laplace approximation: ln P ( D ) ≈ ln P ( D | θ MAP ) + ln P ( θ MAP ) + D 2 ln 2 π − 1 2 ln | A | by taking the large sample limit ( N → ∞ ) where N is the number of data points: ln P ( D ) ≈ P ( D | θ MAP ) − 1 2 D ln N • Quick, easy, does not depend on the prior. • Can use maximum likelihood estimate of θ instead of the MAP estimate • D denotes the number of “well-determined parameters” • Danger: Counting parameters can be tricky (e.g. infinite models) 20

  21. Variational Inference Approximate intractable distribution p ( θ | D ) with simpler, tractable Key Idea: distribution q ( θ ) . We can lower bound the marginal likelihood using Jensen’s inequality: � � q ( θ ) P ( D , θ ) ln p ( D ) p ( D , θ ) dθ = ln = ln dθ q ( θ ) � � � q ( θ ) ln p ( D , θ ) 1 ≥ q ( θ ) ln p ( D , θ ) dθ + q ( θ ) dθ = q ( θ ) ln q ( θ ) dθ � �� � Entropy functional � �� � Variational Lower-Bound = ln p ( D ) − KL( q ( θ ) || p ( θ | D )) = L ( q ) where KL( q || p ) is a Kullback–Leibler divergence. It is a non-symmetric measure of the difference between two probability distributions q and p . The goal of variational inference is to maximize the variational lower-bound w.r.t. approximate q distribution, or minimize KL( q || p ) . 21

Recommend


More recommend