sandwiching the marginal likelihood using bidirectional
play

Sandwiching the marginal likelihood using bidirectional Monte Carlo - PowerPoint PPT Presentation

Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani Introduction When comparing different statistical models, wed like a quantitative criterion which trades off model complexity


  1. Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani

  2. Introduction • When comparing different statistical models, we’d like a quantitative criterion which trades off model complexity and fit to the data • In a Bayesian setting, we often use marginal likelihood • Defined as the probability of the data, with all parameters and latent variables integrated out • Motivation: plug into Bayes’ Rule p ( M i ) p ( D|M i ) p ( M i |D ) = � j p ( M j ) p ( D|M j )

  3. Introduction: marginal likelihood + T G M G need to integrate out all of the component matrices and their + M G hyperparameters

  4. Introduction • Advantages of marginal likelihood (ML) • Accounts for model complexity in a sophisticated way • Closely related to description length • Measures the model’s ability to generalize to unseen examples • ML is used in those rare cases where it is tractable • e.g. Gaussian processes, fully observed Bayes nets • Unfortunately, it’s typically very hard to compute because it requires a very high-dimensional integral • While ML has been criticized on many fronts, the proposed alternatives pose similar computational difficulties

  5. Introduction • Focus on latent variable models • parameters , latent variables , observations θ z y • assume i.i.d. observations • Marginal likelihood requires summing or integrating out latent variables and parameters N � � � p ( y ) = p ( θ ) p ( z i | θ ) p ( y i | z i , θ ) d θ i =1 z i • Similar to computing the partition function � Z = f ( x ) x ∈ X

  6. Introduction • Problem: exact marginal likelihood computation is intractable • There are many algorithms to approximate it, but we don’t know how well they work

  7. Why evaluating ML estimators is hard The answer to life, the universe, and everything is... 42

  8. Why evaluating ML estimators is hard The marginal likelihood is… log p ( D ) = − 23814 . 7

  9. Why evaluating ML estimators is hard • How does one deal with this in practice? • polynomial-time approximations for partition functions of ferromagnetic Ising models • test on very small instances which can be solved exactly • run a bunch of estimators and see if they agree with each other

  10. Log-ML lower bounds • One marginal likelihood estimator is simple importance sampling: K p ( θ ( k ) , z ( k ) , D ) p ( D ) = 1 � { θ ( k ) , z ( k ) } K ˆ k =1 ∼ q q ( θ ( k ) , z ( k ) ) K k =1 • This is an unbiased estimator E [ˆ p ( D )] = p ( D ) • Unbiased estimators are stochastic lower bounds (Jensen’s inequality) E [log ˆ p ( D )] ≤ log p ( D ) p ( D ) > log p ( D ) + b ) ≤ e − b (Markov’s inequality) Pr(log ˆ • Many widely used algorithms have the same property!

  11. Log-ML lower bounds … True value? annealed importance sampling (AIS) sequential Monte Carlo (SMC) Chib-Murray-Salakhutdinov variational Bayes

  12. How to obtain an upper bound? • Harmonic Mean Estimator: K p ( D ) = ˆ { θ ( k ) , z ( k ) } K k =1 ∼ p ( θ , z |D ) � K k =1 1 /p ( D| θ ( k ) , z ( k ) ) • Equivalent to simple importance sampling, but with the role of the proposal and target distributions reversed • Unbiased estimate of the reciprocal of the ML � 1 � 1 = E p ( D ) ˆ p ( D ) • Gives a stochastic upper bound on the log-ML • Caveat 1: only an upper bound if you sample exactly from the posterior, which is generally intractable • Caveat 2: this is the Worst Monte Carlo Estimator (Neal, 2008)

  13. Annealed importance sampling (Neal, 2001) ... p 0 p 1 p 2 p 3 p 4 p K − 1 p K tractable initial intractable target distribution (e.g. distribution (e.g. prior) posterior)

  14. Annealed importance sampling (Neal, 2001) Given: unnormalized distributions f 0 , . . . , f K MCMC transition operators T 0 , . . . , T K f 0 easy to sample from, compute partition function of x ∼ f 0 w = 1 For i = 0 , . . . , K − 1 w := w f i +1 ( x ) f i ( x ) x : ∼ T i +1 ( x ) Then, E [ w ] = Z K Z 0 S Z K = Z 0 X w ( s ) ˆ S s =1

  15. Annealed importance sampling (Neal, 2001) T 1 T 2 T 3 T 4 T K ... x 0 x 3 x 1 x 2 x 4 x K − 1 x K p 0 p 1 p 2 p 3 p 4 p K − 1 p K ˜ ˜ ˜ ˜ ˜ T 4 T K T 3 T 2 T 1 K f i ( x i − 1 ) q back ( x 0 , x 1 , . . . , x K ) f i − 1 ( x i − 1 ) = Z K E [ w ] = Z K Forward: Y w := q fwd ( x 0 , x 1 , . . . , x K ) Z 0 Z 0 i =1 K f i − 1 ( x i ) q fwd ( x 0 , x 1 , . . . , x K ) = Z 0 E [ w ] = Z 0 Y Backward: w := f i ( x i ) q back ( x 0 , x 1 , . . . , x K ) Z K Z K i =1

  16. Bidirectional Monte Carlo • Initial distribution: prior p ( θ , z ) p ( θ , z |D ) = p ( θ , z , D ) • Target distribution: posterior p ( D ) � • Partition function: Z = p ( θ , z , D ) d θ d z = p ( D ) • Forward chain E [ w ] = Z K stochastic lower bound = p ( D ) Z 0 • Backward chain (requires exact posterior sample!) 1 E [ w ] = Z 0 stochastic upper bound = p ( D ) Z K

  17. Bidirectional Monte Carlo How to get an exact sample? Two ways to sample from p ( θ , z , D ) p ( θ , z ) p ( D| θ , z ) θ p ( D ) p ( θ , z |D ) z forward generate sample data, then perform D inference Therefore, the parameters and latent variables used to generate the data are an exact posterior sample!

  18. Bidirectional Monte Carlo Summary of algorithm: θ � , z � ∼ p θ , z y ∼ p y | θ , z ( ·| θ � , z � ) log p ( y ) Obtain a stochastic lower bound on by running AIS forwards log p ( y ) Obtain a stochastic upper bound on by running AIS backwards, ( θ � , z � ) starting from The two bounds will converge given enough intermediate distributions.

  19. Experiments • BDMC lets us compute ground truth log-ML values for data simulated from a model • We can use these ground truth values to benchmark log-ML estimators! • Obtained ground truth ML for simulated data for • clustering • low rank approximation • binary attributes • Compared a wide variety of ML estimators • MCMC operators shared between all algorithms wherever possible

  20. Results: binary attributes harmonic mean estimator true Bayesian information criterion (BIC) Likelihood weighting

  21. Results: binary attributes true Chib-Murray- Salakhutdinov variational Bayes

  22. Results: binary attributes (zoomed in) reverse SMC reverse AIS true sequential nested sampling Monte Carlo annealed importance sampling (AIS)

  23. Results: binary attributes Which estimators give accurate results? likelihood weighting harmonic mean variational Bayes mean Chib-Murray-Salakhutdinov squared nested sampling accuracy needed error to distinguish simple matrix factorizations AIS sequential Monte Carlo (SMC) time (seconds)

  24. Results: low rank approximation annealed importance sampling (AIS)

  25. Recommendations • Try AIS first • If AIS is too slow, try sequential Monte Carlo or nested sampling • Can’t fix a bad algorithm by averaging many samples • Don’t trust naive confidence intervals -- need to evaluate rigorously

  26. On the quantitative evaluation of decoder-based generative models Yuhuai Wu Yuri Burda Ruslan Salakhutdinov

  27. Decoder-based generative models • Define a generative process: • sample latent variables z from a simple (fixed) prior p(z) • pass them through a decoder network to get x = f(z) • Examples: • variational autoencoders (Kingma and Welling, 2014) • generative adversarial networks (Goodfellow et al., 2014) • generative moment matching networks (Li et al., 2015; Dziugaite et al., 2015) • nonlinear independent components estimation (Dinh et al., 2015)

  28. Decoder-based generative models • Variational autoencoder (VAE) • Train both a generator (decoder) and a recognition network (encoder) • Optimize a variational lower bound on the log-likelihood • Generative adversarial network (GAN) • Train a generator (decoder) and a discriminator • Discriminator wants to distinguish model samples from the training data • Generator wants to fool the discriminator • Generative moment matching network (GMMN) • Train a generative network such that certain statistics match between the generated samples and the data

  29. Decoder-based generative models Some impressive-looking samples: Denton et al. (2015) Radford et al. (2016) But how well do these models capture the distribution?

  30. Decoder-based generative models Looking at samples can be misleading:

  31. Decoder-based generative models GAN, 10 dim GAN, 50 dim, GAN, 50 dim, 200 epochs 1000 epochs LLD = 328.7 LLD = 543.5 LLD = 625.5

  32. Evaluating decoder-based models • Want to quantitatively evaluate generative models in terms of the probability of held-out data • Problem: a GAN or GMMN with k latent dimensions can only generate within a k-dimensional submanifold! • Standard (but unsatisfying) solution: impose a spherical Gaussian observation model p σ ( x | z ) = N ( f ( z ) , σ I ) • tune on a validation set σ • Problem: this still requires computing an intractable integral: � p σ ( x ) = p ( z ) p σ ( x | z ) d z

Recommend


More recommend