Deterministic Approximations Bayesian logistic regression Already covered in lectures on classification Laplace and variational approximations I will review Murphy pp256–259 on the board. Similar material by MacKay, Ch. 41, pp492–503. ( § 41.4 uses non-examinable MCMC methods) http://www.inference.phy.cam.ac.uk/mackay/itila/book.html Iain Murray http://iainmurray.net/ Posterior distributions Non-Gaussian example p ( θ |D , M ) = P ( D| θ ) p ( θ ) p ( w ) ∝ N ( w ; 0 , 1) P ( D|M ) p ( w | D ) ∝ N ( w ; 0 , 1) σ (10 − 20 w ) E.g., logistic regression: p ( θ = w ) = N ( w ; 0 , σ 2 I ) � labels z ( n ) ∈ ± 1 σ ( z ( n ) w ⊤ x ( n ) ) , P ( D| θ = w ) = n Integrate large product non-linear functions. Goals: summarize posterior in simple form, −4 −2 0 2 4 estimate model evidence P ( D|M )
Posterior after 500 datapoints Gaussian approximations N =500 labels generated with w =1 at x ( n ) ∼ N (0 , 10 2 ) Finite parameter vector θ p ( w ) ∝ N ( w ; 0 , 1) 500 � σ ( wx ( n ) z ( n ) ) P ( θ | lots of data ) often nearly Gaussian around the mode p ( w | D ) ∝ N ( w ; 0 , 1) n =1 Need to identify which Gaussian it is: mean, covariance −4 −2 0 2 4 −4 −2 0 2 4 Gaussian fit overlaid Laplace Approximation Laplace details MAP estimate: Matrix of second derivatives is called the Hessian: θ ∗ = arg max � � log P ( D| θ ) + log P ( θ ) . �� ∂ 2 θ � � H ij = − log P ( θ |D ) � ∂θ i ∂θ j � Define ‘energy’: θ = θ ∗ Find posterior mode (MAP estimate) θ ∗ using favourite E ( θ ) = − log P ( θ |D ) = − log P ( D| θ ) − log P ( θ ) + log P ( D ) . gradient-based optimizer. Because ∇ θ E is zero at θ ∗ (a turning point), Taylor expansion: E ( θ ∗ + δ ) ≈ E ( θ ∗ ) + 1 2 δ ⊤ H δ Log posterior doesn’t need to be normalized: constants disappear from derivatives and second-derivatives Do same thing to Gaussian around mean, identify Laplace approximation: P ( θ |D ) ≈ N ( θ ; θ ∗ , H − 1 )
Laplace picture Laplace problems Weird densities won’t work well. We only locally match one mode. Mode may not have much mass, or misleading curvature Curvature and mode match. High dimensions: mode may be flat in some direction We can normalize Gaussian. Height at mode won’t match exactly! → ill-conditioned Hessian Used to approximate model likelihood (AKA ‘evidence’, ‘marginal likelihood’): ≈ P ( D| θ ∗ ) P ( θ ∗ ) P ( D ) = P ( D| θ ) P ( θ ) 1 N ( θ ∗ ; θ ∗ , H − 1 ) = P ( D| θ ∗ ) P ( θ ∗ ) | 2 πH − 1 | 2 P ( θ |D ) Other Gaussian approximations Variational methods Can match a Gaussian in other ways that derivatives Goal: fit target distribution (e.g., parameter posterior) Define: — family of possible distributions q ( θ ) — ‘variational objective’ (says ‘how well does q match?’) Accurate approximation with Gaussian may not be possible Capturing posterior width better than only fitting point estimate Optimize objective: Fit parameters of q ( θ ) — e.g., mean and cov of Gaussian
Kullback–Leibler Divergence Minimizing D KL ( p || q ) � p ( θ ) log p ( θ ) Select family: q ( θ ) = N ( θ ; µ, Σ) , D KL ( p || q ) = q ( θ ) d θ Minimize D KL ( p || q ) : match mean and cov of p . D KL ( p || q ) ≥ 0 . Minimized by p ( θ ) = q ( θ ) . Information theory (non-examinable for MLPR): KL divergence: average storage wasted by compression system using model q instead of true distribution p . −4 −2 0 2 4 Minimizing D KL ( p || q ) Considering D KL ( q || p ) Optimizing D KL ( p || q ) tends to be hard. Even Gaussian q : mean and cov of p ? MCMC? Answer may not be what you want: Murphy Fig 21.1 � � D KL ( q || p ) = − q ( θ ) log p ( θ |D ) d θ + q ( θ ) log q ( θ ) d θ � �� � neg. entropy, − H ( q ) 1. “Don’t put probability mass on implausible parameters” 2. Want to be spread out, high entropy. H is the standard symbol for entropy. Nothing to do with a Hessian, also H ; sorry!
Usual variational methods D KL ( q || p ) : fitting posterior Fit q to p ( θ |D ) = p ( D| θ ) p ( θ ) Most variational methods in Machine Learning p ( D ) minimize D KL ( q || p ) Substitute into KL divergence and get spray of terms: — All parameters are plausible. — We know how to do it! D KL ( q || p ) = E q [log q ( θ )] − E q [log p ( D| θ )] − E q [log p ( θ )] + log p ( D ) (There are other variational principles.) First three terms: Minimize sum of these, J ( q ) . log p ( D ) : Model evidence. Usually intractable, but: D KL ( q || p ) ≥ 0 ⇒ log p ( D ) ≥ − J ( q ) We optimize lower bound on the log marginal likelihood D KL ( q || p ) : optimization Summary Laplace approximation: Literature full of clever (non-examinable) iterative ways to — Straightforward to apply optimize D KL ( q || p ) . q not always Gaussian. — 2nd derivatives → certainty of parameter Use standard optimizers? Hardest term to evaluate is: — Incremental improvement on MAP estimate N � E q [log p ( D| θ )] = E q [log p ( x n | θ )] Variational methods: n =1 — Fit variational parameters of q (not θ !) Sum of possibly simple integrals. — Usually KL ( q || p ) , compare to KL ( p || q ) Stochastic gradient descent is an option. — Bound marginal/model likelihood (‘the evidence’)
Recommend
More recommend