Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing β s slides
Nodes: π = {π 1 , β¦ , π π } Evidence: π πΎ Inference query Query variables: π = π\π πΎ ο½ Marginal probability (Likelihood): π π π = π(π, π π ) π ο½ Conditional probability (a posteriori belief): π(π, π π ) π π|π π = π π(π, π π ) ο½ Marginalized conditional probability: πΏ π(π, πΏ, π π ) π π|π π = (π = π βͺ πΏ) π πΏ π(π, πΏ, π π ) ο½ Most probable assignment for some variables of interest given an evidence π πΎ = π π π β |π π = argmax π π|π π π 2
Exact methods for inference ο½ Variable elimination ο½ Message Passing: shared terms ο½ Sum-product (belief propagation) ο½ Max-product ο½ Junction Tree 3
Junction tree ο½ General algorithm on graphs with cycles ο½ Message passing on junction trees π ππ π ππ π ππ π· π· π π π ππ π ππ 4
Why approximate inference ο½ The computational complexity of Junction tree algorithm with be at least πΏ π· where π· shows the largest elimination clique (the largest clique in the triangulated graph) Tree-width of an π Γ π grid is π ο½ For a distribution π associated with a complex graph, computing the marginal (or conditional) probability of arbitrary random variable(s) is intractable 5
Learning and inference ο½ Learning is also an inference problem or usually needs inference ο½ For Bayesian inference that is one of the principal foundations for machine learning, learning is just an inference problem ο½ For Maximum Likelihood approach, also, we need inference when we have incomplete data or when we encounter an undirected model 6
Approximate inference ο½ Approximate inference techniques ο½ Variational algorithms ο½ Loopy belief propagation ο½ Mean field approximation ο½ Expectation propagation ο½ Stochastic simulation / sampling methods 7
Variational methods ο½ β variational β : general term for optimization-based formulations ο½ Many problems can be expressed in terms of an optimization problem in which the quantity being optimized is a functional ο½ Variational inference is a deterministic framework that is widely used for approximate inference 8
Variatonal inference methods ο½ Constructing an approximation to the target distribution π where this approximation takes a simpler form for inference: ο½ We define a target class of distributions π ο½ Search for an instance π β in π that is the best approximation to π ο½ Queries will be answered using π β rather than on π Constrained optimization ο½ π : given family of distributions ο½ Simpler families for which solving the optimization problem will be computationally tractable ο½ However, the family may not be sufficiently expressive to encode π 9
Setup ο½ Assume that we are interested in the posterior distribution Observed variables π(π, π|π½) π = {π¦ 1 , β¦ , π¦ π } π π π, π½ = π = {π¨ 1 , β¦ , π¨ π } Hidden variables π π, π π½ ππ ο½ The problem of computing the posterior is an instance of more general problems that variational inference solves ο½ Main idea: ο½ We pick a family of distributions over the latent variables with its own variational parameters ο½ Then, find the setting of the parameters that makes π close to the posterior of interest ο½ Use π with the fitted parameters as an approximation for the posterior 10
Approximation ο½ Goal: Approximate a difficult distribution π(π|π) with a new distribution π (π) such that: ο½ π(π|π) and π (π) are close ο½ Computation on π (π) is easy ο½ Typically, the true posterior is not in the variational family. ο½ How should we measure distance between distributions? ο½ The Kullback-Leibler divergence (KL-divergence) between two distributions π and π 11
KL divergence ο½ Kullback-Leibler divergence between π and π : πΏπ(π| π = π π¦ log π(π¦) π (π¦) ππ¦ ο½ A result from information theory: For any π and π πΏπ(π| π β₯ 0 ο½ πΏπ(π| π = 0 if and only if π β‘ π ο½ πΈ is asymmetric 12
How measure the distance of π and π ? ο½ We wish to find a distribution π such that π is a β good β approximation to π ο½ We can therefore use KL divergence as a scoring function to decide a good π ο½ But, πΏπ(π(π|π)||π (π)) β πΏπ(π (π)||π(π|π)) 13
M-projection vs. I-projection ο½ M-projection of π onto π π β = argmin πΏπ(π||π ) π βπ ο½ I-projection of π onto π π β = argmin πΏπ(π ||π) π βπ ο½ These two will differ only when π is minimized over a restricted set of probability distributions (when π β π set of possible π distributions) 14
KL divergence: M-projection vs. I-projection ο½ Let π be a 2D Gaussian and π be a Gaussian distribution with diagonal covariance matrix: π π log π π π β = argmin π π log π π π π ππ π β = argmin π π ππ π π π : Green π β : Red πΉ π π = πΉ π [π] πΉ π π = πΉ π [π] [Bishop] 15
KL divergence: M-projection vs. I-projection ο½ Let π is mixture of two 2D Gaussians and π be a 2D Gaussian distribution with arbitrary covariance matrix: π π log π π π : Blue π π log π π π β = argmin π β = argmin π π ππ π π ππ π β : Red π π two good solutions! πΉ π π = πΉ π π π·ππ€ π π = π·ππ€ π π 16 [Bishop]
M-projection ο½ Computing πΏπ(π| π requires inference on π π π¨ log π π¨ πΏπ(π| π = π π¨ = βπΌ π β πΉ π [log π (π¨)] π¨ Inference on π (that is difficult) is required! ο½ When π is in the exponential family: πΏπ(π| π = 0 β πΉ π π π¨ = πΉ π [π π¨ ] Moment projection ο½ Expectation Propagation methods are based on minimizing πΏπ(π| π 17
I-projection can be computed without performing inference on ο½ πΏπ(π | π π πΏπ(π | π = π π¨ log π π¨ π π¨ ππ¨ = βπΌ π β πΉ π [log π(π¨)] ο½ Most variational inference algorithms make use of πΏπ(π | π ο½ Computing expectations w.r.t. π is tractable (by choosing a suitable class of distributions for π ) ο½ We choose a restricted family of distributions such that the expectations can be evaluated and optimized efficiently. ο½ and yet which is still sufficiently flexible as to give a good approximation 18
Example of variatinal approximation Variational Laplace Approx. [Bishop] 19
Evidence Lower Bound (ELBO) ln π π = β π + πΏπ(π ||π) π = {π¦ 1 , β¦ , π¦ π } π = {π¨ 1 , β¦ , π¨ π } β π = π π ln π(π, π) ππ π (π) πΏπ(π ||π) = β π π ln π(π|π) ππ π (π) We also called β π as ο½ We can maximize the lower bound β π πΊ[π, π ] latter. ο½ equivalent to minimizing KL divergence. ο½ if we allow any possible choice for π (π) , then the maximum of the lower bound occurs when the KL divergence vanishes ο½ occurs when π (π) equals the posterior distribution π(π|π) . ο½ The difference between the ELBO and the KL divergence is ln π(π) which is what the ELBO bounds 20
Evidence Lower Bound (ELBO) ο½ Lower bound on the marginal likelihood ο½ This quantity should increase monotonically with each iteration ο½ we maximize the ELBO to find the parameters that gives as tight a bound as possible on the marginal likelihood ο½ ELBO converges to a local minimum. ο½ Variational inference is closely related to EM 21
Factorized distributions π The restriction on the distributions in the form of factorization assumptions: π π = π π (π π ) π β π = π π ln π(π, π) β ln π π ππ π π Coordinate ascent to optimize β π : β π π = π π ln π(π, π) π π ππ π ππ π β π π ln π π ππ π + ππππ‘π’ πβ π β β π π = π π πΉ βπ ln π π, π ππ π β π π ln π π ππ π + ππππ‘π’ πΉ βπ ln π π, π = ln π π, π π π ππ π 22 πβ π
Recommend
More recommend