variational inference
play

Variational inference Probabilistic Graphical Models Sharif - PowerPoint PPT Presentation

Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing s slides Nodes: = { 1 , , } Evidence: Inference query Query


  1. Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing ’ s slides

  2. Nodes: 𝒀 = {π‘Œ 1 , … , π‘Œ π‘œ } Evidence: 𝒀 𝑾 Inference query Query variables: 𝒂 = 𝒀\𝒀 𝑾  Marginal probability (Likelihood): 𝑄 𝒀 π’˜ = 𝑄(𝒂, 𝒀 π’˜ ) 𝒂  Conditional probability (a posteriori belief): 𝑄(𝒂, π’š π’˜ ) 𝑄 𝒂|π’š π’˜ = 𝒂 𝑄(𝒂, π’š π’˜ )  Marginalized conditional probability: 𝑿 𝑄(𝒁, 𝑿, π’š π’˜ ) 𝑄 𝒁|π’š π’˜ = (𝒁 = 𝒁 βˆͺ 𝑿) 𝒁 𝑿 𝑄(𝒁, 𝑿, π’š π’˜ )  Most probable assignment for some variables of interest given an evidence 𝒀 𝑾 = π’š π’˜ 𝒁 βˆ— |π’š π’˜ = argmax 𝑄 𝒁|π’š π’˜ 𝒁 2

  3. Exact methods for inference  Variable elimination  Message Passing: shared terms  Sum-product (belief propagation)  Max-product  Junction Tree 3

  4. Junction tree  General algorithm on graphs with cycles  Message passing on junction trees 𝑛 π‘—π‘˜ 𝑇 π‘—π‘˜ 𝑇 π‘—π‘˜ 𝐷 𝐷 𝑗 π‘˜ 𝑛 π‘˜π‘— 𝑇 π‘—π‘˜ 4

  5. Why approximate inference  The computational complexity of Junction tree algorithm with be at least 𝐿 𝐷 where 𝐷 shows the largest elimination clique (the largest clique in the triangulated graph) Tree-width of an 𝑂 Γ— 𝑂 grid is 𝑂  For a distribution 𝑄 associated with a complex graph, computing the marginal (or conditional) probability of arbitrary random variable(s) is intractable 5

  6. Learning and inference  Learning is also an inference problem or usually needs inference  For Bayesian inference that is one of the principal foundations for machine learning, learning is just an inference problem  For Maximum Likelihood approach, also, we need inference when we have incomplete data or when we encounter an undirected model 6

  7. Approximate inference  Approximate inference techniques  Variational algorithms  Loopy belief propagation  Mean field approximation  Expectation propagation  Stochastic simulation / sampling methods 7

  8. Variational methods  β€œ variational ” : general term for optimization-based formulations  Many problems can be expressed in terms of an optimization problem in which the quantity being optimized is a functional  Variational inference is a deterministic framework that is widely used for approximate inference 8

  9. Variatonal inference methods  Constructing an approximation to the target distribution 𝑄 where this approximation takes a simpler form for inference:  We define a target class of distributions 𝒭  Search for an instance 𝑅 βˆ— in 𝒭 that is the best approximation to 𝑄  Queries will be answered using 𝑅 βˆ— rather than on 𝑄 Constrained optimization  𝒭 : given family of distributions  Simpler families for which solving the optimization problem will be computationally tractable  However, the family may not be sufficiently expressive to encode 𝑄 9

  10. Setup  Assume that we are interested in the posterior distribution Observed variables 𝑄(π‘Ž, π‘Œ|𝛽) π‘Œ = {𝑦 1 , … , 𝑦 π‘œ } 𝑄 π‘Ž π‘Œ, 𝛽 = π‘Ž = {𝑨 1 , … , 𝑨 𝑛 } Hidden variables 𝑄 π‘Ž, π‘Œ 𝛽 π‘’π‘Ž  The problem of computing the posterior is an instance of more general problems that variational inference solves  Main idea:  We pick a family of distributions over the latent variables with its own variational parameters  Then, find the setting of the parameters that makes 𝑅 close to the posterior of interest  Use 𝑅 with the fitted parameters as an approximation for the posterior 10

  11. Approximation  Goal: Approximate a difficult distribution 𝑄(π‘Ž|π‘Œ) with a new distribution 𝑅(π‘Ž) such that:  𝑄(π‘Ž|π‘Œ) and 𝑅(π‘Ž) are close  Computation on 𝑅(π‘Ž) is easy  Typically, the true posterior is not in the variational family.  How should we measure distance between distributions?  The Kullback-Leibler divergence (KL-divergence) between two distributions 𝑄 and 𝑅 11

  12. KL divergence  Kullback-Leibler divergence between 𝑄 and 𝑅 : 𝐿𝑀(𝑄| 𝑅 = 𝑄 𝑦 log 𝑄(𝑦) 𝑅(𝑦) 𝑒𝑦  A result from information theory: For any 𝑄 and 𝑅 𝐿𝑀(𝑄| 𝑅 β‰₯ 0  𝐿𝑀(𝑄| 𝑅 = 0 if and only if 𝑄 ≑ 𝑅  𝐸 is asymmetric 12

  13. How measure the distance of 𝑄 and 𝑅 ?  We wish to find a distribution 𝑅 such that 𝑅 is a β€œ good ” approximation to 𝑄  We can therefore use KL divergence as a scoring function to decide a good 𝑅  But, 𝐿𝑀(𝑄(π‘Ž|π‘Œ)||𝑅(π‘Ž)) β‰  𝐿𝑀(𝑅(π‘Ž)||𝑄(π‘Ž|π‘Œ)) 13

  14. M-projection vs. I-projection  M-projection of 𝑅 onto 𝑄 𝑅 βˆ— = argmin 𝐿𝑀(𝑄||𝑅) π‘…βˆˆπ’­  I-projection of 𝑅 onto 𝑄 𝑅 βˆ— = argmin 𝐿𝑀(𝑅||𝑄) π‘…βˆˆπ’­  These two will differ only when 𝑅 is minimized over a restricted set of probability distributions (when 𝑄 βˆ‰ 𝑅 set of possible 𝑅 distributions) 14

  15. KL divergence: M-projection vs. I-projection  Let 𝑄 be a 2D Gaussian and 𝑅 be a Gaussian distribution with diagonal covariance matrix: 𝑅 π’œ log 𝑅 π’œ 𝑅 βˆ— = argmin 𝑄 π’œ log 𝑄 π’œ 𝑄 π’œ π‘’π’œ 𝑅 βˆ— = argmin 𝑅 π’œ π‘’π’œ 𝑅 𝑅 𝑄 : Green 𝑅 βˆ— : Red 𝐹 𝑄 π’œ = 𝐹 𝑅 [π’œ] 𝐹 𝑄 π’œ = 𝐹 𝑅 [π’œ] [Bishop] 15

  16. KL divergence: M-projection vs. I-projection  Let 𝑄 is mixture of two 2D Gaussians and 𝑅 be a 2D Gaussian distribution with arbitrary covariance matrix: 𝑅 π’œ log 𝑅 π’œ 𝑄 : Blue 𝑄 π’œ log 𝑄 π’œ 𝑅 βˆ— = argmin 𝑅 βˆ— = argmin 𝑄 π’œ π‘’π’œ 𝑅 π’œ π‘’π’œ 𝑅 βˆ— : Red 𝑅 𝑅 two good solutions! 𝐹 𝑄 π’œ = 𝐹 𝑅 π’œ 𝐷𝑝𝑀 𝑄 π’œ = 𝐷𝑝𝑀 𝑅 π’œ 16 [Bishop]

  17. M-projection  Computing 𝐿𝑀(𝑄| 𝑅 requires inference on 𝑄 𝑄 𝑨 log 𝑄 𝑨 𝐿𝑀(𝑄| 𝑅 = 𝑅 𝑨 = βˆ’πΌ 𝑄 βˆ’ 𝐹 𝑄 [log 𝑅(𝑨)] 𝑨 Inference on 𝑄 (that is difficult) is required!  When 𝑅 is in the exponential family: 𝐿𝑀(𝑄| 𝑅 = 0 ⇔ 𝐹 𝑄 π‘ˆ 𝑨 = 𝐹 𝑅 [π‘ˆ 𝑨 ] Moment projection  Expectation Propagation methods are based on minimizing 𝐿𝑀(𝑄| 𝑅 17

  18. I-projection can be computed without performing inference on  𝐿𝑀(𝑅| 𝑄 𝑄 𝐿𝑀(𝑅| 𝑄 = 𝑅 𝑨 log 𝑅 𝑨 𝑄 𝑨 𝑒𝑨 = βˆ’πΌ 𝑅 βˆ’ 𝐹 𝑅 [log 𝑄(𝑨)]  Most variational inference algorithms make use of 𝐿𝑀(𝑅| 𝑄  Computing expectations w.r.t. 𝑅 is tractable (by choosing a suitable class of distributions for 𝑅 )  We choose a restricted family of distributions such that the expectations can be evaluated and optimized efficiently.  and yet which is still sufficiently flexible as to give a good approximation 18

  19. Example of variatinal approximation Variational Laplace Approx. [Bishop] 19

  20. Evidence Lower Bound (ELBO) ln 𝑄 π‘Œ = β„’ 𝑅 + 𝐿𝑀(𝑅||𝑄) π‘Œ = {𝑦 1 , … , 𝑦 π‘œ } π‘Ž = {𝑨 1 , … , 𝑨 𝑛 } β„’ 𝑅 = 𝑅 π‘Ž ln 𝑄(π‘Œ, π‘Ž) π‘’π‘Ž 𝑅(π‘Ž) 𝐿𝑀(𝑅||𝑄) = βˆ’ 𝑅 π‘Ž ln 𝑄(π‘Ž|π‘Œ) π‘’π‘Ž 𝑅(π‘Ž) We also called β„’ 𝑅 as  We can maximize the lower bound β„’ 𝑅 𝐺[𝑄, 𝑅] latter.  equivalent to minimizing KL divergence.  if we allow any possible choice for 𝑅(π‘Ž) , then the maximum of the lower bound occurs when the KL divergence vanishes  occurs when 𝑅(π‘Ž) equals the posterior distribution 𝑄(π‘Ž|π‘Œ) .  The difference between the ELBO and the KL divergence is ln 𝑄(π‘Œ) which is what the ELBO bounds 20

  21. Evidence Lower Bound (ELBO)  Lower bound on the marginal likelihood  This quantity should increase monotonically with each iteration  we maximize the ELBO to find the parameters that gives as tight a bound as possible on the marginal likelihood  ELBO converges to a local minimum.  Variational inference is closely related to EM 21

  22. Factorized distributions 𝑅 The restriction on the distributions in the form of factorization assumptions: 𝑅 π‘Ž = 𝑅 𝑗 (π‘Ž 𝑗 ) 𝑗 β„’ 𝑅 = 𝑅 𝑗 ln 𝑄(π‘Œ, π‘Ž) βˆ’ ln 𝑅 𝑗 π‘’π‘Ž 𝑗 𝑗 Coordinate ascent to optimize β„’ 𝑅 : β„’ π‘˜ 𝑅 = 𝑅 π‘˜ ln 𝑄(π‘Œ, π‘Ž) 𝑅 𝑗 π‘’π‘Ž 𝑗 π‘’π‘Ž π‘˜ βˆ’ 𝑅 π‘˜ ln 𝑅 π‘˜ π‘’π‘Ž π‘˜ + π‘‘π‘π‘œπ‘‘π‘’ π‘—β‰ π‘˜ β‡’ β„’ π‘˜ 𝑅 = 𝑅 π‘˜ 𝐹 βˆ’π‘˜ ln 𝑄 π‘Œ, π‘Ž π‘’π‘Ž π‘˜ βˆ’ 𝑅 π‘˜ ln 𝑅 π‘˜ π‘’π‘Ž π‘˜ + π‘‘π‘π‘œπ‘‘π‘’ 𝐹 βˆ’π‘˜ ln 𝑄 π‘Œ, π‘Ž = ln 𝑄 π‘Œ, π‘Ž 𝑅 𝑗 π‘’π‘Ž 𝑗 22 π‘—β‰ π‘˜

Recommend


More recommend