cs480 680 machine learning lecture 11 february 11 th 2020
play

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra Sheikhbahaee VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE (Beal 2003, chapter 2) Variational Inference: A Review for Statisticians (Blei et al.


  1. CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra Sheikhbahaee VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE (Beal 2003, chapter 2) Variational Inference: A Review for Statisticians (Blei et al. 2016) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 1

  2. • Variational lower bound derivation • Variational mean field approximation University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2

  3. Full Bayesian Inference • Training stage 𝑄 𝑍 $% 𝑌 $% , 𝜄 𝑄(𝜄) 𝑄 𝜄 𝑌 $% , 𝑍 $% = ∫ 𝑄 𝑍 $% 𝑌 $% , 𝜄 𝑄 𝜄 𝑒𝜄 • Testing stage 𝑄 𝑧 𝑦, 𝑌 $% , 𝑍 $% = / 𝑄(𝑧|𝑦, 𝜄)𝑄 𝜄 𝑌 $% , 𝑍 $% 𝑒𝜄 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3

  4. Full Bayesian Inference • Training stage 𝑄 𝑍 $% 𝑌 $% , 𝜄 𝑄(𝜄) 𝑄 𝜄 𝑌 $% , 𝑍 $% = ∫ 𝑄 𝑍 $% 𝑌 $% , 𝜄 𝑄 𝜄 𝑒𝜄 • Testing stage Maybe intractable 𝑄 𝑧 𝑦, 𝑌 $% , 𝑍 $% = / 𝑄(𝑧|𝑦, 𝜄)𝑄 𝜄 𝑌 $% , 𝑍 $% 𝑒𝜄 Posterior distributions can be calculated analytically only for simple conjugate models! University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4

  5. Choice Of Priors • In any Bayesian inference model what is essential is which type of prior knowledge (if any) is conveyed in prior. • Subjective priors: The prior encapsulates information as fully as possible by using previous experimental data or expert knowledge. Conjugate priors in the exponential family are subjective priors. 𝑔 𝜄 2 𝜈 = 𝑞 𝜄 𝑧 ∝ 𝑔 𝜄 𝜈 𝑞 𝑧 𝜄 The definition of a likelihood function in an exponential family model is given as follow where we assume that n data points arrive independent and identically distributed 𝑞 𝑧 6 𝜄 = 𝑕 𝜄 𝑔(𝑧 6 )𝑓 9 : ; <(= > ) 𝑕(𝜄) : a normalization constant 𝜚(𝜄) : a vector of natural parameters 𝑣(𝑧 6 ) : The sufficient statistics 𝑄 𝜄 𝜃, 𝜉 = ℎ(𝜃, 𝜉)𝑕(𝜄) D 𝑓 9(:) ; E University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5

  6. Choice Of Priors • In any Bayesian inference model what is essential is which type of prior knowledge (if any) is conveyed in prior. • Subjective priors: The prior encapsulates information as fully as possible by using previous experimental data or expert knowledge. Conjugate priors in the exponential family are subjective priors. 𝑔 𝜄 2 𝜈 = 𝑞 𝜄 𝑧 ∝ 𝑔 𝜄 𝜈 𝑞 𝑧 𝜄 The posterior distribution I I 𝑞 𝑧 6 𝜄 ∝ 𝑕 𝜄 DJI 𝑓 9(:) ; E F 𝑔(𝑧 6 ) 𝑓 9 : ; <(= > ) ∝ 𝑄(𝜄|2 𝑄 𝜄 𝑧 = 𝑄 𝜄 𝜃, 𝜉 F 𝜃, 2 𝜉) 6GH 6GH 𝜃 = 𝜃 + 𝑜 2 I 𝜉 = 𝜉 + M 2 𝑣(𝑧 6 ) 6GH University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6

  7. Choice Of Priors • Objective Priors : Instead of attempting to encapsulate rich knowledge into the prior , the objective Bayesian tries to impart as little information as possible in an attempt to allow the data to carry as much weight as possible in the posterior distribution. One class of noninformative priors are reference priors . • Hierarchical priors : Utilize hierarchical modeling to transfer the reference prior problem to a ‘higher level’ of the model. Hierarchical models allow a more “objective” approach to inference by estimating the parameters of prior distributions from data rather than requiring them to be specified using subjective information. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7

  8. Approximate Inference Probabilistic model: 𝑄 𝑦, 𝜄 = 𝑄 𝑦 𝜄 𝑄(𝜄) Variational Inference Markov Chain Monte Carlo Approximate 𝑞(𝜄|𝑦) ≈ 𝑟(𝜄) ∈ 𝒭 Samples from unnormalized 𝑞 𝜄 𝑦 • Biased • Unbiased • Faster and more scalable • Need a lot of sample University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8

  9. Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 9 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  10. Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 I ln / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) 𝑞 𝑦 6 , 𝑧 6 𝜄 = M = 𝑟 [ > (𝑦 6 ) 6GH I 𝑞 𝑦 6 , 𝑧 6 𝜄 = M ln 𝔽 ] ^> 𝑟 [ > (𝑦 6 ) 6GH The Jensen’s inequality for a concave function is given as 𝑔 𝔽 ] 𝑦 ≥ 𝔽 ] [𝑔(𝑦)] 10 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  11. Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 I ln / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) 𝑞 𝑦 6 , 𝑧 6 𝜄 = M = 𝑟 [ > (𝑦 6 ) 6GH I 𝑞 𝑦 6 , 𝑧 6 𝜄 = M ln 𝔽 ] ^> 𝑟 [ > (𝑦 6 ) 6GH I 𝔽 ] ^> ln 𝑞 𝑦 6 , 𝑧 6 𝜄 ≥ M 𝑟 [ > (𝑦 6 ) 6GH 11 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  12. Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 I ln / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) 𝑞 𝑦 6 , 𝑧 6 𝜄 = M = 𝑟 [ > (𝑦 6 ) 6GH I 𝔽 ] ^> ln 𝑞 𝑦 6 , 𝑧 6 𝜄 ≥ M 𝑟 [ > (𝑦 6 ) 6GH I / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞 𝑦 6 , 𝑧 6 𝜄 = M 𝑟 [ > (𝑦 6 ) 6GH 12 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  13. Mathematical magic Consider a model with hidden variables 𝒚 = 𝑦 H , … , 𝑦 I and observed variables 𝒛 = 𝑧 H , 𝑧 U , … , 𝑧 I and the stochastic dependency between variables are given by 𝜄 : I I ℒ(𝜄) ≡ ln 𝑞 𝒛 𝜄 = ∑ 6GH ln 𝑞 𝑧 6 𝜄 = ∑ 6GH ln ∫ 𝑒𝑦 6 𝑞 𝑦 6 , 𝑧 6 𝜄 I ln / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) 𝑞 𝑦 6 , 𝑧 6 𝜄 = M = 𝑟 [ > (𝑦 6 ) 6GH b 𝑦 6 , 𝑧 6 𝜄 I ≥ ∑ 6GH ∫ 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln ] ^> ([ > ) I = M / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞 𝑦 6 , 𝑧 6 𝜄 − / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑟 [ > (𝑦 6 ) 6GH ≡ ℱ(𝑟 [ e 𝑦 H , … , 𝑟 [ f 𝑦 I , 𝜄) 13 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee

  14. The Variational Lower Bound • The (negative) variational free energy ( ℱ(𝑟 [ 𝑦 , 𝜄) ) or the evidence lower bound (ELBO): the expected energy under 𝑟 [ (𝑦) minus the entropy of 𝑟 [ (𝑦) I / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞 𝑦 6 , 𝑧 6 𝜄 ℱ 𝑟 [ 𝑦 , 𝜄 = M 𝑟 [ > (𝑦 6 ) 6GH / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞(𝑦 6 |𝑧 6 , 𝜄) = M / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞 𝑧 6 𝜄 + M 𝑟 [ > (𝑦 6 ) 6 6 𝑟 [ > (𝑦 6 ) M ln 𝑞 𝑧 6 𝜄 − M / 𝑒𝑦 6 𝑟 [ > (𝑦 6 ) ln 𝑞(𝑦 6 |𝑧 6 , 𝜄) 6 6 M ln 𝑞 𝑧 6 𝜄 − 𝐸 hi [𝑟 [ > (𝑦 6 ) ∥ 𝑞(𝑦 6 |𝑧 6 , 𝜄)] 6 KL divergence that we need for VI University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14

  15. ELBO = Evidence Lower BOund ln 𝑞 𝑧|𝜄 = ℒ 𝜄 + 𝐸 hi (𝑟(𝑦) ∥ 𝑞(𝑦|𝑧, 𝜄)) Evidence 𝑞 𝑦 𝑧, 𝜄 = 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) ∫ 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) 𝑒𝑦 = Likelihood×Prior 𝑞 𝑧 𝑦, 𝜄 𝑞(𝑦|𝜄) = 𝑞(𝑧|𝜄) Evidence Evidence of the probabilistic model shows the total probability of observing the data. Lower Bound: 𝐸 hi ≥ 0 → ln 𝑞(𝑧|𝜄) ≥ ℒ(𝜄) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 15

  16. Kullback-Leibler Divergence • Properties • 𝐸 hi (𝑞||𝑟) = 0 if and only if ( iff ) 𝑞 = 𝑟 (they may be different on sets of probability zero) • 𝐸 hi (𝑞||𝑟) ≠ 𝐸 hi (𝑟||𝑞) • 𝐸 hi (𝑞||𝑟) ≥ 0 −𝐸 hi 𝑟 ∥ 𝑞 = 𝔽 ] − log 𝑟 𝑞 = 𝔽 ] log 𝑞 𝑟 Blue: mixture of Gaussians ≤ log 𝔽 ] (𝑞 𝑟) = log / 𝑟 𝑦 𝑞 𝑦 𝑞(𝑦) (fixed) 𝑟 𝑦 𝑒𝑦 = log / 𝑞 𝑦 𝑒𝑦 = 0 Green: (unimodal) Gaussian 𝑟 that minimises 𝐿𝑀(𝑟||𝑞) Red: (unimodal) Gaussian 𝑟 that minimises 𝐿𝑀(𝑞||𝑟) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16

  17. Variational Inference • Optimization problem with intractable posterior distribution: 𝑟 ∗ = argmin 𝐸 hi (𝑟(𝑦) ∥ 𝑞(𝑦|𝑧, 𝜄)) ]([)∈𝒭 𝑄(𝑦|𝑧, 𝜄) 𝑟(𝑦) NICE University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 17

Recommend


More recommend