em variational bayes
play

EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 - PowerPoint PPT Presentation

EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vMFs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of Gaussians 2 / 19 MLE by Gradient


  1. EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19

  2. Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vMFs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of Gaussians 2 / 19

  3. MLE by Gradient Ascent Goal: maximize L ( θ ; X ) = log p ( X | θ ) w.r.t θ Gradient Ascent (GA) � � ◮ One-step view: θ t + 1 ← ∇L θ t ; X + θ t ◮ Two-step view: θ t − θ � θ t − θ � θ ; θ t � � � � � � � � � � 2 θ t ; X θ t ; X − 1 1. Q = L + ∇L 2 2 2. θ t + 1 ← argmax θ Q � θ ; θ t � Drawbacks 1. ∇L can be too complicated to work with 2. Too general to be efficient for structured problems 3 / 19

  4. MLE by EM Expectation-maximization (EM) � θ ; θ t � 1. Expectation: Q = E Z | X , θ t L ( θ ; X , Z ) � θ ; θ t � 2. Maximization: θ t + 1 ← argmax θ Q ◮ Replace L ( θ ; X ) by L ( θ ; X , Z ) � �� � � �� � log-likelihood complete log-likelihood ◮ L ( θ ; X , Z ) is a random function w.r.t Z —use the expected function as a surrogate 4 / 19

  5. why EM is superior � θ , θ t � A comparison between Q , i.e., the local concave model 1. EM � θ ; θ t � = E Z | X , θ t L ( θ ; X , Z ) Q � � Z | X , θ t � � = L ( θ ; X ) − D KL p || p ( Z | X , θ ) + C 2. GA − 1 � � � θ ; θ t � � � � � � � 2 θ t − θ � θ t − θ θ t ; X θ t ; X � � Q = L + ∇L � 2 2 5 / 19

  6. Example: vMF mixture Notations � � ◮ X = { x i } n π ∈ ∆ k − 1 , { ( µ i , κ i ) } k i = 1 , θ = i = 1 ◮ Z = { z ij ∈ { 0 , 1 }} ◮ z ij = 1 = ⇒ x i ∼ the j -th mixture component Log-likelihood n n k � � � � � L ( θ ; X ) = log p ( x i | θ ) = log π j vMF x i | µ j , κ j i = 1 i = 1 j = 1 � �� � log sum coupling Complete log-likelihood n n k � � �� � � � L ( θ ; X , Z ) = log p ( x i , z i | θ ) = π j vMF x i | µ j , κ j z ij log i = 1 i = 1 j = 1 6 / 19

  7. E-step θ ; θ t � ∆ � Compute Q = E Z | X , θ t L ( θ ; X , Z ) � π , µ , κ ; π t , µ t , κ t � Q n k � � �� � � x i | µ j , κ j = E Z | X , π t , µ t , κ t z ij log π j vMF i = 1 j = 1 n k � � � � w t + w t = x i | µ j , κ j ij log vMF ij log π j i = 1 j = 1 where � z ij = 1 | x i , π t , µ t , κ t � w t ij = E z ij | X , π t , µ t , κ t [ z ij ] = p � � π t x i | µ t j , κ t j · vMF j = � k u = 1 π t u · vMF ( x i | µ t u , κ t u ) 7 / 19

  8. M-step Maximize n k � π , µ , κ ; π t , µ t , κ t � � � � � w t + w t = x i | µ j , κ j ij log π j Q ij log vMF i = 1 j = 1 w.r.t π , µ and κ s.t. | π | 1 = 1 and � µ j � 2 = 1 , ∀ j ∈ [ k ] To impose constraints, maximize k � � � � � Q ∆ ˜ 1 − π ⊤ 1 1 − µ ⊤ = Q + λ + ν j j µ j j = 1 8 / 19

  9. M-step n k � π , µ , κ ; π t , µ t , κ t � � � � � ˜ w t + w t Q = x i | µ j , κ j ij log π j ij log vMF i = 1 j = 1 k � � � � � 1 − π ⊤ 1 1 − µ ⊤ + λ + ν j j µ j j = 1 Updating π t j Combining � k j π j = � k j w t ij = 1 with � n i = 1 w t ∂ π j ˜ ij Q = − λ = 0 π j � n i = 1 w t π t + 1 = ⇒ ← ij j n 9 / 19

  10. M-step n k � π , µ , κ ; π t , µ t , κ t � � � � � ˜ w t + w t x i | µ j , κ j Q = ij log vMF ij log π j i = 1 j = 1 k � � � � � 1 − π ⊤ 1 1 − µ ⊤ + λ + ν j j µ j j = 1 Updating µ t j � � = κ j µ ⊤ x i | µ j , κ j j x i + C (w.r.t µ j ) log vMF n � ∂ µ j ˜ w t Q = κ j ij x i − ν j µ j = 0 i = 1 � r j � 2 where r j = � n r j ⇒ µ t + 1 i = 1 w t = ← ij x i j 10 / 19

  11. M-step Updating κ t j p 2 − 1 κ ◮ C p ( κ j ) = j p ( 2 π ) 2 I p 2 − 1 ( κ j ) ◮ the recurrence property of modified Bessel function 1 I p 2 ( κ j ) p 2 − 1 ∂ κ j log I p 2 − 1 ( κ j ) = + κ j I p 2 − 1 ( κ j ) � � n I p 2 ( κ j ) � ∂ κ j ˜ 2 − 1 ( κ j ) + µ ⊤ w t − Q = j x i = 0 ij I p i = 1 2 ( κ j ) r 3 I p r j p − ¯ ≈ ¯ j ⇒ κ t + 1 = ⇒ 2 − 1 ( κ j ) = ¯ r j = [ ? ] j r 2 1 − ¯ I p j � n i = 1 w t ij µ ⊤ j x i where ¯ r j = � n i = 1 w t ij 1http://functions.wolfram.com/Bessel-TypeFunctions/BesselK/introductions/Bessels/05/ 11 / 19

  12. An alternative view of EM EM - original definition � θ ; θ t � 1. Expectation: Q = E Z | X , θ t L ( θ ; X , Z ) why? � θ ; θ t � 2. Maximization: θ t + 1 ← argmax θ Q L ( θ ; X ) = E q log p ( X | θ ) � � � � log p ( X , Z | θ ) q ( Z ) = E q + E q log q ( Z ) p ( Z | X , θ ) � �� � � �� � VLB ( q , θ ) D KL ( q ( Z ) || p ( Z | X , θ )) EM - coordinate ascent � q , θ t � 1. q t + 1 = argmax q VLB 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � Show the equivalence? 12 / 19

  13. Bayes Inference Notations ◮ θ : hyper parameters ◮ Z : hidden variables + random parameters Goals 1. find a good posterior q ( Z ) ≈ p ( Z | X ; θ ) 2. estimate θ by Empirical Bayes, i.e., maximize L ( θ ; X ) w.r.t θ � � � � log p ( X , Z | θ ) q ( Z ) L ( θ ; X ) = E q + E q log p ( Z | X , θ ) q ( Z ) � �� � � �� � VLB ( q , θ ) D KL ( q ( Z ) || p ( Z | X , θ )) both goals can be achieved via the same procedure as EM 13 / 19

  14. Variational Bayes Inference One should have q → p ( Z ; X , θ ∗ ) by alternating between � q , θ t � 1. q t + 1 = argmax q VLB 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � However, we do not want q to be too complicated � θ ; θ t � ◮ e.g., Q = E q L ( θ ; X , Z ) can be intractable Solution : modify the first step as � q , θ t � q t + 1 = argmax q ∈Q VLB Q - some tractable distribution families � Z | X , θ t � ◮ Recall: without Q , q t + 1 ≡ p 14 / 19

  15. Variational Bayes Inference � q , θ t � Goal: solve argmax q ∈Q VLB � � i = 1 q i ( Z i ) ∆ q | q ( Z ) = � M = � M usually, Q = i = 1 q i Coordinate ascent � X , Z ; θ t �   p � q j ; q − j , θ t � VLB = E q  log  q ( Z ) M � X , Z ; θ t � � = E q log p − E q log q i i = 1 � � X , Z ; θ t �� − E q j log q j + C = E q j E q − j log p � � X , Z ; θ t �� = − D KL log q j || E q − j log p + C � X , Z ; θ t � log q ∗ j = E q − j log p 15 / 19

  16. Example: Bayes Mixture of Gaussians Consider putting a prior over the means in GM 2 � 0 , τ 2 � ◮ For k = 1 , 2 . . . K , µ k ∼ N ◮ For i = 1 , 2 . . . N 1. z i ∼ Mult ( π ) � µ z i , σ 2 � 2. x i ∼ N p ( z , µ | X ) = p ( X | z , µ ) p ( z ) p ( µ ) p ( X ) � N i = 1 p ( z i ) p ( x i | z i , µ ) � K k = 1 p ( µ k ) = � � � N i = 1 p ( z i ) p ( x i | z i , µ ) � K k = 1 p ( µ k ) d µ z N K � � � � σ 2 q ( z , µ ) = q ( z i ; φ i ) q µ k ; ˜ µ k , ˜ k i = 1 k = 1 2https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf 16 / 19

  17. Example: Bayes Mixture of Gaussians log q ∗ ( z j ) = E q \ zj log p ( z , µ , X ) � N � K � � log p ( z i ) + log p ( x i | z i , µ ) + = E q \ zj log p ( µ k ) i = 1 k = 1 � + C � x j | z j , µ z j = log p ( z j ) + E q ( µ zj ) log p − 1 � � � µ z j � µ 2 = log π z j + x j E q ( µ zj ) + C 2 E q ( µ zj ) z j � �� � � �� � µ zj ˜ µ 2 ˜ zj +˜ σ 2 zj By observation q ∗ ( z j ) ∼ Mult , we can update φ j accordingly 17 / 19

  18. Example: Bayes Mixture of Gaussians log q ∗ ( µ j ) = E q \ µ j log p ( z , µ , X ) � N � K � � = E q \ µ j log p ( z i ) + log p ( x i | z i , µ z i ) + log p ( µ k ) i = 1 k = 1 N K � � = E q \ µ j δ z i = k log N ( x i | µ k ) + log p ( µ j ) + C i = 1 k = 1 N � = E z i [ δ z i = j ] log N ( x i | µ j ) + log p ( µ j ) + C � �� � i = 1 φ j i Observing that q ∗ ( µ j ) ∼ N , ˜ σ 2 µ j and ˜ j can be updated accordingly 18 / 19

  19. Stay tuned Next topics ◮ LDA (Wanli) ◮ Bayes vMF 19 / 19

Recommend


More recommend