variational inference for bayes vmf mixture
play

Variational Inference for Bayes vMF Mixture Hanxiao Liu September - PowerPoint PPT Presentation

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational Inference Review Lower bound the likelihood L ( ; X ) = E q log p ( X | ) log p ( X , Z | ) q ( Z ) = E q + E q log q (


  1. Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14

  2. Variational Inference Review Lower bound the likelihood L ( θ ; X ) = E q log p ( X | θ ) � � � � log p ( X , Z | θ ) q ( Z ) = E q + E q log q ( Z ) p ( Z | X , θ ) � �� � � �� � VLB ( q , θ ) D KL ( q ( Z ) || p ( Z | X , θ )) Raise VLB ( q , θ ) by coordinate ascent � q , θ t � 1. q t + 1 = argmax VLB q = � M i = 1 q i 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � 2 / 14

  3. Variational Inference Review � q , θ t � Goal : solve by coordinate ascent, i.e. argmax VLB q = � M i = 1 q i sequentially updating a single q i in each iteration. Each coordinate step has a closed-form solution— � � � X , Z | θ t � log p � q j ; q − j , θ t � = E q VLB q ( Z ) M � � X , Z | θ t � = E q log p − E q log q i i = 1 � X , Z | θ t � = E q j E q − j log p − E q j log q j + const � �� � log ˜ q j + const � q j log ˜ q j = + const = − D KL ( q j || ˜ q j ) + const q j � X , Z | θ t � ⇒ log q ∗ = j = E q − j log p + const 3 / 14

  4. Bayes vMF Mixture [Gopal and Yang, 2014] ◮ π ∼ Dirichlet ( ·| α ) ? ◮ q ( π ) ≡ Dirichlet ( ·| ρ ) ◮ µ k ∼ vMF ( ·| µ 0 , C 0 ) ? ◮ q ( µ k ) ≡ vMF ( ·| ψ k , γ k ) � ·| m , σ 2 � ◮ κ k ∼ logNormal ? ◮ q ( κ k ) ≡ logNormal ( ·| a k , b k ) ◮ z i ∼ Multi ( ·| π ) � � ? ◮ q ( z i ) ◮ x i ∼ vMF ·| µ z i , κ z i ≡ Multi ( ·| λ i ) 4 / 14

  5. Compute log p ( X , Z | θ ) N � � � p ( X , Z | θ ) = Dirichlet ( π | α ) × Multi ( z i | π ) vMF x i | µ z i , κ z i i = 1 K � � κ k | m , σ 2 � × vMF ( µ k | µ 0 , C 0 ) logNormal k = 1 K � log p ( X , Z | θ ) = − log B ( α ) + ( α − 1 ) log π k k = 1 N K N K � � � � � � log C D ( κ k ) + κ k x ⊤ + z ik log π k + z ik i µ k i = 1 k = 1 i = 1 k = 1 K � � � log C D ( C 0 ) + C 0 µ ⊤ + k µ 0 k = 1 � � K − ( log κ k − m ) 2 − log κ k − 1 � � 2 πσ 2 � + 2 log 2 σ 2 k = 1 5 / 14

  6. Updating q ( π ) ? q ( π ) ≡ Dirichlet ( ·| ρ ) log q ∗ ( π ) = E q \ π log p ( X , Z | θ ) + const � K � N K � � � = E q \ π ( α − 1 ) log π k + + const z ik log π k i = 1 k = 1 k = 1 � � K N � � = α + E q [ z ik ] − 1 log π k + const k = 1 i = 1 K α + � N � i = 1 E q [ z ik ] − 1 ⇒ q ∗ ( π ) ∝ = ∼ Dirichlet π k k = 1 N � ⇒ ρ ∗ = k = α + E q [ z ik ] i = 1 6 / 14

  7. Updating q ( z i ) ? q ( z i ) ≡ Multi ( ·| λ i ) log q ∗ ( z i ) = E q \ z i log p ( X , Z | θ ) + const � N � K N K � � � � � � log C D ( κ k ) + κ k x ⊤ = E q \ z i z ik log π k + z ik i µ k + const i = 1 k = 1 i = 1 k = 1 K � � � E q log π k + E q log C D ( κ k ) + E q [ κ k ] x ⊤ = i E q [ µ k ] + const z ik k = 1 ⇒ q ∗ ( z i ) ∼ Multi , λ ∗ ik ∝ e E q log π k + E q log C D ( κ k )+ E q [ κ k ] x ⊤ i E q [ µ k ] = Assume E q log π k , E q log C D ( κ k ) , E q [ κ k ] and E q [ µ k ] are already known. We will explicitly compute them later. 7 / 14

  8. Updating q ( µ k ) ? q ( µ k ) ≡ vMF ( ·| ψ k , γ k ) log q ∗ ( µ k ) = E q \ µ k log p ( X , Z | θ ) + const   N K K � � � z ij κ j x ⊤ C 0 µ ⊤  + const = E q \ µ k i µ j + j µ 0  i = 1 j = 1 j = 1 � N � � E q [ z ik ] x ⊤ + C 0 µ ⊤ = E q [ κ k ] i µ k k µ 0 + const i = 1 � �� N � � ⊤ µ k ∼ vMF E q [ κ k ] i = 1 E q [ z ik ] x i + C 0 µ 0 ⇒ q ∗ ( µ k ) ∝ e = �� N � � � N � � E q [ κ k ] i = 1 E q [ z ik ] x i + C 0 µ 0 � � � γ ∗ � � , ψ ∗ k = E q [ κ k ] E q [ z ik ] x i + C 0 µ 0 k = � � γ k � � i = 1 8 / 14

  9. Updating q ( κ k ) ? q ( κ k ) ≡ logNormal ( ·| a k , b k ) log q ∗ ( κ k ) = E q \ κ k log p ( X , Z | θ ) + const � � N K K − log κ j − ( log κ j − m ) 2 � � � � � log C D ( κ j ) + κ j x ⊤ = E q \ κ k z ij i µ j + + const 2 σ 2 i = 1 j = 1 j = 1 � � N − log κ k − ( log κ k − m ) 2 � � � log C D ( κ k ) + κ k x ⊤ = E q \ κ k + const z ik i µ k 2 σ 2 i = 1 N − log κ k − ( log κ k − m ) 2 � E q [ z ik ] � i E q [ µ k ] � log C D ( κ k ) + κ k x ⊤ = + const 2 σ 2 i = 1 ⇒ q ∗ ( κ k ) �∼ logNormal = due to the existence of log C D ( κ k ) 9 / 14

  10. Intermediate Quantities Some intermediate quantities are in closed-form ◮ q ( z i ) ≡ Multi ( z i | λ i ) = ⇒ E q [ z ij ] = λ ij �� � ◮ q ( π ) ≡ Dirichlet ( π | ρ ) = ⇒ E q log π k = Ψ ( ρ k ) − Ψ j ρ j I D ( γ k ) 1 ◮ q ( µ k ) ≡ vMF ( µ k | ψ k , γ k ) = ⇒ E q [ µ k ] = 2 − 1 ( γ k ) ψ k 2 I D [Rothenbuehler, 2005] Some are not— E q [ κ k ] and E q log C D ( κ k ) 1. the absence of a good parametric form of q ( κ k ) ◮ apply sampling 2. even if κ k ∼ logNormal is assumed, E q log C D ( κ k ) is still hard to deal with ◮ bound log C D ( · ) by some simple functions 1 can be derived from the characteristic function of vMF 10 / 14

  11. Sampling In principle we can sample κ k from p ( κ k | X , θ ) . Unfortunately, the sampling procedure above requires the samples of z i , µ k , π , . . . which are not maintained by variational inference. Recall the optimal posterior for κ k satisfies 2 log q ∗ ( κ k ) N − log κ k − ( log κ k − m ) 2 � � � log C D ( κ k ) + κ k x ⊤ = E [ z ik ] i E q [ µ k ] + const 2 σ 2 i = 1 � N � � ⇒ q ∗ ( κ k ) ∝ exp � � log C D ( κ k ) + κ k x ⊤ = E [ z ik ] i E q [ µ k ] i = 1 � κ k | m , σ 2 � × logNormal We can sample from q ∗ ( κ k ) ! 2 see derivation on p.8 11 / 14

  12. Bounding Outline ◮ Assume q ( κ k ) ≡ logNormal ( ·| a k , b k ) ◮ Lower bound E q log C D ( κ k ) in VLB by some simple terms ◮ To optimize q ( κ k ) , use gradient ascent w.r.t a k and b k to raise the VLB Empirically, sampling outperforms bounding 12 / 14

  13. Empirical Bayes for Hyperparameters Raise VLB ( q , θ ) by coordinate ascent � q , θ t � 1. q t + 1 = argmax VLB q = � M i = 1 q i 2. θ t + 1 = argmax θ VLB � q t + 1 , θ � = argmax θ E q t + 1 log p ( X , Z | θ ) For example, one can use gradient ascent to optimize α K � max − log B ( α ) + ( α − 1 ) E q t + 1 [ log π k ] α> 0 k = 1 m , σ 2 , µ 0 and C 0 can be optimized in a similar manner 3 3 Unlike α , their solutions can be written in closed-form 13 / 14

  14. Reference I Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005). Clustering on the unit hypersphere using von mises-fisher distributions. In Journal of Machine Learning Research , pages 1345–1382. Gopal, S. and Yang, Y. (2014). Von mises-fisher clustering models. In Proceedings of The 31st International Conference on Machine Learning , pages 154–162. Rothenbuehler, J. (2005). Dependence Structures beyond copulas: A new model of a multivariate regular varying distribution based on a finite von Mises-Fisher mixture model . PhD thesis, Cornell University. 14 / 14

Recommend


More recommend