probabilistic unsupervised learning expectation
play

Probabilistic & Unsupervised Learning Expectation Maximisation - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Expectation Maximisation Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018


  1. The Expectation Maximisation (EM) algorithm The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden. ◮ Decomposes difficult problems into series of tractable steps. ◮ An alternative to gradient-based iterative methods. ◮ No learning rate. ◮ In ML, the E step is called inference, and the M step learning. In stats, these are often imputation and inference or estimation. ◮ Not essential for simple models (like MoGs/FA), though often more efficient than alternatives. Crucial for learning in complex settings.

  2. The Expectation Maximisation (EM) algorithm The EM algorithm (Dempster, Laird & Rubin, 1977; but significant earlier precedents) finds a (local) maximum of a latent variable model likelihood. Start from arbitrary values of the parameters, and iterate two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden. ◮ Decomposes difficult problems into series of tractable steps. ◮ An alternative to gradient-based iterative methods. ◮ No learning rate. ◮ In ML, the E step is called inference, and the M step learning. In stats, these are often imputation and inference or estimation. ◮ Not essential for simple models (like MoGs/FA), though often more efficient than alternatives. Crucial for learning in complex settings. ◮ Provides a framework for principled approximations.

  3. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood.

  4. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) x 1 x 2

  5. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) x 1 x 2

  6. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) α x 1 + ( 1 − α ) x 2 x 1 x 2

  7. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2

  8. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2 In general:

  9. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2 In general: For α i ≥ 0, � α i = 1 (and { x i > 0 } ) : � � � � α i x i ≥ α i log ( x i ) log i i

  10. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2 In general: For α i ≥ 0, � α i = 1 (and { x i > 0 } ) : For probability measure α and concave f � � � � α i x i ≥ α i log ( x i ) f ( E α [ x ]) ≥ E α [ f ( x )] log i i

  11. Jensen’s inequality One view: EM iteratively refines a lower bound on the log-likelihood. log ( x ) log ( α x 1 + ( 1 − α ) x 2 ) α log ( x 1 ) + ( 1 − α ) log ( x 2 ) α x 1 + ( 1 − α ) x 2 x 1 x 2 In general: For α i ≥ 0, � α i = 1 (and { x i > 0 } ) : For probability measure α and concave f � � � � α i x i ≥ α i log ( x i ) f ( E α [ x ]) ≥ E α [ f ( x )] log i i Equality (if and) only if f ( x ) is almost surely constant or linear on (convex) support of α .

  12. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ )

  13. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � d Z P ( Z , X| θ ) ℓ ( θ ) = log

  14. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � d Z q ( Z ) P ( Z , X| θ ) ℓ ( θ ) = log q ( Z )

  15. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) ≥ ℓ ( θ ) = log q ( Z ) q ( Z )

  16. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) def ≥ = F ( q , θ ) . ℓ ( θ ) = log q ( Z ) q ( Z )

  17. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) def ≥ = F ( q , θ ) . ℓ ( θ ) = log q ( Z ) q ( Z ) Now, � � � d Z q ( Z ) log P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) − d Z q ( Z ) log q ( Z ) = q ( Z )

  18. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) def ≥ = F ( q , θ ) . ℓ ( θ ) = log q ( Z ) q ( Z ) Now, � � � d Z q ( Z ) log P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) − d Z q ( Z ) log q ( Z ) = q ( Z ) � = d Z q ( Z ) log P ( Z , X| θ ) + H [ q ] , where H [ q ] is the entropy of q ( Z ) .

  19. The lower bound for EM – “free energy” Observed data X = { x i } ; Latent variables Z = { z i } ; Parameters θ = { θ x , θ z } . � Log-likelihood: ℓ ( θ ) = log P ( X| θ ) = log d Z P ( Z , X| θ ) By Jensen, any distribution, q ( Z ) , over the latent variables generates a lower bound: � � d Z q ( Z ) P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) def ≥ = F ( q , θ ) . ℓ ( θ ) = log q ( Z ) q ( Z ) Now, � � � d Z q ( Z ) log P ( Z , X| θ ) d Z q ( Z ) log P ( Z , X| θ ) − d Z q ( Z ) log q ( Z ) = q ( Z ) � = d Z q ( Z ) log P ( Z , X| θ ) + H [ q ] , where H [ q ] is the entropy of q ( Z ) . So: F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ]

  20. The E and M steps of EM The free-energy lower bound on ℓ ( θ ) is a function of θ and a distribution q : F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] ,

  21. The E and M steps of EM The free-energy lower bound on ℓ ( θ ) is a function of θ and a distribution q : F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] , The EM steps can be re-written: ◮ E step: optimize F ( q , θ ) wrt distribution over hidden variables holding parameters fixed: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax F . q ( Z )

  22. The E and M steps of EM The free-energy lower bound on ℓ ( θ ) is a function of θ and a distribution q : F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] , The EM steps can be re-written: ◮ E step: optimize F ( q , θ ) wrt distribution over hidden variables holding parameters fixed: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax F . q ( Z ) ◮ M step: maximize F ( q , θ ) wrt parameters holding hidden distribution fixed: � � θ ( k ) := argmax q ( k ) ( Z ) , θ F � log P ( Z , X| θ ) � q ( k ) ( Z ) = argmax θ θ � � q ( k ) ( Z ) does not depend directly on θ . The second equality comes from the fact that H

  23. The E Step The free energy can be re-written

  24. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z )

  25. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z )

  26. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z )

  27. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z ) = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X , θ )] The second term is the Kullback-Leibler divergence.

  28. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z ) = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X , θ )] The second term is the Kullback-Leibler divergence. This means that, for fixed θ , F is bounded above by ℓ , and achieves that bound when KL [ q ( Z ) � P ( Z|X , θ )] = 0.

  29. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z ) = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X , θ )] The second term is the Kullback-Leibler divergence. This means that, for fixed θ , F is bounded above by ℓ , and achieves that bound when KL [ q ( Z ) � P ( Z|X , θ )] = 0. But KL [ q � p ] is zero if and only if q = p (see appendix.)

  30. The E Step The free energy can be re-written � q ( Z ) log P ( Z , X| θ ) F ( q , θ ) = d Z q ( Z ) � q ( Z ) log P ( Z|X , θ ) P ( X| θ ) = d Z q ( Z ) � � q ( Z ) log P ( Z|X , θ ) = q ( Z ) log P ( X| θ ) d Z + d Z q ( Z ) = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X , θ )] The second term is the Kullback-Leibler divergence. This means that, for fixed θ , F is bounded above by ℓ , and achieves that bound when KL [ q ( Z ) � P ( Z|X , θ )] = 0. But KL [ q � p ] is zero if and only if q = p (see appendix.) So, the E step sets q ( k ) ( Z ) = P ( Z|X , θ ( k − 1 ) ) [inference / imputation] and, after an E step, the free energy equals the likelihood.

  31. Coordinate Ascent in F (Demo) To visualise, we consider a one parameter / one latent mixture: s ∼ Bernoulli [ π ] x | s = 0 ∼ N [ − 1 , 1 ] x | s = 1 ∼ N [ 1 , 1 ] . Single data point x 1 = . 3. q ( s ) is a distribution on a single binary latent, and so is represented by r 1 ∈ [ 0 , 1 ] . 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

  32. Coordinate Ascent in F (Demo)

  33. Coordinate Ascent in F (Demo)

  34. Coordinate Ascent in F (Demo)

  35. Coordinate Ascent in F (Demo)

  36. Coordinate Ascent in F (Demo)

  37. Coordinate Ascent in F (Demo)

  38. Coordinate Ascent in F (Demo)

  39. Coordinate Ascent in F (Demo)

  40. Coordinate Ascent in F (Demo)

  41. Coordinate Ascent in F (Demo)

  42. Coordinate Ascent in F (Demo)

  43. Coordinate Ascent in F (Demo)

  44. Coordinate Ascent in F (Demo)

  45. Coordinate Ascent in F (Demo)

  46. Coordinate Ascent in F (Demo)

  47. Coordinate Ascent in F (Demo)

  48. Coordinate Ascent in F (Demo)

  49. Coordinate Ascent in F (Demo)

  50. Coordinate Ascent in F (Demo)

  51. Coordinate Ascent in F (Demo)

  52. Coordinate Ascent in F (Demo)

  53. Coordinate Ascent in F (Demo)

  54. Coordinate Ascent in F (Demo)

  55. Coordinate Ascent in F (Demo)

  56. Coordinate Ascent in F (Demo)

  57. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � ℓ

  58. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � ℓ = F E step ◮ The E step brings the free energy to the likelihood.

  59. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � ℓ = F ≤ F E step M step ◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ .

  60. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen ◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ . ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL

  61. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen ◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ . ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL If the M-step is executed so that θ ( k ) � = θ ( k − 1 ) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

  62. EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen ◮ The E step brings the free energy to the likelihood. ◮ The M-step maximises the free energy wrt θ . ◮ F ≤ ℓ by Jensen – or, equivalently, from the non-negativity of KL If the M-step is executed so that θ ( k ) � = θ ( k − 1 ) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases. Can also show that fixed points of EM (generally) correspond to maxima of the likelihood (see appendices).

  63. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ )

  64. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ ) ◮ Increases a variational lower bound on the likelihood by coordinate ascent. F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X )] ≤ ℓ ( θ )

  65. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ ) ◮ Increases a variational lower bound on the likelihood by coordinate ascent. F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X )] ≤ ℓ ( θ ) ◮ E step: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax = P ( Z|X , θ ( k − 1 ) ) F q ( Z )

  66. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ ) ◮ Increases a variational lower bound on the likelihood by coordinate ascent. F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X )] ≤ ℓ ( θ ) ◮ E step: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax = P ( Z|X , θ ( k − 1 ) ) F q ( Z ) ◮ M step: � � θ ( k ) := argmax q ( k ) ( Z ) , θ F = argmax � log P ( Z , X| θ ) � q ( k ) ( Z ) θ θ

  67. EM Summary ◮ An iterative algorithm that finds (local) maxima of the likelihood of a latent variable model. � ℓ ( θ ) = log P ( X| θ ) = log d Z P ( X|Z , θ ) P ( Z| θ ) ◮ Increases a variational lower bound on the likelihood by coordinate ascent. F ( q , θ ) = � log P ( Z , X| θ ) � q ( Z ) + H [ q ] = ℓ ( θ ) − KL [ q ( Z ) � P ( Z|X )] ≤ ℓ ( θ ) ◮ E step: � q ( Z ) , θ ( k − 1 ) � q ( k ) ( Z ) := argmax = P ( Z|X , θ ( k − 1 ) ) F q ( Z ) ◮ M step: � � θ ( k ) := argmax q ( k ) ( Z ) , θ F = argmax � log P ( Z , X| θ ) � q ( k ) ( Z ) θ θ ◮ After E-step F ( q , θ ) = ℓ ( θ ) ⇒ maximum of free-energy is maximum of likelihood.

  68. Partial M steps and Partial E steps Partial M steps: The proof holds even if we just increase F wrt θ rather than maximize. (Dempster, Laird and Rubin (1977) call this the generalized EM, or GEM, algorithm). In fact, immediately after an E step � � � � ∂ � log P ( X , Z| θ ) � q ( k ) ( Z )[= P ( Z|X ,θ ( k − 1 ) )] = ∂ � � log P ( X| θ ) � � ∂θ � ∂θ � θ ( k − 1 ) θ ( k − 1 ) [cf. mixture gradients from last lecture.] So E-step (inference) can be used to construct other gradient-based optimisation schemes (e.g. “Expectation Conjugate Gradient”, Salakhutdinov et al. ICML 2003). Partial E steps: We can also just increase F wrt to some of the q s. For example, sparse or online versions of the EM algorithm would compute the posterior for a subset of the data points or as the data arrives, respectively. One might also update the posterior over a subset of the hidden variables, while holding others fixed...

  69. EM for MoGs ◮ Evaluate responsibilities P m ( x ) π m r im = � m ′ P m ′ ( x ) π m ′ ◮ Update parameters � i r im x i µ m ← � i r im � i r im ( x i − µ m )( x i − µ m ) T Σ m ← � i r im � i r im π m ← N

  70. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i .

  71. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i . The E-step computes the posterior for s i given the current parameters: q ( s i ) = p ( s i | x i , θ ) ∝ p ( x i | s i , θ ) p ( s i | θ )

  72. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i . The E-step computes the posterior for s i given the current parameters: q ( s i ) = p ( s i | x i , θ ) ∝ p ( x i | s i , θ ) p ( s i | θ ) � ( x i − µ m ) 2 � q ( s i = m ) ∝ π m 1 − exp σ m 2 σ 2 m with the normalization such that � m r im = 1.

  73. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i . The E-step computes the posterior for s i given the current parameters: q ( s i ) = p ( s i | x i , θ ) ∝ p ( x i | s i , θ ) p ( s i | θ ) � ( x i − µ m ) 2 � = q ( s i = m ) ∝ π m 1 def − r im exp (responsibilities) σ m 2 σ 2 m with the normalization such that � m r im = 1.

  74. The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: � k � k π m � � x − µ m ) 2 � 1 p ( x | θ ) = p ( s = m | θ ) p ( x | s = m , θ ) ∝ − , exp σ m 2 σ 2 m m = 1 m = 1 where θ is the collection of parameters: means µ m , variances σ 2 m and mixing proportions π m = p ( s = m | θ ) . The hidden variable s i indicates which component generated observation x i . The E-step computes the posterior for s i given the current parameters: q ( s i ) = p ( s i | x i , θ ) ∝ p ( x i | s i , θ ) p ( s i | θ ) � ( x i − µ m ) 2 � = q ( s i = m ) ∝ π m 1 def − ← � δ s i = m � q r im exp (responsibilities) σ m 2 σ 2 m with the normalization such that � m r im = 1.

  75. The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): � E = � log p ( x , s | θ ) � q ( s ) = q ( s ) log [ p ( s | θ ) p ( x | s , θ )] � � ( x i − µ m ) 2 � 1 = r im log π m − log σ m − . 2 σ 2 m i , m Optimum is found by setting the partial derivatives of E to zero:

  76. The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): � E = � log p ( x , s | θ ) � q ( s ) = q ( s ) log [ p ( s | θ ) p ( x | s , θ )] � � ( x i − µ m ) 2 � 1 = r im log π m − log σ m − . 2 σ 2 m i , m Optimum is found by setting the partial derivatives of E to zero: � � ∂ r im ( x i − µ m ) i r im x i E = = 0 ⇒ µ m = � i r im , ∂µ m 2 σ 2 m i

  77. The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): � E = � log p ( x , s | θ ) � q ( s ) = q ( s ) log [ p ( s | θ ) p ( x | s , θ )] � � ( x i − µ m ) 2 � 1 = r im log π m − log σ m − . 2 σ 2 m i , m Optimum is found by setting the partial derivatives of E to zero: � � ∂ r im ( x i − µ m ) i r im x i E = = 0 ⇒ µ m = � i r im , ∂µ m 2 σ 2 m i � � � � σ m + ( x i − µ m ) 2 i r im ( x i − µ m ) 2 ∂ − 1 σ 2 E = = 0 ⇒ m = � , r im ∂σ m σ 3 i r im m i

  78. The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): � E = � log p ( x , s | θ ) � q ( s ) = q ( s ) log [ p ( s | θ ) p ( x | s , θ )] � � ( x i − µ m ) 2 � 1 = r im log π m − log σ m − . 2 σ 2 m i , m Optimum is found by setting the partial derivatives of E to zero: � � ∂ r im ( x i − µ m ) i r im x i E = = 0 ⇒ µ m = � i r im , ∂µ m 2 σ 2 m i � � � � σ m + ( x i − µ m ) 2 i r im ( x i − µ m ) 2 ∂ − 1 σ 2 E = = 0 ⇒ m = � , r im ∂σ m σ 3 i r im m i � � ∂ ∂ E 1 π m = 1 E = r im π m , ∂π m + λ = 0 ⇒ r im , ∂π m n i i where λ is a Lagrange multiplier ensuring that the mixing proportions sum to unity.

  79. EM for Factor Analysis • • • z 1 z 2 z K The model for x : � p ( z | θ ) p ( x | z , θ ) d z = N ( 0 , ΛΛ T + Ψ) p ( x | θ ) = Model parameters: θ = { Λ , Ψ } . x 1 x 2 • • • x D E step: For each data point x n , compute the posterior distribution of hidden factors given the observed data: q n ( z n ) = p ( z n | x n , θ t ) . M step: Find the θ t + 1 that maximises F ( q , θ ) : � � F ( q , θ ) = q n ( z n ) [ log p ( z n | θ ) + log p ( x n | z n , θ ) − log q n ( z n )] d z n n � � = q n ( z n ) [ log p ( z n | θ ) + log p ( x n | z n , θ )] d z n + c. n

  80. The E step for Factor Analysis E step: For each data point x n , compute the posterior distribution of hidden factors given the observed data: q n ( z n ) = p ( z n | x n , θ ) = p ( z n , x n | θ ) / p ( x n | θ ) Tactic: write p ( z n , x n | θ ) , consider x n to be fixed. What is this as a function of z n ? p ( z n , x n ) = p ( z n ) p ( x n | z n ) 2 exp {− 1 2 exp {− 1 ( 2 π ) − K n z n } | 2 π Ψ | − 1 2 ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n ) } 2 z T = c × exp {− 1 2 [ z T n z n + ( x n − Λ z n ) T Ψ − 1 ( x n − Λ z n )] } = c’ × exp {− 1 2 [ z T n ( I + Λ T Ψ − 1 Λ) z n − 2 z T n Λ T Ψ − 1 x n ] } = c” × exp {− 1 n Σ − 1 z n − 2 z T n Σ − 1 µ n + µ T n Σ − 1 µ n ] } 2 [ z T = So Σ = ( I + Λ T Ψ − 1 Λ) − 1 = I − β Λ and µ n = ΣΛ T Ψ − 1 x n = β x n . Where β = ΣΛ T Ψ − 1 . Note that µ n is a linear function of x n and Σ does not depend on x n .

Recommend


More recommend