gaussian mixture models em
play

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Mixture Models: definition Mixture models: Linear supper-position of mixtures or components | =


  1. Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

  2. Mixture Models: definition  Mixture models: Linear supper-position of mixtures or components 𝐿 𝑞 𝒚|𝜾 = 𝑄(𝑁 𝑘 ) 𝑞 𝒚 𝑁 𝑘 ; 𝜾 𝑘 𝑘=1 𝐿  𝑘=1 𝑄(𝑁 𝑘 ) = 1 𝑘 ) : the prior probability of 𝑘 -th mixture  𝑄(𝑁  𝜾 𝑘 : the parameters of 𝑘 -th mixture 𝑘 ; 𝜾 𝑘 : the probability of 𝒚 according to 𝑘 -th mixture  𝑞 𝒚 𝑁  Framework for finding more complex probability distributions  Goal: estimate 𝑞 𝒚 𝜄 E.g., Multi-modal density estimation 2

  3. Gaussian Mixture Models (GMMs)  Gaussian Mixture Models: 𝑞 𝒚 𝑁 𝑘 ; 𝜾 𝑘 ~𝑂(𝝂 𝑘 , 𝜯 𝑘 ) 𝐿 0 ≤ 𝜌 𝑘 ≤ 1 𝑞 𝒚 = 𝜌 𝑘 𝒪(𝒚|𝝂 𝑘 , 𝜯 𝑘 ) 𝐿 𝑘=1 𝜌 𝑘 = 1 𝑘=1  Fitting the Gaussian mixture model 𝑂  Input: data points 𝒚 𝑗 𝑗=1  Goal: find the parameters of GMM ( 𝜌 𝑘 , 𝝂 𝑘 , 𝜯 𝑘 , 𝑘 = 1, … , 𝐿 ) 3

  4. GMM: 1-D Example    2 1  1  2 𝜌 1 = 0.6  2  4  2  1 𝜌 2 = 0.3  3  8  3  0 . 2 𝜌 3 = 0.1 4

  5. GMM: 2-D Example 𝝂 1 = −2 3 1 0.5 Σ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 5

  6. GMM: 2-D Example  GMM distribution 𝝂 1 = −2 3 1 0.5 Σ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 6

  7. How to Fit GMM?  In order to maximize log likelihood: 𝒀 = 𝒚 (1) , … , 𝒚 (𝑂) 𝑂 𝑙 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = ln 𝜌 𝑘 𝒪(𝒚|𝝂 𝑘 , 𝜯 𝑘 ) 𝑗=1 𝑘=1  The sum over components appears inside the log and there is no closed form solution for maximum likelihood. 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 𝜖𝝂 𝑙 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 𝑙 = 1, … , 𝐿 𝜖𝜯 𝑙 𝐿 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 + 𝜇 𝑘=1 𝜌 𝑘 − 1 = 0 𝜖𝜌 𝑙 7

  8. ML for GMM 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝝂 𝑙 = 1 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) 𝒚 (𝑗) 𝐿 𝑘=1 𝑂 𝑙 𝑗=1 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝜯 𝑙 = 1 new )(𝒚 𝑗 −𝝂 𝑙 new ) 𝑈 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) (𝒚 (𝑗) −𝝂 𝑙 𝐿 𝑂 𝑙 𝑘=1 𝑗=1 new = 𝑂 𝑙 𝜌 𝑙 𝑂 𝑂 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝑙 = 𝐿 𝑘=1 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) 𝑗=1 𝜖 log 𝑩 −1 𝜖𝒚 𝑈 𝑩𝒚 = 𝑩 𝑈 = 𝒚𝒚 𝑈 8 𝜖𝑩 −1 𝜖𝑩

  9. EM algorithm  An iterative algorithm in which each iteration is guaranteed to improve the log-likelihood function  General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).  EM find the maximum likelihood parameters in cases where the models involve unobserved variables 𝑎 in addition to unknown parameters 𝜾 and known data observations 𝑌 . 9

  10. Mixture models: discrete latent variables 𝐿 𝑞(𝒚) = 𝑄 𝑨 𝑘 = 1 𝑞 𝒚 𝑨 𝑘 = 1 = 𝜌 𝑘 𝑞 𝒚 𝑨 𝑘 = 1 𝑘=1  𝑨 : latent or hidden variable  specifies the mixture component  𝑄 𝑨 𝑘 = 1 = 𝜌 𝑘  0 ≤ 𝜌 𝑘 ≤ 1 𝐿  𝑘=1 𝜌 𝑘 = 1 10

  11. 𝜾 = [𝝆, 𝝂, 𝜯] 𝑨 (𝑗) ∈ {1,2, … ,𝐿} shows the mixture EM for GMM from which 𝑦 (𝑗) is generated  Initialize 𝝂 𝑙 , 𝜯 𝑙 , 𝜌 𝑙 𝑙 = 1, … , 𝐿  E step : 𝑗 = 1, … , 𝑂 , 𝑘 = 1, … , 𝐿 𝑝𝑚𝑒 𝒪(𝒚 𝑗 |𝝂 𝑘 𝑝𝑚𝑒 , 𝜯 𝑘 𝑝𝑚𝑒 ) 𝜌 𝑘 (𝑗) = 1|𝒚 𝑗 , 𝜾 𝑝𝑚𝑒 𝑗 = 𝑄 𝑨 𝛿 𝑘 = 𝑘 𝐿 𝑝𝑚𝑒 𝒪(𝒚 (𝑗) |𝝂 𝑙 𝑝𝑚𝑒 , 𝜯 𝑙 𝑝𝑚𝑒 ) 𝑙=1 𝜌 𝑙  M Step : 𝑘 = 1, … , 𝐿 𝑂 𝑗 𝒚 (𝑗) 𝑗=1 𝛿 𝑘 𝑜𝑓𝑥 = 𝝂 𝑘 𝑗 𝑂 𝑗=1 𝛿 𝑘 𝑂 1 𝑜𝑓𝑥 = new )(𝒚 𝑗 −𝝂 𝑘 𝑗 (𝒚 (𝑗) −𝝂 𝑘 new ) 𝑈 𝜯 𝑘 𝑗 𝛿 𝑘 𝑂 𝑗=1 𝛿 𝑘 𝑗=1 𝑂 𝑗 𝑗=1 𝛿 𝑘 new = 𝜌 𝑘 𝑂  Repeat E and M steps until convergence 11

  12. EM & GMM: example [Bishop] 12

  13. EM & GMM: Example 13 [Bishop]

  14. Local Minima 14

  15. 𝝂 1 = −2 3 1 0.5 Σ 1 = Local Minima 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 𝐷 3 𝐷 2 𝝂 1 = 0.36 −4.09 𝝂 1 = 1.45 −1.81 𝐷 3 Σ 1 = 0.89 0.26 Σ 1 = 3.30 4.76 0.26 0.83 4.76 10.01 𝐷 2 𝜌 1 = 0.249 𝜌 1 = 0.392 𝝂 2 = 3.25 𝝂 2 = −2.20 2.09 3.16 Σ 2 = 2.23 1.08 Σ 2 = 1.30 1.10 1.09 1.41 1.10 2.80 𝜌 2 = 0.146 𝜌 2 = 0.429 𝐷 1 𝝂 3 = −2.11 3.36 𝐷 1 𝝂 3 = −1.88 3.74 Σ 3 = 1.12 0.61 5.83 −0.82 Σ 3 = 0.61 3.61 −0.82 5.83 𝜌 3 = 0.604 𝜌 3 = 0.178 15

  16. EM+GMM vs. k-means  k-means:  It is not probabilistic  Has fewer parameters (and faster)  Limited by the underlying assumption of spherical clusters  can be extended to use covariance – get “ hard EM ” (ellipsoidal k- means).  Both EM and k-means depend on initialization  getting stuck in local optima  EM+GMM has more local minima  Useful trick: first run k-means and then use its result to initialize EM. 16

  17. EM algorithm: general General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).

  18. Incomplete log likelihood  Complete log likelihood  Maximizing likelihood (i.e., log 𝑄(𝑌, 𝑍|𝜾) ) for labeled data is straightforward  Incomplete log likelihood  With 𝑎 unobserved, our objective becomes the log of a marginal probability log 𝑄(𝑌|𝜾) = log 𝑎 𝑄(𝑌, 𝑎|𝜾)  This objective will not decouple and we use EM algorithm to solve it 18

  19. EM Algorithm  Assumptions: 𝑌 (observed or known variables), 𝑎 (unobserved or latent variables), 𝑌 come from a specific model with unknown parameters 𝜾  If 𝑎 is relevant to 𝑌 (in any way), we can hope to extract information about it from 𝑌 assuming a specific parametric model on the data.  Steps:  Initialization: Initialize the unknown parameters 𝜾  Iterate the following steps, until convergence:  Expectation step: Find the probability of unobserved variables given the current parameters estimates and the observed data.  Maximization step: from the observed data and the probability of the unobserved data find the most likely parameters (a better estimate for the parameters). 19

  20. EM algorithm intuition  When learning with hidden variables, we are trying to solve two problems at once:  hypothesizing values for the unobserved variables in each data sample  learning the parameters  Each of these tasks is fairly easy when we have the solution to the other.  Given complete data, we have the statistics, and we can estimate parameters using the MLE formulas.  Conversely, computing probability of missing data given the parameters is a probabilistic inference problem 20

  21. EM algorithm 21

  22. EM theoretical analysis  What is the underlying theory for the use of the expected complete log likelihood in the M-step? 𝐹 𝑄 𝑎 𝑌, 𝜾 𝑝𝑚𝑒 log 𝑄 𝑌, 𝑎 𝜾  Now, we show that maximizing this function also maximizes the likelihood 22

  23. EM theoretical foundation: Objective function 𝑎 23

  24. Jensen ’ s inequality 24

  25. EM theoretical foundation: Algorithm in general form 25

  26. EM theoretical foundation: E-step 𝑅 𝑢 = 𝑄(𝑎|𝑌, 𝜾 𝑢 ) ⟹ 𝑅 𝑢 = argmax 𝐺 𝜾 𝑢 , 𝑅 𝑅 Proof: 𝑄(𝑎|𝑌, 𝜾 𝑢 ) log 𝑄(𝑌, 𝑎|𝜾 𝑢 ) 𝐺 𝜾 𝑢 , 𝑄(𝑎|𝑌, 𝜾 𝑢 ) = 𝑄(𝑎|𝑌, 𝜾 𝑢 ) 𝑎 𝑄(𝑎|𝑌, 𝜾 𝑢 ) log 𝑄(𝑌|𝜾 𝑢 ) = log 𝑄(𝑌|𝜾 𝑢 ) = ℓ 𝜾 𝑢 ; 𝑌 = 𝑎 is a lower bound on ℓ 𝜾; 𝑌 . Thus, 𝐺 𝜾 𝑢 , 𝑅 has been  𝐺 𝜾, 𝑅 maximized by setting 𝑅 to 𝑄 𝑎 𝑌, 𝜾 𝑢 : 𝐺 𝜾 𝑢 , 𝑄(𝑎|𝑌, 𝜾 𝑢 ) = ℓ 𝜾 𝑢 ; 𝑌 ⇒ 𝑄 𝑎 𝑌, 𝜾 𝑢 = argmax 𝐺 𝜾 𝑢 , 𝑅 𝑅 26

  27. EM algorithm: illustration ℓ 𝜾; 𝑌 𝐺 𝜾, 𝑅 𝑢 𝜾 𝑢 𝜾 𝑢+1 27

  28. EM theoretical foundation: M-step M-step can be equivalently viewed as maximizing the expected complete log-likelihood: 𝜾 𝑢+1 = argmax 𝐺 𝜾, 𝑅 𝑢 = argmax 𝐹 𝑅 𝑢 log 𝑄(𝑌, 𝑎|𝜾) 𝜾 𝜾 Proof: 𝑅 𝑢 (𝑎) log 𝑄(𝑌, 𝑎|𝜾) 𝐺 𝜾, 𝑅 𝑢 = 𝑅 𝑢 (𝑎) 𝑎 𝑅 𝑢 (𝑎) log 𝑄(𝑌, 𝑎|𝜾) − 𝑅 𝑢 (𝑎) log 𝑅 𝑢 (𝑎) = 𝑎 𝑎 ⇒ 𝐺 𝜾, 𝑅 𝑢 = 𝐹 𝑅 𝑢 log 𝑄(𝑌, 𝑎|𝜾) + 𝐼(𝑅 𝑢 𝑎 ) Independent of 𝜾 28

Recommend


More recommend