Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016
Mixture Models: definition Mixture models: Linear supper-position of mixtures or components 𝐿 𝑞 𝒚|𝜾 = 𝑄(𝑁 𝑘 ) 𝑞 𝒚 𝑁 𝑘 ; 𝜾 𝑘 𝑘=1 𝐿 𝑘=1 𝑄(𝑁 𝑘 ) = 1 𝑘 ) : the prior probability of 𝑘 -th mixture 𝑄(𝑁 𝜾 𝑘 : the parameters of 𝑘 -th mixture 𝑘 ; 𝜾 𝑘 : the probability of 𝒚 according to 𝑘 -th mixture 𝑞 𝒚 𝑁 Framework for finding more complex probability distributions Goal: estimate 𝑞 𝒚 𝜄 E.g., Multi-modal density estimation 2
Gaussian Mixture Models (GMMs) Gaussian Mixture Models: 𝑞 𝒚 𝑁 𝑘 ; 𝜾 𝑘 ~𝑂(𝝂 𝑘 , 𝜯 𝑘 ) 𝐿 0 ≤ 𝜌 𝑘 ≤ 1 𝑞 𝒚 = 𝜌 𝑘 𝒪(𝒚|𝝂 𝑘 , 𝜯 𝑘 ) 𝐿 𝑘=1 𝜌 𝑘 = 1 𝑘=1 Fitting the Gaussian mixture model 𝑂 Input: data points 𝒚 𝑗 𝑗=1 Goal: find the parameters of GMM ( 𝜌 𝑘 , 𝝂 𝑘 , 𝜯 𝑘 , 𝑘 = 1, … , 𝐿 ) 3
GMM: 1-D Example 2 1 1 2 𝜌 1 = 0.6 2 4 2 1 𝜌 2 = 0.3 3 8 3 0 . 2 𝜌 3 = 0.1 4
GMM: 2-D Example 𝝂 1 = −2 3 1 0.5 Σ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 5
GMM: 2-D Example GMM distribution 𝝂 1 = −2 3 1 0.5 Σ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 6
How to Fit GMM? In order to maximize log likelihood: 𝒀 = 𝒚 (1) , … , 𝒚 (𝑂) 𝑂 𝑙 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = ln 𝜌 𝑘 𝒪(𝒚|𝝂 𝑘 , 𝜯 𝑘 ) 𝑗=1 𝑘=1 The sum over components appears inside the log and there is no closed form solution for maximum likelihood. 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 𝜖𝝂 𝑙 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 𝑙 = 1, … , 𝐿 𝜖𝜯 𝑙 𝐿 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 + 𝜇 𝑘=1 𝜌 𝑘 − 1 = 0 𝜖𝜌 𝑙 7
ML for GMM 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝝂 𝑙 = 1 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) 𝒚 (𝑗) 𝐿 𝑘=1 𝑂 𝑙 𝑗=1 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝜯 𝑙 = 1 new )(𝒚 𝑗 −𝝂 𝑙 new ) 𝑈 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) (𝒚 (𝑗) −𝝂 𝑙 𝐿 𝑂 𝑙 𝑘=1 𝑗=1 new = 𝑂 𝑙 𝜌 𝑙 𝑂 𝑂 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝑙 = 𝐿 𝑘=1 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) 𝑗=1 𝜖 log 𝑩 −1 𝜖𝒚 𝑈 𝑩𝒚 = 𝑩 𝑈 = 𝒚𝒚 𝑈 8 𝜖𝑩 −1 𝜖𝑩
EM algorithm An iterative algorithm in which each iteration is guaranteed to improve the log-likelihood function General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data). EM find the maximum likelihood parameters in cases where the models involve unobserved variables 𝑎 in addition to unknown parameters 𝜾 and known data observations 𝑌 . 9
Mixture models: discrete latent variables 𝐿 𝑞(𝒚) = 𝑄 𝑨 𝑘 = 1 𝑞 𝒚 𝑨 𝑘 = 1 = 𝜌 𝑘 𝑞 𝒚 𝑨 𝑘 = 1 𝑘=1 𝑨 : latent or hidden variable specifies the mixture component 𝑄 𝑨 𝑘 = 1 = 𝜌 𝑘 0 ≤ 𝜌 𝑘 ≤ 1 𝐿 𝑘=1 𝜌 𝑘 = 1 10
𝜾 = [𝝆, 𝝂, 𝜯] 𝑨 (𝑗) ∈ {1,2, … ,𝐿} shows the mixture EM for GMM from which 𝑦 (𝑗) is generated Initialize 𝝂 𝑙 , 𝜯 𝑙 , 𝜌 𝑙 𝑙 = 1, … , 𝐿 E step : 𝑗 = 1, … , 𝑂 , 𝑘 = 1, … , 𝐿 𝑝𝑚𝑒 𝒪(𝒚 𝑗 |𝝂 𝑘 𝑝𝑚𝑒 , 𝜯 𝑘 𝑝𝑚𝑒 ) 𝜌 𝑘 (𝑗) = 1|𝒚 𝑗 , 𝜾 𝑝𝑚𝑒 𝑗 = 𝑄 𝑨 𝛿 𝑘 = 𝑘 𝐿 𝑝𝑚𝑒 𝒪(𝒚 (𝑗) |𝝂 𝑙 𝑝𝑚𝑒 , 𝜯 𝑙 𝑝𝑚𝑒 ) 𝑙=1 𝜌 𝑙 M Step : 𝑘 = 1, … , 𝐿 𝑂 𝑗 𝒚 (𝑗) 𝑗=1 𝛿 𝑘 𝑜𝑓𝑥 = 𝝂 𝑘 𝑗 𝑂 𝑗=1 𝛿 𝑘 𝑂 1 𝑜𝑓𝑥 = new )(𝒚 𝑗 −𝝂 𝑘 𝑗 (𝒚 (𝑗) −𝝂 𝑘 new ) 𝑈 𝜯 𝑘 𝑗 𝛿 𝑘 𝑂 𝑗=1 𝛿 𝑘 𝑗=1 𝑂 𝑗 𝑗=1 𝛿 𝑘 new = 𝜌 𝑘 𝑂 Repeat E and M steps until convergence 11
EM & GMM: example [Bishop] 12
EM & GMM: Example 13 [Bishop]
Local Minima 14
𝝂 1 = −2 3 1 0.5 Σ 1 = Local Minima 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 𝐷 3 𝐷 2 𝝂 1 = 0.36 −4.09 𝝂 1 = 1.45 −1.81 𝐷 3 Σ 1 = 0.89 0.26 Σ 1 = 3.30 4.76 0.26 0.83 4.76 10.01 𝐷 2 𝜌 1 = 0.249 𝜌 1 = 0.392 𝝂 2 = 3.25 𝝂 2 = −2.20 2.09 3.16 Σ 2 = 2.23 1.08 Σ 2 = 1.30 1.10 1.09 1.41 1.10 2.80 𝜌 2 = 0.146 𝜌 2 = 0.429 𝐷 1 𝝂 3 = −2.11 3.36 𝐷 1 𝝂 3 = −1.88 3.74 Σ 3 = 1.12 0.61 5.83 −0.82 Σ 3 = 0.61 3.61 −0.82 5.83 𝜌 3 = 0.604 𝜌 3 = 0.178 15
EM+GMM vs. k-means k-means: It is not probabilistic Has fewer parameters (and faster) Limited by the underlying assumption of spherical clusters can be extended to use covariance – get “ hard EM ” (ellipsoidal k- means). Both EM and k-means depend on initialization getting stuck in local optima EM+GMM has more local minima Useful trick: first run k-means and then use its result to initialize EM. 16
EM algorithm: general General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).
Incomplete log likelihood Complete log likelihood Maximizing likelihood (i.e., log 𝑄(𝑌, 𝑍|𝜾) ) for labeled data is straightforward Incomplete log likelihood With 𝑎 unobserved, our objective becomes the log of a marginal probability log 𝑄(𝑌|𝜾) = log 𝑎 𝑄(𝑌, 𝑎|𝜾) This objective will not decouple and we use EM algorithm to solve it 18
EM Algorithm Assumptions: 𝑌 (observed or known variables), 𝑎 (unobserved or latent variables), 𝑌 come from a specific model with unknown parameters 𝜾 If 𝑎 is relevant to 𝑌 (in any way), we can hope to extract information about it from 𝑌 assuming a specific parametric model on the data. Steps: Initialization: Initialize the unknown parameters 𝜾 Iterate the following steps, until convergence: Expectation step: Find the probability of unobserved variables given the current parameters estimates and the observed data. Maximization step: from the observed data and the probability of the unobserved data find the most likely parameters (a better estimate for the parameters). 19
EM algorithm intuition When learning with hidden variables, we are trying to solve two problems at once: hypothesizing values for the unobserved variables in each data sample learning the parameters Each of these tasks is fairly easy when we have the solution to the other. Given complete data, we have the statistics, and we can estimate parameters using the MLE formulas. Conversely, computing probability of missing data given the parameters is a probabilistic inference problem 20
EM algorithm 21
EM theoretical analysis What is the underlying theory for the use of the expected complete log likelihood in the M-step? 𝐹 𝑄 𝑎 𝑌, 𝜾 𝑝𝑚𝑒 log 𝑄 𝑌, 𝑎 𝜾 Now, we show that maximizing this function also maximizes the likelihood 22
EM theoretical foundation: Objective function 𝑎 23
Jensen ’ s inequality 24
EM theoretical foundation: Algorithm in general form 25
EM theoretical foundation: E-step 𝑅 𝑢 = 𝑄(𝑎|𝑌, 𝜾 𝑢 ) ⟹ 𝑅 𝑢 = argmax 𝐺 𝜾 𝑢 , 𝑅 𝑅 Proof: 𝑄(𝑎|𝑌, 𝜾 𝑢 ) log 𝑄(𝑌, 𝑎|𝜾 𝑢 ) 𝐺 𝜾 𝑢 , 𝑄(𝑎|𝑌, 𝜾 𝑢 ) = 𝑄(𝑎|𝑌, 𝜾 𝑢 ) 𝑎 𝑄(𝑎|𝑌, 𝜾 𝑢 ) log 𝑄(𝑌|𝜾 𝑢 ) = log 𝑄(𝑌|𝜾 𝑢 ) = ℓ 𝜾 𝑢 ; 𝑌 = 𝑎 is a lower bound on ℓ 𝜾; 𝑌 . Thus, 𝐺 𝜾 𝑢 , 𝑅 has been 𝐺 𝜾, 𝑅 maximized by setting 𝑅 to 𝑄 𝑎 𝑌, 𝜾 𝑢 : 𝐺 𝜾 𝑢 , 𝑄(𝑎|𝑌, 𝜾 𝑢 ) = ℓ 𝜾 𝑢 ; 𝑌 ⇒ 𝑄 𝑎 𝑌, 𝜾 𝑢 = argmax 𝐺 𝜾 𝑢 , 𝑅 𝑅 26
EM algorithm: illustration ℓ 𝜾; 𝑌 𝐺 𝜾, 𝑅 𝑢 𝜾 𝑢 𝜾 𝑢+1 27
EM theoretical foundation: M-step M-step can be equivalently viewed as maximizing the expected complete log-likelihood: 𝜾 𝑢+1 = argmax 𝐺 𝜾, 𝑅 𝑢 = argmax 𝐹 𝑅 𝑢 log 𝑄(𝑌, 𝑎|𝜾) 𝜾 𝜾 Proof: 𝑅 𝑢 (𝑎) log 𝑄(𝑌, 𝑎|𝜾) 𝐺 𝜾, 𝑅 𝑢 = 𝑅 𝑢 (𝑎) 𝑎 𝑅 𝑢 (𝑎) log 𝑄(𝑌, 𝑎|𝜾) − 𝑅 𝑢 (𝑎) log 𝑅 𝑢 (𝑎) = 𝑎 𝑎 ⇒ 𝐺 𝜾, 𝑅 𝑢 = 𝐹 𝑅 𝑢 log 𝑄(𝑌, 𝑎|𝜾) + 𝐼(𝑅 𝑢 𝑎 ) Independent of 𝜾 28
Recommend
More recommend