Gaussian Mixture Models & EM CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Mixture Models: definition  Mixture models: Linear supper-position of mixtures or components 𝐿 𝑞 𝒚|𝜾 = 𝑄(𝑁 𝑘 ) 𝑞 𝒚 𝑁 𝑘 ; 𝜾 𝑘 𝑘=1 𝐿  𝑘=1 𝑄(𝑁 𝑘 ) = 1 𝑘 ) : the prior probability of 𝑘 -th mixture  𝑄(𝑁  𝜾 𝑘 : the parameters of 𝑘 -th mixture 𝑘 ; 𝜾 𝑘 : the probability of 𝒚 according to 𝑘 -th mixture  𝑞 𝒚 𝑁  Framework for finding more complex probability distributions  Goal: estimate 𝑞 𝒚 𝜄 E.g., Multi-modal density estimation 2

Gaussian Mixture Models (GMMs)  Gaussian Mixture Models: 𝑞 𝒚 𝑁 𝑘 ; 𝜾 𝑘 ~𝑂(𝝂 𝑘 , 𝜯 𝑘 ) 𝐿 0 ≤ 𝜌 𝑘 ≤ 1 𝑞 𝒚 = 𝜌 𝑘 𝒪(𝒚|𝝂 𝑘 , 𝜯 𝑘 ) 𝐿 𝑘=1 𝜌 𝑘 = 1 𝑘=1  Fitting the Gaussian mixture model 𝑂  Input: data points 𝒚 𝑗 𝑗=1  Goal: find the parameters of GMM ( 𝜌 𝑘 , 𝝂 𝑘 , 𝜯 𝑘 , 𝑘 = 1, … , 𝐿 ) 3

GMM: 1-D Example    2 1  1  2 𝜌 1 = 0.6  2  4  2  1 𝜌 2 = 0.3  3  8  3  0 . 2 𝜌 3 = 0.1 4

GMM: 2-D Example 𝝂 1 = −2 3 1 0.5 Σ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 5

GMM: 2-D Example  GMM distribution 𝝂 1 = −2 3 1 0.5 Σ 1 = 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 k = 3 6

How to Fit GMM?  In order to maximize log likelihood: 𝒀 = 𝒚 (1) , … , 𝒚 (𝑂) 𝑂 𝑙 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = ln 𝜌 𝑘 𝒪(𝒚|𝝂 𝑘 , 𝜯 𝑘 ) 𝑗=1 𝑘=1  The sum over components appears inside the log and there is no closed form solution for maximum likelihood. 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 𝜖𝝂 𝑙 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 𝟏 𝑙 = 1, … , 𝐿 𝜖𝜯 𝑙 𝐿 𝜖 ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 + 𝜇 𝑘=1 𝜌 𝑘 − 1 = 0 𝜖𝜌 𝑙 7

ML for GMM 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝝂 𝑙 = 1 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) 𝒚 (𝑗) 𝐿 𝑘=1 𝑂 𝑙 𝑗=1 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝜯 𝑙 = 1 new )(𝒚 𝑗 −𝝂 𝑙 new ) 𝑈 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) (𝒚 (𝑗) −𝝂 𝑙 𝐿 𝑂 𝑙 𝑘=1 𝑗=1 new = 𝑂 𝑙 𝜌 𝑙 𝑂 𝑂 𝜌 𝑙 𝒪(𝒚 (𝑗) |𝝂 𝑙 , 𝜯 𝑙 ) 𝑂 𝑙 = 𝐿 𝑘=1 𝜌 𝑘 𝒪(𝒚 (𝑗) |𝝂 𝑘 , 𝜯 𝑘 ) 𝑗=1 𝜖 log 𝑩 −1 𝜖𝒚 𝑈 𝑩𝒚 = 𝑩 𝑈 = 𝒚𝒚 𝑈 8 𝜖𝑩 −1 𝜖𝑩

EM algorithm  An iterative algorithm in which each iteration is guaranteed to improve the log-likelihood function  General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).  EM find the maximum likelihood parameters in cases where the models involve unobserved variables 𝑎 in addition to unknown parameters 𝜾 and known data observations 𝑌 . 9

Mixture models: discrete latent variables 𝐿 𝑞(𝒚) = 𝑄 𝑨 𝑘 = 1 𝑞 𝒚 𝑨 𝑘 = 1 = 𝜌 𝑘 𝑞 𝒚 𝑨 𝑘 = 1 𝑘=1  𝑨 : latent or hidden variable  specifies the mixture component  𝑄 𝑨 𝑘 = 1 = 𝜌 𝑘  0 ≤ 𝜌 𝑘 ≤ 1 𝐿  𝑘=1 𝜌 𝑘 = 1 10

𝜾 = [𝝆, 𝝂, 𝜯] 𝑨 (𝑗) ∈ {1,2, … ,𝐿} shows the mixture EM for GMM from which 𝑦 (𝑗) is generated  Initialize 𝝂 𝑙 , 𝜯 𝑙 , 𝜌 𝑙 𝑙 = 1, … , 𝐿  E step : 𝑗 = 1, … , 𝑂 , 𝑘 = 1, … , 𝐿 𝑝𝑚𝑒 𝒪(𝒚 𝑗 |𝝂 𝑘 𝑝𝑚𝑒 , 𝜯 𝑘 𝑝𝑚𝑒 ) 𝜌 𝑘 (𝑗) = 1|𝒚 𝑗 , 𝜾 𝑝𝑚𝑒 𝑗 = 𝑄 𝑨 𝛿 𝑘 = 𝑘 𝐿 𝑝𝑚𝑒 𝒪(𝒚 (𝑗) |𝝂 𝑙 𝑝𝑚𝑒 , 𝜯 𝑙 𝑝𝑚𝑒 ) 𝑙=1 𝜌 𝑙  M Step : 𝑘 = 1, … , 𝐿 𝑂 𝑗 𝒚 (𝑗) 𝑗=1 𝛿 𝑘 𝑜𝑓𝑥 = 𝝂 𝑘 𝑗 𝑂 𝑗=1 𝛿 𝑘 𝑂 1 𝑜𝑓𝑥 = new )(𝒚 𝑗 −𝝂 𝑘 𝑗 (𝒚 (𝑗) −𝝂 𝑘 new ) 𝑈 𝜯 𝑘 𝑗 𝛿 𝑘 𝑂 𝑗=1 𝛿 𝑘 𝑗=1 𝑂 𝑗 𝑗=1 𝛿 𝑘 new = 𝜌 𝑘 𝑂  Repeat E and M steps until convergence 11

EM & GMM: example [Bishop] 12

EM & GMM: Example 13 [Bishop]

Local Minima 14

𝝂 1 = −2 3 1 0.5 Σ 1 = Local Minima 0.5 4 𝜌 1 = 0.6 𝝂 2 = 0 −4 Σ 2 = 1 0 0 1 𝜌 2 = 0.25 𝝂 3 = 3 2 Σ 3 = 3 1 1 1 𝜌 3 = 0.15 𝐷 3 𝐷 2 𝝂 1 = 0.36 −4.09 𝝂 1 = 1.45 −1.81 𝐷 3 Σ 1 = 0.89 0.26 Σ 1 = 3.30 4.76 0.26 0.83 4.76 10.01 𝐷 2 𝜌 1 = 0.249 𝜌 1 = 0.392 𝝂 2 = 3.25 𝝂 2 = −2.20 2.09 3.16 Σ 2 = 2.23 1.08 Σ 2 = 1.30 1.10 1.09 1.41 1.10 2.80 𝜌 2 = 0.146 𝜌 2 = 0.429 𝐷 1 𝝂 3 = −2.11 3.36 𝐷 1 𝝂 3 = −1.88 3.74 Σ 3 = 1.12 0.61 5.83 −0.82 Σ 3 = 0.61 3.61 −0.82 5.83 𝜌 3 = 0.604 𝜌 3 = 0.178 15

EM+GMM vs. k-means  k-means:  It is not probabilistic  Has fewer parameters (and faster)  Limited by the underlying assumption of spherical clusters  can be extended to use covariance – get “ hard EM ” (ellipsoidal k- means).  Both EM and k-means depend on initialization  getting stuck in local optima  EM+GMM has more local minima  Useful trick: first run k-means and then use its result to initialize EM. 16

EM algorithm: general General algorithm for finding ML estimation when the data is incomplete (missing or unobserved data).

Incomplete log likelihood  Complete log likelihood  Maximizing likelihood (i.e., log 𝑄(𝑌, 𝑍|𝜾) ) for labeled data is straightforward  Incomplete log likelihood  With 𝑎 unobserved, our objective becomes the log of a marginal probability log 𝑄(𝑌|𝜾) = log 𝑎 𝑄(𝑌, 𝑎|𝜾)  This objective will not decouple and we use EM algorithm to solve it 18

EM Algorithm  Assumptions: 𝑌 (observed or known variables), 𝑎 (unobserved or latent variables), 𝑌 come from a specific model with unknown parameters 𝜾  If 𝑎 is relevant to 𝑌 (in any way), we can hope to extract information about it from 𝑌 assuming a specific parametric model on the data.  Steps:  Initialization: Initialize the unknown parameters 𝜾  Iterate the following steps, until convergence:  Expectation step: Find the probability of unobserved variables given the current parameters estimates and the observed data.  Maximization step: from the observed data and the probability of the unobserved data find the most likely parameters (a better estimate for the parameters). 19

EM algorithm intuition  When learning with hidden variables, we are trying to solve two problems at once:  hypothesizing values for the unobserved variables in each data sample  learning the parameters  Each of these tasks is fairly easy when we have the solution to the other.  Given complete data, we have the statistics, and we can estimate parameters using the MLE formulas.  Conversely, computing probability of missing data given the parameters is a probabilistic inference problem 20

EM algorithm 21

EM theoretical analysis  What is the underlying theory for the use of the expected complete log likelihood in the M-step? 𝐹 𝑄 𝑎 𝑌, 𝜾 𝑝𝑚𝑒 log 𝑄 𝑌, 𝑎 𝜾  Now, we show that maximizing this function also maximizes the likelihood 22

EM theoretical foundation: Objective function 𝑎 23

Jensen ’ s inequality 24

EM theoretical foundation: Algorithm in general form 25

EM algorithm: illustration ℓ 𝜾; 𝑌 𝐺 𝜾, 𝑅 𝑢 𝜾 𝑢 𝜾 𝑢+1 27

EM theoretical foundation: M-step M-step can be equivalently viewed as maximizing the expected complete log-likelihood: 𝜾 𝑢+1 = argmax 𝐺 𝜾, 𝑅 𝑢 = argmax 𝐹 𝑅 𝑢 log 𝑄(𝑌, 𝑎|𝜾) 𝜾 𝜾 Proof: 𝑅 𝑢 (𝑎) log 𝑄(𝑌, 𝑎|𝜾) 𝐺 𝜾, 𝑅 𝑢 = 𝑅 𝑢 (𝑎) 𝑎 𝑅 𝑢 (𝑎) log 𝑄(𝑌, 𝑎|𝜾) − 𝑅 𝑢 (𝑎) log 𝑅 𝑢 (𝑎) = 𝑎 𝑎 ⇒ 𝐺 𝜾, 𝑅 𝑢 = 𝐹 𝑅 𝑢 log 𝑄(𝑌, 𝑎|𝜾) + 𝐼(𝑅 𝑢 𝑎 ) Independent of 𝜾 28

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Gaussian Mixture Models & EM CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Mixture Models: definition Mixture models: Linear supper-position of mixtures or components | =

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Deep Gaussian Mixture Models Cinzia Viroli (University of Bologna, Italy) joint with Geoff

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Using Gaussian Mixture Models to Detect Figurative Language in Context Linlin Li and Caroline

ELEN E6884 - Topics in Signal Processing Recap Topic: Speech Recognition Gaussian Mixture

Expectation Maximization Greg Mori - CMPT 419/726 Bishop PRML Ch. 9 K-Means Gaussian Mixture

MLE 04-09-2019 For Gaussian and Mixture Gaussian Models Instructor - Sriram Ganapathy

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Vine copula mixture models and clustering for non-Gaussian data Statistical Methods in Machine

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Optimal transport for Gaussian mixture models Yongxin Chen, Tryphon T. Georgiou and Allen

AND MACHINE LEARNING CHAPTER 10: MIXTURE MODELS AND EM Mixture Models - Define a joint

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Latent class analysis and finite mixture models with Stata Isabel Canette Principal

Econ 2148, fall 2019 Text as data Maximilian Kasy Department of Economics, Harvard University 1

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Zonoids and sparsification of quantum measurements Guillaume AUBRUN (joint with C ecilia

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by conditioning each word on

Machine Learning Estimation Hamid R. Rabiee Spring 2015

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Modernmachinelearningmethods fortrustworthyscience TomCharnock