applied machine learning
play

Applied Machine Learning Expectation Maximization for Mixture of - PowerPoint PPT Presentation

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives what is a latent variable model? Gaussian mixture model the intuition behind the Expectation-Maximization


  1. Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh COMP 551 (Fall 2020)

  2. Learning objectives what is a latent variable model? Gaussian mixture model the intuition behind the Expectation-Maximization algorithm relationship to k-means

  3. Probabilistic modeling so far... (1) ( N ) given data D = { x , … , x } Model p ( x ; θ ) model e.g., multivariate Gaussian, Bernoulli (1) (1) ( N ) ( N ) or if we have labels D = {( x , y ), … , ( x , y )} p ( x , y ; θ ) ∝ p ( y ; θ ) p ( x ∣ y ; θ ) we saw generative models for classification Learning used maximum likelihood to fit the data or Bayesian inference ^ ( n ) ( n ) = arg max log p ( x , y ; θ ) p ( θ ∣ D ) = p ( θ ) p ( D ∣ θ ) θ ∑ n θ e.g., we used this to fit the naive Bayes

  4. Latent variable models sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables (1) ( N ) data D = { x , … , x } is partial or incomplete model p ( x , z ; θ ) accounts for both observed (x) and latent variables (z) the latent variable examples bias (unobserved) leading to a hiring practice (observed) 3D scene (unobserved) producing a 2D photograph (observed) gravity (unobserved) leading to apple falling (observed) genotype (unobserved) leading to some phenotype (observed) input features (observed) having some unobserved class labels ...

  5. Latent variable models sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables (1) ( N ) D = { x , … , x } is partial or incomplete data p ( x , z ; θ ) = p ( z ; θ ) p ( x ∣ z ; θ ) model often we model it gives us a lot of flexibility in modeling the data find hidden factors and learn about how they lead to our observations both natural and powerful way to model complex observations difficult to "learn" the model from partial observations

  6. Latent variable models sometimes we do not observe all the variables that we wish to model these are called hidden variables or latent variables (1) ( N ) data D = { x , … , x } is partial or incomplete p ( x , z ; θ ) = p ( z ; θ ) p ( x ∣ z ; θ ) model often we model if the latent variable is the class label, this resembles generative classification p ( x , y ; θ ) = p ( y ; θ ) p ( x ∣ y ; θ ) but here we don't observe the labels we saw clustering performs classification without having labels so we can use latent variable models for clustering?

  7. Mixture models suppose the latent variable has a categorical distribution (unobserved class label) p ( x , z ; θ , π ) = Categorical( z ; π ) p ( x ∣ z ; θ ) p ( z ; π ) we only observe x we can marginalize out z to get the data distribution p ( x ; θ , π ) = Categorical( z = k ; π ) p ( x ∣ z = k ; θ ) = π p ( x ∣ z = k ; θ ) ∑ k ∑ k k k mixture weights p ( x ∣ z = k ; θ ) each datapoint with probability comes from the x π k k the marginal over the observed variables is a mixture of K distributions lets consider the case where this is Gaussian

  8. Mixture of Gaussians model the data as a mixture of K Gaussian distributions p ( x ; π , { μ , Σ }) = ∑ k π N ( x ; μ , Σ ) k k k k k Gaussian mixture model for D=2 we can calculate the probability of each datapoint belonging to a cluster k also called the responsibility of cluster k for data point (n) ( n ) x ( n ) π N ( x ; μ ,Σ ) weighted density of k'th Gaussian at ( n ) p ( z = k ∣ x ) = k k k π N ( x ( n ) ; μ ,Σ ) ∑ c density of the whole mixture at that point c c c

  9. Mixture of Gaussians p ( x , z ) visualizing samples from the join distribution colors show the value of z marginal distribution responsibilities ( n ) ∼ p ( z ; π ) z ( n ) ( n ) ( n ) π N ( x ; μ ,Σ ) ∼ p ( x ∣ z ; θ ) ( n ) x p ( x ) = p ( x ∣ z = k ) p ( z = k ) p ( z = k ∣ x ) = ∑ k k k k π N ( x ( n ) ; μ ,Σ ) ∑ c c c c complete data incomplete data (we only have x) (we have both x and z)

  10. Clustering with Gaussian mixture mixture of Gaussians p ( x ; π , { μ , Σ }) = ∑ k π N ( x ; μ , Σ ) k k k k k we can calculate the probability of each datapoint belonging to a cluster k ( n ) π N ( x ; μ ,Σ ) ( n ) p ( z = k ∣ x ) = k k k π N ( x ( n ) ; μ ,Σ ) ∑ c c c c a probabilistic alternative to K-means: soft cluster membership (responsibilities) r ( n ) = p ( z = k ∣ x ) n , k cluster mean μ k cluster covariance matrix Σ k cluster membership r ∈ {0, 1} n , k cluster mean μ k COMP 551 | Fall 2020

  11. Learning the Gaussian mixture model maximize the marginal likelihood of observations under our model ( n ) ℓ( π , { μ , Σ }) = ∑ n log (∑ k π N ( x ; μ , Σ ) k ) k k k k p ( x ) set the derivatives to zero (see our references for step-by-step calculation) 1 ( n ) weighted mean μ = ∑ n r x ∂ℓ = 0 n , k k ∑ n r n , k ∂ μ k ( n ) weight is the responsibility = p ( z = k ∣ x ) r n , k probability of sample (n) belonging to cluster k ∂ℓ 1 ( n ) ( n ) k ⊤ weighted covariance = 0 Σ = ( x − μ )( x − μ ) ∑ n r n , k ∂Σ k k k ∑ n ′ r ′ n , k ∑ n r the total amount of responsibilities accepted by cluster k π = n , k ∂ℓ = 0 k N ∂ π k model parameters depend on the responsibilities problem responsibilities depend on model parameters

  12. Expectation Maximization algorithm for Gaussian Mixture solution iteratively update both parameters and responsibilities until convergence start from some initial model { μ , Σ }, π k k repeat until convergence: update the responsibilities given the model ∀ n , k ( n ) π N ( x ; μ ,Σ ) expectation step ← r k k k n , k ( n ) ∑ c π N ( x ; μ ,Σ ) c c c ∀ k update the model given the responsibilities 1 ( n ) μ ← ∑ n r x n , k k ∑ n r n , k 1 ( n ) ( n ) k ⊤ Σ ← ( x − μ )( x − μ ) ∑ n maximization step r n , k k k ∑ n ′ r ′ n , k ∑ n r π ← n , k k N

  13. EM algorithm for Gaussian Mixture EM converges after 20 iteration (D=2, K=2) example expectation step maximization step initialize (finding responsibilities) (finding model parameters) iteration 2 iteration 5 iteration 20

  14. EM algorithm for Gaussian Mixture example Iris flowers dataset, multiple runs which model is better? converged after 50 iterations, average log-likelihood: -1.45 converged after 120 iterations, average log-likelihood: -1.47 converged after 34 iterations, average log-likelihood: -1.49 converged after 43 iterations, average log-likelihood: -1.45 COMP 551 | Fall 2020

  15. Comparison with K-Means K-Means EM for Gaussian mixture model objective objective minimize the sum of squared Euclidean minimize the negative log-(marginal) likelihood distance to cluster centers parameters cluster centers means, covariances and mixture weights parameters hard cluster memberships responsibilities soft cluster memberships responsibilities alternating minimization wrt parameters and responsibility algorithm robust, because of learning the covariance feature scaling sensitive feature scaling slower convergence faster convergence efficiency efficiency both converge to a local optima, and in both swapping cluster indices makes no difference in the optimality objective

  16. Expectation Maximization we saw application of EM to Gaussian mixture EM is a general algorithm for learning latent variable models: p ( x , z ; θ ) we have a model and partial observations D = { x (1) ( N ) , … , x } to learn model parameters and infer the latent variables use EM start from some initial model θ repeat until convergence: ( n ) ( n ) E-step: do a probabilistic completion p ( z ∣ x ; θ )∀ n M-step: fit the model to the (probabilistically) completed data θ

  17. Expectation Maximization a simple variation called hard EM algorithm start from some initial model θ repeat until convergence: ( n ) ( n ) = arg max p ( z ∣ x ; θ )∀ n E-step: do a deterministic completion z z θ M-step: fit the model to the completed data using max-likelihood K-means is performing hard-EM using a fixed covariance and mixture weights find the closest center (finding the Gaussian with the highest probability) fit Gaussians to completed data (x,z) COMP 551 | Fall 2020

  18. Summary Latent variable models: a general and powerful type of probablistic model we have only partial observations can use EM to learn the parameters and infer hidden values Expectation maximization (EM): useful when we have hidden variables or missing values tries to maximize log-likelihood of observations iteratates between learning model parameters and inferring the latents converges to a local optima (performance depends on initialization The only concrete example that we saw: Gaussian mixture model (GMM) EM in GMM for soft clustering relationship to K-means

Recommend


More recommend