clustering models and algorithms
play

Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline - PowerPoint PPT Presentation

Clustering: Models and Algorithms Shikui Tu 2019-03-07 1 Outline Gaussian Mixture Models (GMM) Expectation-Maximization (EM) for maximum likelihood Model selection, Bayesian learning 2 From distance to probability distance


  1. Clustering: Models and Algorithms Shikui Tu 2019-03-07 1

  2. Outline • Gaussian Mixture Models (GMM) • Expectation-Maximization (EM) for maximum likelihood • Model selection, Bayesian learning 2

  3. From distance to probability distance likely || x − µ || 2 exp{ − λ || x − µ || 2 } “The closer, the more likely.” Sum or integral to be one Probability It is more powerful to consider everything in probability framework! Gaussian distribution with the Mahalanobis distance S 3

  4. Review the clustering problem again We have the following data: We want to cluster the data into two clusters (red and blue) 4

  5. Instead if using { µ 1, µ 2 }, each cluster is represented as a Gaussian distribution K-means Gaussian Mixture Model (GMM) k =2 µ 2 µ 1 k =1 = 5

  6. Maximum likelihood Maximizing the log-likelihood function: = 0 Similarly we get and are the maximum likelihood estimates of the mean and the co-variance matrix. 6

  7. Matrix derivatives 7 http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf

  8. Gaussian Mixture Model (GMM) We use z k = 1 to indicate a point x belongs to cluster k k =2 z = ( z 1 , …, z K ) Assume the points in the same cluster follow a Gaussian distribution A mixing weight for each cluster: prior probability of point belonging to a cluster k =1 So, we get a distribution for the data point x : 8

  9. Introduce a latent variable We use z k = 1 to indicate a point x belongs to cluster k k =2 z = ( z 1 , …, z K ) A mixing weight for each cluster: prior probability of point belonging to a cluster k =1 Assume the points in the same cluster follow a Gaussian distribution 9

  10. Gaussian Mixture Model (GMM) Generative process • Randomly sample a z from a categorical distribution [ ! 1 , …, ! K ]; • Generate " according to Gaussian distribution Graphical representation of # $, & = #())# $ & So, we get a distribution for the data point x : 10

  11. From minimizing sum of square distances to finding maximum likelihood X = { x 1 ,..., x N } minimize maximize likelihood π = { π 1 ,..., π K } µ = { µ 1 , … , µ K } Σ = { Σ 1 ,..., Σ K } r n1 = 1 k =2 r n2 = 0 µ 2 µ 1 k =1 Remember: The closer the distance, the more likely the probability. 11

  12. Outline • Gaussian Mixture Models (GMM) • Expectation-Maximization (EM) for maximum likelihood • Model selection, Bayesian learning 12

  13. Expectation-Maximization (EM) algorithm for maximum likelihood Initialization k =1 k =2 13

  14. E Step When the parameters are given, the assignments of the points can be calculated by the posterior probability, i.e., the probability of a data point belonging to a cluster once we have observed the data point. Soft assignment: A point fractionally belongs to two clusters. For example, 0.2 belong to cluster 1 0.8 belong to cluster 2 14

  15. M Step When the assignments γ ( z nk ) of the points to the clusters are known, parameters could be calculated for each cluster (Gaussian) separately. Mixing weight π k : the proportion of number of points in cluster k within all data points ; µ k, Σ k : the mean and the covariance matrix are calculated for each cluster L denotes the number of cycles of the EM algorithm. 15

  16. E-Step M-Step initialization Convergence 16 L denotes the number of cycles of E-Step and M-Step.

  17. Details of the EM Algorithm 17

  18. K-means is a hard-cut EM Σ k = ε I Fixed equal GMM considers mixing covariance and weights mixing weights. { µ k } One-in-K assignment Soft assignment 18

  19. The General EM Algorithm Given a joint distribution p ( X , Z | θ ) over observed variables X and latent vari- ables Z , governed by parameters θ , the goal is to maximize the likelihood func- tion p ( X | θ ) with respect to θ . 1. Choose an initial setting for the parameters θ old . 2. E step Evaluate p ( Z | X , θ old ) . 3. M step Evaluate θ new given by θ new = arg max Q ( θ , θ old ) (9.32) θ where � Q ( θ , θ old ) = p ( Z | X , θ old ) ln p ( X , Z | θ ) . (9.33) Z 4. Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let θ old ← θ new (9.34) and return to step 2. 19

  20. Summary for the EM algorithm for GMM • Does it find the global optimum? – No, like K-means, EM only finds the nearest local optimum and the optimum depends on the initialization • GMM is more general then K-means by considering mixing weights, covariance matrices, and soft assignments. • Like K-means, it does not tell you the best K. 20

  21. EM never decreases the likelihood KL( q || p ) New log likelihood ln[$ % & '56 ] quan- L ( q, θ ) ln p ( X | θ ) '56 ||$ . /, & '56 ] KL[+ , New lower bound '56 , & '56 ) ℱ(+ , Log likelihood ln $ % & ' ln[$ % & ' ] '56 , & ' ) = ℱ(+ , '56 | $ . /, & ' KL[+ , = 0 ' ||$ . /, & ' KL[+ , ] ' , & ' ) ℱ(+ , Lower bound (t+1) (t) E-Step M-Step 21

  22. 22

  23. The KL [ q ( x ) k p ( x )] is non-negative and zero iff 8 x : p ( x ) = q ( x ) First let’s consider discrete distributions; the Kullback-Liebler divergence is: q i log q i X KL [ q k p ] = . p i i To find the distribution q which minimizes KL [ q k p ] we add a Lagrange multiplier to enforce the normalization constraint: q i log q i def X X X � � � � = KL [ q k p ] + λ 1 � = + λ 1 � E q i q i p i i i i We then take partial derivatives and set to zero: 9 ∂ E = log q i � log p i + 1 � λ = 0 ) q i = p i exp( λ � 1) > > ∂ q i > = ) q i = p i . ∂ E X X ∂λ = 1 � q i = 0 ) q i = 1 > > k > ; i i Check that the curvature (Hessian) is positive (definite), corresponding to a minimum: ∂ 2 E ∂ 2 E = 1 > 0 , = 0 , ∂ q i ∂ q i q i ∂ q i ∂ q j showing that q i = p i is a genuine minimum. 23 At the minimum is it easily verified that KL [ p k p ] = 0 .

  24. � ����������������������������� ����������� Jensen’s Inequality due to convexity 24

  25. EM never decreases the likelihood KL( q || p ) New log likelihood ln[$ % & '56 ] quan- L ( q, θ ) ln p ( X | θ ) '56 ||$ . /, & '56 ] KL[+ , New lower bound '56 , & '56 ) ℱ(+ , Log likelihood ln $ % & ' ln[$ % & ' ] '56 , & ' ) = ℱ(+ , '56 | $ . /, & ' KL[+ , = 0 ' ||$ . /, & ' KL[+ , ] ' , & ' ) ℱ(+ , Lower bound (t+1) (t) E-Step M-Step 25

  26. Outline • Gaussian Mixture Models (GMM) • Expectation-Maximization (EM) for maximum likelihood • Model selection, Bayesian learning 26

  27. How to determine the cluster number K? K-mean GMM - Log-likelihood J K 0 K K J does not tell which K is better. Negative log-likelihood also decreases as K increases. 27

  28. Model selection in general Probabilistic model Candidate models: Θ 1 ⊆ Θ 2 ⊆ ! ⊆ Θ K ⊆ ! p ( X N | Θ K ) A trade-off between fitting the data well and keeping the model simple Criterion Models become more and more complex as K increases. Negative log- likelihood (or fitting error) to make sure K the model fit the data well. ln p ( X N | ˆ Θ K ) − d k Akaike’s Information Criterion (AIC) d k : number of free parameters Θ K ) − 1 ln p ( X N | ˆ Bayesian Information Criterion (BIC) d k ln N N : sample size 2 28

  29. Bayesian learning • Maximum A Posteriori (MAP) max $(Θ|%) 0 Equivalent to: log $ %, Θ = log $(%| Θ) + log $( Θ) Consider a simple example: $ 1 Θ = 2(1|3, Σ) 7 ) $ 3 = 2(3|3 5 , 6 5 29

  30. Model selection Probabilistic model Candidate models: Θ 1 ⊆ Θ 2 ⊆ ! ⊆ Θ K ⊆ ! p ( X N | Θ K ) 30

  31. Using Occam’s Razor to Learn Model Structure Compare model classes m using their posterior probability given the data: P ( m | y ) = P ( y | m ) P ( m ) Z P ( y | m ) = P ( y | θ m , m ) P ( θ m | m ) d θ m , P ( y ) Θ m Interpretation of P ( y | m ) : The probability that randomly selected parameter values from the model class would generate data set y . Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. too simple P(Y| M i ) "just right" too complex Y 31 All possible data sets

  32. Bayesian model selection • A model class m is a set of models parameterised by θ m , e.g. the set of all possible mixtures of m Gaussians. • The marginal likelihood of model class m : Z P ( y | m ) = P ( y | θ m , m ) P ( θ m | m ) d θ m Θ m is also known as the Bayesian evidence for model m . • The ratio of two marginal likelihoods is known as the Bayes factor: P ( y | m ) P ( y | m 0 ) • The Occam’s Razor principle is, roughly speaking, that one should prefer simpler explanations than more complex explanations. • Bayesian inference formalises and automatically implements the Occam’s Razor principle. 32

Recommend


More recommend