coms 4721 machine learning for data science lecture 16 3
play

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University S OFT CLUSTERING VS HARD CLUSTERING MODELS H ARD CLUSTERING MODELS


  1. COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. S OFT CLUSTERING VS HARD CLUSTERING MODELS

  3. H ARD CLUSTERING MODELS Review: K-means clustering algorithm Given: Data x 1 , . . . , x n , where x ∈ R d Goal: Minimize L = � n � K k = 1 1 { c i = k }� x i − µ k � 2 . i = 1 ◮ Iterate until values no longer changing 1. Update c : For each i , set c i = arg min k � x i − µ k � 2 �� � �� � 2. Update µ : For each k , set µ k = i x i 1 { c i = k } / i 1 { c i = k } K-means is an example of a hard clustering algorithm because it assigns each observation to only one cluster. In other words, c i = k for some k ∈ { 1 , . . . , K } . There is no accounting for the “boundary cases” by hedging on the corresponding c i .

  4. S OFT CLUSTERING MODELS A soft clustering algorithm breaks the data across clusters intelligently. 1 1 1 (a) (b) (c) 0.5 0.5 0.5 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 (left) True cluster assignments of data from three Gaussians. (middle) The data as we see it. (right) A soft-clustering of the data accounting for borderline cases.

  5. W EIGHTED K- MEANS ( SOFT CLUSTERING EXAMPLE ) Weighted K-means clustering algorithm Given: Data x 1 , . . . , x n , where x ∈ R d Goal: Minimize L = � n � K k = 1 φ i ( k ) � x i − µ k � 2 − � i H ( φ i ) over φ i and µ k i = 1 β Conditions: φ i ( k ) > 0 , � K k = 1 φ i ( k ) = 1, H ( φ i ) = entropy. Set β > 0. ◮ Iterate the following 1. Update φ : For each i , update the cluster allocation weights exp {− 1 β � x i − µ k � 2 } φ i ( k ) = β � x i − µ j � 2 } , for k = 1 , . . . , K j exp {− 1 � 2. Update µ : For each k , update µ k with the weighted average � i x i φ i ( k ) µ k = � i φ i ( k )

  6. S OFT CLUSTERING WITH WEIGHTED K - MEANS ϕ i = 0.75 on green cluster & 0.25 blue cluster ϕ i = 1 on blue cluster x When ϕ i is binary, we get back the hard clustering model β-defined region μ 1

  7. M IXTURE MODELS

  8. P ROBABILISTIC SOFT CLUSTERING MODELS Probabilistic vs non-probabilistic soft clustering The weight vector φ i is like a probability of x i being assigned to each cluster. A mixture model is a probabilistic model where φ i actually is a probability distribution according to the model. Mixture models work by defining: ◮ A prior distribution on the cluster assignment indicator c i ◮ A likelihood distribution on observation x i given the assignment c i Intuitively we can connect a mixture model to the Bayes classifier: ◮ Class prior → cluster prior. This time, we don’t know the “label” ◮ Class-conditional likelihood → cluster-conditional likelihood

  9. M IXTURE MODELS (a) A probability distribution on R 2 . (b) Data sampled from this distribution. Before introducing math, some key features of a mixture model are: 1. It is a generative model (defines a probability distribution on the data) 2. It is a weighted combination of simpler distributions. ◮ Each simple distribution is in the same distribution family (i.e., a Gaussian). ◮ The “weighting” is defined by a discrete probability distribution.

  10. M IXTURE MODELS Generating data from a mixture model Data : x 1 , . . . , x n , where each x i ∈ X (can be complicated, but think X = R d ) Model parameters : A K -dim distribution π and parameters θ 1 , . . . , θ K . Generative process : For observation number i = 1 , . . . , n , iid ∼ Discrete ( π ) ⇒ Prob ( c i = k | π ) = π k . 1. Generate cluster assignment: c i 2. Generate observation: x i ∼ p ( x | θ c i ) . Some observations about this procedure: ◮ First, each x i is randomly assigned to a cluster using distribution π . ◮ c i indexes the cluster assignment for x i ◮ This picks out the index of the parameter θ used to generate x i . ◮ If two x ’s share a parameter, they are clustered together.

  11. M IXTURE MODELS (a) Uniform mixing weights (b) Data sampled from this distribution. (c) Uneven mixing weights (d) Data sampled from this distribution.

  12. G AUSSIAN MIXTURE MODELS

  13. I LLUSTRATION Gaussian mixture models are mixture models where p ( x | θ ) is Gaussian. Mixture of two Gaussians The red line is the density function. π = [ 0 . 5 , 0 . 5 ] ( µ 1 , σ 2 1 ) = ( 0 , 1 ) ( µ 2 , σ 2 2 ) = ( 2 , 0 . 5 ) Influence of mixing weights The red line is the density function. π = [ 0 . 8 , 0 . 2 ] ( µ 1 , σ 2 1 ) = ( 0 , 1 ) ( µ 2 , σ 2 2 ) = ( 2 , 0 . 5 )

  14. G AUSSIAN MIXTURE MODELS (GMM) The model Parameters: Let π be a K -dimensional probability distribution and ( µ k , Σ k ) be the mean and covariance of the k th Gaussian in R d . Generate data: For the i th observation, 1. Assign the i th observation to a cluster, c i ∼ Discrete ( π ) 2. Generate the value of the observation, x i ∼ N ( µ c i , Σ c i ) Definitions: µ = { µ 1 , . . . , µ K } and Σ = { Σ 1 , . . . , Σ k } . Goal: We want to learn π, µ and Σ .

  15. G AUSSIAN MIXTURE MODELS (GMM) Maximum likelihood Objective: Maximize the likelihood over model parameters π , µ and Σ by treating the c i as auxiliary data using the EM algorithm. n n K � � � p ( x 1 , . . . , x n | π, µ , Σ ) = p ( x i | π, µ , Σ ) = p ( x i , c i = k | π, µ , Σ ) i = 1 i = 1 k = 1 The summation over values of each c i “integrates out” this variable. We can’t simply take derivatives with respect to π , µ k and Σ k and set to zero to maximize this because there’s no closed form solution. We could use gradient methods, but EM is cleaner.

  16. EM A LGORITHM Q : Why not instead just include each c i and maximize � n i = 1 p ( x i , c i | π, µ , Σ ) since (we can show) this is easy to do using coordinate ascent? A : We would end up with a hard-clustering model where c i ∈ { 1 , . . . , K } . Our goal here is to have soft clustering, which EM does. EM and the GMM We will not derive everything from scratch. However, we can treat c 1 , . . . , c n as the auxiliary data that we integrate out. Therefore, we use EM to n n � � ln p ( x i | π, µ , Σ ) ln p ( x i , c i | π, µ , Σ ) maximize by using i = 1 i = 1 Let’s look at the outlines of how to derive this.

  17. T HE EM ALGORITHM AND THE GMM From the last lecture, the generic EM objective is � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = d θ 2 + q ( θ 2 ) ln p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) The EM objective for the Gaussian mixture model is n n K q ( c i = k ) ln p ( x i , c i = k | π, µ , Σ ) � � � ln p ( x i | π, µ , Σ ) = + q ( c i = k ) i = 1 i = 1 k = 1 n K q ( c i = k ) � � q ( c i = k ) ln p ( c i | x i , π, µ , Σ ) i = 1 k = 1 Because c i is discrete, the integral becomes a sum.

  18. EM S ETUP ( ONE ITERATION ) First: Set q ( c i = k ) ⇐ p ( c i = k | x i , π, µ , Σ ) using Bayes rule: p ( c i = k | x i , π, µ , Σ ) ∝ p ( c i = k | π ) p ( x i | c i = k , µ , Σ ) We can solve the posterior of c i given π , µ and Σ : π k N ( x i | µ k , Σ k ) q ( c i = k ) = = ⇒ φ i ( k ) � j π j N ( x i | µ j , Σ j ) E-step: Take the expectation using the updated q ’s n K � � L = φ i ( k ) ln p ( x i , c i = k | π, µ k , Σ k ) + constant w.r.t. π , µ , Σ i = 1 k = 1 M-step: Maximize L with respect to π and each µ k , Σ k .

  19. M- STEP CLOSE UP Aside: How has EM made this easier? Original objective function: n K n K � � � � L = p ( x i , c i = k | π, µ k , Σ k ) = π k N ( x i | µ k , Σ k ) . ln ln i = 1 k = 1 i = 1 k = 1 The log-sum form makes optimizing π , and each µ k and Σ k difficult. Using EM here, we have the M-Step: n K � � L = φ i ( k ) { ln π k + ln N ( x i | µ k , Σ k ) } + constant w.r.t. π , µ , Σ � �� � i = 1 k = 1 ln p ( x i , c i = k | π,µ k , Σ k ) The sum-log form is easier to optimize. We can take derivatives and solve.

  20. EM FOR THE GMM Algorithm: Maximum likelihood EM for the GMM Given: x 1 , . . . , x n where x ∈ R d Goal: Maximize L = � n i = 1 ln p ( x i | π, µ , Σ ) . ◮ Iterate until incremental improvement to L is “small” 1. E-step : For i = 1 , . . . , n , set π k N ( x i | µ k , Σ k ) φ i ( k ) = j π j N ( x i | µ j , Σ j ) , for k = 1 , . . . , K � 2. M-step : For k = 1 , . . . , K , define n k = � n i = 1 φ i ( k ) and update the values n n π k = n k µ k = 1 Σ k = 1 � � φ i ( k )( x i − µ k )( x i − µ k ) T n , φ i ( k ) x i n k n k i = 1 i = 1 Comment: The updated value for µ k is used when updating Σ k .

  21. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 A random initialization 0 −2 −2 0 2 (a)

  22. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 Iteration 1 (E-step) 0 Assign data to clusters −2 −2 0 2 (b)

  23. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 L = 1 Iteration 1 (M-step) 0 Update the Gaussians −2 −2 0 2 (c)

  24. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 L = 2 Iteration 2 0 Assign data to clusters and update the Gaussians −2 −2 0 2 (d)

  25. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 L = 5 Iteration 5 (skipping ahead) 0 Assign data to clusters and update the Gaussians −2 −2 0 2 (e)

  26. G AUSSIAN MIXTURE MODEL : E XAMPLE RUN 2 L = 20 Iteration 20 (convergence) 0 Assign data to clusters and update the Gaussians −2 −2 0 2 (f)

Recommend


More recommend