Latent variable models. k p θ ( y i = j | x i ) and p θ ( y i = j | x i ) = p θ ( y i = j, x i ) � Since 1 = , then p θ ( x i ) j =1 n n n k � � � � L ( θ ) = ln p θ ( x i ) = 1 · ln p θ ( x i ) = p θ ( y i = j | x i ) ln p θ ( x i ) i =1 i =1 i =1 j =1 n k p θ ( y i = j | x i ) ln p θ ( x i , y i = j ) � � = p θ ( y i = j | x i ) . i =1 j =1 Therefore: define augmented likelihood n k R ij ln p θ ( x i , y i = j ) � � L ( θ ; R ) := ; R ij i =1 j =1 note that R ij := p θ ( y i = j | x i ) implies L ( θ ; R ) = L ( θ ) . 29 / 70
E-M method for latent variable models Define augmented likelihood n k R ij ln p θ ( x i , y i = j ) � � L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . 30 / 70
E-M method for latent variable models Define augmented likelihood n k R ij ln p θ ( x i , y i = j ) � � L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . Soon: we’ll see this gives nondecreasing likelihood! 30 / 70
E-M for Gaussian mixtures Initialization: a standard choice is π j = 1 / k , Σ j = I , and ( µ j ) k j =1 given by k -means. ◮ E-step: Set R ij = p θ ( y i = j | x i ) , meaning π j p µ j , Σ j ( x i ) R ij = p θ ( y i = j | x i ) = p θ ( y i = j, x i ) = . � k p θ ( x i ) l =1 π l p µ l , Σ l ( x i ) ◮ M-step: solve arg max θ ∈ Θ L ( θ ; R ) , meaning � n � n i =1 R ij i =1 R ij π j := = , � n � k n l =1 R il i =1 � n � n i =1 R ij x i i =1 R ij x i µ j := = , � n nπ j i =1 R ij � n i =1 R ij ( x i − µ j )( x i − µ j ) T Σ j := . nπ j (These are as before.) 31 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: spherical clusters 6 4 2 0 2 2 0 2 4 6 8 10 (Initialized with k -means, thus not so dramatic.) 32 / 70
Demo: elliptical clusters E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . . 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 33 / 70
Theorem. Suppose ( R 0 , θ 0 ) ∈ R n,k × Θ arbitrary, thereafter ( R t , θ t ) given by E-M: ( R t ) ij := p θ t − 1 ( y = j | x i ) . and θ t := arg max L ( θ ; R t ) θ ∈ Θ Then L ( θ t ; R t ) ≤ R ∈R n × k L ( θ t ; R ) = L ( θ t ; R t +1 ) = L ( θ t ) max ≤ L ( θ t +1 ; R t +1 ) . In particular, L ( θ t ) ≤ L ( θ t +1 ) . 34 / 70
Theorem. Suppose ( R 0 , θ 0 ) ∈ R n,k × Θ arbitrary, thereafter ( R t , θ t ) given by E-M: ( R t ) ij := p θ t − 1 ( y = j | x i ) . and θ t := arg max L ( θ ; R t ) θ ∈ Θ Then L ( θ t ; R t ) ≤ R ∈R n × k L ( θ t ; R ) = L ( θ t ; R t +1 ) = L ( θ t ) max ≤ L ( θ t +1 ; R t +1 ) . In particular, L ( θ t ) ≤ L ( θ t +1 ) . Remarks. ◮ We proved a similar guarantee for k -means, which is also an alternating minimization scheme. ◮ Similarly, MLE for Gaussian mixtures is NP-hard; it is also known to need exponentially many samples in k to information-theoretically recover the parameters. 34 / 70
Proof. We’ve already shown: ◮ L ( θ t ; R t +1 ) = L ( θ t ) ; ◮ L ( θ t ; R t +1 ) = max θ ∈ Θ L ( θ ; R t +1 ) ≤ L ( θ t +1 ; R t +1 ) by definition of θ t +1 . We still need to show: L ( θ t ; R t +1 ) = max R ∈R n,k L ( θ t +1 ; R ) . We’ll give two proofs. 35 / 70
Proof. We’ve already shown: ◮ L ( θ t ; R t +1 ) = L ( θ t ) ; ◮ L ( θ t ; R t +1 ) = max θ ∈ Θ L ( θ ; R t +1 ) ≤ L ( θ t +1 ; R t +1 ) by definition of θ t +1 . We still need to show: L ( θ t ; R t +1 ) = max R ∈R n,k L ( θ t +1 ; R ) . We’ll give two proofs. By concavity of ln (“Jensen’s inequality” in convexity lectures), for any R ∈ R n,k , n k R ij ln p θ t ( x i , y i = j ) � � L ( θ t ; R ) = R ij i =1 j =1 n k p θ t ( x i , y i = j ) � � ≤ ln R ij R ij i =1 j =1 n � = ln p θ t ( x i ) = L ( θ t ) = L ( θ t ; R t +1 ) . i =1 Since R was arbitrary, max R ∈R L ( θ t ; R ) = L ( θ t ; R t +1 ) . 35 / 70
Proof (continued). Here’s a second proof of that missing fact. To evaluate arg max R ∈R n,k L ( θ ; R ) , consider Lagrangian n k k � k � � � � � . R ij ln p θ ( x i , y = j ) − R ij ln R ij + λ i R ij − 1 i =1 j =1 j =1 j =1 Fixing i and taking the gradient with respect to R ij for any j , 0 = ln p θ ( x i , y i = j ) − ln R ij − 1 + λ i , giving R ij = p θ ( x i , y = j ) exp( λ i − 1) . Since moreover � � R ij = exp( λ i − 1) p θ ( x i , y = j ) = exp( λ i − 1) p θ ( x i ) , 1 = j j it follows that exp( λ i − 1) = 1 / p θ ( x i ) , and the optimal R satisfies R ij = p θ ( x i ,y = j ) / p θ ( x i ) = p θ ( y = j | x i ) . � 36 / 70
Related issues. 37 / 70
Parameter constraints. E-M for GMMs still works if we freeze or constrain some parameters. 38 / 70
Parameter constraints. E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = ( 1 / k , . . . , 1 / k ) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σ j := diag(( σ j ) 2 1 , . . . , ( σ j ) 2 d ) where � n i =1 R ij ( x i − µ j ) 2 ( σ j ) 2 l l := ; nπ j that is: we use coordinate-wise sample variances weighted by R . Why is this a good idea? 38 / 70
Parameter constraints. E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = ( 1 / k , . . . , 1 / k ) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σ j := diag(( σ j ) 2 1 , . . . , ( σ j ) 2 d ) where � n i =1 R ij ( x i − µ j ) 2 ( σ j ) 2 l l := ; nπ j that is: we use coordinate-wise sample variances weighted by R . Why is this a good idea? Computation (of inverse), sample complexity, . . . 38 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Gaussian Mixture Model with diagonal covariances. 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10 5 0 5 10 15 39 / 70
Recommend
More recommend