Rademacher complexity examples Definition. Given examples ( x 1 , . . . , x n ) and functions F , � n 1 Rad ( F ) = E ε max ǫ i f ( x i ) , n f ∈F i =1 where ( ǫ 1 , . . . , ǫ n ) are IID Rademacher rv ( Pr[ ǫ i = 1] = Pr[ ǫ i = − 1] = 1 2 ). Examples. T w : � w � ≤ W } ) ≤ RW ◮ If � x � ≤ R , then Rad ( { x �→ x √ n . � For SVM, we can set W = 2 / λ . � ◮ For deep networks, we have Rad ( F ) ≤ Lipschitz · Junk /n ; still very loose. 7 / 61
Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? 8 / 61
Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? 8 / 61
Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? The task is less clear-cut. In 2019 we still have people trying to formalize it! 8 / 61
SVD reminder 9 / 61
SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 9 / 61
SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 9 / 61
SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 9 / 61
SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 4. “Operational” view of SVD: for M ∈ R n × d , s 1 0 ⊤ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ... 0 · · · · · · · · · · · · · · . u 1 u r u r +1 u n v 1 v r v r +1 v d 0 s r ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 9 / 61
SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 4. “Operational” view of SVD: for M ∈ R n × d , s 1 0 ⊤ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ... 0 · · · · · · · · · · · · · · . u 1 u r u r +1 u n v 1 v r v r +1 v d 0 s r ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). 9 / 61
SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 4. “Operational” view of SVD: for M ∈ R n × d , s 1 0 ⊤ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ... 0 · · · · · · · · · · · · · · . u 1 u r u r +1 u n v 1 v r v r +1 v d 0 s r ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). New: let ( U k , S k , V k ) denote the truncated SVD with U k ∈ R d × k (first k columns of U ), similarly for the others. 9 / 61
PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 10 / 61
PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is unique, but � r i =1 s 2 i unique. 10 / 61
PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is unique, but � r i =1 s 2 i unique. Remark 2. As written, this is not a convex optimization problem! 10 / 61
PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is unique, but � r i =1 s 2 i unique. Remark 2. As written, this is not a convex optimization problem! Remark 3. The second form is interesting. . . 10 / 61
Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 11 / 61
Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 1 T X ∈ R d × d is data covariance; n X 11 / 61
Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 1 T X ∈ R d × d is data covariance; n X 1 T ( XD ) is data covariance after projection; n ( XD ) 11 / 61
Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 1 T X ∈ R d × d is data covariance; n X 1 T ( XD ) is data covariance after projection; n ( XD ) lastly k � � � 1 F = 1 = 1 n � XD � 2 T ( XD ) T ( XDe i ) , n tr ( XD ) ( XDe i ) n i =1 therefore PCA is maximizing the resulting per-coordinate variances! 11 / 61
Lloyd’s method revisited 1. Choose initial clusters ( S 1 , . . . , S k ) . 2. Repeat until convergence: 2.1 (Recenter.) Set µ j := mean( S j ) for j ∈ (1 , . . . , k ) . 2.2 (Reassign). Update S j := { x i : µ ( x i ) = µ j } for j ∈ (1 , . . . , k ) . (“ µ ( x i ) ” means “center closest to x i ”; break ties arbitrarily). 12 / 61
Lloyd’s method revisited 1. Choose initial clusters ( S 1 , . . . , S k ) . 2. Repeat until convergence: 2.1 (Recenter.) Set µ j := mean( S j ) for j ∈ (1 , . . . , k ) . 2.2 (Reassign). Update S j := { x i : µ ( x i ) = µ j } for j ∈ (1 , . . . , k ) . (“ µ ( x i ) ” means “center closest to x i ”; break ties arbitrarily). Geometric perspective: ◮ Centers define a Voronoi diagram/partition : for each µ j , define cell V j := { x ∈ R d : µ ( x ) = µ j } (break ties arbitrarily). ◮ Reassignment leaves assignment consistent with Voronoi cells. ◮ Recentering might shift data outside Voronoi cells, except if we’ve converged! ◮ See http://mjt.cs.illinois.edu/htv/ for an interactive demo. 12 / 61
Does Lloyd’s method solve the original problem? Theorem. ◮ For all t , φ ( C t ; A t − 1 ) ≥ φ ( C t ; A t ) ≥ φ ( C t +1 ; A t ) . ◮ The method terminates. 13 / 61
Does Lloyd’s method solve the original problem? Theorem. ◮ For all t , φ ( C t ; A t − 1 ) ≥ φ ( C t ; A t ) ≥ φ ( C t +1 ; A t ) . ◮ The method terminates. Proof. ◮ This first property is from the earlier theorem and the definition of the algorithm: φ ( C t ; A t ) = φ ( C t ; A ( C t − 1 )) = min A ∈A φ ( C t ; A ) ≤ φ ( C t , A t − 1 ) , φ ( C t +1 ; A t ) = φ ( C ( A t ); A t ) = min C ∈C φ ( C ; A t ) ≤ φ ( C t , A t ) , ◮ Previous property implies: cost is nonincreasing. Combined with termination condition: all but final partition visited at most once. There are finitely many partitions of ( x i ) n i =1 . � 13 / 61
Does Lloyd’s method solve the original problem? Theorem. ◮ For all t , φ ( C t ; A t − 1 ) ≥ φ ( C t ; A t ) ≥ φ ( C t +1 ; A t ) . ◮ The method terminates. Proof. ◮ This first property is from the earlier theorem and the definition of the algorithm: φ ( C t ; A t ) = φ ( C t ; A ( C t − 1 )) = min A ∈A φ ( C t ; A ) ≤ φ ( C t , A t − 1 ) , φ ( C t +1 ; A t ) = φ ( C ( A t ); A t ) = min C ∈C φ ( C ; A t ) ≤ φ ( C t , A t ) , ◮ Previous property implies: cost is nonincreasing. Combined with termination condition: all but final partition visited at most once. There are finitely many partitions of ( x i ) n i =1 . � (That didn’t answer the question. . . ) 13 / 61
Seriously: does Lloyd’s method solve the original problem? ◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost . (Suppose width is c > 1 and height is 1.) 14 / 61
Seriously: does Lloyd’s method solve the original problem? ◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost . (Suppose width is c > 1 and height is 1.) ◮ In practice, method takes few iterations; in theory: can take 2 Ω( √ n ) iterations! (Examples of this are painful; but note, problem is NP-hard, and convergence proof used number of partitions. . . ) 14 / 61
Seriously: does Lloyd’s method solve the original problem? ◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost . (Suppose width is c > 1 and height is 1.) ◮ In practice, method takes few iterations; in theory: can take 2 Ω( √ n ) iterations! (Examples of this are painful; but note, problem is NP-hard, and convergence proof used number of partitions. . . ) So: in practice, yes; in theory, don’t know. . . 14 / 61
Application: vector quantization. Vector quantization with k -means. ◮ Let ( x i ) n i =1 be given. ◮ run k -means to obtain ( µ 1 , . . . , µ k ) . ◮ Replace each ( x i ) n i =1 with ( µ ( x i )) n i =1 . Encoding size reduces from O ( nd ) to O ( kd + n ln( k )) . Examples. ◮ Audio compression. ◮ Image compression. 15 / 61
0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 10, 8 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 10, 32 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 10, 128 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 10, 512 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 10, 2048 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 25, 8 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 25, 32 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 25, 128 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 25, 256 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 50, 8 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 50, 32 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
patch quantization, width 50, 64 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61
Initialization matters! ◮ Easy choices: ◮ k random points from dataset. ◮ Random partition. ◮ Standard choice (theory and practice) : “ D 2 -sampling”/ kmeans++ 1. Choose µ 1 uniformly at random from data. 2. for j ∈ (2 , . . . , k ) : 2.1 Choose x i ∝ min l<j � x i − µ l � 2 2 . ◮ kmeans++ is randomized furthest-first traversal ; regular furthest-first fails with outliers. ◮ Scikits-learn and Matlab both default to kmeans++. 17 / 61
Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. 18 / 61
Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. We’ve also discussed another: the “Maximum likelihood estimation (MLE)” principle : ◮ Pick a set of probability models for your data: P := { p θ : θ ∈ Θ } . ◮ p θ will denote both densities and masses; the literature is similarly inconsistent. ◮ Given samples ( z i ) n i =1 , pick the model that maximized the likelihood n n � � max θ ∈ Θ L ( θ ) = max θ ∈ Θ ln p θ ( z i ) = max ln p θ ( z i ) , θ ∈ Θ i =1 i =1 where the ln( · ) is for mathematical convenience, and z i can be a labeled pair ( x i , y i ) or just x i . 18 / 61
Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. 19 / 61
Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 19 / 61
Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 Differentiating and setting to 0, 0 = H T θ − 1 − θ, T + H = H H which gives θ = N . ◮ In this way, we’ve justified a natural algorithm. 19 / 61
Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � � − ( x i − µ ) 2 exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 20 / 61
Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � � − ( x i − µ ) 2 exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 ◮ Therefore n � L ( θ ) = − 1 ( x i − µ ) 2 + stuff without µ ; 2 σ 2 i =1 � applying ∇ µ and setting to zero gives µ = 1 x i . n i � ◮ A similar derivation gives σ 2 = 1 i ( x i − µ ) 2 . n 20 / 61
Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) 21 / 61
Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. 21 / 61
Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. ◮ Let’s assume coordinates of X = ( X 1 , . . . , X d ) are independent given Y : = p ( X = x | Y = y ) p ( Y = y ) p ( Y = y | X = x ) = p ( Y = y, X = x ) p ( X = x ) p ( X = x ) p ( Y = y ) � d j =1 p ( X j = x j | Y = y ) = , p ( X = x ) and d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 21 / 61
Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 22 / 61
Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 Examples where this helps: ◮ Suppose X ∈ { 0 , 1 } d has an arbitrary distribution; \ it’s specified with 2 d − 1 numbers. \ The factored form above needs d numbers. To see how this can help, suppose x ∈ { 0 , 1 } d ; instead of having to learn a probability model of 2 d possibilities, we now have to learn d + 1 models each with 2 possibilities (binary labels). ◮ HW5 will use the standard “Iris dataset”. \ This data is continuous, Naive Bayes would approximate univariate distributions. 22 / 61
Gaussian Mixture Model ◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete ( π 1 , . . . , π k ) , X = x | Y = j ∼ N ( µ j , Σ j ) , and the parameters are θ = (( π 1 , µ 1 , Σ 1 ) , . . . , ( π k , µ k , Σ k )) . (Note: this is a generative model, and we have a way to sample.) 23 / 61
Gaussian Mixture Model ◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete ( π 1 , . . . , π k ) , X = x | Y = j ∼ N ( µ j , Σ j ) , and the parameters are θ = (( π 1 , µ 1 , Σ 1 ) , . . . , ( π k , µ k , Σ k )) . (Note: this is a generative model, and we have a way to sample.) ◮ The probability density (with parameters θ = (( π j , µ j , Σ j )) k j =1 ) at a given x is k k � � p θ ( x ) = p θ ( x | y = j ) p θ ( y = j ) = p µ j , Σ j ( x | Y = j ) π j , j =1 j =1 and the likelihood problem is n k π j � − 1 � � � T Σ − 1 ( x i − µ j ) L ( θ ) = ln exp 2( x i − µ j ) . � (2 π ) d | Σ | i =1 j =1 The ln and the exp are no longer next to each other; we can’t just take the derivative and set the answer to 0. 23 / 61
Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 24 / 61
Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 Doesn’t look Gaussian! 24 / 61
Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 Pearson fit a mixture of two Gaussians . 25 / 61
Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 Pearson fit a mixture of two Gaussians . Remark. Pearson did not use E-M. For this he invented the “method of moments” and obtained a solution by hand. 25 / 61
Gaussian mixture likelihood with responsibility matrix R Let’s replace � n i =1 ln � k j =1 π j p µ j , Σ j ( x i ) with n k � � � � R ij ln π j p µ j , Σ j ( x i ) i =1 j =1 where R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } is a responsibility matrix . 26 / 61
Gaussian mixture likelihood with responsibility matrix R Let’s replace � n i =1 ln � k j =1 π j p µ j , Σ j ( x i ) with n k � � � � R ij ln π j p µ j , Σ j ( x i ) i =1 j =1 where R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } is a responsibility matrix . Holding R fixed and optimizing θ gives � n � n i =1 R ij i =1 R ij π j := = ; � n � k n l =1 R il i =1 � n � n i =1 R ij x i i =1 R ij x i µ j := � n = , i =1 R ij nπ j � n i =1 R ij ( x i − µ j )( x i − µ j ) T Σ j := . nπ j (Should use new mean in Σ j so that all deriviatives 0.) 26 / 61
Generalizing the assignment matrix to GMMs We introduced an assigment matrix A ∈ { 0 , 1 } n × k : ◮ For each x i , define µ ( x i ) to be a closest center: � x i − µ ( x i ) � = min � x i − µ j � . j ◮ For each i , set A ij = 1 [ µ ( x i ) = µ j ] . 27 / 61
Generalizing the assignment matrix to GMMs We introduced an assigment matrix A ∈ { 0 , 1 } n × k : ◮ For each x i , define µ ( x i ) to be a closest center: � x i − µ ( x i ) � = min � x i − µ j � . j ◮ For each i , set A ij = 1 [ µ ( x i ) = µ j ] . ◮ Key property: by this choice, n k n � � � A ij � x i − µ j � 2 = � x i − µ j � 2 = φ ( C ); φ ( C ; A ) = min j i =1 j =1 i =1 therefore we can decrase φ ( C ) = φ ( C ; A ) first by optimizing C to get φ ( C ′ ; A ) ≤ φ ( C ; A ) , then setting A as above to get φ ( C ′ ) = φ ( C ′ ; A ′ ) ≤ φ ( C ′ ; A ) ≤ φ ( C ; A ) = φ ( C ) . In other words: we minimize φ ( C ) via φ ( C ; A ) . 27 / 61
Generalizing the assignment matrix to GMMs We introduced an assigment matrix A ∈ { 0 , 1 } n × k : ◮ For each x i , define µ ( x i ) to be a closest center: � x i − µ ( x i ) � = min � x i − µ j � . j ◮ For each i , set A ij = 1 [ µ ( x i ) = µ j ] . ◮ Key property: by this choice, n k n � � � A ij � x i − µ j � 2 = � x i − µ j � 2 = φ ( C ); φ ( C ; A ) = min j i =1 j =1 i =1 therefore we can decrase φ ( C ) = φ ( C ; A ) first by optimizing C to get φ ( C ′ ; A ) ≤ φ ( C ; A ) , then setting A as above to get φ ( C ′ ) = φ ( C ′ ; A ′ ) ≤ φ ( C ′ ; A ) ≤ φ ( C ; A ) = φ ( C ) . In other words: we minimize φ ( C ) via φ ( C ; A ) . What fulfills the same role for L ? 27 / 61
E-M method for latent variable models Define augmented likelihood n k � � R ij ln p θ ( x i , y i = j ) L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . 28 / 61
E-M method for latent variable models Define augmented likelihood n k � � R ij ln p θ ( x i , y i = j ) L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . Soon: we’ll see this gives nondecreasing likelihood! 28 / 61
E-M for Gaussian mixtures Initialization: a standard choice is π j = 1 / k , Σ j = I , and ( µ j ) k j =1 given by k -means. ◮ E-step: Set R ij = p θ ( y i = j | x i ) , meaning π j p µ j , Σ j ( x i ) R ij = p θ ( y i = j | x i ) = p θ ( y i = j, x i ) = . � k p θ ( x i ) l =1 π l p µ l , Σ l ( x i ) ◮ M-step: solve arg max θ ∈ Θ L ( θ ; R ) , meaning � n � n i =1 R ij i =1 R ij π j := = , � n � k n l =1 R il i =1 � n � n i =1 R ij x i i =1 R ij x i µ j := � n = , nπ j i =1 R ij � n i =1 R ij ( x i − µ j )( x i − µ j ) T Σ j := . nπ j (These are as before.) 29 / 61
Demo: elliptical clusters E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61
Demo: elliptical clusters E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61
Demo: elliptical clusters E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61
Demo: elliptical clusters E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61
Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61
Recommend
More recommend