Final exam review CS 446 Selected lecture slides 1 / 61 - PowerPoint PPT Presentation

Rademacher complexity examples Definition. Given examples ( x 1 , . . . , x n ) and functions F , � n 1 Rad ( F ) = E ε max ǫ i f ( x i ) , n f ∈F i =1 where ( ǫ 1 , . . . , ǫ n ) are IID Rademacher rv ( Pr[ ǫ i = 1] = Pr[ ǫ i = − 1] = 1 2 ). Examples. T w : � w � ≤ W } ) ≤ RW ◮ If � x � ≤ R , then Rad ( { x �→ x √ n . � For SVM, we can set W = 2 / λ . � ◮ For deep networks, we have Rad ( F ) ≤ Lipschitz · Junk /n ; still very loose. 7 / 61

Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? 8 / 61

Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? 8 / 61

Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? The task is less clear-cut. In 2019 we still have people trying to formalize it! 8 / 61

SVD reminder 9 / 61

SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 9 / 61

SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 9 / 61

SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 9 / 61

SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 4. “Operational” view of SVD: for M ∈ R n × d ,   s 1 0     ⊤   ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ...    0    ·   · · · · · · · · · · · · · . u 1 u r u r +1 u n v 1 v r v r +1 v d   0 s r   ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 9 / 61

SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 4. “Operational” view of SVD: for M ∈ R n × d ,   s 1 0     ⊤   ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ...    0    ·   · · · · · · · · · · · · · . u 1 u r u r +1 u n v 1 v r v r +1 v d   0 s r   ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). 9 / 61

SVD reminder T u = s v . 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . T . 3. Full factorization SVD: M = USV 4. “Operational” view of SVD: for M ∈ R n × d ,   s 1 0     ⊤   ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ...    0    ·   · · · · · · · · · · · · · . u 1 u r u r +1 u n v 1 v r v r +1 v d   0 s r   ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). New: let ( U k , S k , V k ) denote the truncated SVD with U k ∈ R d × k (first k columns of U ), similarly for the others. 9 / 61

PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 10 / 61

PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is unique, but � r i =1 s 2 i unique. 10 / 61

PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is unique, but � r i =1 s 2 i unique. Remark 2. As written, this is not a convex optimization problem! 10 / 61

PCA properties T and integer k ≤ r be Theorem. Let X ∈ R n × d with SVD X = USV given. � T � � 2 � X − XED � 2 � X − XDD min F = min F D ∈ R k × d D ∈ R d × k E ∈ R d × k D T D = I r � � � � 2 � X − XV k V s 2 = T F = i . k i = k +1 Additionally, � T � � 2 � X − XDD F = � X � 2 � XD � 2 min F − max F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is unique, but � r i =1 s 2 i unique. Remark 2. As written, this is not a convex optimization problem! Remark 3. The second form is interesting. . . 10 / 61

Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 11 / 61

Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 1 T X ∈ R d × d is data covariance; n X 11 / 61

Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 1 T X ∈ R d × d is data covariance; n X 1 T ( XD ) is data covariance after projection; n ( XD ) 11 / 61

Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 1 T X ∈ R d × d is data covariance; n X 1 T ( XD ) is data covariance after projection; n ( XD ) lastly k � � � 1 F = 1 = 1 n � XD � 2 T ( XD ) T ( XDe i ) , n tr ( XD ) ( XDe i ) n i =1 therefore PCA is maximizing the resulting per-coordinate variances! 11 / 61

Lloyd’s method revisited 1. Choose initial clusters ( S 1 , . . . , S k ) . 2. Repeat until convergence: 2.1 (Recenter.) Set µ j := mean( S j ) for j ∈ (1 , . . . , k ) . 2.2 (Reassign). Update S j := { x i : µ ( x i ) = µ j } for j ∈ (1 , . . . , k ) . (“ µ ( x i ) ” means “center closest to x i ”; break ties arbitrarily). 12 / 61

Lloyd’s method revisited 1. Choose initial clusters ( S 1 , . . . , S k ) . 2. Repeat until convergence: 2.1 (Recenter.) Set µ j := mean( S j ) for j ∈ (1 , . . . , k ) . 2.2 (Reassign). Update S j := { x i : µ ( x i ) = µ j } for j ∈ (1 , . . . , k ) . (“ µ ( x i ) ” means “center closest to x i ”; break ties arbitrarily). Geometric perspective: ◮ Centers define a Voronoi diagram/partition : for each µ j , define cell V j := { x ∈ R d : µ ( x ) = µ j } (break ties arbitrarily). ◮ Reassignment leaves assignment consistent with Voronoi cells. ◮ Recentering might shift data outside Voronoi cells, except if we’ve converged! ◮ See http://mjt.cs.illinois.edu/htv/ for an interactive demo. 12 / 61

Does Lloyd’s method solve the original problem? Theorem. ◮ For all t , φ ( C t ; A t − 1 ) ≥ φ ( C t ; A t ) ≥ φ ( C t +1 ; A t ) . ◮ The method terminates. 13 / 61

Does Lloyd’s method solve the original problem? Theorem. ◮ For all t , φ ( C t ; A t − 1 ) ≥ φ ( C t ; A t ) ≥ φ ( C t +1 ; A t ) . ◮ The method terminates. Proof. ◮ This first property is from the earlier theorem and the definition of the algorithm: φ ( C t ; A t ) = φ ( C t ; A ( C t − 1 )) = min A ∈A φ ( C t ; A ) ≤ φ ( C t , A t − 1 ) , φ ( C t +1 ; A t ) = φ ( C ( A t ); A t ) = min C ∈C φ ( C ; A t ) ≤ φ ( C t , A t ) , ◮ Previous property implies: cost is nonincreasing. Combined with termination condition: all but final partition visited at most once. There are finitely many partitions of ( x i ) n i =1 . � 13 / 61

Does Lloyd’s method solve the original problem? Theorem. ◮ For all t , φ ( C t ; A t − 1 ) ≥ φ ( C t ; A t ) ≥ φ ( C t +1 ; A t ) . ◮ The method terminates. Proof. ◮ This first property is from the earlier theorem and the definition of the algorithm: φ ( C t ; A t ) = φ ( C t ; A ( C t − 1 )) = min A ∈A φ ( C t ; A ) ≤ φ ( C t , A t − 1 ) , φ ( C t +1 ; A t ) = φ ( C ( A t ); A t ) = min C ∈C φ ( C ; A t ) ≤ φ ( C t , A t ) , ◮ Previous property implies: cost is nonincreasing. Combined with termination condition: all but final partition visited at most once. There are finitely many partitions of ( x i ) n i =1 . � (That didn’t answer the question. . . ) 13 / 61

Seriously: does Lloyd’s method solve the original problem? ◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost . (Suppose width is c > 1 and height is 1.) 14 / 61

Seriously: does Lloyd’s method solve the original problem? ◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost . (Suppose width is c > 1 and height is 1.) ◮ In practice, method takes few iterations; in theory: can take 2 Ω( √ n ) iterations! (Examples of this are painful; but note, problem is NP-hard, and convergence proof used number of partitions. . . ) 14 / 61

Seriously: does Lloyd’s method solve the original problem? ◮ In practice, Lloyd’s method seems to optimize well; In theory, output can have unboundedly poor cost . (Suppose width is c > 1 and height is 1.) ◮ In practice, method takes few iterations; in theory: can take 2 Ω( √ n ) iterations! (Examples of this are painful; but note, problem is NP-hard, and convergence proof used number of partitions. . . ) So: in practice, yes; in theory, don’t know. . . 14 / 61

Application: vector quantization. Vector quantization with k -means. ◮ Let ( x i ) n i =1 be given. ◮ run k -means to obtain ( µ 1 , . . . , µ k ) . ◮ Replace each ( x i ) n i =1 with ( µ ( x i )) n i =1 . Encoding size reduces from O ( nd ) to O ( kd + n ln( k )) . Examples. ◮ Audio compression. ◮ Image compression. 15 / 61

0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

patch quantization, width 10, 8 exemplars 0 100 200 300 400 500 0 100 200 300 400 500 16 / 61

Initialization matters! ◮ Easy choices: ◮ k random points from dataset. ◮ Random partition. ◮ Standard choice (theory and practice) : “ D 2 -sampling”/ kmeans++ 1. Choose µ 1 uniformly at random from data. 2. for j ∈ (2 , . . . , k ) : 2.1 Choose x i ∝ min l<j � x i − µ l � 2 2 . ◮ kmeans++ is randomized furthest-first traversal ; regular furthest-first fails with outliers. ◮ Scikits-learn and Matlab both default to kmeans++. 17 / 61

Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. 18 / 61

Maximum likelihood: abstract formulation We’ve had one main “meta-algorithm” this semester: ◮ (Regularized) ERM principle: pick the model that minimizes an average loss over training data. We’ve also discussed another: the “Maximum likelihood estimation (MLE)” principle : ◮ Pick a set of probability models for your data: P := { p θ : θ ∈ Θ } . ◮ p θ will denote both densities and masses; the literature is similarly inconsistent. ◮ Given samples ( z i ) n i =1 , pick the model that maximized the likelihood n n � � max θ ∈ Θ L ( θ ) = max θ ∈ Θ ln p θ ( z i ) = max ln p θ ( z i ) , θ ∈ Θ i =1 i =1 where the ln( · ) is for mathematical convenience, and z i can be a labeled pair ( x i , y i ) or just x i . 18 / 61

Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. 19 / 61

Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 19 / 61

Example 1: coin flips. ◮ We flip a coin of bias θ ∈ [0 , 1] . ◮ Write down x i = 0 for tails, x i = 1 for heads; then p θ ( x i ) = x i θ + (1 − x i )(1 − θ ) , or alternatively p θ ( x i ) = θ x i (1 − θ ) 1 − x i . The second form will be more convenient. ◮ Writing H := � i x i and T := � i (1 − x i ) = n − H for convenience, n � � � L ( θ ) = x i ln θ + (1 − x i ) ln(1 − θ ) = H ln θ + T ln(1 − θ ) . i =1 Differentiating and setting to 0, 0 = H T θ − 1 − θ, T + H = H H which gives θ = N . ◮ In this way, we’ve justified a natural algorithm. 19 / 61

Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � � − ( x i − µ ) 2 exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 20 / 61

Example 2: mean of a Gaussian ◮ Suppose x i ∼ N ( µ, σ 2 ) , so θ = ( µ, σ 2 ) , and � � − ( x i − µ ) 2 exp = − ( x i − µ ) 2 − ln(2 πσ 2 ) 2 σ 2 √ ln p θ ( x i ) = ln . 2 σ 2 2 2 πσ 2 ◮ Therefore n � L ( θ ) = − 1 ( x i − µ ) 2 + stuff without µ ; 2 σ 2 i =1 � applying ∇ µ and setting to zero gives µ = 1 x i . n i � ◮ A similar derivation gives σ 2 = 1 i ( x i − µ ) 2 . n 20 / 61

Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) 21 / 61

Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. 21 / 61

Example 4: Naive Bayes ◮ Let’s try a simple prediction setup, with (Bayes) optimal classifier arg max p ( Y = y | X = x ) . y ∈Y (We haven’t discussed this concept a lot, but it’s widespread in ML.) ◮ One way to proceed is to learn p ( Y | X ) exactly; that’s a pain. ◮ Let’s assume coordinates of X = ( X 1 , . . . , X d ) are independent given Y : = p ( X = x | Y = y ) p ( Y = y ) p ( Y = y | X = x ) = p ( Y = y, X = x ) p ( X = x ) p ( X = x ) p ( Y = y ) � d j =1 p ( X j = x j | Y = y ) = , p ( X = x ) and d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 21 / 61

Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 22 / 61

Example 4: Naive Bayes (part 2) d � arg max p ( Y = y | X = x ) = arg max p ( Y = y ) p ( X = x ) | Y = y ) . y ∈Y y ∈Y j =1 Examples where this helps: ◮ Suppose X ∈ { 0 , 1 } d has an arbitrary distribution; \ it’s specified with 2 d − 1 numbers. \ The factored form above needs d numbers. To see how this can help, suppose x ∈ { 0 , 1 } d ; instead of having to learn a probability model of 2 d possibilities, we now have to learn d + 1 models each with 2 possibilities (binary labels). ◮ HW5 will use the standard “Iris dataset”. \ This data is continuous, Naive Bayes would approximate univariate distributions. 22 / 61

Gaussian Mixture Model ◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete ( π 1 , . . . , π k ) , X = x | Y = j ∼ N ( µ j , Σ j ) , and the parameters are θ = (( π 1 , µ 1 , Σ 1 ) , . . . , ( π k , µ k , Σ k )) . (Note: this is a generative model, and we have a way to sample.) 23 / 61

Gaussian Mixture Model ◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete ( π 1 , . . . , π k ) , X = x | Y = j ∼ N ( µ j , Σ j ) , and the parameters are θ = (( π 1 , µ 1 , Σ 1 ) , . . . , ( π k , µ k , Σ k )) . (Note: this is a generative model, and we have a way to sample.) ◮ The probability density (with parameters θ = (( π j , µ j , Σ j )) k j =1 ) at a given x is k k � � p θ ( x ) = p θ ( x | y = j ) p θ ( y = j ) = p µ j , Σ j ( x | Y = j ) π j , j =1 j =1 and the likelihood problem is n k π j � − 1 � � � T Σ − 1 ( x i − µ j ) L ( θ ) = ln exp 2( x i − µ j ) . � (2 π ) d | Σ | i =1 j =1 The ln and the exp are no longer next to each other; we can’t just take the derivative and set the answer to 0. 23 / 61

Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 24 / 61

Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 Doesn’t look Gaussian! 24 / 61

Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 Pearson fit a mixture of two Gaussians . 25 / 61

Pearson’s crabs. Statistician Karl Pearson wanted to understand the distribution of “forehead breadth to body length” for 1000 crabs 25 20 15 10 5 0 0.58 0.60 0.62 0.64 0.66 0.68 Pearson fit a mixture of two Gaussians . Remark. Pearson did not use E-M. For this he invented the “method of moments” and obtained a solution by hand. 25 / 61

Gaussian mixture likelihood with responsibility matrix R Let’s replace � n i =1 ln � k j =1 π j p µ j , Σ j ( x i ) with n k � � � � R ij ln π j p µ j , Σ j ( x i ) i =1 j =1 where R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } is a responsibility matrix . 26 / 61

Gaussian mixture likelihood with responsibility matrix R Let’s replace � n i =1 ln � k j =1 π j p µ j , Σ j ( x i ) with n k � � � � R ij ln π j p µ j , Σ j ( x i ) i =1 j =1 where R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } is a responsibility matrix . Holding R fixed and optimizing θ gives � n � n i =1 R ij i =1 R ij π j := = ; � n � k n l =1 R il i =1 � n � n i =1 R ij x i i =1 R ij x i µ j := � n = , i =1 R ij nπ j � n i =1 R ij ( x i − µ j )( x i − µ j ) T Σ j := . nπ j (Should use new mean in Σ j so that all deriviatives 0.) 26 / 61

Generalizing the assignment matrix to GMMs We introduced an assigment matrix A ∈ { 0 , 1 } n × k : ◮ For each x i , define µ ( x i ) to be a closest center: � x i − µ ( x i ) � = min � x i − µ j � . j ◮ For each i , set A ij = 1 [ µ ( x i ) = µ j ] . 27 / 61

Generalizing the assignment matrix to GMMs We introduced an assigment matrix A ∈ { 0 , 1 } n × k : ◮ For each x i , define µ ( x i ) to be a closest center: � x i − µ ( x i ) � = min � x i − µ j � . j ◮ For each i , set A ij = 1 [ µ ( x i ) = µ j ] . ◮ Key property: by this choice, n k n � � � A ij � x i − µ j � 2 = � x i − µ j � 2 = φ ( C ); φ ( C ; A ) = min j i =1 j =1 i =1 therefore we can decrase φ ( C ) = φ ( C ; A ) first by optimizing C to get φ ( C ′ ; A ) ≤ φ ( C ; A ) , then setting A as above to get φ ( C ′ ) = φ ( C ′ ; A ′ ) ≤ φ ( C ′ ; A ) ≤ φ ( C ; A ) = φ ( C ) . In other words: we minimize φ ( C ) via φ ( C ; A ) . 27 / 61

Generalizing the assignment matrix to GMMs We introduced an assigment matrix A ∈ { 0 , 1 } n × k : ◮ For each x i , define µ ( x i ) to be a closest center: � x i − µ ( x i ) � = min � x i − µ j � . j ◮ For each i , set A ij = 1 [ µ ( x i ) = µ j ] . ◮ Key property: by this choice, n k n � � � A ij � x i − µ j � 2 = � x i − µ j � 2 = φ ( C ); φ ( C ; A ) = min j i =1 j =1 i =1 therefore we can decrase φ ( C ) = φ ( C ; A ) first by optimizing C to get φ ( C ′ ; A ) ≤ φ ( C ; A ) , then setting A as above to get φ ( C ′ ) = φ ( C ′ ; A ′ ) ≤ φ ( C ′ ; A ) ≤ φ ( C ; A ) = φ ( C ) . In other words: we minimize φ ( C ) via φ ( C ; A ) . What fulfills the same role for L ? 27 / 61

E-M method for latent variable models Define augmented likelihood n k � � R ij ln p θ ( x i , y i = j ) L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . 28 / 61

E-M method for latent variable models Define augmented likelihood n k � � R ij ln p θ ( x i , y i = j ) L ( θ ; R ) := , R ij i =1 j =1 with responsibility matrix R ∈ R n,k := { R ∈ [0 , 1] n × k : R 1 k = 1 n } . Alternate two steps: ◮ E-step: set ( R t ) ij := p θ t − 1 ( y i = j | x i ) . ◮ M-step: set θ t = arg max θ ∈ Θ L ( θ ; R t ) . Soon: we’ll see this gives nondecreasing likelihood! 28 / 61

E-M for Gaussian mixtures Initialization: a standard choice is π j = 1 / k , Σ j = I , and ( µ j ) k j =1 given by k -means. ◮ E-step: Set R ij = p θ ( y i = j | x i ) , meaning π j p µ j , Σ j ( x i ) R ij = p θ ( y i = j | x i ) = p θ ( y i = j, x i ) = . � k p θ ( x i ) l =1 π l p µ l , Σ l ( x i ) ◮ M-step: solve arg max θ ∈ Θ L ( θ ; R ) , meaning � n � n i =1 R ij i =1 R ij π j := = , � n � k n l =1 R il i =1 � n � n i =1 R ij x i i =1 R ij x i µ j := � n = , nπ j i =1 R ij � n i =1 R ij ( x i − µ j )( x i − µ j ) T Σ j := . nπ j (These are as before.) 29 / 61

Demo: elliptical clusters E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

Demo: elliptical clusters E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

Demo: elliptical clusters E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

Demo: elliptical clusters E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

Demo: elliptical clusters E. . . M. . . E. . . M. . . E. . . M. . . E. . . 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 5 0 5 10 15 30 / 61

Final exam review CS 446 Selected lecture slides 1 / 61 - PowerPoint PPT Presentation

Final exam review CS 446 Selected lecture slides 1 / 61 Hoeffdings inequality Theorem (Hoeffdings inequality). Given IID Z i [ a, b ] , 2 n 2 1 exp Pr Z i E Z 1 . ( b a ) 2 n

Math 211 Math 211 Review for the Final Exam December 8, 2002 2 The Final Exam The Final Exam

Final Review Drawing on the Web Final exam on Thursday, May 14 at 2:00 p.m. (EST) Final Review

ICS 101 Final Exam Review Fall 2016 Final Exam information In lab: check final exam schedule

Final Review Introduction to Web Design Final exam on Thursday, December 19 at 12:00 p.m. Final

Final exam effects Textures I Final exam effects Final exam effects Lighting Grads

Announcements Announcements Final Exam will be a take Final Exam will be a take- -home exam

The final exam Other finals review Final Exam Review CSH Review November 17 th

Did I happen to mention? Final exam Final Exam Review The date for the Final has been

Exam4 Information and Guidance General Topics General Exam Information Exam types

Quicksort Sorting Lower Bound Exam Exam Exam Exam 2 2 tomorrow evening 2 2 tomorrow

Review Final exam Final exam will be 11-12 problems, drop any 2 Cumulative up to and including

Final Exam Details The final exam will be posted on Blackboard by 7am on April 26th It will be

Final exam on Thursday, May 16 Drawing on the Web Final CSCI-UA 380 Review Multiple choice

FINAL EXAM REVIEW PACKET ANSWERS All answers can be found on my website! Final Exam Review 1.

Exam Review 2 Exam Overview Final Exam Friday,

Examination Lydia Love DVM DACVAA 2018 Exam Committee Chair September 2018 Exam Format

MERGER POLICY Overview Context: you are helping a firm that wants to acquire a competitor

Xenon Doping of Liquid Argon Denver Whittington, Syracuse University DUNE Module of Opportunity

CEE 772: Instrumental Methods in Environmental Analysis Lecture #10 Sample Preparation: Basics

Miguel Brito FCUL DAY 1: SMART GRIDS TABLE 2: REGULATORY CHALLENGES AND BUSINESS OPPORTUNITIES

ThorCon Molten Salt Reactor (TMSR-500) Technology for Indonesia Dane Wilson, ThorCon Intl

1 Now we can split up that again into open circuit voltage, short circuit current, and fill

A GUIDE TO CRYOGENIC APPLICATIONS OF SIPMS Relatively young field One running experiment

Inequality and New York City: Higher Heights, Local Policy Matters for Inequality by the Numbers