em and gmm
play

EM and GMM Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / - PowerPoint PPT Presentation

EM and GMM Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 3 due March 27. Final project discussion: Link Final exam date/time Exam Section: 14M


  1. EM and GMM Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

  2. Administrative • HW 3 due March 27. • Final project discussion: Link • Final exam date/time • Exam Section: 14M • https://banweb.banner.vt.edu/ssb/prod/hzskexam.P_DispExamInfo • 2:05PM to 4:05PM May 13

  3. J. Mark Sowers Distinguished Lecture • Michael Jordan • Pehong Chen Distinguished Professor Department of Statistics and Electrical Engineering and Computer Sciences • University of California, Berkeley • 3/28/19 • 7:30 PM, McBryde 100

  4. K-means algorithm • Input: • 𝐿 (number of clusters) • Training set 𝑦 (1) , 𝑦 (2) , 𝑦 (3) , ⋯ , 𝑦 (𝑛) • 𝑦 (𝑗) ∈ ℝ 𝑜 (note: drop 𝑦 0 = 1 convention) Slide credit: Andrew Ng

  5. K-means algorithm • Randomly initialize 𝐿 cluster centroids 𝜈 1 , 𝜈 2 , ⋯ , 𝜈 𝐿 ∈ ℝ 𝑜 Repeat{ for 𝑗 = 1 to 𝑛 𝑑 (𝑗) ≔ index (from 1 to 𝐿 ) of cluster centroid closest to 𝑦 (𝑗) Cluster assignment step for 𝑙 = 1 to 𝐿 𝜈 𝑙 ≔ average (mean) of points assigned to cluster 𝑙 Centroid update step } Slide credit: Andrew Ng

  6. K-means optimization objective • 𝑑 (𝑗) = Index of cluster (1, 2, … K) to which Example: example 𝑦 𝑗 is currently assigned 𝑦 (𝑗) = 5 𝑑 (𝑗) = 5 • 𝜈 𝑙 = cluster centroid 𝑙 ( 𝜈 𝑙 ∈ ℝ 𝑜 ) 𝜈 𝑑 (𝑗) = 𝜈 5 • 𝜈 𝑑 (𝑗) = cluster centroid of cluster to which example 𝑦 𝑗 has been assigned • Optimization objective: 𝑛 𝐾 𝑑 1 , ⋯ , 𝑑 𝑛 , 𝜈 1 , ⋯ , 𝜈 𝐿 = 1 2 𝑦 𝑗 − 𝜈 𝑑 𝑗 𝑛 ෍ 𝑗=1 𝐾 𝑑 1 , ⋯ , 𝑑 𝑛 , 𝜈 1 , ⋯ , 𝜈 𝐿 min 𝑑 1 ,⋯,𝑑 𝑛 𝜈 1 ,⋯,𝜈 𝐿 Slide credit: Andrew Ng

  7. K-means algorithm Randomly initialize 𝐿 cluster centroids 𝜈 1 , 𝜈 2 , ⋯ , 𝜈 𝐿 ∈ ℝ 𝑜 Cluster assignment step 𝑛 Repeat{ 𝐾 𝑑 1 , ⋯ , 𝑑 𝑛 , 𝜈 1 , ⋯ , 𝜈 𝐿 = 1 2 𝑦 𝑗 − 𝜈 𝑑 𝑗 𝑛 ෍ for 𝑗 = 1 to 𝑛 𝑗=1 𝑑 (𝑗) ≔ index (from 1 to 𝐿 ) of cluster centroid closest to 𝑦 (𝑗) Centroid update step 𝑛 𝐾 𝑑 1 , ⋯ , 𝑑 𝑛 , 𝜈 1 , ⋯ , 𝜈 𝐿 = 1 2 𝑦 𝑗 − 𝜈 𝑑 𝑗 𝑛 ෍ for 𝑙 = 1 to 𝐿 𝑗=1 𝜈 𝑙 ≔ average (mean) of points assigned to cluster 𝑙 } Slide credit: Andrew Ng

  8. Hierarchical Clustering • A hierarchy might be more nature • Different users might care about different levels of granularity or even prunings. Slide credit: Maria-Florina Balcan

  9. Hierarchical Clustering • Top-down (divisive) • Partition data into 2-groups (e.g., 2-means) • Recursively cluster each group • Bottom-up (agglomerative) • Start with every point in its own cluster. • Repeatedly merge the “closest” two clusters • Different definitions of “closest” give different algorithms. Slide credit: Maria-Florina Balcan

  10. Bottom-up (agglomerative) • Have a distance measure on pairs of objects. • 𝑒 𝑦, 𝑧 : Distance between 𝑦 and 𝑧 • Single linkage: dist A, B = x∈𝐵,𝑦 ′ ∈𝐶 d(x, x′) min • Complete linkage: dist A, B = x∈𝐵,𝑦 ′ ∈𝐶 d(x, x′) max • Average linkage: dist A, B = average d(x, x′) x∈𝐵,𝑦 ′ ∈𝐶 𝐵 |𝐶| 2 • Ward’s method dist A, B = 𝐵 +|𝐶| mean 𝐵 − mean 𝐶 Slide credit: Maria-Florina Balcan

  11. Bottom-up (agglomerative) • Single linkage : dist A, B = x∈𝐵,𝑦 ′ ∈𝐶 d(x, x′) min • At any time, distance between any two points in a connected components < r. • Complete linkage : dist A, B = x∈𝐵,𝑦 ′ ∈𝐶 d(x, x′) max • Keep max diameter as small as possible at any level 𝐵 |𝐶| 2 • Ward’s method dist A, B = 𝐵 +|𝐶| mean 𝐵 − mean 𝐶 • Merge the two clusters such that the increase in k-means cost is as small as possible. • Works well in practice Slide credit: Maria-Florina Balcan

  12. Things to remember • Intro to unsupervised learning • K-means algorithm • Optimization objective • Initialization and the number of clusters • Hierarchical clustering

  13. Today’s Class • Examples of Missing Data Problems • Detecting outliers • Latent topic models • Segmentation • Background • Maximum Likelihood Estimation • Probabilistic Inference • Dealing with “Hidden” Variables • EM algorithm, Mixture of Gaussians • Hard EM

  14. Today’s Class • Examples of Missing Data Problems • Detecting outliers • Latent topic models • Segmentation • Background • Maximum Likelihood Estimation • Probabilistic Inference • Dealing with “Hidden” Variables • EM algorithm, Mixture of Gaussians • Hard EM

  15. Missing Data Problems: Outliers You want to train an algorithm to predict whether a photograph is attractive. You collect annotations from Mechanical Turk. Some annotators try to give accurate ratings, but others answer randomly. Challenge: Determine which people to trust and the average rating by accurate annotators. Annotator Ratings 10 8 9 2 8 Photo: Jam343 (Flickr)

  16. Missing Data Problems: Object Discovery You have a collection of images and have extracted regions from them. Each is represented by a histogram of “visual words”. Challenge: Discover frequently occurring object categories, without pre-trained appearance models. http://www.robots.ox.ac.uk/~vgg/publications/papers/russell06.pdf

  17. Missing Data Problems: Segmentation You are given an image and want to assign foreground/background pixels. Challenge: Segment the image into figure and ground without knowing what the foreground looks like in advance. Foreground Background

  18. Missing Data Problems: Segmentation Challenge: Segment the image into figure and ground without knowing what the foreground looks like in advance. Three steps: 1. If we had labels, how could we model the appearance of foreground and background? • Maximum Likelihood Estimation 2. Once we have modeled the fg/bg appearance, how do we compute the likelihood that a pixel is foreground? • Probabilistic Inference 3. How can we get both labels and appearance models at once? • Expectation-Maximization (EM) Algorithm

  19. Maximum Likelihood Estimation 1. If we had labels, how could we model the appearance of foreground and background? Background Foreground

  20. Maximum Likelihood Estimation data   parameters  .. x x x 1 N ˆ    argmax ( | ) p x   ˆ    argmax ( | ) p x n  n

  21. Maximum Likelihood Estimation    .. x x x 1 N ˆ    argmax ( | ) p x   ˆ    argmax ( | ) p x n  n Gaussian Distribution       2 1 x       2 n ( | , ) exp p x    n 2  2 2   2

  22. Ƹ Maximum Likelihood Estimation       2 1 x       2 n Gaussian Distribution ( | , ) exp p x    n 2  2 2   2 ෠ Log-Likelihood 𝜄 = argmax 𝜄 𝑞 𝐲 𝜄) = argmax 𝜄 log 𝑞 𝐲 𝜄) ෠ 𝜄 = argmax 𝜄 ෍ log (𝑞 𝑦 𝑜 𝜄 ) = argmax 𝜄 𝑀(𝜄) 𝑜 𝑀 𝜄 = −𝑂 2 log 2𝜌 − −𝑂 2 log 𝜏 2 − 1 𝑦 𝑜 − 𝜈 2 2𝜏 2 ෍ 𝑜 𝜖𝑀(𝜄) = 1 𝜈 = 1 𝜏 2 ෍ 𝑦 𝑜 − 𝑣 = 0 → 𝑂 ෍ 𝑦 𝑜 𝜖𝜈 𝑜 𝑜 𝜖𝑀(𝜄) = 𝑂 𝜏 − 1 𝜏 2 = 1 𝑦 𝑜 − 𝜈 2 = 0 𝜈 2 𝜏 3 ෍ → 𝑂 ෍ 𝑦 𝑜 − Ƹ 𝜖𝜏 𝑜 𝑜

  23. Maximum Likelihood Estimation    .. x x x 1 N ˆ    argmax ( | ) p x   ˆ    argmax ( | ) p x n  n Gaussian Distribution       2 1 x       2 n ( | , ) exp p x    n 2  2 2   2 1  1      ˆ     ˆ ˆ 2 2 x x n n N N n n

  24. Example: MLE Parameters used to Generate fg: mu=0.6, sigma=0.1 bg: mu=0.4, sigma=0.1 im labels >> mu_fg = mean(im(labels)) mu_fg = 0.6012 >> sigma_fg = sqrt(mean((im(labels)-mu_fg).^2)) sigma_fg = 0.1007 >> mu_bg = mean(im(~labels)) mu_bg = 0.4007 >> sigma_bg = sqrt(mean((im(~labels)-mu_bg).^2)) sigma_bg = 0.1007 >> pfg = mean(labels(:));

  25. Probabilistic Inference 2. Once we have modeled the fg/bg appearance, how do we compute the likelihood that a pixel is foreground? Background Foreground

  26. Probabilistic Inference Compute the likelihood that a particular model generated a sample component or label   ( | , ) p z m x n n

  27. Probabilistic Inference Compute the likelihood that a particular model generated a sample component or label     , | p z m x    n n m ( | , ) p z m x Conditional probability    n n | p x 𝑄 𝐵 𝐶 = 𝑄(𝐵, 𝐶) n 𝑄(𝐶)

  28. Probabilistic Inference Compute the likelihood that a particular model generated a sample component or label     , | p z m x    n n m ( | , ) p z m x    n n | p x n     , | p z m x  n n m Marginalization      , | p z k x n n k 𝑄 𝐵 = ෍ 𝑄(𝐵, 𝐶 = 𝑙) k 𝑙

Recommend


More recommend