CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra Sheikhbahaee University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 1
Outline • 𝐿 -mean Clustering • Gaussian Mixture model • EM for Gaussian Mixture model • EM algorithm University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2
𝐿 -mean Clustering • The organization of unlabeled data into similarity groups called clusters. • A cluster is a collection of data items which are similar between them, and dissimilar to data items in other clusters. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3
𝐿 -mean clustering • 𝐿 -mean clustering has been used for image segmentation. • In image segmentation, one partitions an image into regions each of which has a reasonably homogeneous visual appearance or which corresponds to objects or parts of objects. • In data compression, for an RGB image with 𝑂 pixels values of each is stored with 8 bits precision. Total cost of the original image transmission 24𝑂 bits Transmitting the identity of nearest centroid for each pixel has the total cost of 𝑂 log ! 𝐿 transmitting the 𝐿 centroid vectors requires 24𝐿 bits The compressed image has the cost of 24𝐿 + 𝑂 log ! 𝐿 bits University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4
𝐿 -mean clustering • Let a data set be 𝑦 ) , … , 𝑦 * which is 𝑂 observations of a random 𝐸 - dimensional Euclidean variable 𝒚 . • The 𝐿 -means algorithm partitions the given data into 𝐿 clusters ( 𝐿 is known): – Each cluster has a cluster center, called centroid ( 𝝂 ! where 𝑙 = 1 … 𝐿 ). – The sum of the squares of the distances of each data point to its closest vector 𝝂 ! , is a minimum – Each data point 𝑦 " has a corresponding set of binary indicator variables 𝑠 "! which represent whether data point 𝑦 # belongs to cluster 𝑙 or not 𝑠 "! = 2 1 if 𝑦 # is assigned to cluster 𝑙 0 otherwise University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5
𝐿 -mean clustering • Let a data set be 𝑦 ) , … , 𝑦 * which is 𝑂 observations of a random 𝐸 - dimensional Euclidean variable 𝒚 . • The 𝐿 -means algorithm partitions the given data into 𝐿 clusters ( 𝐿 is known): – Each cluster has a cluster center, called centroid ( 𝝂 ! where 𝑙 = 1 … 𝐿 ). – The sum of the squares of the distances of each data point to its closest vector 𝝂 ! , is a minimum – Each data point 𝑦 " has a corresponding set of binary indicator variables 𝑠 "! which represent whether data point 𝑦 " belongs to cluster 𝑙 or not # & 𝑠 "! ∥ 𝑦 " − 𝜈 ! ∥ ' 𝐾 = B B "$% !$% University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6
𝐿 -mean clustering Algorithm : Initialize 𝜈 % , … , 𝜈 & Iterations: • We minimize J with respect to the 𝑠 !" , keeping the 𝜈 " fixed by assigning each data point to the closest centroid • We minimize J with respect to the 𝜈 " , keeping the 𝑠 !" fixed by recomputing the centroids using the current cluster membership Repeat until convergence University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7
𝐿 -mean clustering Algorithm : Initialize 𝜈 " , … , 𝜈 # Iterations: • We minimize J with respect to the 𝑠 !" , keeping the 𝜈 " fixed by assigning each data point to the closest centroid ∥ 𝑦 ! − 𝜈 # ∥ $ 1 if 𝑙 = arg min 𝑠 !" = ( # 0 otherwise We minimize J with respect to the 𝜈 " , keeping the 𝑠 • !" fixed by recomputing the centroids using the current cluster membership ' 𝜖𝐾 ∑ ! 𝑠 !" 𝑦 ! = −2 ? 𝑠 !" 𝑦 ! − 𝜈 " = 0 → 𝜈 " = ∑ ! 𝑠 𝜖𝜈 " !" !%& Repeat until convergence University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8
𝐿 -mean clustering (Cons) • 𝐿 -mean algorithm may converge to a local rather than a global minimum of 𝐾 • The 𝐿 -means algorithm is based on the use of squared Euclidean distance as the measure of dissimilarity between a data point and a prototype vector . • In 𝐿 -means algorithm, every data point is assigned uniquely to one, and only one, of the clusters (hard assignment to the nearest cluster). University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 9
Gaussian Mixture Model • Let a data set be 𝒚 = 𝑦 ) , … , 𝑦 * observations. A linear superposition of Gaussians I 𝑞 𝒚 = + 𝜌 G 𝒪(𝒚|𝝂 G , 𝜯 G ) GH) 𝜌 G : mixing coefficient • Let a binary random variable 𝒜 which has 𝐿 dimensions and satisfies the following condition I + 𝑨 G = 1, where 𝑨 G ∈ {0,1} GH) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 10
Gaussian Mixture Model • The joint distribution of the observed 𝒚 and hidden variable 𝒜 is 𝑞 𝒚, 𝒜 = 𝑞 𝒚 𝒜 𝑞(𝒜) The marginal distribution over 𝒜 𝑞 𝑨 G = 1 = 𝜌 G , where 0 ≤ 𝜌 G ≤ 1 I J ! = Cat(𝒜|𝝆) 𝑞 𝒜 = ? 𝜌 G GH) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11
Gaussian Mixture Model • The joint distribution of the observed 𝒚 and hidden variable 𝒜 is 𝑞 𝒚, 𝒜 = 𝑞 𝒚 𝒜 𝑞(𝒜) The marginal distribution over 𝒜 I J ! = Cat(𝒜|𝝆) 𝑞 𝒜 = ? 𝜌 G GH) The conditional distribution of 𝒚 given a particular value for 𝒜 𝑞 𝒚 𝑨 G = 1 = 𝒪(𝒚|𝝂 G , 𝜯 G ) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 12
Gaussian Mixture Model • The joint distribution of the observed 𝒚 and hidden variable 𝒜 is 𝑞 𝒚, 𝒜 = 𝑞 𝒚 𝒜 𝑞(𝒜) The marginal distribution over 𝒜 I J ! = Cat(𝒜|𝝆) 𝑞 𝒜 = ? 𝜌 G GH) The conditional distribution of 𝒚 given a 𝒜 I 𝒪(𝒚|𝝂 G , 𝜯 G ) J ! 𝑞 𝒚 𝒜 = ? GH) University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13
Gaussian Mixture Model • The joint distribution of the observed 𝒚 and hidden variable 𝒜 is 𝑞 𝒚, 𝒜 = 𝑞 𝒚 𝒜 𝑞(𝒜) The marginal distribution over 𝒚 I 𝑞 𝒚 = + 𝑞(𝒚|𝒜) 𝑞 𝒜 = + 𝜌 G 𝒪(𝒚|𝝂 G , 𝜯 G ) GH) J University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 14
Gaussian Mixture Model • The posterior distribution of 𝒜 𝑞 𝒚 𝑨 ! = 1 𝑞(𝑨 ! = 1) 𝑞 𝑨 ! = 1 𝒚 = " = 1) = % ∑ "#$ 𝑞 𝒚 𝑨 " = 1 𝑞(𝑨 𝜌 ! 𝒪(𝒚|𝝂 ! , 𝜯 ! ) ! ∑ "#$ 𝜌 " 𝒪(𝒚|𝝂 " , 𝜯 " ) Let assume we have i.i.d data set. The log-likelihood function is ' ) % ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = ln 4 5 𝑞(𝑦 ' |𝑨 ' )𝑞 𝑨 ' = 5 ln 5 𝜌 ! 𝒪(𝑦 ' |𝜈 ! , Σ ! ) &#$ ( ! '#$ !#$ University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 15
Gaussian Mixture Model The log-likelihood function is ) % ln 𝑞 𝒀 𝝆, 𝝂, 𝜯 = 5 ln 5 𝜌 ! 𝒪(𝑦 ' |𝜈 ! , Σ ! ) '#$ !#$ • Problems: • Singularities : Arbitrarily large likelihood when a Gaussian explains a single point (whenever one of the Gaussian components collapses onto a specific data point) • Identifiability : Solution is invariant to permutations A total of 𝐿! equivalent solutions because of the 𝐿! ways of assigning 𝐿 sets of parameters to 𝐿 components. • Non-convex University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 16
E xpectation- M aximization For GMM • Let assume that we do not have access to the complete data set 𝒀, 𝒂 .Then the actual observed 𝒀 is considered as incomplete data. So we can not use the complete data log-likelihood ℒ 𝜄 𝑌, 𝑎 = ln 𝑄(𝑌, 𝑎|𝜄) We consider the expected value of the likelihood function under the posterior distribution of latent variable = ∑ J 𝑄(𝑎|𝑌, 𝜄 RST ) ln 𝑄(𝑌, 𝑎|𝜄) 𝔽 K(M|O,P "#$ ) ℒ 𝜄 𝑌, 𝑎 which is the E xpectation step of the EM algorithm. In the M aximization step, we maximize this expectation. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 17
E xpectation- M aximization For GMM Algorithm : Initialize 𝜄 + Iterations: E step: Evaluate the posterior distribution of the latent variables 𝑎 and compute 𝜄, 𝜄 BCD = B 𝑞(𝑎|𝑌, 𝜄 BCD ) ln 𝑞(𝑌, 𝑎|𝜄) E M step: Evaluate 𝜄 FGH 𝜄 FGH = 𝑏𝑠 max 𝜄, 𝜄 BCD I Check for the convergence of either the log-likelihood or the parameter values, otherwise 𝜄 BCD ← 𝜄 FGH University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 18
E xpectation- M aximization For GMM • The likelihood function of complete data * I J %! 𝒪(𝑦 _ |𝝂 G , 𝜯 G ) J %! 𝑞 𝑌, 𝑎 𝜈, Σ, 𝜌 = ? ? 𝜌 G _H) GH) The log-likelihood * I ℒ 𝜄 𝑌, 𝑎 = ln 𝑄 𝑌, 𝑎 𝜈, Σ, 𝜌 = + + 𝑨 _G {ln 𝜌 G + ln 𝒪(𝑦 _ |𝝂 G , 𝜯 G )} _H) GH) The summation over 𝑙 and the logarithm have been interchanged. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 19
Recommend
More recommend