Expectation maximization Subhransu Maji CMPSCI 689: Machine Learning 14 April 2015
Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is no money to label the data. � ‣ You have a probabilistic model that assumes labelled data, but you don't have any labels. Can you still do something? � Amazingly you can! � ‣ Treat the labels as hidden variables and try to learn them simultaneously along with the parameters of the model � Expectation Maximization (EM) � ‣ A broad family of algorithms for solving hidden variable problems ‣ In today’s lecture we will derive EM algorithms for clustering and naive Bayes classification and learn why EM works CMPSCI 689 Subhransu Maji (UMASS) 2 /19
Gaussian mixture model for clustering Suppose data comes from a Gaussian Mixture Model (GMM) — you have K clusters and the data from the cluster k is drawn from a Gaussian with mean μ k and variance σ k2 � We will assume that the data comes with labels (we will soon remove this assumption) � Generative story of the data: � ‣ For each example n = 1, 2, .., N ➡ Choose a label y n ∼ Mult( θ 1 , θ 2 , . . . , θ K ) x n ∼ N ( µ k , σ 2 ➡ Choose example k ) Likelihood of the data: N N Y Y θ y n N ( x n ; µ y n , σ 2 p ( D ) = p ( y n ) p ( x n | y n ) = y n ) n =1 n =1 N − || x n − µ y n || 2 ✓ ◆ � − D 2 exp Y 2 πσ 2 � p ( D ) = θ y n y n 2 σ 2 y n n =1 CMPSCI 689 Subhransu Maji (UMASS) 3 /19
GMM: known labels Likelihood of the data: � � N − || x n − µ y n || 2 ✓ ◆ � − D 2 exp Y 2 πσ 2 � p ( D ) = θ y n � y n 2 σ 2 y n n =1 � If you knew the labels y n then the maximum-likelihood estimates of the parameters is easy: θ k = 1 fraction of examples X [ y n = k ] N with label k n P n [ y n = k ] x n mean of all the µ k = P n [ y n = k ] examples with label k n [ y n = k ] || x n − µ k || 2 P variance of all the σ 2 k = examples with label k P n [ y n = k ] CMPSCI 689 Subhransu Maji (UMASS) 4 /19
GMM: unknown labels Now suppose you didn’t have labels y n . Analogous to k-means, one solution is to iterate. Start by guessing the parameters and then repeat the two steps: � ‣ Estimate labels given the parameters ‣ Estimate parameters given the labels � In k-means we assigned each point to a single cluster, also called as hard assignment (point 10 goes to cluster 2) � In expectation maximization (EM) we will will use soft assignment (point 10 goes half to cluster 2 and half to cluster 5) � � Lets define a random variable z n = [z 1 , z 2 , …, z K ] to denote the assignment vector for the n th point � ‣ Hard assignment: only one of z k is 1, the rest are 0 ‣ Soft assignment: z k is positive and sum to 1 CMPSCI 689 Subhransu Maji (UMASS) 5 /19
GMM: parameter estimation Formally z n,k is the probability that the n th point goes to cluster k � � z n,k = p ( y n = k | x n ) � = P ( y n = k, x n ) � P ( x n ) � ∝ P ( y n = k ) P ( x n | y n ) = θ k N ( x n ; µ k , σ 2 k ) � Given a set of parameters ( θ k , μ k, σ k2 ), z n,k is easy to compute � Given z n,k , we can update the parameters ( θ k , μ k, σ k2 ) as: θ k = 1 X fraction of examples z n,k N with label k n P n z n,k x n mean of all the fractional µ k = P examples with label k n z n,k n z n,k || x n − µ k || 2 P variance of all the fractional σ 2 k = P examples with label k n z n,k CMPSCI 689 Subhransu Maji (UMASS) 6 /19
GMM: example We have replaced the indicator variable [y n = k] with p(y n =k) which is the expectation of [y n =k]. This is our guess of the labels. � Just like k-means the EM is susceptible to local minima. � Clustering example: k-means GMM http://nbviewer.ipython.org/github/NICTA/MLSS/tree/master/clustering/ CMPSCI 689 Subhransu Maji (UMASS) 7 /19
The EM framework We have data with observations x n and hidden variables y n , and would like to estimate parameters θ � The likelihood of the data and hidden variables: � Y � p ( D ) = p ( x n , y n | θ ) � n Only x n are known so we can compute the data likelihood by marginalizing out the y n : � � Y X p ( X | θ ) = p ( x n , y n | θ ) � n y n � Parameter estimation by maximizing log-likelihood: X ! X θ ML ← arg max log p ( x n , y n | θ ) θ n y n hard to maximize since the sum is inside the log CMPSCI 689 Subhransu Maji (UMASS) 8 /19
Jensen’s inequality Given a concave function f and a set of weights λ i ≥ 0 and ∑ᵢ λᵢ = 1 � Jensen’s inequality states that f( ∑ᵢ λᵢ x ᵢ ) ≥ ∑ᵢ λᵢ f(x ᵢ ) � This is a direct consequence of concavity � ‣ f(ax + by) ≥ a f(x) + b f(y) when a ≥ 0, b ≥ 0, a + b = 1 f(y) f(ax+by) a f(x) + b f(y) f(x) CMPSCI 689 Subhransu Maji (UMASS) 9 /19
The EM framework Construct a lower bound the log-likelihood using Jensen’s inequality � X ! � X L ( X | θ ) = log p ( x n , y n | θ ) � x n y n f X ! � q ( y n ) p ( x n , y n | θ ) X = log Jensen’s inequality � q ( y n ) n y n � λ ✓ p ( x n , y n | θ ) ◆ X X q ( y n ) log ≥ � q ( y n ) n y n � X X = [ q ( y n ) log p ( x n , y n | θ ) − q ( y n ) log q ( y n )] � n y n � , ˆ L ( X | θ ) � Maximize the lower bound: independent of θ X X θ ← arg max q ( y n ) log p ( x n , y n | θ ) θ n y n CMPSCI 689 Subhransu Maji (UMASS) 10 /19
Lower bound illustrated Maximizing the lower bound increases the value of the original function if the lower bound touches the function at the current value L ( X | θ ) ˆ L ( X | θ t +1 ) ˆ L ( X | θ t ) θ t +1 θ t CMPSCI 689 Subhransu Maji (UMASS) 11 /19
An optimal lower bound Any choice of the probability distribution q(y n ) is valid as long as the lower bound touches the function at the current estimate of θ� L ( X | θ t ) = ˆ � L ( X | θ t ) We can the pick the optimal q(y n ) by maximizing the lower bound � � X arg max [ q ( y n ) log p ( x n , y n | θ ) − q ( y n ) log q ( y n )] q � y n q ( y n ) ← p ( y n | x n , θ t ) This gives us � ‣ Proof: use Lagrangian multipliers with “sum to one” constraint � This is the distributions of the hidden variables conditioned on the data and the current estimate of the parameters � ‣ This is exactly what we computed in the GMM example CMPSCI 689 Subhransu Maji (UMASS) 12 /19
The EM algorithm We have data with observations x n and hidden variables y n , and would like to estimate parameters θ of the distribution p(x | θ ) � EM algorithm � ‣ Initialize the parameters θ randomly ‣ Iterate between the following two steps: ➡ E step: Compute probability distribution over the hidden variables q ( y n ) ← p ( y n | x n , θ ) � ➡ M step: Maximize the lower bound X X � θ ← arg max q ( y n ) log p ( x n , y n | θ ) θ � n y n � EM algorithm is a great candidate when M-step can done easily but p(x | θ ) cannot be easily optimized over θ� ‣ For e.g. for GMMs it was easy to compute means and variances given the memberships CMPSCI 689 Subhransu Maji (UMASS) 13 /19
Naive Bayes: revisited Consider the binary prediction problem � Let the data be distributed according to a probability distribution: � � p θ ( y, x ) = p θ ( y, x 1 , x 2 , . . . , x D ) � We can simplify this using the chain rule of probability: � p θ ( y, x ) = p θ ( y ) p θ ( x 1 | y ) p θ ( x 2 | x 1 , y ) . . . p θ ( x D | x 1 , x 2 , . . . , x D − 1 , y ) � � D Y = p θ ( y ) p θ ( x d | x 1 , x 2 , . . . , x d − 1 , y ) � � d =1 Naive Bayes assumption: � � p θ ( x d | x d 0 , y ) = p θ ( x d | y ) , 8 d 0 6 = d � � E.g., The words “free” and “money” are independent given spam CMPSCI 689 Subhransu Maji (UMASS) 14 /19
Naive Bayes: a simple case Case: binary labels and binary features � } p θ ( y ) = Bernoulli ( θ 0 ) � � p θ ( x d | y = 1) = Bernoulli ( θ + d ) 1+2D parameters � p θ ( x d | y = − 1) = Bernoulli ( θ − d ) � Probability of the data: D Y p θ ( y, x ) = p θ ( y ) p θ ( x d | y ) d =1 = θ [ y =+1] (1 − θ 0 ) [ y = − 1] 0 D θ +[ x d =1 ,y =+1] Y (1 − θ + d ) [ x d =0 ,y =+1] ... × // label +1 d d =1 D θ − [ x d =1 ,y = − 1] Y d ) [ x d =0 ,y = − 1] ... × (1 − θ − // label -1 d d =1 CMPSCI 689 Subhransu Maji (UMASS) 15 /19
Naive Bayes: parameter estimation Given data we can estimate the parameters by maximizing data likelihood � The maximum likelihood estimates are: P n [ y n = +1] // fraction of the data with label as +1 ˆ θ 0 = N P n [ x d,n = 1 , y n = +1] ˆ // fraction of the instances with 1 among +1 θ + d = P n [ y n = +1] P n [ x d,n = 1 , y n = − 1] ˆ // fraction of the instances with 1 among -1 d = θ − P n [ y n = − 1] CMPSCI 689 Subhransu Maji (UMASS) 16 /19
Naive Bayes: EM Now suppose you don’t have labels y n � Initialize the parameters θ randomly � E step: compute the distribution over the hidden variables q(y n ) � D � θ +[ x d,n =1] q ( y n = 1) = p ( y n = +1 | x n , θ ) ∝ θ + Y (1 − θ + d ) [ x d,n =0] 0 d � d =1 M step: estimate θ given the guesses P n q ( y n = 1) // fraction of the data with label as +1 θ 0 = N P n [ x d,n = 1] q ( y n = 1) θ + d = // fraction of the instances with 1 among +1 P n q ( y n = 1) P n [ x d,n = 1] q ( y n = − 1) // fraction of the instances with 1 among -1 d = θ − P n q ( y n = − 1) CMPSCI 689 Subhransu Maji (UMASS) 17 /19
Recommend
More recommend