COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University
M AXIMUM LIKELIHOOD
A PPROACHES TO DATA MODELING Our approaches to modeling data thus far have been either probabilistic or non-probabilistic in motivation. ◮ Probabilistic models: Probability distributions defined on data, e.g., 1. Bayes classifiers 2. Logistic regression 3. Least squares and ridge regression (using ML and MAP interpretation) 4. Bayesian linear regression ◮ Non-probabilistic models: No probability distributions involved, e.g., 1. Perceptron 2. Support vector machine 3. Decision trees 4. K-means In every case, we have some objective function we are trying to optimize (greedily vs non-greedily, locally vs globally).
M AXIMUM LIKELIHOOD As we’ve seen, one probabilistic objective function is maximum likelihood. Setup: In the most basic scenario, we start with 1. some set of model parameters θ 2. a set of data { x 1 , . . . , x n } 3. a probability distribution p ( x | θ ) iid 4. an i.i.d. assumption, x i ∼ p ( x | θ ) Maximum likelihood seeks the θ that maximizes the likelihood n n ( a ) ( b ) � � θ ML = arg max p ( x 1 , . . . , x n | θ ) = arg max p ( x i | θ ) = arg max ln p ( x i | θ ) θ θ θ i = 1 i = 1 (a) follows from i.i.d. assumption. (b) follows since f ( y ) > f ( x ) ⇒ ln f ( y ) > ln f ( x ) .
M AXIMUM LIKELIHOOD We’ve discussed maximum likelihood for a few models, e.g., least squares linear regression and the Bayes classifier. Both of these models were “nice” because we could find their respective θ ML analytically by writing an equation and plugging in data to solve. Gaussian with unknown mean and covariance iid ∼ N ( µ, Σ) , where θ = { µ, Σ } , then In the first lecture, we saw if x i � n ∇ θ ln p ( x i | θ ) = 0 i = 1 gives the following maximum likelihood values for µ and Σ : � n � n µ ML = 1 Σ ML = 1 ( x i − µ ML )( x i − µ ML ) T x i , n n i = 1 i = 1
C OORDINATE ASCENT AND MAXIMUM LIKELIHOOD In more complicated models, we might split the parameters into groups θ 1 , θ 2 and try to maximize the likelihood over both of these, � n θ 1 , ML , θ 2 , ML = arg max ln p ( x i | θ 1 , θ 2 ) , θ 1 ,θ 2 i = 1 Although we can solve one given the other, we can’t solve it simultaneously . Coordinate ascent (probabilistic version) We saw how K-means presented a similar situation, and that we could optimize using coordinate ascent. This technique is generalizable. Algorithm : For iteration t = 1 , 2 , . . . , � n 1. Optimize θ ( t ) i = 1 ln p ( x i | θ 1 , θ ( t − 1 ) = arg max θ 1 ) 1 2 � n 2. Optimize θ ( t ) i = 1 ln p ( x i | θ ( t ) = arg max θ 2 1 , θ 2 ) 2
C OORDINATE ASCENT AND MAXIMUM LIKELIHOOD There is a third (subtly) different situation, where we really want to find � n θ 1 , ML = arg max ln p ( x i | θ 1 ) . θ 1 i = 1 Except this function is “tricky” to optimize directly. However, we figure out that we can add a second variable θ 2 such that n � ln p ( x i , θ 2 | θ 1 ) ( Function 2 ) i = 1 is easier to work with. We’ll make this clearer later. ◮ Notice in this second case that θ 2 is on the left side of the conditioning bar. This implies a prior on θ 2 , (whatever “ θ 2 ” turns out to be). ◮ We will next discuss a fundamental technique called the EM algorithm for finding θ 1 , ML by using Function 2 instead.
E XPECTATION -M AXIMIZATION A LGORITHM
A MOTIVATING EXAMPLE Let x i ∈ R d , be a vector with missing data . Split this vector into two parts: 1. x o i – observed portion (the sub-vector of x i that is measured) 2. x m i – missing portion (the sub-vector of x i that is still unknown) 3. The missing dimensions can be different for different x i . iid ∼ N ( µ, Σ) , and want to solve We assume that x i n � ln p ( x o µ ML , Σ ML = arg max i | µ, Σ) . µ, Σ i = 1 This is tricky. However, if we knew x m i (and therefore x i ), then n � ln p ( x o i , x m µ ML , Σ ML = arg max i | µ, Σ) � �� � µ, Σ i = 1 = p ( x i | µ, Σ) is very easy to optimize (we just did it on a previous slide).
C ONNECTING TO A MORE GENERAL SETUP We will discuss a method for optimizing � n i = 1 ln p ( x o i | µ, Σ) and imputing its missing values { x m 1 , . . . , x m n } . This is a very general technique. General setup Imagine we have two parameter sets θ 1 , θ 2 , where � p ( x | θ 1 ) = p ( x , θ 2 | θ 1 ) d θ 2 ( marginal distribution ) Example: For the previous example we can show that � p ( x o p ( x o i , x m i | µ, Σ) dx m i = N ( µ o i , Σ o i | µ, Σ) = i ) , where µ o i and Σ o i are the sub-vector/sub-matrix of µ and Σ defined by x o i .
T HE EM OBJECTIVE FUNCTION We need to define a general objective function that gives us what we want: 1. It lets us optimize the marginal p ( x | θ 1 ) over θ 1 , 2. It uses p ( x , θ 2 | θ 1 ) in doing so purely for computational convenience. The EM objective function Before picking it apart, we claim that this objective function is � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = d θ 2 + q ( θ 2 ) ln p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) Some immediate comments: ◮ q ( θ 2 ) is any probability distribution (assumed continuous for now) ◮ We assume we know p ( θ 2 | x , θ 1 ) . That is, given the data x and fixed values for θ 1 , we can solve the conditional posterior distribution of θ 2 .
D ERIVING THE EM OBJECTIVE FUNCTION Let’s show that this equality is actually true � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = d θ 2 + q ( θ 2 ) ln p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) = p ( θ 2 | x , θ 1 ) q ( θ 2 ) d θ 2 Remember some rules of probability: p ( b | c ) = p ( a , b | c ) p ( a , b | c ) = p ( a | b , c ) p ( b | c ) ⇒ p ( a | b , c ) . Letting a = θ 1 , b = x and c = θ 1 , we conclude � ln p ( x | θ 1 ) = q ( θ 2 ) ln p ( x | θ 1 ) d θ 2 = ln p ( x | θ 1 )
T HE EM OBJECTIVE FUNCTION The EM objective function splits our desired objective into two terms: � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = + q ( θ 2 ) ln d θ 2 p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) � �� � � �� � A function only of θ 1 , we’ll call it L Kullback-Leibler divergence Some more observations about the right hand side: 1. The KL diverence is always ≥ 0 and only = 0 when q = p . 2. We are assuming that the integral in L can be calculated, leaving a function only of θ 1 (for a particular setting of the distribution q ).
B IGGER PICTURE Q : What does it mean to iteratively optimize ln p ( x | θ 1 ) w.r.t. θ 1 ? A : One way to think about it is that we want a method for generating: 1. A sequence of values for θ 1 such that ln p ( x | θ ( t ) 1 ) ≥ ln p ( x | θ ( t − 1 ) ) . 1 2. We want θ ( t ) to converge to a local maximum of ln p ( x | θ 1 ) . 1 It doesn’t matter how we generate the sequence θ ( 1 ) 1 , θ ( 2 ) 1 , θ ( 3 ) 1 , . . . We will show how EM generates # 1 and just mention that EM satisfies # 2.
T HE EM ALGORITHM The EM objective function � � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) q ( θ 2 ) ln p ( x | θ 1 ) = + q ( θ 2 ) ln d θ 2 p ( θ 2 | x , θ 1 ) d θ 2 q ( θ 2 ) � �� � � �� � define this to be L ( x , θ 1 ) Kullback-Leibler divergence Definition: The EM algorithm Given the value θ ( t ) 1 , find the value θ ( t + 1 ) as follows: 1 E-step : Set q t ( θ 2 ) = p ( θ 2 | x , θ ( t ) 1 ) and calculate � � L q t ( x , θ 1 ) = q t ( θ 2 ) ln p ( x , θ 2 | θ 1 ) d θ 2 − q t ( θ 2 ) ln q t ( θ 2 ) d θ 2 . � �� � can ignore this term M-step : Set θ ( t + 1 ) = arg max θ 1 L q t ( x , θ 1 ) . 1
P ROOF OF MONOTONIC IMPROVEMENT Once we’re comfortable with the moving parts, the proof that the sequence θ ( t ) monotonically improves ln p ( x | θ 1 ) just requires analysis : 1 � � ln p ( x | θ ( t ) L ( x , θ ( t ) q ( θ 2 ) � p ( θ 2 | x 1 , θ ( t ) 1 ) = 1 ) + KL 1 ) � �� � = 0 by setting q = p L q t ( x , θ ( t ) = 1 ) ← E-step L q t ( x , θ ( t + 1 ) ≤ ) ← M-step 1 � � L q t ( x , θ ( t + 1 ) q t ( θ 2 ) � p ( θ 2 | x 1 , θ ( t + 1 ) ≤ ) + KL ) 1 1 � �� � > 0 because q � = p � � L ( x , θ ( t + 1 ) q ( θ 2 ) � p ( θ 2 | x 1 , θ ( t + 1 ) = ) + KL ) 1 1 ln p ( x | θ ( t + 1 ) = ) 1
O NE ITERATION OF EM Start : Current setting of θ 1 and q ( θ 2 ) For reference : } ln p ( x | θ 1 ) = L + KL KL(q| |p) lnp(X|θ 1 ) � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) L = d θ 2 q ( θ 2 ) q ( θ 2 ) � = q ( θ 2 ) ln KL p ( θ 2 | x , θ 1 ) d θ 2 (X|θ 1 ) L Some arbitrary point < 0
O NE ITERATION OF EM E-step : Set q ( θ 2 ) = p ( θ 2 | x , θ 1 ) and update L . For reference : KL(q| |p) = 0 ln p ( x | θ 1 ) = L + KL (X|θ 1 ) L lnp(X|θ 1 ) � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) L = d θ 2 q ( θ 2 ) q ( θ 2 ) � = q ( θ 2 ) ln KL p ( θ 2 | x , θ 1 ) d θ 2 Some arbitrary point < 0
O NE ITERATION OF EM M-step : Maximize L wrt θ 1 . Now q � = p . } KL(q| |p) For reference : ln p ( x | θ 1 ) = L + KL up ) lnp(X|θ 1 up ) (X|θ 1 L � q ( θ 2 ) ln p ( x , θ 2 | θ 1 ) L = d θ 2 q ( θ 2 ) q ( θ 2 ) � = q ( θ 2 ) ln KL p ( θ 2 | x , θ 1 ) d θ 2 Some arbitrary point < 0
EM FOR MISSING DATA
Recommend
More recommend