Notes on Neal and Hinton’s Generalized Expectation Maximization (GEM) Algorithm Mark Johnson Brown University Febuary 2005, updated November 2008 1
Talk overview • What kinds of problems does expectation maximization solve? • An example of EM • Relaxation, and proving that EM converges • Sufficient statistics and EM • The generalized EM algorithm 2
Hidden Markov Models States (e.g., parts of speech) y 0 y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 Observations (e.g., words) n � P( Y , X | θ ) = P( Y i | Y i − 1 , θ )P( X i | Y i , θ ) i =1 P( y i | y i − 1 , θ ) = θ y i ,y i − 1 P( x i | y i , θ ) = θ x i ,y i 3
Maximum likelihood estimation • Given visible data ( y , x ), how can we estimate θ ? • Maximum likelihood principle: � = argmax L ( y,x ) ( θ ) , where: θ θ L ( y , x ) ( θ ) = log P θ ( y , x ) = log P( y , x | θ ) • For a HMM, these are simple to calculate: n y i ,y j ( y , x ) � = � θ y i ,y j i ,y j ( y , x ) i n y ′ y ′ n x i ,y i ( y , x ) � = � θ x i ,y i i ,y i ( y , x ) i n x ′ x ′ 4
ML estimation from hidden data • Our model defines P( Y , X ), but our data only contains values for X , i.e., the variable Y is hidden – HMM example: D only contains words x but not their labels y • Maximum likelihood principle still applies: � = argmax L x ( θ ) , where: θ θ � L x ( θ ) = log P( x | θ ) = log P( y , x | θ ) y ∈ Y • But maximizing L x ( θ ) may now be a non-trivial problem! 5
What does Expectation Maximization do? • Expectation Maximization (EM) is a maximum likelihood estimation procedure for problems with hidden variables • EM is good for problems where: – our model P( Y, X | θ ) involves variables Y and X – our training data contains x but not y – maximizing P( x | θ ) is hard – maximizing P( y, x | θ ) is easy • In HMM example: if training data consists of words x alone, and does not contain their labels 6
The EM algorithm • The EM algorithm: – Guess an initial model θ (0) – For t = 1 , 2 , . . . , compute Q ( t ) ( y ) and θ ( t ) , where Q ( t ) ( y ) P( y | x, θ ( t − 1) ) = (E-step) θ ( t ) = argmax E Y ∼ Q ( t ) [log P( Y, x | θ )] (M-step) θ � Q ( t ) ( y ) log P( y, x | θ ) = argmax θ y ∈Y � P( y, x | θ ) Q ( t ) ( y ) = argmax θ y ∈Y • Q ( t ) ( y ) is probability of “pseudo-data” y using model θ ( t − 1) • θ ( t ) is the MLE based on pseudo-data ( y, x ), where each ( y, x ) is weighted according to Q ( t ) ( y ) 7
HMM example • For a HMM, the EM formulae are: P( y | x , θ ( t − 1) ) Q ( t ) ( y ) = P( y , x | θ ( t − 1) ) = � y ∈ Y P( y , x | θ ( t − 1) ) � y ∈ Y Q ( t ) ( y ) n y i ,y j ( y , x ) θ ( t ) = � � y i ,y j y ∈ Y Q ( t ) ( y ) n y ′ i ,y j ( y , x ) y ′ i � y ∈Y Q ( t ) ( y ) n x i ,y i ( y , x ) θ ( t ) = � � x i ,y i y ∈Y Q ( t ) ( y ) n x ′ i ,y i ( y , x ) x ′ i 8
EM converges — overview • Neal and Hinton define a function F ( Q, θ ) where: – Q ( Y ) is a probability distribution over the hidden variables – θ are the model parameters � argmax max F ( Q, θ ) = θ, the MLE of θ Q θ max F ( Q, θ ) = L x ( θ ) , the log likelihood of θ Q argmax F ( Q, θ ) = P( Y | x, θ ) for all θ Q • The EM algorithm is an alternating maximization of F Q ( t ) F ( Q, θ ( t − 1) ) = argmax (E-step) Q θ ( t ) F ( Q ( t ) , θ ) = argmax (M-step) θ 9
The EM algorithm converges F ( Q, θ ) = E Y ∼ Q [log P( Y, x | θ )] + H ( Q ) = L x ( θ ) − KL( Q ( Y ) || P( Y | x, θ )) H ( Q ) = entropy of Q L x ( θ ) = log P( x | θ ) = log likelihood of θ KL( Q || P) = KL divergence between Q and P Q ( t ) ( Y ) P( Y | x, θ ( t − 1) ) F ( Q, θ ( t − 1) ) (E-st = = argmax Q θ ( t ) F ( Q ( t ) , θ ) = argmax E Y ∼ Q ( t ) [log P( Y, x | θ )] = argmax (M-st θ θ • The maximum value of F is achieved at θ = � θ and Q ( Y ) = P( Y | x, � θ ). • The sequence of F values produced by the EM algorithm is non-decreasing and bounded above by L ( � θ ). 10
Generalized EM • Idea: anything that increases F gets you closer to � θ • Idea: insert any additional operations you want into the EM algorithm so long as they don’t decrease F – Update θ after each data item has been processed – Visit some data items more often than others – Only update some components of θ on some iterations 11
Incremental EM for factored models • Data and model both factor: Y = ( Y 1 , . . . , Y n ) , X = ( X 1 , . . . , X n ) n � P( Y, X | θ ) = P( Y i , X i | θ ) i =1 • Incremental EM algorithm: – Initialize θ (0) and Q (0) i ( Y i ) for i = 1 , . . . , n – E-step: Choose some data item i to be updated Q ( t ) Q ( t − 1) = for all j � = i j j Q ( t ) P( Y i | x i , θ ( t − 1) ) i ( Y i ) = – M-step: θ ( t ) = argmax E Y ∼ Q ( t ) [log P( Y, x | θ )] θ 12
EM using sufficient statistics • Model parameters θ estimated from sufficient statistics S : ( Y, X ) → S → θ • In HMM example, pseudo-counts are sufficient statistics • EM algorithm with sufficient statistics: s ( t ) ˜ = E Y ∼ P( Y | x,θ ( t − 1) ) [ S ] (E-step) θ ( t ) s ( t ) = maximum likelihood value for θ based on ˜ (M-step) 13
Incremental EM using sufficient statistics • Incremental EM algorithm with sufficient statistics: � ( Y i , X i ) → S i → S → θ S = S i i – Initialize θ (0) and ˜ s (0) for i = 1 , . . . , n i – E-step: Choose some data item i to be updated s ( t ) s ( t − 1) ˜ = ˜ for all j � = i j j s ( t ) ˜ = E Y i ∼ P( Y i | x i ,θ ( t − 1) ) [ S i ] i s ( t − 1) + (˜ s ( t ) s ( t − 1) s ( t ) ˜ = ˜ − ˜ ) i i – M-step: θ ( t ) s ( t ) = maximum likelihood value for θ based on ˜ 14
Conclusion • The Expectation-Maximization algorithm is a general technique for using supervised maximum likelihood estimators to solve unsupervised estimation problems • The E-step and the M-step can be viewed as steps of an alternating maximization procedure – The functional F is bounded above by the log likelihood – Each E and M step increases F ⇒ Eventually the EM algorithm converges to a local optimum (not necessarily a global optimum) • We can insert any steps we like into the EM algorithm so long as they do not decrease F ⇒ Incremental versions of the EM algorithm 15
Recommend
More recommend