notes on neal and hinton s generalized expectation
play

Notes on Neal and Hintons Generalized Expectation Maximization - PowerPoint PPT Presentation

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson Brown University Febuary 2005, updated November 2008 1 Talk overview What kinds of problems does expectation maximization solve? An example


  1. Notes on Neal and Hinton’s Generalized Expectation Maximization (GEM) Algorithm Mark Johnson Brown University Febuary 2005, updated November 2008 1

  2. Talk overview • What kinds of problems does expectation maximization solve? • An example of EM • Relaxation, and proving that EM converges • Sufficient statistics and EM • The generalized EM algorithm 2

  3. Hidden Markov Models States (e.g., parts of speech) y 0 y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 Observations (e.g., words) n � P( Y , X | θ ) = P( Y i | Y i − 1 , θ )P( X i | Y i , θ ) i =1 P( y i | y i − 1 , θ ) = θ y i ,y i − 1 P( x i | y i , θ ) = θ x i ,y i 3

  4. Maximum likelihood estimation • Given visible data ( y , x ), how can we estimate θ ? • Maximum likelihood principle: � = argmax L ( y,x ) ( θ ) , where: θ θ L ( y , x ) ( θ ) = log P θ ( y , x ) = log P( y , x | θ ) • For a HMM, these are simple to calculate: n y i ,y j ( y , x ) � = � θ y i ,y j i ,y j ( y , x ) i n y ′ y ′ n x i ,y i ( y , x ) � = � θ x i ,y i i ,y i ( y , x ) i n x ′ x ′ 4

  5. ML estimation from hidden data • Our model defines P( Y , X ), but our data only contains values for X , i.e., the variable Y is hidden – HMM example: D only contains words x but not their labels y • Maximum likelihood principle still applies: � = argmax L x ( θ ) , where: θ θ � L x ( θ ) = log P( x | θ ) = log P( y , x | θ ) y ∈ Y • But maximizing L x ( θ ) may now be a non-trivial problem! 5

  6. What does Expectation Maximization do? • Expectation Maximization (EM) is a maximum likelihood estimation procedure for problems with hidden variables • EM is good for problems where: – our model P( Y, X | θ ) involves variables Y and X – our training data contains x but not y – maximizing P( x | θ ) is hard – maximizing P( y, x | θ ) is easy • In HMM example: if training data consists of words x alone, and does not contain their labels 6

  7. The EM algorithm • The EM algorithm: – Guess an initial model θ (0) – For t = 1 , 2 , . . . , compute Q ( t ) ( y ) and θ ( t ) , where Q ( t ) ( y ) P( y | x, θ ( t − 1) ) = (E-step) θ ( t ) = argmax E Y ∼ Q ( t ) [log P( Y, x | θ )] (M-step) θ � Q ( t ) ( y ) log P( y, x | θ ) = argmax θ y ∈Y � P( y, x | θ ) Q ( t ) ( y ) = argmax θ y ∈Y • Q ( t ) ( y ) is probability of “pseudo-data” y using model θ ( t − 1) • θ ( t ) is the MLE based on pseudo-data ( y, x ), where each ( y, x ) is weighted according to Q ( t ) ( y ) 7

  8. HMM example • For a HMM, the EM formulae are: P( y | x , θ ( t − 1) ) Q ( t ) ( y ) = P( y , x | θ ( t − 1) ) = � y ∈ Y P( y , x | θ ( t − 1) ) � y ∈ Y Q ( t ) ( y ) n y i ,y j ( y , x ) θ ( t ) = � � y i ,y j y ∈ Y Q ( t ) ( y ) n y ′ i ,y j ( y , x ) y ′ i � y ∈Y Q ( t ) ( y ) n x i ,y i ( y , x ) θ ( t ) = � � x i ,y i y ∈Y Q ( t ) ( y ) n x ′ i ,y i ( y , x ) x ′ i 8

  9. EM converges — overview • Neal and Hinton define a function F ( Q, θ ) where: – Q ( Y ) is a probability distribution over the hidden variables – θ are the model parameters � argmax max F ( Q, θ ) = θ, the MLE of θ Q θ max F ( Q, θ ) = L x ( θ ) , the log likelihood of θ Q argmax F ( Q, θ ) = P( Y | x, θ ) for all θ Q • The EM algorithm is an alternating maximization of F Q ( t ) F ( Q, θ ( t − 1) ) = argmax (E-step) Q θ ( t ) F ( Q ( t ) , θ ) = argmax (M-step) θ 9

  10. The EM algorithm converges F ( Q, θ ) = E Y ∼ Q [log P( Y, x | θ )] + H ( Q ) = L x ( θ ) − KL( Q ( Y ) || P( Y | x, θ )) H ( Q ) = entropy of Q L x ( θ ) = log P( x | θ ) = log likelihood of θ KL( Q || P) = KL divergence between Q and P Q ( t ) ( Y ) P( Y | x, θ ( t − 1) ) F ( Q, θ ( t − 1) ) (E-st = = argmax Q θ ( t ) F ( Q ( t ) , θ ) = argmax E Y ∼ Q ( t ) [log P( Y, x | θ )] = argmax (M-st θ θ • The maximum value of F is achieved at θ = � θ and Q ( Y ) = P( Y | x, � θ ). • The sequence of F values produced by the EM algorithm is non-decreasing and bounded above by L ( � θ ). 10

  11. Generalized EM • Idea: anything that increases F gets you closer to � θ • Idea: insert any additional operations you want into the EM algorithm so long as they don’t decrease F – Update θ after each data item has been processed – Visit some data items more often than others – Only update some components of θ on some iterations 11

  12. Incremental EM for factored models • Data and model both factor: Y = ( Y 1 , . . . , Y n ) , X = ( X 1 , . . . , X n ) n � P( Y, X | θ ) = P( Y i , X i | θ ) i =1 • Incremental EM algorithm: – Initialize θ (0) and Q (0) i ( Y i ) for i = 1 , . . . , n – E-step: Choose some data item i to be updated Q ( t ) Q ( t − 1) = for all j � = i j j Q ( t ) P( Y i | x i , θ ( t − 1) ) i ( Y i ) = – M-step: θ ( t ) = argmax E Y ∼ Q ( t ) [log P( Y, x | θ )] θ 12

  13. EM using sufficient statistics • Model parameters θ estimated from sufficient statistics S : ( Y, X ) → S → θ • In HMM example, pseudo-counts are sufficient statistics • EM algorithm with sufficient statistics: s ( t ) ˜ = E Y ∼ P( Y | x,θ ( t − 1) ) [ S ] (E-step) θ ( t ) s ( t ) = maximum likelihood value for θ based on ˜ (M-step) 13

  14. Incremental EM using sufficient statistics • Incremental EM algorithm with sufficient statistics: � ( Y i , X i ) → S i → S → θ S = S i i – Initialize θ (0) and ˜ s (0) for i = 1 , . . . , n i – E-step: Choose some data item i to be updated s ( t ) s ( t − 1) ˜ = ˜ for all j � = i j j s ( t ) ˜ = E Y i ∼ P( Y i | x i ,θ ( t − 1) ) [ S i ] i s ( t − 1) + (˜ s ( t ) s ( t − 1) s ( t ) ˜ = ˜ − ˜ ) i i – M-step: θ ( t ) s ( t ) = maximum likelihood value for θ based on ˜ 14

  15. Conclusion • The Expectation-Maximization algorithm is a general technique for using supervised maximum likelihood estimators to solve unsupervised estimation problems • The E-step and the M-step can be viewed as steps of an alternating maximization procedure – The functional F is bounded above by the log likelihood – Each E and M step increases F ⇒ Eventually the EM algorithm converges to a local optimum (not necessarily a global optimum) • We can insert any steps we like into the EM algorithm so long as they do not decrease F ⇒ Incremental versions of the EM algorithm 15

Recommend


More recommend