A unifying methodology A Gentle Introduction to the EM Algorithm • Dempster, Laird & Rubin (1977) unified many strands of apparently unrelated work under the banner of The EM Algorithm Ted Pedersen • EM had gone incognito for many years Department of Computer Science – Newcomb (1887) University of Minnesota Duluth – McKendrick (1926) – Hartley (1958) tpederse@d.umn.edu – Baum et. al. (1970) EMNLP, June 2001 Ted Pedersen - EM Panel 1 EMNLP, June 2001 Ted Pedersen - EM Panel 2 A general framework for solving EM allows us to make MLE many kinds of problems under adverse circumstances • Filling in missing data in a sample • What are Maximum Likelihood Estimates? • Discovering the value of latent variables • What are these adverse circumstances? • Estimating parameters of HMMs • How does EM triumph over adversity? • Estimating parameters of finite mixtures • PANEL: When does it really work? • Unsupervised learning of clusters • … EMNLP, June 2001 Ted Pedersen - EM Panel 3 EMNLP, June 2001 Ted Pedersen - EM Panel 4 Maximum Likelihood Estimates Coin Tossing! • Parameters describe the characteristics of a • How likely am I to toss a head? A series of population. Their values are estimated from 10 trials/tosses yields (h,t,t,t,h,t,t,h,t,t) samples collected from that population. – (x1=3, x2=7), n=10 • A MLE is a parameter estimate that is most • Probability of tossing a head = 3/10 consistent with the sampled data. It • That’s a MLE! This estimate is absolutely maximizes the likelihood function. consistent with the observed data. • A few underlying details are masked… EMNLP, June 2001 Ted Pedersen - EM Panel 5 EMNLP, June 2001 Ted Pedersen - EM Panel 6 1
Maximum Likelihood Estimates Coin tossing unmasked • Coin tossing is well described by the • We seek to estimate the parameter such that binomial distribution since there are n it maximizes the likelihood function. independent trials with two outcomes. • Take the first derivative of the likelihood • Given 10 tosses, how likely is 3 heads? function with respect to the parameter theta and solve for 0. This value maximizes the 10 likelihood function and is the MLE. θ = θ − θ 3 7 L ( ) ( 1 ) 3 EMNLP, June 2001 Ted Pedersen - EM Panel 7 EMNLP, June 2001 Ted Pedersen - EM Panel 8 Maximizing the likelihood Multinomial MLE example 10 L ( θ ) = θ 3 ( 1 − θ ) 7 3 • There are n animals classified into one of four possible categories (Rao 1973). 10 θ = + θ + − θ log L ( ) log 3 log 7 log( 1 ) – Category counts are the sufficient statistics to 3 estimate multinomial parameters θ d log L ( ) 3 7 • Technique for finding MLEs is the same = − = 0 θ θ − θ d 1 – Take derivative of likelihood function 3 7 3 – Solve for zero = ⇒ θ = θ − θ 1 10 EMNLP, June 2001 Ted Pedersen - EM Panel 9 EMNLP, June 2001 Ted Pedersen - EM Panel 10 Multinomial MLE example Multinomial MLE example = There are n 197 animals classified into one of 4 categories : = = 1 1 1 1 1 Y ( y 1 , y 2 , y 3 , y 4 ) (125, 18, 20, 34) π = + π + − π + − π + π log L ( ) y 1 * log( ) y 2 * log( ( 1 )) y 3 * log( ( 1 )) y 4 * log( ) 2 4 4 4 4 The probabilit y associated with each category is given as : π + d log L ( ) y 1 y 2 y 3 y 4 1 1 1 1 1 = − + = 0 Θ = ( + π , ( 1 − π ), ( 1 − π ), π ) π + π − π π d 2 1 2 4 4 4 4 π The resulting likelihood function for this multinomia l is : d log L ( ) 125 38 34 = − + = ⇒ π = 0 0 . 627 π + π − π π d 2 1 n ! 1 1 1 1 1 π = + π − π − π π L ( ) * ( ) y 1 * ( ( 1 )) y 2 * ( ( 1 )) y 3 * ( ) y 4 y 1 ! y 2 ! y 3 ! y 4 ! 2 4 4 4 4 EMNLP, June 2001 Ted Pedersen - EM Panel 11 EMNLP, June 2001 Ted Pedersen - EM Panel 12 2
Multinomial MLE runs aground? EM triumphs over adversity! • Adversity strikes! The observed data is • E-STEP: Find the expected values of the incomplete. There are really 5 categories. sufficient statistics for the complete data X, • y1 is the composite of 2 categories (x1+x2) given the incomplete data Y and the current parameter estimates – p(y1)= ½ + ¼ *pi, p(x1) = ½, p(x2)= ¼* pi • How can we make a MLE, since we can’t • M-STEP: Use those sufficient statistics to observe category counts x1 and x2?! make a MLE as usual! – Unobserved sufficient statistics!? EMNLP, June 2001 Ted Pedersen - EM Panel 13 EMNLP, June 2001 Ted Pedersen - EM Panel 14 MLE for complete data MLE for complete data 1 1 1 1 = = + = X ( x 1 , x 2 , x 3 , x 4 , x 5 ) ( x 1 , x 2 , 18, 20, 34) where x1 x2 125 log L ( π ) = x 2 * log( π ) * x 3 * log( ( 1 − π )) * x 4 * log( ( 1 − π )) * x 5 * log( π ) 4 4 4 4 1 1 1 1 1 π + + Θ = π − π − π π d log L ( ) x 2 x 5 x 3 x 4 ( , , ( 1 ), ( 1 ), ) = − = 0 2 4 4 4 4 π π − π d 1 π + n ! 1 1 1 1 1 d log L ( ) x2 34 38 π = x 1 π x 2 − π x 3 − π x 4 π x 5 = − = L ( ) * ( ) * ( ) * ( ( 1 )) * ( ( 1 )) * ( ) 0 π π − π d 1 x 1 ! x 2 ! x 3 ! x 4 ! x 5 ! 2 4 4 4 4 EMNLP, June 2001 Ted Pedersen - EM Panel 15 EMNLP, June 2001 Ted Pedersen - EM Panel 16 E-step E-Step • What are the sufficient statistics? • E[x1|y1] = n*p(x1) – X1 => X2 = 125 – x1 – p(x1) = ½ / (½+ ¼*pi) • How can their expected value be computed? – E [x1 | y1] = n*p(x1) • E[x2|y1] = n*p(x2) = 125 – E[x1|y1] • The unobserved counts x1 and x2 are the – p(x2)= ¼*pi / ( ½ + ¼*pi) categories of a binomial distribution with a sample size of 125. • Iteration 1? Start with pi = 0.5 (this is just a – p(x1) + p(x2) = p(y1) = ½ + ¼*pi random guess…) EMNLP, June 2001 Ted Pedersen - EM Panel 17 EMNLP, June 2001 Ted Pedersen - EM Panel 18 3
E-Step Iteration 1 M-Step iteration 1 • E[x1|y1] = 125* (½ / (½+ ¼*0.5)) = 100 • Given sufficient statistics, make MLEs as usual • E[x2|y1] = 125 – 100 = 25 d log L ( π ) x2 + 34 38 = − = 0 d π π 1 − π • These are the expected values of the sufficient 25 + 34 38 − = 0 statistics, given the observed data and current π − π 1 parameter estimate (which was just a guess) π = . 608 EMNLP, June 2001 Ted Pedersen - EM Panel 19 EMNLP, June 2001 Ted Pedersen - EM Panel 20 E-Step Iteration 2 M-Step iteration 2 • E[x1|y1] = 125* (½ / (½+ ¼*0.608)) = 95.86 • Given sufficient statistics, make MLEs as usual • E[x2|y1] = 125 – 95.86 = 29.14 π x2 + d log L ( ) 34 38 = − = 0 π π − π d 1 • These are the expected values of the sufficient 29.14 + 34 38 statistics, given the observed data and current − = 0 π − π 1 parameter estimate (from iteration 1) π = . 624 EMNLP, June 2001 Ted Pedersen - EM Panel 21 EMNLP, June 2001 Ted Pedersen - EM Panel 22 Result? Conclusion • Distribution must be appropriate to problem • Converge in 4 iterations to pi=.627 • Sufficient statistics should be identifiable – E[x1|y1] = 95.2 and have computable expected values – E[x2|y1] = 29.8 • Maximization operation should be possible • Initialization should be good or lucky to avoid saddle points and local maxima • Then…it might be safe to proceed… EMNLP, June 2001 Ted Pedersen - EM Panel 23 EMNLP, June 2001 Ted Pedersen - EM Panel 24 4
Recommend
More recommend