Expectation Maximization CMSC 691 UMBC
Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works
Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty (compute expectations) 2. M-step: maximize log-likelihood, assuming these uncertain counts
Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(π¨ π , π₯ π ) π(π¨ π ) 2. M-step: maximize log-likelihood, assuming these uncertain counts
Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(π¨ π , π₯ π ) π(π¨ π ) 2. M-step: maximize log-likelihood, assuming these Weβve already seen this type of counting, when uncertain counts computing the gradient in maxent models.
Expectation Maximization (EM): M-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts π π’+1 (π¨) π (π’) (π¨) estimated counts
EM Math the average log-likelihood of our complete data (z, w), averaged across max π½ π¨ ~ π π(π’) (β |π₯) log π π (π¨, π₯) all z and according to how likely our current model thinks z is
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π(π’) (β |π₯) log π π (π¨, π₯) π
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π(π’) (β |π₯) log π π (π¨, π₯) π
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π(π’) (β |π₯) log π π (π¨, π₯) current parameters π posterior distribution
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π(π’) (β |π₯) log π π (π¨, π₯) current parameters π new parameters posterior distribution new parameters
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π(π’) (β |π₯) log π π (π¨, π₯) current parameters π new parameters posterior distribution new parameters E-step: count under uncertainty M-step: maximize log-likelihood
Why EM? Un-Supervised Learning ? ? ? ? ? ? ? ? ? NO labeled data: β ? ? ? β’ human annotated EM β’ relatively small/few ? ? ? examples ? ? ? ? ? ? EM/generative models in this case ? ? ? can be seen as a type of clustering unlabeled data: β’ raw; not annotated β’ plentiful
Why EM? Semi-Supervised Learning ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ? ? ? labeled data: unlabeled data: β’ human annotated β’ raw; not annotated β’ relatively small/few β’ plentiful examples
Why EM? Semi-Supervised Learning ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? EM ο ? ? ? ο ? ? ? ο ? ? ? ? ? ? labeled data: unlabeled data: β’ human annotated β’ raw; not annotated β’ relatively small/few β’ plentiful examples
Why EM? Semi-Supervised Learning ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ? ? ? labeled data: unlabeled data: β’ human annotated β’ raw; not annotated β’ relatively small/few β’ plentiful examples
Why EM? Semi-Supervised Learning ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ? ? ? EM
Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works
Three Coins Example Imagine three coins Flip 1 st coin (penny) If heads: flip 2 nd coin (dollar coin) If tails: flip 3 rd coin (dime)
Three Coins Example Imagine three coins Flip 1 st coin (penny) donβt observe this If heads: flip 2 nd coin (dollar coin) only observe these (record heads vs. tails If tails: flip 3 rd coin (dime) outcome)
Three Coins Example Imagine three coins Flip 1 st coin (penny) unobserved: part of speech? genre? If heads: flip 2 nd coin (dollar coin) observed: a , b , e , etc. If tails: flip 3 rd coin (dime) We run the code, vs. The run failed
Three Coins Example Imagine three coins Flip 1 st coin (penny) π heads = π π tails = 1 β π If heads: flip 2 nd coin (dollar coin) π heads = πΏ π tails = 1 β πΏ If tails: flip 3 rd coin (dime) π heads = π π tails = 1 β π
Three Coins Example Imagine three coins π heads = π π heads = π π heads = πΏ π tails = 1 β π π tails = 1 β πΏ π tails = 1 β π Three parameters to estimate: Ξ» , Ξ³ , and Ο
Generative Story for Three Coins π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π add complexity to better explain what we see π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = π π¨ 1 π π₯ 1 |π¨ 1 β― π π¨ π π π₯ π |π¨ π = ΰ· π π₯ π |π¨ π π π¨ π π Generative Story π heads = π π = distribution over penny π tails = 1 β π πΏ = distribution for dollar coin π = distribution over dime π heads = πΏ for item π = 1 to π: π tails = 1 β πΏ π¨ π ~ Bernoulli π π heads = π if π¨ π = πΌ: π₯ π ~ Bernoulli πΏ else: π₯ π ~ Bernoulli π π tails = 1 β π
Three Coins Example H H T H T H H T H T T T If all flips were observed π heads = π π heads = π π heads = πΏ π tails = 1 β π π tails = 1 β πΏ π tails = 1 β π
Three Coins Example H H T H T H H T H T T T If all flips were observed π heads = π π heads = π π heads = πΏ π tails = 1 β π π tails = 1 β πΏ π tails = 1 β π π heads = 4 π heads = 1 π heads = 1 6 4 2 π tails = 2 π tails = 3 π tails = 1 6 4 2
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4 Use these values to compute posteriors π heads | observed item H = π(heads & H) π(H) π heads | observed item T = π(heads & T) π(T)
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4 Use these values to compute posteriors rewrite joint using Bayes rule π heads | observed item H = π H heads)π(heads) π(H) marginal likelihood
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4 Use these values to compute posteriors π heads | observed item H = π H heads)π(heads) π(H) π H | heads = .8 π T | heads = .2
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4 Use these values to compute posteriors π heads | observed item H = π H heads)π(heads) π(H) π H | heads = .8 π T | heads = .2 π H = π H | heads β π heads + π H | tails * π(tails) = .8 β .6 + .6 β .4
Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π heads | obs. T = π T heads)π(heads) π heads | obs. H = π H heads)π(heads) π(T) π(H) .2 β .6 .8 β .6 = .2 β .6 + .6 β .4 β 0.334 = .8 β .6 + .6 β .4 β 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?
Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π heads | obs. T = π T heads)π(heads) π heads | obs. H = π H heads)π(heads) π(T) π(H) .2 β .6 .8 β .6 = .2 β .6 + .6 β .4 β 0.334 = .8 β .6 + .6 β .4 β 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.
Recommend
More recommend