expectation maximization
play

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation - PowerPoint PPT Presentation

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count


  1. Expectation Maximization CMSC 691 UMBC

  2. Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works

  3. Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty (compute expectations) 2. M-step: maximize log-likelihood, assuming these uncertain counts

  4. Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(𝑨 𝑗 , π‘₯ 𝑗 ) π‘ž(𝑨 𝑗 ) 2. M-step: maximize log-likelihood, assuming these uncertain counts

  5. Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(𝑨 𝑗 , π‘₯ 𝑗 ) π‘ž(𝑨 𝑗 ) 2. M-step: maximize log-likelihood, assuming these We’ve already seen this type of counting, when uncertain counts computing the gradient in maxent models.

  6. Expectation Maximization (EM): M-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts π‘ž 𝑒+1 (𝑨) π‘ž (𝑒) (𝑨) estimated counts

  7. EM Math the average log-likelihood of our complete data (z, w), averaged across max 𝔽 𝑨 ~ π‘ž πœ„(𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) all z and according to how likely our current model thinks z is

  8. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„(𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) πœ„

  9. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„(𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) πœ„

  10. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„(𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ posterior distribution

  11. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„(𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ new parameters posterior distribution new parameters

  12. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„(𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ new parameters posterior distribution new parameters E-step: count under uncertainty M-step: maximize log-likelihood

  13. Why EM? Un-Supervised Learning ? ? ? ? ? ? ? ? ? NO labeled data: βž” ? ? ? β€’ human annotated EM β€’ relatively small/few ? ? ? examples ? ? ? ? ? ? EM/generative models in this case ? ? ? can be seen as a type of clustering unlabeled data: β€’ raw; not annotated β€’ plentiful

  14. Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: β€’ human annotated β€’ raw; not annotated β€’ relatively small/few β€’ plentiful examples

  15. Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ? EM  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: β€’ human annotated β€’ raw; not annotated β€’ relatively small/few β€’ plentiful examples

  16. Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: β€’ human annotated β€’ raw; not annotated β€’ relatively small/few β€’ plentiful examples

  17. Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? EM

  18. Outline EM (Expectation Maximization) Basic idea Three coins example Why EM works

  19. Three Coins Example Imagine three coins Flip 1 st coin (penny) If heads: flip 2 nd coin (dollar coin) If tails: flip 3 rd coin (dime)

  20. Three Coins Example Imagine three coins Flip 1 st coin (penny) don’t observe this If heads: flip 2 nd coin (dollar coin) only observe these (record heads vs. tails If tails: flip 3 rd coin (dime) outcome)

  21. Three Coins Example Imagine three coins Flip 1 st coin (penny) unobserved: part of speech? genre? If heads: flip 2 nd coin (dollar coin) observed: a , b , e , etc. If tails: flip 3 rd coin (dime) We run the code, vs. The run failed

  22. Three Coins Example Imagine three coins Flip 1 st coin (penny) π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ If heads: flip 2 nd coin (dollar coin) π‘ž heads = 𝛿 π‘ž tails = 1 βˆ’ 𝛿 If tails: flip 3 rd coin (dime) π‘ž heads = πœ” π‘ž tails = 1 βˆ’ πœ”

  23. Three Coins Example Imagine three coins π‘ž heads = πœ” π‘ž heads = πœ‡ π‘ž heads = 𝛿 π‘ž tails = 1 βˆ’ πœ‡ π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ” Three parameters to estimate: Ξ» , Ξ³ , and ψ

  24. Generative Story for Three Coins π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 add complexity to better explain what we see π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = π‘ž 𝑨 1 π‘ž π‘₯ 1 |𝑨 1 β‹― π‘ž 𝑨 𝑂 π‘ž π‘₯ 𝑂 |𝑨 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 Generative Story π‘ž heads = πœ‡ πœ‡ = distribution over penny π‘ž tails = 1 βˆ’ πœ‡ 𝛿 = distribution for dollar coin πœ” = distribution over dime π‘ž heads = 𝛿 for item 𝑗 = 1 to 𝑂: π‘ž tails = 1 βˆ’ 𝛿 𝑨 𝑗 ~ Bernoulli πœ‡ π‘ž heads = πœ” if 𝑨 𝑗 = 𝐼: π‘₯ 𝑗 ~ Bernoulli 𝛿 else: π‘₯ 𝑗 ~ Bernoulli πœ” π‘ž tails = 1 βˆ’ πœ”

  25. Three Coins Example H H T H T H H T H T T T If all flips were observed π‘ž heads = πœ” π‘ž heads = πœ‡ π‘ž heads = 𝛿 π‘ž tails = 1 βˆ’ πœ‡ π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

  26. Three Coins Example H H T H T H H T H T T T If all flips were observed π‘ž heads = πœ” π‘ž heads = πœ‡ π‘ž heads = 𝛿 π‘ž tails = 1 βˆ’ πœ‡ π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ” π‘ž heads = 4 π‘ž heads = 1 π‘ž heads = 1 6 4 2 π‘ž tails = 2 π‘ž tails = 3 π‘ž tails = 1 6 4 2

  27. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4

  28. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4 Use these values to compute posteriors π‘ž heads | observed item H = π‘ž(heads & H) π‘ž(H) π‘ž heads | observed item T = π‘ž(heads & T) π‘ž(T)

  29. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4 Use these values to compute posteriors rewrite joint using Bayes rule π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H) marginal likelihood

  30. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4 Use these values to compute posteriors π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H) π‘ž H | heads = .8 π‘ž T | heads = .2

  31. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4 Use these values to compute posteriors π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H) π‘ž H | heads = .8 π‘ž T | heads = .2 π‘ž H = π‘ž H | heads βˆ— π‘ž heads + π‘ž H | tails * π‘ž(tails) = .8 βˆ— .6 + .6 βˆ— .4

  32. Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(T) π‘ž(H) .2 βˆ— .6 .8 βˆ— .6 = .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 = .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

  33. Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(T) π‘ž(H) .2 βˆ— .6 .8 βˆ— .6 = .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 = .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.

Recommend


More recommend