probabilistic modeling and
play

Probabilistic Modeling and Expectation Maximization CMSC 678 UMBC - PowerPoint PPT Presentation

Probabilistic Modeling and Expectation Maximization CMSC 678 UMBC Course Overview (so far) Basics of Probability Maximum Entropy Models Requirements to be a distribution (proportional to, ) Meanings of feature functions and weights


  1. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss function do we minimize to maximize log-likelihood?

  2. Learning Parameters for the Die Model π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss A: Cross-entropy function do we minimize to maximize log-likelihood?

  3. Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are β€œreasonable” these 9 rolls… estimates for p(w)? p(1) = ? p(2) = ? p(3) = ? p(4) = ? p(5) = ? p(6) = ?

  4. Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 maximize (log-) likelihood to learn the probability parameters If you observe …what are β€œreasonable” these 9 rolls… estimates for p(w)? p(1) = 2/9 p(2) = 1/9 maximum p(3) = 1/9 p(4) = 3/9 likelihood estimates p(5) = 1/9 p(6) = 1/9

  5. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story π‘₯ 1 = 1 for roll 𝑗 = 1 to 𝑂: π‘₯ 𝑗 ∼ Cat(πœ„) π‘₯ 2 = 5 Maximize Log-likelihood π‘₯ 3 = 4 β„’ πœ„ = ෍ log π‘ž πœ„ (π‘₯ 𝑗 ) 𝑗 β‹― = ෍ log πœ„ π‘₯ 𝑗 𝑗

  6. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: β„’ πœ„ = ෍ log πœ„ π‘₯ 𝑗 π‘₯ 𝑗 ∼ Cat(πœ„) 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

  7. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Generative Story Maximize Log-likelihood for roll 𝑗 = 1 to 𝑂: β„’ πœ„ = ෍ log πœ„ π‘₯ 𝑗 π‘₯ 𝑗 ∼ Cat(πœ„) 𝑗 Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing πœ„ 𝑙 ( we know πœ„ must be a distribution, but it’s not specified)

  8. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) 6 (we can include the inequality constraints β„’ πœ„ = ෍ log πœ„ π‘₯ 𝑗 s. t. ෍ πœ„ 𝑙 = 1 0 ≀ πœ„ 𝑙 , but it complicates the problem and, right 𝑗 𝑙=1 now , is not needed) solve using Lagrange multipliers

  9. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) (we can include the 6 inequality constraints 0 ≀ πœ„ 𝑙 , but it β„± πœ„ = ෍ log πœ„ π‘₯ 𝑗 βˆ’ πœ‡ ෍ πœ„ 𝑙 βˆ’ 1 complicates the problem and, right 𝑗 𝑙=1 now , is not needed) 6 πœ–β„± πœ„ 1 πœ–β„± πœ„ = ෍ βˆ’ πœ‡ = βˆ’ ෍ πœ„ 𝑙 + 1 πœ–πœ„ 𝑙 πœ„ π‘₯ 𝑗 πœ–πœ‡ 𝑗:π‘₯ 𝑗 =𝑙 𝑙=1

  10. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) (we can include the 6 inequality constraints 0 ≀ πœ„ 𝑙 , but it β„± πœ„ = ෍ log πœ„ π‘₯ 𝑗 βˆ’ πœ‡ ෍ πœ„ 𝑙 βˆ’ 1 complicates the problem and, right 𝑗 𝑙=1 now , is not needed) 6 Οƒ 𝑗:π‘₯ 𝑗 =𝑙 1 πœ„ 𝑙 = optimal πœ‡ when ෍ πœ„ 𝑙 = 1 πœ‡ 𝑙=1

  11. Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 Maximize Log-likelihood (with distribution constraints) (we can include the 6 inequality constraints 0 ≀ πœ„ 𝑙 , but it β„± πœ„ = ෍ log πœ„ π‘₯ 𝑗 βˆ’ πœ‡ ෍ πœ„ 𝑙 βˆ’ 1 complicates the problem and, right 𝑗 𝑙=1 now , is not needed) 6 Οƒ 𝑗:π‘₯ 𝑗 =𝑙 1 Οƒ 𝑙 Οƒ 𝑗:π‘₯ 𝑗 =𝑙 1 = 𝑂 𝑙 πœ„ 𝑙 = optimal πœ‡ when ෍ πœ„ 𝑙 = 1 𝑂 𝑙=1

  12. Outline Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

  13. Example: Conditionally Rolling a Die π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 add complexity to better explain what we see π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = π‘ž 𝑨 1 π‘ž π‘₯ 1 |𝑨 1 β‹― π‘ž 𝑨 𝑂 π‘ž π‘₯ 𝑂 |𝑨 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗

  14. Example: Conditionally Rolling a Die π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 add complexity to better explain what we see π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = π‘ž 𝑨 1 π‘ž π‘₯ 1 |𝑨 1 β‹― π‘ž 𝑨 𝑂 π‘ž π‘₯ 𝑂 |𝑨 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 First flip a coin… 𝑨 1 = π‘ˆ 𝑨 2 = 𝐼 β‹―

  15. Example: Conditionally Rolling a Die π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 add complexity to better explain what we see π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = π‘ž 𝑨 1 π‘ž π‘₯ 1 |𝑨 1 β‹― π‘ž 𝑨 𝑂 π‘ž π‘₯ 𝑂 |𝑨 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 …then roll a different die First flip a coin… depending on the coin flip 𝑨 1 = π‘ˆ π‘₯ 1 = 1 𝑨 2 = 𝐼 π‘₯ 2 = 5 β‹―

  16. Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 add complexity to better explain what we see π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = π‘ž 𝑨 1 π‘ž π‘₯ 1 |𝑨 1 β‹― π‘ž 𝑨 𝑂 π‘ž π‘₯ 𝑂 |𝑨 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 values, this is easy!

  17. Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 values, this is easy! First: Write the Generative Story πœ‡ = distribution over coin (z) 𝛿 (𝐼) = distribution for die when coin comes up heads 𝛿 (π‘ˆ) = distribution for die when coin comes up tails for item 𝑗 = 1 to 𝑂: 𝑨 𝑗 ~ Bernoulli πœ‡ π‘₯ 𝑗 ~ Cat 𝛿 (𝑨 𝑗 )

  18. Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 values, this is easy! Second: Generative Story β†’ Objective First: Write the Generative Story πœ‡ = distribution over coin (z) π‘œ 𝛿 (𝐼) = distribution for H die (𝑨 𝑗 ) ) β„± πœ„ = ෍ (log πœ‡ 𝑨 𝑗 + log 𝛿 π‘₯ 𝑗 𝛿 (π‘ˆ) = distribution for T die 𝑗 2 2 6 for item 𝑗 = 1 to 𝑂: Lagrange multiplier (𝑙) βˆ’ 1 βˆ’πœƒ ෍ πœ‡ 𝑙 βˆ’ 1 βˆ’ ෍ πœ€ 𝑙 ෍ 𝛿 π‘˜ 𝑨 𝑗 ~ Bernoulli πœ‡ constraints 𝑙=1 𝑙 π‘˜ π‘₯ 𝑗 ~ Cat 𝛿 (𝑨 𝑗 )

  19. Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 values, this is easy! Second: Generative Story β†’ Objective First: Write the Generative Story πœ‡ = distribution over coin (z) π‘œ 𝛿 (𝐼) = distribution for H die (𝑨 𝑗 ) ) β„± πœ„ = ෍ (log πœ‡ 𝑨 𝑗 + log 𝛿 π‘₯ 𝑗 𝛿 (π‘ˆ) = distribution for T die 𝑗 2 2 6 for item 𝑗 = 1 to 𝑂: (𝑙) βˆ’ 1 βˆ’πœƒ ෍ πœ‡ 𝑙 βˆ’ 1 βˆ’ ෍ πœ€ 𝑙 ෍ 𝛿 π‘˜ 𝑨 𝑗 ~ Bernoulli πœ‡ 𝑙=1 𝑙=1 π‘˜=1 π‘₯ 𝑗 ~ Cat 𝛿 (𝑨 𝑗 )

  20. Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 If you observe the 𝑨 𝑗 But if you don’t observe the 𝑨 𝑗 values, this is not easy! values, this is easy! Second: Generative Story β†’ Objective First: Write the Generative Story πœ‡ = distribution over coin (z) π‘œ 𝛿 (𝐼) = distribution for H die (𝑨 𝑗 ) ) β„± πœ„ = ෍ (log πœ‡ 𝑨 𝑗 + log 𝛿 π‘₯ 𝑗 𝛿 (π‘ˆ) = distribution for T die 𝑗 2 2 6 for item 𝑗 = 1 to 𝑂: (𝑙) βˆ’ 1 βˆ’πœƒ ෍ πœ‡ 𝑙 βˆ’ 1 βˆ’ ෍ πœ€ 𝑙 ෍ 𝛿 π‘˜ 𝑨 𝑗 ~ Bernoulli πœ‡ 𝑙=1 𝑙=1 π‘˜=1 π‘₯ 𝑗 ~ Cat 𝛿 (𝑨 𝑗 )

  21. Example: Conditionally Rolling a Die π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 goal: maximize (log-)likelihood we don’t actually observe these z values we just see the items w if we did observe z , estimating the probability parameters would be easy… but we don’t! :(

  22. Example: Conditionally Rolling a Die π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 goal: maximize (log-)likelihood we don’t actually observe these z values we just see the items w if we did observe z , estimating the if we knew the probability parameters probability parameters would be easy… then we could estimate z and evaluate but we don’t! :( likelihood… but we don’t! :(

  23. Example: Conditionally Rolling a Die π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 we don’t actually observe these z values goal: maximize marginalized (log-)likelihood

  24. Example: Conditionally Rolling a Die π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w

  25. Example: Conditionally Rolling a Die π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w z 2 & w z 3 & w z 4 & w z 1 & w

  26. Example: Conditionally Rolling a Die π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w z 1 & w z 2 & w z 3 & w z 4 & w π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = ෍ π‘ž(𝑨 1 , π‘₯) ෍ π‘ž(𝑨 2 , π‘₯) β‹― ෍ π‘ž(𝑨 𝑂 , π‘₯) 𝑨 1 𝑨 2 𝑨 𝑂

  27. Example: Conditionally Rolling a Die π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = π‘ž 𝑨 1 π‘ž π‘₯ 1 |𝑨 1 β‹― π‘ž 𝑨 𝑂 π‘ž π‘₯ 𝑂 |𝑨 𝑂 goal: maximize marginalized (log-)likelihood w z 2 & w z 3 & w z 4 & w z 1 & w π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = ෍ π‘ž(𝑨 1 , π‘₯) ෍ π‘ž(𝑨 2 , π‘₯) β‹― ෍ π‘ž(𝑨 𝑂 , π‘₯) 𝑨 1 𝑨 2 𝑨 𝑂 if we did observe z , estimating the if we knew the probability parameters probability parameters would be easy… then we could estimate z and evaluate but we don’t! :( likelihood… but we don’t! :(

  28. if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z , estimating the probability parameters would be easy… but we don’t! :(

  29. if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z , estimating the probability parameters would be easy… but we don’t! :(

  30. if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z , estimating the probability parameters would be easy… but we don’t! :( Expectation Maximization : give you model estimation the needed β€œspark” http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

  31. Outline Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

  32. Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty (compute expectations) 2. M-step: maximize log-likelihood, assuming these uncertain counts

  33. Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(𝑨 𝑗 , π‘₯ 𝑗 ) π‘ž(𝑨 𝑗 ) 2. M-step: maximize log-likelihood, assuming these uncertain counts

  34. Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(𝑨 𝑗 , π‘₯ 𝑗 ) π‘ž(𝑨 𝑗 ) 2. M-step: maximize log-likelihood, assuming these We’ve already seen this type of counting, when uncertain counts computing the gradient in maxent models.

  35. Expectation Maximization (EM): M-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts π‘ž 𝑒+1 (𝑨) π‘ž (𝑒) (𝑨) estimated counts

  36. EM Math the average log-likelihood of our complete data (z, w), averaged across max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) all z and according to how likely our current model thinks z is

  37. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) πœ„

  38. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) πœ„

  39. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ posterior distribution

  40. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ new parameters posterior distribution new parameters

  41. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ new parameters posterior distribution new parameters E-step: count under uncertainty M-step: maximize log-likelihood

  42. Why EM? Un-Supervised Learning ? ? ? ? ? ? ? ? ? NO labeled data: βž” ? ? ? β€’ human annotated EM β€’ relatively small/few ? ? ? examples ? ? ? ? ? ? EM/generative models in this case ? ? ? can be seen as a type of clustering unlabeled data: β€’ raw; not annotated β€’ plentiful

  43. Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: β€’ human annotated β€’ raw; not annotated β€’ relatively small/few β€’ plentiful examples

  44. Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ? EM  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: β€’ human annotated β€’ raw; not annotated β€’ relatively small/few β€’ plentiful examples

  45. Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? labeled data: unlabeled data: β€’ human annotated β€’ raw; not annotated β€’ relatively small/few β€’ plentiful examples

  46. Why EM? Semi-Supervised Learning ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ?  ? ? ? ? ? ? EM

  47. Outline Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

  48. Three Coins Example Imagine three coins Flip 1 st coin (penny) If heads: flip 2 nd coin (dollar coin) If tails: flip 3 rd coin (dime)

  49. Three Coins Example Imagine three coins Flip 1 st coin (penny) don’t observe this If heads: flip 2 nd coin (dollar coin) only observe these (record heads vs. tails If tails: flip 3 rd coin (dime) outcome)

  50. Three Coins Example Imagine three coins Flip 1 st coin (penny) unobserved: part of speech? genre? If heads: flip 2 nd coin (dollar coin) observed: a , b , e , etc. If tails: flip 3 rd coin (dime) We run the code, vs. The run failed

  51. Three Coins Example Imagine three coins Flip 1 st coin (penny) π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ If heads: flip 2 nd coin (dollar coin) π‘ž heads = 𝛿 π‘ž tails = 1 βˆ’ 𝛿 If tails: flip 3 rd coin (dime) π‘ž heads = πœ” π‘ž tails = 1 βˆ’ πœ”

  52. Three Coins Example Imagine three coins π‘ž heads = πœ” π‘ž heads = πœ‡ π‘ž heads = 𝛿 π‘ž tails = 1 βˆ’ πœ‡ π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ” Three parameters to estimate: Ξ» , Ξ³ , and ψ

  53. Generative Story for Three Coins π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = π‘ž π‘₯ 1 π‘ž π‘₯ 2 β‹― π‘ž π‘₯ 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 𝑗 add complexity to better explain what we see π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = π‘ž 𝑨 1 π‘ž π‘₯ 1 |𝑨 1 β‹― π‘ž 𝑨 𝑂 π‘ž π‘₯ 𝑂 |𝑨 𝑂 = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 𝑗 Generative Story π‘ž heads = πœ‡ πœ‡ = distribution over penny π‘ž tails = 1 βˆ’ πœ‡ 𝛿 = distribution for dollar coin πœ” = distribution over dime π‘ž heads = 𝛿 for item 𝑗 = 1 to 𝑂: π‘ž tails = 1 βˆ’ 𝛿 𝑨 𝑗 ~ Bernoulli πœ‡ π‘ž heads = πœ” if 𝑨 𝑗 = 𝐼: π‘₯ 𝑗 ~ Bernoulli 𝛿 else: π‘₯ 𝑗 ~ Bernoulli πœ” π‘ž tails = 1 βˆ’ πœ”

  54. Three Coins Example H H T H T H H T H T T T If all flips were observed π‘ž heads = πœ” π‘ž heads = πœ‡ π‘ž heads = 𝛿 π‘ž tails = 1 βˆ’ πœ‡ π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

  55. Three Coins Example H H T H T H H T H T T T If all flips were observed π‘ž heads = πœ” π‘ž heads = πœ‡ π‘ž heads = 𝛿 π‘ž tails = 1 βˆ’ πœ‡ π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ” π‘ž heads = 4 π‘ž heads = 1 π‘ž heads = 1 6 4 2 π‘ž tails = 2 π‘ž tails = 3 π‘ž tails = 1 6 4 2

  56. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4

  57. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4 Use these values to compute posteriors π‘ž heads | observed item H = π‘ž(heads & H) π‘ž(H) π‘ž heads | observed item T = π‘ž(heads & T) π‘ž(T)

  58. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4 Use these values to compute posteriors rewrite joint using Bayes rule π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H) marginal likelihood

  59. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4 Use these values to compute posteriors π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H) π‘ž H | heads = .8 π‘ž T | heads = .2

  60. Three Coins Example H H T H T H H T H T T T But not all flips are observed β†’ set parameter values π‘ž heads = .6 π‘ž heads = πœ‡ = .6 π‘ž heads = .8 π‘ž tails = .4 π‘ž tails = .2 π‘ž tails = .4 Use these values to compute posteriors π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H) π‘ž H | heads = .8 π‘ž T | heads = .2 π‘ž H = π‘ž H | heads βˆ— π‘ž heads + π‘ž H | tails * π‘ž(tails) = .8 βˆ— .6 + .6 βˆ— .4

  61. Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(T) π‘ž(H) .2 βˆ— .6 .8 βˆ— .6 = .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 = .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?

  62. Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(T) π‘ž(H) .2 βˆ— .6 .8 βˆ— .6 = .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 = .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.

  63. Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(T) π‘ž(H) .2 βˆ— .6 .8 βˆ— .6 = .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 = .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1) π‘ž heads = # heads from penny fully observed setting # total flips of penny π‘ž heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny our setting: partially-observed # total flips of penny

  64. Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(T) π‘ž(H) .2 βˆ— .6 .8 βˆ— .6 = .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 = .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž (𝑒+1) heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny 𝔽 π‘ž (𝑒) [# π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny] our setting: partially-observed = # total flips of penny

  65. Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(T) π‘ž(H) .2 βˆ— .6 .8 βˆ— .6 = .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 = .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž (𝑒+1) heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny our setting: 𝔽 π‘ž (𝑒) [# π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny] partially- = # total flips of penny observed = 2 βˆ— π‘ž heads | obs. H + 4 βˆ— π‘ž heads | obs. π‘ˆ 6 β‰ˆ 0.444

  66. Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm: 1. E-step: count under uncertainty (compute expectations) 2. M-step: maximize log-likelihood, assuming these uncertain counts

  67. Outline Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works

  68. Why does EM work? π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) β„³ πœ„ = marginal log-likelihood of 𝒬 πœ„ = posterior log-likelihood of observed data X incomplete data Y what do π’Ÿ , β„³ , 𝒬 look like?

  69. Why does EM work? π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) β„³ πœ„ = marginal log-likelihood of 𝒬 πœ„ = posterior log-likelihood of observed data X incomplete data Y π’Ÿ πœ„ = ෍ log π‘ž(𝑦 𝑗 , 𝑧 𝑗 ) 𝑗

  70. Why does EM work? π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) β„³ πœ„ = marginal log-likelihood of 𝒬 πœ„ = posterior log-likelihood of observed data X incomplete data Y π’Ÿ πœ„ = ෍ log π‘ž(𝑦 𝑗 , 𝑧 𝑗 ) 𝑗 β„³ πœ„ = ෍ log π‘ž(𝑦 𝑗 ) = ෍ log ෍ π‘ž(𝑦 𝑗 , 𝑧 = 𝑙) 𝑗 𝑗 𝑙

  71. Why does EM work? π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) β„³ πœ„ = marginal log-likelihood of 𝒬 πœ„ = posterior log-likelihood of observed data X incomplete data Y π’Ÿ πœ„ = ෍ log π‘ž(𝑦 𝑗 , 𝑧 𝑗 ) 𝑗 β„³ πœ„ = ෍ log π‘ž(𝑦 𝑗 ) = ෍ log ෍ π‘ž(𝑦 𝑗 , 𝑧 = 𝑙) 𝑗 𝑗 𝑙 𝒬 πœ„ = ෍ log π‘ž 𝑧 𝑗 𝑦 𝑗 ) 𝑗

  72. Why does EM work? π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) β„³ πœ„ = marginal log-likelihood of 𝒬 πœ„ = posterior log-likelihood of observed data X incomplete data Y π‘ž πœ„ 𝑍 π‘Œ) = π‘ž πœ„ (π‘Œ, 𝑍) π‘ž πœ„ (π‘Œ) = π‘ž πœ„ (π‘Œ, 𝑍) π‘ž πœ„ (π‘Œ) π‘ž πœ„ 𝑍 π‘Œ) algebra definition of conditional probability

  73. Why does EM work? π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) β„³ πœ„ = marginal log-likelihood of 𝒬 πœ„ = posterior log-likelihood of observed data X incomplete data Y π‘ž πœ„ 𝑍 π‘Œ) = π‘ž πœ„ (π‘Œ, 𝑍) π‘ž πœ„ (π‘Œ) = π‘ž πœ„ (π‘Œ, 𝑍) π‘ž πœ„ (π‘Œ) π‘ž πœ„ 𝑍 π‘Œ) 𝒬 πœ„ = ෍ log π‘ž 𝑧 𝑗 𝑦 𝑗 ) π’Ÿ πœ„ = ෍ log π‘ž(𝑦 𝑗 ,𝑧 𝑗 ) β„³ πœ„ = ෍ logπ‘ž(𝑦 𝑗 ) = ෍ log ෍ π‘ž(𝑦 𝑗 , 𝑧 = 𝑙) 𝑗 𝑗 𝑗 𝑗 𝑙 β„³ πœ„ = π’Ÿ πœ„ βˆ’ 𝒬 πœ„

  74. Why does EM work? π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) β„³ πœ„ = marginal log-likelihood of 𝒬 πœ„ = posterior log-likelihood of observed data X incomplete data Y π‘ž πœ„ 𝑍 π‘Œ) = π‘ž πœ„ (π‘Œ, 𝑍) π‘ž πœ„ (π‘Œ) = π‘ž πœ„ (π‘Œ, 𝑍) π‘ž πœ„ (π‘Œ) π‘ž πœ„ 𝑍 π‘Œ) β„³ πœ„ = π’Ÿ πœ„ βˆ’ 𝒬 πœ„ 𝔽 π‘βˆΌπœ„ (𝑒) [β„³ πœ„ |π‘Œ] = 𝔽 π‘βˆΌπœ„ (𝑒) [π’Ÿ πœ„ |π‘Œ] βˆ’ 𝔽 π‘βˆΌπœ„ (𝑒) [𝒬 πœ„ |π‘Œ] take a conditional expectation (why? we’ll cover this more in variational inference)

  75. Why does EM work? π‘Œ: observed data 𝑍: unobserved data π’Ÿ πœ„ = log-likelihood of complete data (X,Y) β„³ πœ„ = marginal log-likelihood of 𝒬 πœ„ = posterior log-likelihood of observed data X incomplete data Y π‘ž πœ„ 𝑍 π‘Œ) = π‘ž πœ„ (π‘Œ, 𝑍) π‘ž πœ„ (π‘Œ) = π‘ž πœ„ (π‘Œ, 𝑍) π‘ž πœ„ (π‘Œ) π‘ž πœ„ 𝑍 π‘Œ) β„³ πœ„ = π’Ÿ πœ„ βˆ’ 𝒬 πœ„ 𝔽 π‘βˆΌπœ„ (𝑒) [β„³ πœ„ |π‘Œ] = 𝔽 π‘βˆΌπœ„ (𝑒) [π’Ÿ πœ„ |π‘Œ] βˆ’ 𝔽 π‘βˆΌπœ„ (𝑒) [𝒬 πœ„ |π‘Œ] β„³ πœ„ = 𝔽 π‘βˆΌπœ„ (𝑒) [π’Ÿ πœ„ |π‘Œ] βˆ’ 𝔽 π‘βˆΌπœ„ (𝑒) [𝒬 πœ„ |π‘Œ] β„³ already β„³ πœ„ = ෍ log π‘ž(𝑦 𝑗 ) = ෍ log ෍ π‘ž(𝑦 𝑗 , 𝑧 = 𝑙) sums over Y 𝑗 𝑗 𝑙

Recommend


More recommend