Learning Parameters for the Die Model π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss function do we minimize to maximize log-likelihood?
Learning Parameters for the Die Model π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters Q: Why is maximizing log- A: Develop a good model likelihood a reasonable for what we observe thing to do? Q: (for discrete observations) What loss A: Cross-entropy function do we minimize to maximize log-likelihood?
Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters If you observe β¦what are βreasonableβ these 9 rollsβ¦ estimates for p(w)? p(1) = ? p(2) = ? p(3) = ? p(4) = ? p(5) = ? p(6) = ?
Learning Parameters for the Die Model: Maximum Likelihood (Intuition) π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π maximize (log-) likelihood to learn the probability parameters If you observe β¦what are βreasonableβ these 9 rollsβ¦ estimates for p(w)? p(1) = 2/9 p(2) = 1/9 maximum p(3) = 1/9 p(4) = 3/9 likelihood estimates p(5) = 1/9 p(6) = 1/9
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Generative Story π₯ 1 = 1 for roll π = 1 to π: π₯ π βΌ Cat(π) π₯ 2 = 5 Maximize Log-likelihood π₯ 3 = 4 β π = ΰ· log π π (π₯ π ) π β― = ΰ· log π π₯ π π
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Generative Story Maximize Log-likelihood for roll π = 1 to π: β π = ΰ· log π π₯ π π₯ π βΌ Cat(π) π Q: Whatβs an easy way to maximize this, as written exactly (even without calculus)?
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Generative Story Maximize Log-likelihood for roll π = 1 to π: β π = ΰ· log π π₯ π π₯ π βΌ Cat(π) π Q: Whatβs an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing π π ( we know π must be a distribution, but itβs not specified)
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Maximize Log-likelihood (with distribution constraints) 6 (we can include the inequality constraints β π = ΰ· log π π₯ π s. t. ΰ· π π = 1 0 β€ π π , but it complicates the problem and, right π π=1 now , is not needed) solve using Lagrange multipliers
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Maximize Log-likelihood (with distribution constraints) (we can include the 6 inequality constraints 0 β€ π π , but it β± π = ΰ· log π π₯ π β π ΰ· π π β 1 complicates the problem and, right π π=1 now , is not needed) 6 πβ± π 1 πβ± π = ΰ· β π = β ΰ· π π + 1 ππ π π π₯ π ππ π:π₯ π =π π=1
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Maximize Log-likelihood (with distribution constraints) (we can include the 6 inequality constraints 0 β€ π π , but it β± π = ΰ· log π π₯ π β π ΰ· π π β 1 complicates the problem and, right π π=1 now , is not needed) 6 Ο π:π₯ π =π 1 π π = optimal π when ΰ· π π = 1 π π=1
Learning Parameters for the Die Model: Maximum Likelihood (Math) N different (independent) rolls π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π Maximize Log-likelihood (with distribution constraints) (we can include the 6 inequality constraints 0 β€ π π , but it β± π = ΰ· log π π₯ π β π ΰ· π π β 1 complicates the problem and, right π π=1 now , is not needed) 6 Ο π:π₯ π =π 1 Ο π Ο π:π₯ π =π 1 = π π π π = optimal π when ΰ· π π = 1 π π=1
Outline Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works
Example: Conditionally Rolling a Die π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π add complexity to better explain what we see π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = π π¨ 1 π π₯ 1 |π¨ 1 β― π π¨ π π π₯ π |π¨ π = ΰ· π π₯ π |π¨ π π π¨ π π
Example: Conditionally Rolling a Die π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π add complexity to better explain what we see π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = π π¨ 1 π π₯ 1 |π¨ 1 β― π π¨ π π π₯ π |π¨ π = ΰ· π π₯ π |π¨ π π π¨ π π First flip a coinβ¦ π¨ 1 = π π¨ 2 = πΌ β―
Example: Conditionally Rolling a Die π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π add complexity to better explain what we see π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = π π¨ 1 π π₯ 1 |π¨ 1 β― π π¨ π π π₯ π |π¨ π = ΰ· π π₯ π |π¨ π π π¨ π π β¦then roll a different die First flip a coinβ¦ depending on the coin flip π¨ 1 = π π₯ 1 = 1 π¨ 2 = πΌ π₯ 2 = 5 β―
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π add complexity to better explain what we see π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = π π¨ 1 π π₯ 1 |π¨ 1 β― π π¨ π π π₯ π |π¨ π = ΰ· π π₯ π |π¨ π π π¨ π π If you observe the π¨ π values, this is easy!
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π If you observe the π¨ π values, this is easy! First: Write the Generative Story π = distribution over coin (z) πΏ (πΌ) = distribution for die when coin comes up heads πΏ (π) = distribution for die when coin comes up tails for item π = 1 to π: π¨ π ~ Bernoulli π π₯ π ~ Cat πΏ (π¨ π )
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π If you observe the π¨ π values, this is easy! Second: Generative Story β Objective First: Write the Generative Story π = distribution over coin (z) π πΏ (πΌ) = distribution for H die (π¨ π ) ) β± π = ΰ· (log π π¨ π + log πΏ π₯ π πΏ (π) = distribution for T die π 2 2 6 for item π = 1 to π: Lagrange multiplier (π) β 1 βπ ΰ· π π β 1 β ΰ· π π ΰ· πΏ π π¨ π ~ Bernoulli π constraints π=1 π π π₯ π ~ Cat πΏ (π¨ π )
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π If you observe the π¨ π values, this is easy! Second: Generative Story β Objective First: Write the Generative Story π = distribution over coin (z) π πΏ (πΌ) = distribution for H die (π¨ π ) ) β± π = ΰ· (log π π¨ π + log πΏ π₯ π πΏ (π) = distribution for T die π 2 2 6 for item π = 1 to π: (π) β 1 βπ ΰ· π π β 1 β ΰ· π π ΰ· πΏ π π¨ π ~ Bernoulli π π=1 π=1 π=1 π₯ π ~ Cat πΏ (π¨ π )
Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π If you observe the π¨ π But if you donβt observe the π¨ π values, this is not easy! values, this is easy! Second: Generative Story β Objective First: Write the Generative Story π = distribution over coin (z) π πΏ (πΌ) = distribution for H die (π¨ π ) ) β± π = ΰ· (log π π¨ π + log πΏ π₯ π πΏ (π) = distribution for T die π 2 2 6 for item π = 1 to π: (π) β 1 βπ ΰ· π π β 1 β ΰ· π π ΰ· πΏ π π¨ π ~ Bernoulli π π=1 π=1 π=1 π₯ π ~ Cat πΏ (π¨ π )
Example: Conditionally Rolling a Die π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π goal: maximize (log-)likelihood we donβt actually observe these z values we just see the items w if we did observe z , estimating the probability parameters would be easyβ¦ but we donβt! :(
Example: Conditionally Rolling a Die π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π goal: maximize (log-)likelihood we donβt actually observe these z values we just see the items w if we did observe z , estimating the if we knew the probability parameters probability parameters would be easyβ¦ then we could estimate z and evaluate but we donβt! :( likelihoodβ¦ but we donβt! :(
Example: Conditionally Rolling a Die π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π we donβt actually observe these z values goal: maximize marginalized (log-)likelihood
Example: Conditionally Rolling a Die π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π we donβt actually observe these z values goal: maximize marginalized (log-)likelihood w
Example: Conditionally Rolling a Die π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π we donβt actually observe these z values goal: maximize marginalized (log-)likelihood w z 2 & w z 3 & w z 4 & w z 1 & w
Example: Conditionally Rolling a Die π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = ΰ· π π₯ π |π¨ π π π¨ π π we donβt actually observe these z values goal: maximize marginalized (log-)likelihood w z 1 & w z 2 & w z 3 & w z 4 & w π π₯ 1 , π₯ 2 , β¦ , π₯ π = ΰ· π(π¨ 1 , π₯) ΰ· π(π¨ 2 , π₯) β― ΰ· π(π¨ π , π₯) π¨ 1 π¨ 2 π¨ π
Example: Conditionally Rolling a Die π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = π π¨ 1 π π₯ 1 |π¨ 1 β― π π¨ π π π₯ π |π¨ π goal: maximize marginalized (log-)likelihood w z 2 & w z 3 & w z 4 & w z 1 & w π π₯ 1 , π₯ 2 , β¦ , π₯ π = ΰ· π(π¨ 1 , π₯) ΰ· π(π¨ 2 , π₯) β― ΰ· π(π¨ π , π₯) π¨ 1 π¨ 2 π¨ π if we did observe z , estimating the if we knew the probability parameters probability parameters would be easyβ¦ then we could estimate z and evaluate but we donβt! :( likelihoodβ¦ but we donβt! :(
if we knew the probability parameters then we could estimate z and evaluate likelihoodβ¦ but we donβt! :( if we did observe z , estimating the probability parameters would be easyβ¦ but we donβt! :(
if we knew the probability parameters then we could estimate z and evaluate likelihoodβ¦ but we donβt! :( if we did observe z , estimating the probability parameters would be easyβ¦ but we donβt! :(
if we knew the probability parameters then we could estimate z and evaluate likelihoodβ¦ but we donβt! :( if we did observe z , estimating the probability parameters would be easyβ¦ but we donβt! :( Expectation Maximization : give you model estimation the needed βsparkβ http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg
Outline Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works
Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty (compute expectations) 2. M-step: maximize log-likelihood, assuming these uncertain counts
Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(π¨ π , π₯ π ) π(π¨ π ) 2. M-step: maximize log-likelihood, assuming these uncertain counts
Expectation Maximization (EM): E-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters count(π¨ π , π₯ π ) π(π¨ π ) 2. M-step: maximize log-likelihood, assuming these Weβve already seen this type of counting, when uncertain counts computing the gradient in maxent models.
Expectation Maximization (EM): M-step 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts π π’+1 (π¨) π (π’) (π¨) estimated counts
EM Math the average log-likelihood of our complete data (z, w), averaged across max π½ π¨ ~ π π (π’) (β |π₯) log π π (π¨, π₯) all z and according to how likely our current model thinks z is
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π (π’) (β |π₯) log π π (π¨, π₯) π
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π (π’) (β |π₯) log π π (π¨, π₯) π
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π (π’) (β |π₯) log π π (π¨, π₯) current parameters π posterior distribution
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π (π’) (β |π₯) log π π (π¨, π₯) current parameters π new parameters posterior distribution new parameters
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π (π’) (β |π₯) log π π (π¨, π₯) current parameters π new parameters posterior distribution new parameters E-step: count under uncertainty M-step: maximize log-likelihood
Why EM? Un-Supervised Learning ? ? ? ? ? ? ? ? ? NO labeled data: β ? ? ? β’ human annotated EM β’ relatively small/few ? ? ? examples ? ? ? ? ? ? EM/generative models in this case ? ? ? can be seen as a type of clustering unlabeled data: β’ raw; not annotated β’ plentiful
Why EM? Semi-Supervised Learning ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ? ? ? labeled data: unlabeled data: β’ human annotated β’ raw; not annotated β’ relatively small/few β’ plentiful examples
Why EM? Semi-Supervised Learning ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? EM ο ? ? ? ο ? ? ? ο ? ? ? ? ? ? labeled data: unlabeled data: β’ human annotated β’ raw; not annotated β’ relatively small/few β’ plentiful examples
Why EM? Semi-Supervised Learning ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ? ? ? labeled data: unlabeled data: β’ human annotated β’ raw; not annotated β’ relatively small/few β’ plentiful examples
Why EM? Semi-Supervised Learning ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ο ? ? ? ? ? ? EM
Outline Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works
Three Coins Example Imagine three coins Flip 1 st coin (penny) If heads: flip 2 nd coin (dollar coin) If tails: flip 3 rd coin (dime)
Three Coins Example Imagine three coins Flip 1 st coin (penny) donβt observe this If heads: flip 2 nd coin (dollar coin) only observe these (record heads vs. tails If tails: flip 3 rd coin (dime) outcome)
Three Coins Example Imagine three coins Flip 1 st coin (penny) unobserved: part of speech? genre? If heads: flip 2 nd coin (dollar coin) observed: a , b , e , etc. If tails: flip 3 rd coin (dime) We run the code, vs. The run failed
Three Coins Example Imagine three coins Flip 1 st coin (penny) π heads = π π tails = 1 β π If heads: flip 2 nd coin (dollar coin) π heads = πΏ π tails = 1 β πΏ If tails: flip 3 rd coin (dime) π heads = π π tails = 1 β π
Three Coins Example Imagine three coins π heads = π π heads = π π heads = πΏ π tails = 1 β π π tails = 1 β πΏ π tails = 1 β π Three parameters to estimate: Ξ» , Ξ³ , and Ο
Generative Story for Three Coins π π₯ 1 , π₯ 2 , β¦ , π₯ π = π π₯ 1 π π₯ 2 β― π π₯ π = ΰ· π π₯ π π add complexity to better explain what we see π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = π π¨ 1 π π₯ 1 |π¨ 1 β― π π¨ π π π₯ π |π¨ π = ΰ· π π₯ π |π¨ π π π¨ π π Generative Story π heads = π π = distribution over penny π tails = 1 β π πΏ = distribution for dollar coin π = distribution over dime π heads = πΏ for item π = 1 to π: π tails = 1 β πΏ π¨ π ~ Bernoulli π π heads = π if π¨ π = πΌ: π₯ π ~ Bernoulli πΏ else: π₯ π ~ Bernoulli π π tails = 1 β π
Three Coins Example H H T H T H H T H T T T If all flips were observed π heads = π π heads = π π heads = πΏ π tails = 1 β π π tails = 1 β πΏ π tails = 1 β π
Three Coins Example H H T H T H H T H T T T If all flips were observed π heads = π π heads = π π heads = πΏ π tails = 1 β π π tails = 1 β πΏ π tails = 1 β π π heads = 4 π heads = 1 π heads = 1 6 4 2 π tails = 2 π tails = 3 π tails = 1 6 4 2
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4 Use these values to compute posteriors π heads | observed item H = π(heads & H) π(H) π heads | observed item T = π(heads & T) π(T)
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4 Use these values to compute posteriors rewrite joint using Bayes rule π heads | observed item H = π H heads)π(heads) π(H) marginal likelihood
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4 Use these values to compute posteriors π heads | observed item H = π H heads)π(heads) π(H) π H | heads = .8 π T | heads = .2
Three Coins Example H H T H T H H T H T T T But not all flips are observed β set parameter values π heads = .6 π heads = π = .6 π heads = .8 π tails = .4 π tails = .2 π tails = .4 Use these values to compute posteriors π heads | observed item H = π H heads)π(heads) π(H) π H | heads = .8 π T | heads = .2 π H = π H | heads β π heads + π H | tails * π(tails) = .8 β .6 + .6 β .4
Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π heads | obs. T = π T heads)π(heads) π heads | obs. H = π H heads)π(heads) π(T) π(H) .2 β .6 .8 β .6 = .2 β .6 + .6 β .4 β 0.334 = .8 β .6 + .6 β .4 β 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1?
Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π heads | obs. T = π T heads)π(heads) π heads | obs. H = π H heads)π(heads) π(T) π(H) .2 β .6 .8 β .6 = .2 β .6 + .6 β .4 β 0.334 = .8 β .6 + .6 β .4 β 0.667 Q: Is p(heads | obs. H) + p(heads| obs. T) = 1? A: No.
Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π heads | obs. T = π T heads)π(heads) π heads | obs. H = π H heads)π(heads) π(T) π(H) .2 β .6 .8 β .6 = .2 β .6 + .6 β .4 β 0.334 = .8 β .6 + .6 β .4 β 0.667 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1) π heads = # heads from penny fully observed setting # total flips of penny π heads = # ππ¦ππππ’ππ heads from penny our setting: partially-observed # total flips of penny
Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π heads | obs. T = π T heads)π(heads) π heads | obs. H = π H heads)π(heads) π(T) π(H) .2 β .6 .8 β .6 = .2 β .6 + .6 β .4 β 0.334 = .8 β .6 + .6 β .4 β 0.667 π (π’+1) heads = # ππ¦ππππ’ππ heads from penny # total flips of penny π½ π (π’) [# ππ¦ππππ’ππ heads from penny] our setting: partially-observed = # total flips of penny
Three Coins Example H H T H T H H T H T T T Use posteriors to update parameters π heads | obs. T = π T heads)π(heads) π heads | obs. H = π H heads)π(heads) π(T) π(H) .2 β .6 .8 β .6 = .2 β .6 + .6 β .4 β 0.334 = .8 β .6 + .6 β .4 β 0.667 π (π’+1) heads = # ππ¦ππππ’ππ heads from penny # total flips of penny our setting: π½ π (π’) [# ππ¦ππππ’ππ heads from penny] partially- = # total flips of penny observed = 2 β π heads | obs. H + 4 β π heads | obs. π 6 β 0.444
Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm: 1. E-step: count under uncertainty (compute expectations) 2. M-step: maximize log-likelihood, assuming these uncertain counts
Outline Latent and probabilistic modeling Generative Modeling Example 1: A Model of Rolling a Die Example 2: A Model of Conditional Die Rolls EM (Expectation Maximization) Basic idea Three coins example Why EM works
Why does EM work? π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) β³ π = marginal log-likelihood of π¬ π = posterior log-likelihood of observed data X incomplete data Y what do π , β³ , π¬ look like?
Why does EM work? π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) β³ π = marginal log-likelihood of π¬ π = posterior log-likelihood of observed data X incomplete data Y π π = ΰ· log π(π¦ π , π§ π ) π
Why does EM work? π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) β³ π = marginal log-likelihood of π¬ π = posterior log-likelihood of observed data X incomplete data Y π π = ΰ· log π(π¦ π , π§ π ) π β³ π = ΰ· log π(π¦ π ) = ΰ· log ΰ· π(π¦ π , π§ = π) π π π
Why does EM work? π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) β³ π = marginal log-likelihood of π¬ π = posterior log-likelihood of observed data X incomplete data Y π π = ΰ· log π(π¦ π , π§ π ) π β³ π = ΰ· log π(π¦ π ) = ΰ· log ΰ· π(π¦ π , π§ = π) π π π π¬ π = ΰ· log π π§ π π¦ π ) π
Why does EM work? π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) β³ π = marginal log-likelihood of π¬ π = posterior log-likelihood of observed data X incomplete data Y π π π π) = π π (π, π) π π (π) = π π (π, π) π π (π) π π π π) algebra definition of conditional probability
Why does EM work? π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) β³ π = marginal log-likelihood of π¬ π = posterior log-likelihood of observed data X incomplete data Y π π π π) = π π (π, π) π π (π) = π π (π, π) π π (π) π π π π) π¬ π = ΰ· log π π§ π π¦ π ) π π = ΰ· log π(π¦ π ,π§ π ) β³ π = ΰ· logπ(π¦ π ) = ΰ· log ΰ· π(π¦ π , π§ = π) π π π π π β³ π = π π β π¬ π
Why does EM work? π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) β³ π = marginal log-likelihood of π¬ π = posterior log-likelihood of observed data X incomplete data Y π π π π) = π π (π, π) π π (π) = π π (π, π) π π (π) π π π π) β³ π = π π β π¬ π π½ πβΌπ (π’) [β³ π |π] = π½ πβΌπ (π’) [π π |π] β π½ πβΌπ (π’) [π¬ π |π] take a conditional expectation (why? weβll cover this more in variational inference)
Why does EM work? π: observed data π: unobserved data π π = log-likelihood of complete data (X,Y) β³ π = marginal log-likelihood of π¬ π = posterior log-likelihood of observed data X incomplete data Y π π π π) = π π (π, π) π π (π) = π π (π, π) π π (π) π π π π) β³ π = π π β π¬ π π½ πβΌπ (π’) [β³ π |π] = π½ πβΌπ (π’) [π π |π] β π½ πβΌπ (π’) [π¬ π |π] β³ π = π½ πβΌπ (π’) [π π |π] β π½ πβΌπ (π’) [π¬ π |π] β³ already β³ π = ΰ· log π(π¦ π ) = ΰ· log ΰ· π(π¦ π , π§ = π) sums over Y π π π
Recommend
More recommend