em hidden markov models
play

EM & Hidden Markov Models CMSC 691 UMBC Recap from last time - PowerPoint PPT Presentation

EM & Hidden Markov Models CMSC 691 UMBC Recap from last time Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step:


  1. 2 State HMM Likelihood π‘ž π‘Š| start π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š … z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 4 |π‘Š π‘ž π‘₯ 1 |π‘Š π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 2 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4 N V end w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05

  2. 2 State HMM Likelihood π‘ž π‘Š| start π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š … z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 4 |π‘Š π‘ž π‘₯ 1 |π‘Š π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 2 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of N V end (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05

  3. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of N V end (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = N .15 .8 .05 V .2 .6 .1 .1 0.0002822 V .6 .35 .05

  4. 2 State HMM Likelihood π‘ž π‘Š| start π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š … z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 4 |π‘Š π‘ž π‘₯ 1 |π‘Š π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 2 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of N V end (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05

  5. 2 State HMM Likelihood z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4 N V end Q: What’s the probability of w 1 w 2 w 3 w 4 (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 )? start .7 .2 .1 N .7 .2 .05 .05 A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) = N .15 .8 .05 V .2 .6 .1 .1 0.00007056 V .6 .35 .05

  6. Agenda HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

  7. Estimating Parameters from Observed Data π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V Transition Counts π‘ž 𝑂| start π‘ž π‘Š| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start π‘ž π‘₯ 4 |𝑂 N π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š V w 1 w 2 w 3 w 4 Emission Counts w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N V V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start end emission not shown π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  8. Estimating Parameters from Observed Data π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V Transition Counts π‘ž 𝑂| start π‘ž π‘Š| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 2 0 0 π‘ž π‘₯ 4 |𝑂 N 1 2 2 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š V 2 1 0 w 1 w 2 w 3 w 4 Emission Counts w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N 2 0 1 2 V V V V V 0 2 1 0 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start end emission not shown π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  9. Estimating Parameters from Observed Data π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V Transition MLE π‘ž 𝑂| start π‘ž π‘Š| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 1 0 0 π‘ž π‘₯ 4 |𝑂 N .2 .4 .4 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š V 2/3 1/3 0 w 1 w 2 w 3 w 4 Emission MLE w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N .4 0 .2 .4 V V V V V 0 2/3 1/3 0 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start end emission not shown π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  10. Estimating Parameters from Observed Data π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V Transition MLE π‘ž 𝑂| start π‘ž π‘Š| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 1 0 0 π‘ž π‘₯ 4 |𝑂 N .2 .4 .4 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š V 2/3 1/3 0 w 1 w 2 w 3 w 4 Emission MLE w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N .4 0 .2 .4 V V V V V 0 2/3 1/3 0 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start end emission not shown π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N smooth these π‘ž π‘₯ 2 |π‘Š values if π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 needed w 1 w 2 w 3 w 4

  11. What If We Don’t Observe 𝑨 ? Approach: Develop EM algorithm Goal: Estimate π‘ž 𝑒 𝑑 β€² 𝑑) and π‘ž 𝑓 𝑀 𝑑) Why: Compute 𝔽 𝑨 𝑗 =𝑑→𝑨 𝑗+1 =𝑑 β€² 𝑑 𝑑 β†’ 𝑑 β€² 𝔽 𝑨 𝑗 =𝑑→π‘₯ 𝑗 =𝑀 𝑑 𝑑 β†’ 𝑀

  12. Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts estimated counts

  13. Expectation Maximization (EM) p obs (w | s) 0. Assume some value for your parameters p trans (s’ | s) Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts estimated counts

  14. Expectation Maximization (EM) p obs (w | s) 0. Assume some value for your parameters p trans (s’ | s) Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters π‘ž βˆ— 𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = π‘ž βˆ— 𝑨 𝑗 = 𝑑 π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = π‘ž(𝑨 𝑗 = 𝑑, π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² , π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) 2. M-step: maximize log-likelihood, assuming these uncertain counts estimated counts

  15. M-Step β€œ maximize log-likelihood, assuming these uncertain counts ” 𝑑(𝑑 β†’ 𝑑 β€² ) π‘ž new 𝑑 β€² 𝑑) = Οƒ 𝑑 β€²β€² 𝑑(𝑑 β†’ 𝑑 β€²β€² ) if we observed the hidden transitions…

  16. M-Step β€œ maximize log-likelihood, assuming these uncertain counts ” 𝔽 𝑑→𝑑 β€² [𝑑 𝑑 β†’ 𝑑 β€² ] π‘ž new 𝑑 β€² 𝑑) = Οƒ 𝑑 β€²β€² 𝔽 𝑑→𝑑 β€²β€² [𝑑 𝑑 β†’ 𝑑′′ ] we don’t observe the hidden transitions, but we can approximately count

  17. M-Step β€œ maximize log-likelihood, assuming these uncertain counts ” 𝔽 𝑑→𝑑 β€² [𝑑 𝑑 β†’ 𝑑 β€² ] π‘ž new 𝑑 β€² 𝑑) = Οƒ 𝑑 β€²β€² 𝔽 𝑑→𝑑 β€²β€² [𝑑 𝑑 β†’ 𝑑′′ ] we don’t observe the hidden transitions, but we can approximately count we compute these in the E-step

  18. Expectation Maximization (EM) p obs (w | s) 0. Assume some value for your parameters p trans (s’ | s) Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters π‘ž βˆ— 𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = π‘ž βˆ— 𝑨 𝑗 = 𝑑 π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = π‘ž(𝑨 𝑗 = 𝑑, π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² , π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) 2. M-step: maximize log-likelihood, assuming these uncertain counts Baum-Welch estimated counts

  19. Estimating Parameters from Unobserved Data Expected Transition Counts N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = N V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž w 1 w 2 w 3 w 4

  20. Estimating Parameters from Unobserved Data Expected all of these p* arcs are Transition Counts specific to a time-step N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = N V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž w 1 w 2 w 3 w 4

  21. Estimating Parameters from Unobserved Data all of these p* arcs are Expected specific to a time-step Transition Counts N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = =.5 =.3 =.3 N V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 =.4 =.6 =.5 N N N N N βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž w 1 w 2 w 3 w 4

  22. Estimating Parameters from Unobserved Data all of these p* arcs are Expected specific to a time-step Transition Counts N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start π‘ž π‘ž π‘ž π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = =.5 =.3 =.3 N 1.5 V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V 1.1 βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 =.4 =.6 =.5 N N N N N βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž w 1 w 2 w 3 w 4

  23. Estimating Parameters from Unobserved Data Expected Transition Counts N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start 1.8 .1 .1 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = N 1.5 .8 .1 V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V 1.4 1.1 .4 βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N .4 .3 .2 .2 βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V .1 .6 .3 .3 π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž (these numbers are made up) w 1 w 2 w 3 w 4

  24. Estimating Parameters from Unobserved Data Expected Transition MLE N V end start 1.8/2 .1/2 .1/2 N 1.5/ .8/ .1/ βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž π‘ž π‘ž π‘ž 2.4 2.4 2.4 z 1 = z 2 = z 3 = z 4 = V 1.4/2.9 1.1/ .4/ V V V V 2.9 2.9 βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission MLE π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N .4/ .3/ .2/ .2/ 1.1 1.1 1.1 1.1 βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š π‘ž π‘ž π‘ž V .1/ .6/ .3/ .3/ βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 1.3 1.3 1.3 1.3 π‘ž π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |π‘Š π‘ž end emission not shown (these numbers are made up) w 1 w 2 w 3 w 4

  25. Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0

  26. Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0 Expected Transition Counts Expected Emission Counts N V end w 1 w 2 W 3 w 4 start 1.8 .1 .1 N .4 .3 .2 .2 N 1.5 .8 .1 V .1 .6 .3 .3 V 1.4 1.1 .4

  27. Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0 Expected Transition Counts Expected Emission Counts N V end w 1 w 2 W 3 w 4 start 1.8 .1 .1 N .4 .3 .2 .2 N 1.5 .8 .1 V .1 .6 .3 .3 V 1.4 1.1 .4

  28. Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0 Mixed Transition Counts Mixed Emission Counts N V end w 1 w 2 W 3 w 4 start 3.8 .1 .1 N 2.4 .3 1.2 2.2 N 2.5 2.8 2.1 V .1 2.6 1.3 .3 V 3.4 2.1 .4 Expected Transition Counts Expected Emission Counts N V end w 1 w 2 W 3 w 4 start 1.8 .1 .1 N .4 .3 .2 .2 N 1.5 .8 .1 V .1 .6 .3 .3 V 1.4 1.1 .4

  29. Agenda HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

  30. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ new parameters posterior distribution new parameters

  31. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ new parameters posterior distribution new parameters 𝑨 ∈ 𝑑 1 , … , 𝑑 𝐿 𝑂 ෍ log π‘ž πœ„ (𝑨 𝑗 |𝑨 π‘—βˆ’1 ) + log π‘ž πœ„ (π‘₯ 𝑗 |𝑨 𝑗 ) 𝑗

  32. Estimating Parameters from Unobserved Data Expected Transition MLE N V end start 1.8/2 .1/2 .1/2 N 1.5/ .8/ .1/ βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž π‘ž π‘ž π‘ž 2.4 2.4 2.4 z 1 = z 2 = z 3 = z 4 = V 1.4/2.9 1.1/ .4/ V V V V 2.9 2.9 βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission MLE π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N .4/ .3/ .2/ .2/ 1.1 1.1 1.1 1.1 βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š π‘ž π‘ž π‘ž V .1/ .6/ .3/ .3/ βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 1.3 1.3 1.3 1.3 π‘ž π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |π‘Š π‘ž end emission not shown (these numbers are made up) w 1 w 2 w 3 w 4

  33. EM For HMMs (Baum-Welch Algorithm) L = π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) for(i = 1; i ≀ N; ++ i) { for(state = 0; state < K*; ++state) { π‘ž(𝑨 𝑗 = state ,π‘₯ 1 ,β‹―,π‘₯ 𝑂 ) c obs (obs i | state) += 𝑀 for(prev = 0; prev < K*; ++prev) { π‘ž(𝑨 𝑗 = state ,𝑨 𝑗+1 = next ,π‘₯ 1 ,β‹―,π‘₯ 𝑂 ) c trans (state | prev) += 𝑀 } } }

  34. EM For HMMs (Baum-Welch L = π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) Algorithm) for(i = 1; i ≀ N; ++i) { for(state = 0; state < K*; ++state) { c obs (obs i | state) += π‘ž 𝑨 𝑗 = state ,π‘₯ 1 ,…,π‘₯ 𝑗 = obs i π‘ž π‘₯ 𝑗+1:𝑂 𝑨 𝑗 = state ) 𝑀 for(prev = 0; prev < K*; ++prev) { u = p obs (obs i | state) * p trans (state | prev) c trans (state | prev) += π‘ž 𝑨 π‘—βˆ’1 = prev ,π‘₯ 1:π‘—βˆ’1 βˆ—π‘£βˆ—π‘ž π‘₯ 𝑗+1:𝑂 𝑨 𝑗 = state ) 𝑀 } } }

  35. EM For HMMs L = π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) (Baum-Welch for(i = 1; i ≀ N; ++i) { for(state = 0; state < K*; ++state) { Algorithm) c obs (obs i | state) += 𝛽( state , 𝑗) 𝛾( state , 𝑗) π‘ž 𝑨 𝑗 = state ,π‘₯ 1 ,…,π‘₯ 𝑗 = obs i π‘ž π‘₯ 𝑗+1:𝑂 𝑨 𝑗 = state ) 𝑀 for(prev = 0; prev < K*; ++prev) { u = p obs (obs i | state) * p trans (state | prev) c trans (state | prev) += 𝛽( prev , 𝑗 βˆ’ 1) 𝛾( state , 𝑗) π‘ž 𝑨 π‘—βˆ’1 = prev ,π‘₯ 1:π‘—βˆ’1 βˆ—π‘£βˆ—π‘ž π‘₯ 𝑗+1:𝑂 𝑨 𝑗 = state ) 𝑀 } } }

  36. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ²( i, s ) is the total probability of all paths: Ξ±( i, s ) is the total probability of all paths: 1. that start at step i at state s 1. that start from the beginning 2. that terminate at the end 2. that end (currently) in s at step i 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  37. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i, B ) Ξ²( i, s ) is the total probability of all paths: Ξ±( i, s ) is the total probability of all paths: 1. that start at step i at state s 1. that start from the beginning 2. that terminate at the end 2. that end (currently) in s at step i 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  38. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i, B ) Ξ±( i, B ) * Ξ²( i, B ) = total probability of paths through state B at step i Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  39. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B we can compute posterior state z i-1 z i+1 z i = probabilities = C = C C (normalize by marginal likelihood) Ξ±( i, B ) Ξ²( i, B ) Ξ±( i, s ) * Ξ²( i, s ) = total probability of paths through state s at step i Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  40. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i+1, s ) Ξ²( i, s ) is the total probability of all paths: Ξ±( i, s ) is the total probability of all paths: 1. that start at step i at state s 1. that start from the beginning 2. that terminate at the end 2. that end (currently) in s at step i 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  41. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i+1, s’ ) Ξ±( i, B ) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the B β†’ s’ arc (at time i) Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  42. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A we can compute z i-1 z i+1 z i = posterior transition = B = B B probabilities (normalize by z i-1 z i+1 z i = marginal likelihood) = C = C C Ξ±( i, B ) Ξ²( i+1, s’ ) Ξ±( i, B ) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the B β†’ s’ arc (at time i) Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  43. With Both Forward and Backward Values Ξ±( i, s ) * Ξ²( i, s) = total probability of paths through state s at step i Ξ±( i, s) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the s β†’ s ’ arc (at time i)

  44. With Both Forward and Backward Values Ξ±( i, s ) * Ξ²( i, s) = total probability of paths through state s at step i π‘ž 𝑨 𝑗 = 𝑑 π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = 𝛽 𝑗, 𝑑 βˆ— 𝛾(𝑗, 𝑑) 𝛽(𝑂 + 1, END ) Ξ±( i, s) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the s β†’ s ’ arc (at time i)

  45. With Both Forward and Backward Values Ξ±( i, s ) * Ξ²( i, s) = total probability of paths through state s at step i π‘ž 𝑨 𝑗 = 𝑑 π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = 𝛽 𝑗, 𝑑 βˆ— 𝛾(𝑗, 𝑑) 𝛽(𝑂 + 1, END ) Ξ±( i, s) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the s β†’ s ’ arc (at time i) π‘ž 𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = 𝛽 𝑗, 𝑑 βˆ— π‘ž 𝑑 β€² 𝑑 βˆ— π‘ž obs 𝑗+1 𝑑 β€² βˆ— 𝛾(𝑗 + 1, 𝑑′) 𝛽(𝑂 + 1, END )

  46. Agenda HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

  47. HMM Expectation Calculation π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = π‘ž 𝑨 1 | 𝑨 0 π‘ž π‘₯ 1 |𝑨 1 β‹― π‘ž 𝑨 𝑂 | 𝑨 π‘‚βˆ’1 π‘ž π‘₯ 𝑂 |𝑨 𝑂 emission transition = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 | 𝑨 π‘—βˆ’1 probabilities/parameters probabilities/parameters 𝑗 Calculate the forward (log) likelihood of an observed (sub-)sequence w 1 , …, w J Calculate the backward (log) likelihood of an observed (sub-)sequence w J+1 , …, w N

  48. HMM Likelihood Task Marginalize over all latent sequence joint likelihoods π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = ෍ π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 𝑨 1 ,β‹―,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?

  49. HMM Likelihood Task Marginalize over all latent sequence joint likelihoods π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = ෍ π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 𝑨 1 ,β‹―,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: K N

  50. HMM Likelihood Task Marginalize over all latent sequence joint likelihoods π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = ෍ π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 𝑨 1 ,β‹―,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: K N Goal: Find a way to compute this exponential sum efficiently (in polynomial time)

  51. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  52. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 Up until here, all the computation was the same z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  53. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 Up until here, all the computation was the same Let’s reuse what computations we can z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  54. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information β€œforward” in π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š the graph, e.g., from time step 2 to 3… π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  55. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information β€œforward” in π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š the graph, e.g., from time step 2 to 3… π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 Issue: these highlighted paths are only z 1 = z 2 = z 3 = z 4 = 2 of the 16 possible paths through the V V V V trellis π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  56. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information β€œforward” in π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š the graph, e.g., from time step 2 to 3… π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 Issue: these highlighted paths are only z 1 = z 2 = z 3 = z 4 = 2 of the 16 possible paths through the V V V V trellis π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = Solution: marginalize out all N N N N information from previous timesteps π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  57. Reusing Computation z i-2 z i-1 z i = = A = A A z i-2 z i-1 z i = = B = B B z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B”

  58. Reusing Computation 𝛽(𝑗 βˆ’ 1, 𝐡) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 βˆ’ 1, 𝐢) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 βˆ’ 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 βˆ’ 1, 𝐡) , 𝛽(𝑗 βˆ’ 1, 𝐢) , 𝛽(𝑗 βˆ’ 1, 𝐷)

  59. Reusing Computation 𝛽(𝑗 βˆ’ 1, 𝐡) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗, 𝐢) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 βˆ’ 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 βˆ’ 1, 𝐡) , 𝛽(𝑗 βˆ’ 1, 𝐢) , 𝛽(𝑗 βˆ’ 1, 𝐷) Marginalize (sum) across the previous timestep’s possible states

  60. Reusing Computation 𝛽(𝑗 βˆ’ 1, 𝐡) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗, 𝐢) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 βˆ’ 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐢 = ෍ 𝛽 𝑗 βˆ’ 1, 𝑑 βˆ— π‘ž 𝐢 𝑑) βˆ— π‘ž(obs at 𝑗 | 𝐢) 𝑑

  61. Reusing Computation 𝛽(𝑗 βˆ’ 1, 𝐡) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗, 𝐢) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 βˆ’ 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐢 = ෍ 𝛽 𝑗 βˆ’ 1, 𝑑 βˆ— π‘ž 𝐢 𝑑) βˆ— π‘ž(obs at 𝑗 | 𝐢) 𝑑 computing Ξ± at time i-1 will correctly incorporate paths through time i-2 : we correctly obey the Markov property

  62. Forward Probability z i-2 z i-1 z i = = A = A A z i-2 z i-1 z i = = B = B B z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” marginalize across the previous hidden state values Ξ±(i, B) is the total probability of all 𝛽 𝑗 βˆ’ 1, 𝑑 β€² βˆ— π‘ž 𝐢 𝑑 β€² ) βˆ— π‘ž(obs at 𝑗 | 𝐢) 𝛽 𝑗, 𝐢 = ෍ paths to that state B from the 𝑑 β€² beginning computing Ξ± at time i-1 will correctly incorporate paths through time i-2 : we correctly obey the Markov property

  63. Forward Probability 𝛽 𝑗 βˆ’ 1, 𝑑 β€² βˆ— π‘ž 𝑑 𝑑 β€² ) βˆ— π‘ž(obs at 𝑗 | 𝑑) 𝛽 𝑗, 𝑑 = ෍ 𝑑 β€² Ξ±(i, s ) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

  64. Forward Probability 𝛽 𝑗 βˆ’ 1, 𝑑 β€² βˆ— π‘ž 𝑑 𝑑 β€² ) βˆ— π‘ž(obs at 𝑗 | 𝑑) 𝛽 𝑗, 𝑑 = ෍ 𝑑 β€² what are the what’s the total probability how likely is it to get immediate ways to up until now? into state s this way? get into state s ? Ξ±(i, s ) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

  65. Forward Algorithm Ξ± : a 2D table, N+2 x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the Ξ± left-to- right

  66. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i ≀ N+1; ++ i) { }

  67. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { } }

  68. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) } }

  69. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }

  70. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 we still need to learn these Ξ± [0][START] = 1.0 (EM if not observed) for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }

  71. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 Q: What do we return? (How do we return the likelihood of the sequence?) for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }

  72. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 Q: What do we return? (How do we return the likelihood of the sequence?) for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) A: Ξ± [N+1][end] for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }

  73. Interactive HMM Example https://goo.gl/rbHEoc (Jason Eisner, 2002) Original: http://www.cs.jhu.edu/~jason/465/PowerPoint/lect24-hmm.xls

  74. Forward Algorithm in Ξ± = double[N+2][K*] Log-Space Ξ± [0][*] = - ∞ Ξ± [0][*] = 0.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = log p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = log p transition (state | old) Ξ± [i][state] = logadd( Ξ± [i][state], Ξ± [i-1][old] + p obs + p move ) } } }

  75. Forward Algorithm in Ξ± = double[N+2][K*] Log-Space Ξ± [0][*] = - ∞ Ξ± [0][*] = 0.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = log p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = log p transition (state | old) Ξ± [i][state] = logadd( Ξ± [i][state], Ξ± [i-1][old] + p obs + p move ) scipy.misc.logsumexp } } logadd π‘šπ‘ž, π‘šπ‘Ÿ = α‰Š π‘šπ‘ž + log 1 + exp π‘šπ‘Ÿ βˆ’ π‘šπ‘ž , π‘šπ‘ž β‰₯ π‘šπ‘Ÿ } π‘šπ‘Ÿ + log 1 + exp π‘šπ‘ž βˆ’ π‘šπ‘Ÿ , π‘šπ‘Ÿ > π‘šπ‘ž

Recommend


More recommend