2 State HMM Likelihood π π| start π π| π π π| π π π| π β¦ z 1 = z 2 = z 3 = z 4 = V V V V π π| π π π| π π π| π π π| π π π| π π π| π π π| start π π| π π π| π π π| π β¦ z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 3 |π π π₯ 4 |π π π₯ 1 |π π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 2 |π π π₯ 3 |π w 1 w 2 w 3 w 4 N V end w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05
2 State HMM Likelihood π π| start π π| π π π| π π π| π β¦ z 1 = z 2 = z 3 = z 4 = V V V V π π| π π π| π π π| π π π| π π π| π π π| π π π| start π π| π π π| π π π| π β¦ z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 3 |π π π₯ 4 |π π π₯ 1 |π π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 2 |π π π₯ 3 |π w 1 w 2 w 3 w 4 Q: Whatβs the probability of N V end (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05
2 State HMM Likelihood π π| π z 1 = z 2 = z 3 = z 4 = V V V V π π| π π π| π π π| start z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 3 |π π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π w 1 w 2 w 3 w 4 Q: Whatβs the probability of N V end (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = N .15 .8 .05 V .2 .6 .1 .1 0.0002822 V .6 .35 .05
2 State HMM Likelihood π π| start π π| π π π| π π π| π β¦ z 1 = z 2 = z 3 = z 4 = V V V V π π| π π π| π π π| π π π| π π π| π π π| π π π| start π π| π π π| π π π| π β¦ z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 3 |π π π₯ 4 |π π π₯ 1 |π π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 2 |π π π₯ 3 |π w 1 w 2 w 3 w 4 Q: Whatβs the probability of N V end (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05
2 State HMM Likelihood z 1 = z 2 = z 3 = z 4 = V V V V π π| π π π| π π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4 N V end Q: Whatβs the probability of w 1 w 2 w 3 w 4 (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 )? start .7 .2 .1 N .7 .2 .05 .05 A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) = N .15 .8 .05 V .2 .6 .1 .1 0.00007056 V .6 .35 .05
Agenda HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation
Estimating Parameters from Observed Data π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V Transition Counts π π| start π π| π N V end z 1 = z 2 = z 3 = z 4 = N N N N start π π₯ 4 |π N π π₯ 1 |π π π₯ 3 |π π π₯ 2 |π V w 1 w 2 w 3 w 4 Emission Counts w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N V V V V V π π| π π π| π π π| start end emission not shown π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4
Estimating Parameters from Observed Data π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V Transition Counts π π| start π π| π N V end z 1 = z 2 = z 3 = z 4 = N N N N start 2 0 0 π π₯ 4 |π N 1 2 2 π π₯ 1 |π π π₯ 3 |π π π₯ 2 |π V 2 1 0 w 1 w 2 w 3 w 4 Emission Counts w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N 2 0 1 2 V V V V V 0 2 1 0 π π| π π π| π π π| start end emission not shown π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4
Estimating Parameters from Observed Data π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V Transition MLE π π| start π π| π N V end z 1 = z 2 = z 3 = z 4 = N N N N start 1 0 0 π π₯ 4 |π N .2 .4 .4 π π₯ 1 |π π π₯ 3 |π π π₯ 2 |π V 2/3 1/3 0 w 1 w 2 w 3 w 4 Emission MLE w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N .4 0 .2 .4 V V V V V 0 2/3 1/3 0 π π| π π π| π π π| start end emission not shown π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4
Estimating Parameters from Observed Data π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V Transition MLE π π| start π π| π N V end z 1 = z 2 = z 3 = z 4 = N N N N start 1 0 0 π π₯ 4 |π N .2 .4 .4 π π₯ 1 |π π π₯ 3 |π π π₯ 2 |π V 2/3 1/3 0 w 1 w 2 w 3 w 4 Emission MLE w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N .4 0 .2 .4 V V V V V 0 2/3 1/3 0 π π| π π π| π π π| start end emission not shown π π| π z 1 = z 2 = z 3 = z 4 = N N N N smooth these π π₯ 2 |π values if π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π needed w 1 w 2 w 3 w 4
What If We Donβt Observe π¨ ? Approach: Develop EM algorithm Goal: Estimate π π’ π‘ β² π‘) and π π π€ π‘) Why: Compute π½ π¨ π =π‘βπ¨ π+1 =π‘ β² π π‘ β π‘ β² π½ π¨ π =π‘βπ₯ π =π€ π π‘ β π€
Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts estimated counts
Expectation Maximization (EM) p obs (w | s) 0. Assume some value for your parameters p trans (sβ | s) Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts estimated counts
Expectation Maximization (EM) p obs (w | s) 0. Assume some value for your parameters p trans (sβ | s) Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters π β π¨ π = π‘, π¨ π+1 = π‘ β² π₯ 1 , β― , π₯ π ) = π β π¨ π = π‘ π₯ 1 , β― , π₯ π ) = π(π¨ π = π‘, π₯ 1 , β― , π₯ π ) π(π¨ π = π‘, π¨ π+1 = π‘ β² , π₯ 1 , β― , π₯ π ) π(π₯ 1 , β― , π₯ π ) π(π₯ 1 , β― , π₯ π ) 2. M-step: maximize log-likelihood, assuming these uncertain counts estimated counts
M-Step β maximize log-likelihood, assuming these uncertain counts β π(π‘ β π‘ β² ) π new π‘ β² π‘) = Ο π‘ β²β² π(π‘ β π‘ β²β² ) if we observed the hidden transitionsβ¦
M-Step β maximize log-likelihood, assuming these uncertain counts β π½ π‘βπ‘ β² [π π‘ β π‘ β² ] π new π‘ β² π‘) = Ο π‘ β²β² π½ π‘βπ‘ β²β² [π π‘ β π‘β²β² ] we donβt observe the hidden transitions, but we can approximately count
M-Step β maximize log-likelihood, assuming these uncertain counts β π½ π‘βπ‘ β² [π π‘ β π‘ β² ] π new π‘ β² π‘) = Ο π‘ β²β² π½ π‘βπ‘ β²β² [π π‘ β π‘β²β² ] we donβt observe the hidden transitions, but we can approximately count we compute these in the E-step
Expectation Maximization (EM) p obs (w | s) 0. Assume some value for your parameters p trans (sβ | s) Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters π β π¨ π = π‘, π¨ π+1 = π‘ β² π₯ 1 , β― , π₯ π ) = π β π¨ π = π‘ π₯ 1 , β― , π₯ π ) = π(π¨ π = π‘, π₯ 1 , β― , π₯ π ) π(π¨ π = π‘, π¨ π+1 = π‘ β² , π₯ 1 , β― , π₯ π ) π(π₯ 1 , β― , π₯ π ) π(π₯ 1 , β― , π₯ π ) 2. M-step: maximize log-likelihood, assuming these uncertain counts Baum-Welch estimated counts
Estimating Parameters from Unobserved Data Expected Transition Counts N V end β π| start β π| π β π| π β π| π π start π π π z 1 = z 2 = z 3 = z 4 = N V V V V β π| π β π| π V β π| π β π| π π π π π β π| π β π| π π π Expected β π| start Emission Counts π β π| π β π| π β π| π π π π z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N β π₯ 3 |π β π₯ 1 |π β π₯ 2 |π V π π π β π₯ 4 |π β π₯ 1 |π β π₯ 2 |π β π₯ 3 |π β π₯ 4 |π end emission not shown π π π π π w 1 w 2 w 3 w 4
Estimating Parameters from Unobserved Data Expected all of these p* arcs are Transition Counts specific to a time-step N V end β π| start β π| π β π| π β π| π π start π π π z 1 = z 2 = z 3 = z 4 = N V V V V β π| π β π| π V β π| π β π| π π π π π β π| π β π| π π π Expected β π| start Emission Counts π β π| π β π| π β π| π π π π z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N β π₯ 3 |π β π₯ 1 |π β π₯ 2 |π V π π π β π₯ 4 |π β π₯ 1 |π β π₯ 2 |π β π₯ 3 |π β π₯ 4 |π end emission not shown π π π π π w 1 w 2 w 3 w 4
Estimating Parameters from Unobserved Data all of these p* arcs are Expected specific to a time-step Transition Counts N V end β π| start β π| π β π| π β π| π π start π π π z 1 = z 2 = z 3 = z 4 = =.5 =.3 =.3 N V V V V β π| π β π| π V β π| π β π| π π π π π β π| π β π| π π π Expected β π| start Emission Counts π β π| π β π| π β π| π π π π z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 =.4 =.6 =.5 N N N N N β π₯ 3 |π β π₯ 1 |π β π₯ 2 |π V π π π β π₯ 4 |π β π₯ 1 |π β π₯ 2 |π β π₯ 3 |π β π₯ 4 |π end emission not shown π π π π π w 1 w 2 w 3 w 4
Estimating Parameters from Unobserved Data all of these p* arcs are Expected specific to a time-step Transition Counts N V end β π| start β π| π β π| π β π| π β π| π β π| π β π| π π start π π π π π π z 1 = z 2 = z 3 = z 4 = =.5 =.3 =.3 N 1.5 V V V V β π| π β π| π V 1.1 β π| π β π| π π π π π β π| π β π| π π π Expected β π| start Emission Counts π β π| π β π| π β π| π β π| π β π| π β π| π π π π π π π z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 =.4 =.6 =.5 N N N N N β π₯ 3 |π β π₯ 1 |π β π₯ 2 |π V π π π β π₯ 4 |π β π₯ 1 |π β π₯ 2 |π β π₯ 3 |π β π₯ 4 |π end emission not shown π π π π π w 1 w 2 w 3 w 4
Estimating Parameters from Unobserved Data Expected Transition Counts N V end β π| start β π| π β π| π β π| π π start 1.8 .1 .1 π π π z 1 = z 2 = z 3 = z 4 = N 1.5 .8 .1 V V V V β π| π β π| π V 1.4 1.1 .4 β π| π β π| π π π π π β π| π β π| π π π Expected β π| start Emission Counts π β π| π β π| π β π| π π π π z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N .4 .3 .2 .2 β π₯ 3 |π β π₯ 1 |π β π₯ 2 |π V .1 .6 .3 .3 π π π β π₯ 4 |π β π₯ 1 |π β π₯ 2 |π β π₯ 3 |π β π₯ 4 |π end emission not shown π π π π π (these numbers are made up) w 1 w 2 w 3 w 4
Estimating Parameters from Unobserved Data Expected Transition MLE N V end start 1.8/2 .1/2 .1/2 N 1.5/ .8/ .1/ β π| start β π| π β π| π β π| π π π π π 2.4 2.4 2.4 z 1 = z 2 = z 3 = z 4 = V 1.4/2.9 1.1/ .4/ V V V V 2.9 2.9 β π| π β π| π β π| π β π| π π π π π β π| π β π| π π π Expected β π| start Emission MLE π β π| π β π| π β π| π π π π z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N .4/ .3/ .2/ .2/ 1.1 1.1 1.1 1.1 β π₯ 3 |π β π₯ 1 |π β π₯ 2 |π π π π V .1/ .6/ .3/ .3/ β π₯ 4 |π β π₯ 1 |π β π₯ 2 |π β π₯ 3 |π 1.3 1.3 1.3 1.3 π π π π β π₯ 4 |π π end emission not shown (these numbers are made up) w 1 w 2 w 3 w 4
Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0
Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0 Expected Transition Counts Expected Emission Counts N V end w 1 w 2 W 3 w 4 start 1.8 .1 .1 N .4 .3 .2 .2 N 1.5 .8 .1 V .1 .6 .3 .3 V 1.4 1.1 .4
Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0 Expected Transition Counts Expected Emission Counts N V end w 1 w 2 W 3 w 4 start 1.8 .1 .1 N .4 .3 .2 .2 N 1.5 .8 .1 V .1 .6 .3 .3 V 1.4 1.1 .4
Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0 Mixed Transition Counts Mixed Emission Counts N V end w 1 w 2 W 3 w 4 start 3.8 .1 .1 N 2.4 .3 1.2 2.2 N 2.5 2.8 2.1 V .1 2.6 1.3 .3 V 3.4 2.1 .4 Expected Transition Counts Expected Emission Counts N V end w 1 w 2 W 3 w 4 start 1.8 .1 .1 N .4 .3 .2 .2 N 1.5 .8 .1 V .1 .6 .3 .3 V 1.4 1.1 .4
Agenda HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π (π’) (β |π₯) log π π (π¨, π₯) current parameters π new parameters posterior distribution new parameters
EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max π½ π¨ ~ π π (π’) (β |π₯) log π π (π¨, π₯) current parameters π new parameters posterior distribution new parameters π¨ β π‘ 1 , β¦ , π‘ πΏ π ΰ· log π π (π¨ π |π¨ πβ1 ) + log π π (π₯ π |π¨ π ) π
Estimating Parameters from Unobserved Data Expected Transition MLE N V end start 1.8/2 .1/2 .1/2 N 1.5/ .8/ .1/ β π| start β π| π β π| π β π| π π π π π 2.4 2.4 2.4 z 1 = z 2 = z 3 = z 4 = V 1.4/2.9 1.1/ .4/ V V V V 2.9 2.9 β π| π β π| π β π| π β π| π π π π π β π| π β π| π π π Expected β π| start Emission MLE π β π| π β π| π β π| π π π π z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N .4/ .3/ .2/ .2/ 1.1 1.1 1.1 1.1 β π₯ 3 |π β π₯ 1 |π β π₯ 2 |π π π π V .1/ .6/ .3/ .3/ β π₯ 4 |π β π₯ 1 |π β π₯ 2 |π β π₯ 3 |π 1.3 1.3 1.3 1.3 π π π π β π₯ 4 |π π end emission not shown (these numbers are made up) w 1 w 2 w 3 w 4
EM For HMMs (Baum-Welch Algorithm) L = π(π₯ 1 , β― , π₯ π ) for(i = 1; i β€ N; ++ i) { for(state = 0; state < K*; ++state) { π(π¨ π = state ,π₯ 1 ,β―,π₯ π ) c obs (obs i | state) += π for(prev = 0; prev < K*; ++prev) { π(π¨ π = state ,π¨ π+1 = next ,π₯ 1 ,β―,π₯ π ) c trans (state | prev) += π } } }
EM For HMMs (Baum-Welch L = π(π₯ 1 , β― , π₯ π ) Algorithm) for(i = 1; i β€ N; ++i) { for(state = 0; state < K*; ++state) { c obs (obs i | state) += π π¨ π = state ,π₯ 1 ,β¦,π₯ π = obs i π π₯ π+1:π π¨ π = state ) π for(prev = 0; prev < K*; ++prev) { u = p obs (obs i | state) * p trans (state | prev) c trans (state | prev) += π π¨ πβ1 = prev ,π₯ 1:πβ1 βπ£βπ π₯ π+1:π π¨ π = state ) π } } }
EM For HMMs L = π(π₯ 1 , β― , π₯ π ) (Baum-Welch for(i = 1; i β€ N; ++i) { for(state = 0; state < K*; ++state) { Algorithm) c obs (obs i | state) += π½( state , π) πΎ( state , π) π π¨ π = state ,π₯ 1 ,β¦,π₯ π = obs i π π₯ π+1:π π¨ π = state ) π for(prev = 0; prev < K*; ++prev) { u = p obs (obs i | state) * p trans (state | prev) c trans (state | prev) += π½( prev , π β 1) πΎ( state , π) π π¨ πβ1 = prev ,π₯ 1:πβ1 βπ£βπ π₯ π+1:π π¨ π = state ) π } } }
Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ²( i, s ) is the total probability of all paths: Ξ±( i, s ) is the total probability of all paths: 1. that start at step i at state s 1. that start from the beginning 2. that terminate at the end 2. that end (currently) in s at step i 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i
Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i, B ) Ξ²( i, s ) is the total probability of all paths: Ξ±( i, s ) is the total probability of all paths: 1. that start at step i at state s 1. that start from the beginning 2. that terminate at the end 2. that end (currently) in s at step i 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i
Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i, B ) Ξ±( i, B ) * Ξ²( i, B ) = total probability of paths through state B at step i Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i
Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B we can compute posterior state z i-1 z i+1 z i = probabilities = C = C C (normalize by marginal likelihood) Ξ±( i, B ) Ξ²( i, B ) Ξ±( i, s ) * Ξ²( i, s ) = total probability of paths through state s at step i Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i
Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i+1, s ) Ξ²( i, s ) is the total probability of all paths: Ξ±( i, s ) is the total probability of all paths: 1. that start at step i at state s 1. that start from the beginning 2. that terminate at the end 2. that end (currently) in s at step i 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i
Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i+1, sβ ) Ξ±( i, B ) * p( sβ | B) * p(obs at i+1 | sβ) * Ξ²( i+1, sβ ) = total probability of paths through the B β sβ arc (at time i) Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i
Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A we can compute z i-1 z i+1 z i = posterior transition = B = B B probabilities (normalize by z i-1 z i+1 z i = marginal likelihood) = C = C C Ξ±( i, B ) Ξ²( i+1, sβ ) Ξ±( i, B ) * p( sβ | B) * p(obs at i+1 | sβ) * Ξ²( i+1, sβ ) = total probability of paths through the B β sβ arc (at time i) Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i
With Both Forward and Backward Values Ξ±( i, s ) * Ξ²( i, s) = total probability of paths through state s at step i Ξ±( i, s) * p( sβ | B) * p(obs at i+1 | sβ) * Ξ²( i+1, sβ ) = total probability of paths through the s β s β arc (at time i)
With Both Forward and Backward Values Ξ±( i, s ) * Ξ²( i, s) = total probability of paths through state s at step i π π¨ π = π‘ π₯ 1 , β― , π₯ π ) = π½ π, π‘ β πΎ(π, π‘) π½(π + 1, END ) Ξ±( i, s) * p( sβ | B) * p(obs at i+1 | sβ) * Ξ²( i+1, sβ ) = total probability of paths through the s β s β arc (at time i)
With Both Forward and Backward Values Ξ±( i, s ) * Ξ²( i, s) = total probability of paths through state s at step i π π¨ π = π‘ π₯ 1 , β― , π₯ π ) = π½ π, π‘ β πΎ(π, π‘) π½(π + 1, END ) Ξ±( i, s) * p( sβ | B) * p(obs at i+1 | sβ) * Ξ²( i+1, sβ ) = total probability of paths through the s β s β arc (at time i) π π¨ π = π‘, π¨ π+1 = π‘ β² π₯ 1 , β― , π₯ π ) = π½ π, π‘ β π π‘ β² π‘ β π obs π+1 π‘ β² β πΎ(π + 1, π‘β²) π½(π + 1, END )
Agenda HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation
HMM Expectation Calculation π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π = π π¨ 1 | π¨ 0 π π₯ 1 |π¨ 1 β― π π¨ π | π¨ πβ1 π π₯ π |π¨ π emission transition = ΰ· π π₯ π |π¨ π π π¨ π | π¨ πβ1 probabilities/parameters probabilities/parameters π Calculate the forward (log) likelihood of an observed (sub-)sequence w 1 , β¦, w J Calculate the backward (log) likelihood of an observed (sub-)sequence w J+1 , β¦, w N
HMM Likelihood Task Marginalize over all latent sequence joint likelihoods π π₯ 1 , π₯ 2 , β¦ , π₯ π = ΰ· π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π π¨ 1 ,β―,π¨ π Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
HMM Likelihood Task Marginalize over all latent sequence joint likelihoods π π₯ 1 , π₯ 2 , β¦ , π₯ π = ΰ· π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π π¨ 1 ,β―,π¨ π Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: K N
HMM Likelihood Task Marginalize over all latent sequence joint likelihoods π π₯ 1 , π₯ 2 , β¦ , π₯ π = ΰ· π π¨ 1 , π₯ 1 , π¨ 2 , π₯ 2 , β¦ , π¨ π , π₯ π π¨ 1 ,β―,π¨ π Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: K N Goal: Find a way to compute this exponential sum efficiently (in polynomial time)
2 State HMM Likelihood π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 1 |π π π₯ 4 |π π π₯ 3 |π π π₯ 2 |π w 1 w 2 w 3 w 4 z 1 = z 2 = z 3 = z 4 = V V V V π π| π π π| π π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4
2 State HMM Likelihood π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 1 |π π π₯ 4 |π π π₯ 3 |π π π₯ 2 |π w 1 w 2 w 3 w 4 Up until here, all the computation was the same z 1 = z 2 = z 3 = z 4 = V V V V π π| π π π| π π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4
2 State HMM Likelihood π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 1 |π π π₯ 4 |π π π₯ 3 |π π π₯ 2 |π w 1 w 2 w 3 w 4 Up until here, all the computation was the same Letβs reuse what computations we can z 1 = z 2 = z 3 = z 4 = V V V V π π| π π π| π π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4
2 State HMM Likelihood π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information βforwardβ in π π₯ 1 |π π π₯ 4 |π π π₯ 3 |π the graph, e.g., from time step 2 to 3β¦ π π₯ 2 |π w 1 w 2 w 3 w 4 z 1 = z 2 = z 3 = z 4 = V V V V π π| π π π| π π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4
2 State HMM Likelihood π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information βforwardβ in π π₯ 1 |π π π₯ 4 |π π π₯ 3 |π the graph, e.g., from time step 2 to 3β¦ π π₯ 2 |π w 1 w 2 w 3 w 4 Issue: these highlighted paths are only z 1 = z 2 = z 3 = z 4 = 2 of the 16 possible paths through the V V V V trellis π π| π π π| π π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4
2 State HMM Likelihood π π| π z 1 = z 2 = z 3 = z 4 = π π| π V V V V π π| start π π| π z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information βforwardβ in π π₯ 1 |π π π₯ 4 |π π π₯ 3 |π the graph, e.g., from time step 2 to 3β¦ π π₯ 2 |π w 1 w 2 w 3 w 4 Issue: these highlighted paths are only z 1 = z 2 = z 3 = z 4 = 2 of the 16 possible paths through the V V V V trellis π π| π π π| π π π| start π π| π z 1 = z 2 = z 3 = z 4 = Solution: marginalize out all N N N N information from previous timesteps π π₯ 2 |π π π₯ 4 |π π π₯ 1 |π π π₯ 3 |π w 1 w 2 w 3 w 4
Reusing Computation z i-2 z i-1 z i = = A = A A z i-2 z i-1 z i = = B = B B z i-2 z i-1 z i = = C = C C letβs first consider β any shared path ending with B (AB, BB, or CB) β Bβ
Reusing Computation π½(π β 1, π΅) z i-2 z i-1 z i = = A = A A π½(π β 1, πΆ) z i-2 z i-1 z i = = B = B B π½(π β 1, π·) z i-2 z i-1 z i = = C = C C letβs first consider β any shared path ending with B (AB, BB, or CB) β Bβ Assume that all necessary information has been computed and stored in π½(π β 1, π΅) , π½(π β 1, πΆ) , π½(π β 1, π·)
Reusing Computation π½(π β 1, π΅) z i-2 z i-1 z i = = A = A A π½(π β 1, πΆ) π½(π, πΆ) z i-2 z i-1 z i = = B = B B π½(π β 1, π·) z i-2 z i-1 z i = = C = C C letβs first consider β any shared path ending with B (AB, BB, or CB) β Bβ Assume that all necessary information has been computed and stored in π½(π β 1, π΅) , π½(π β 1, πΆ) , π½(π β 1, π·) Marginalize (sum) across the previous timestepβs possible states
Reusing Computation π½(π β 1, π΅) z i-2 z i-1 z i = = A = A A π½(π β 1, πΆ) π½(π, πΆ) z i-2 z i-1 z i = = B = B B π½(π β 1, π·) z i-2 z i-1 z i = = C = C C letβs first consider β any shared path ending with B (AB, BB, or CB) β Bβ marginalize across the previous hidden state values π½ π, πΆ = ΰ· π½ π β 1, π‘ β π πΆ π‘) β π(obs at π | πΆ) π‘
Reusing Computation π½(π β 1, π΅) z i-2 z i-1 z i = = A = A A π½(π β 1, πΆ) π½(π, πΆ) z i-2 z i-1 z i = = B = B B π½(π β 1, π·) z i-2 z i-1 z i = = C = C C letβs first consider β any shared path ending with B (AB, BB, or CB) β Bβ marginalize across the previous hidden state values π½ π, πΆ = ΰ· π½ π β 1, π‘ β π πΆ π‘) β π(obs at π | πΆ) π‘ computing Ξ± at time i-1 will correctly incorporate paths through time i-2 : we correctly obey the Markov property
Forward Probability z i-2 z i-1 z i = = A = A A z i-2 z i-1 z i = = B = B B z i-2 z i-1 z i = = C = C C letβs first consider β any shared path ending with B (AB, BB, or CB) β Bβ marginalize across the previous hidden state values Ξ±(i, B) is the total probability of all π½ π β 1, π‘ β² β π πΆ π‘ β² ) β π(obs at π | πΆ) π½ π, πΆ = ΰ· paths to that state B from the π‘ β² beginning computing Ξ± at time i-1 will correctly incorporate paths through time i-2 : we correctly obey the Markov property
Forward Probability π½ π β 1, π‘ β² β π π‘ π‘ β² ) β π(obs at π | π‘) π½ π, π‘ = ΰ· π‘ β² Ξ±(i, s ) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i
Forward Probability π½ π β 1, π‘ β² β π π‘ π‘ β² ) β π(obs at π | π‘) π½ π, π‘ = ΰ· π‘ β² what are the whatβs the total probability how likely is it to get immediate ways to up until now? into state s this way? get into state s ? Ξ±(i, s ) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i
Forward Algorithm Ξ± : a 2D table, N+2 x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the Ξ± left-to- right
Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i β€ N+1; ++ i) { }
Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i β€ N+1; ++ i) { for(state = 0; state < K*; ++state) { } }
Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i β€ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) } }
Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i β€ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }
Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 we still need to learn these Ξ± [0][START] = 1.0 (EM if not observed) for(i = 1; i β€ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }
Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 Q: What do we return? (How do we return the likelihood of the sequence?) for(i = 1; i β€ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }
Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 Q: What do we return? (How do we return the likelihood of the sequence?) for(i = 1; i β€ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) A: Ξ± [N+1][end] for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }
Interactive HMM Example https://goo.gl/rbHEoc (Jason Eisner, 2002) Original: http://www.cs.jhu.edu/~jason/465/PowerPoint/lect24-hmm.xls
Forward Algorithm in Ξ± = double[N+2][K*] Log-Space Ξ± [0][*] = - β Ξ± [0][*] = 0.0 for(i = 1; i β€ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = log p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = log p transition (state | old) Ξ± [i][state] = logadd( Ξ± [i][state], Ξ± [i-1][old] + p obs + p move ) } } }
Forward Algorithm in Ξ± = double[N+2][K*] Log-Space Ξ± [0][*] = - β Ξ± [0][*] = 0.0 for(i = 1; i β€ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = log p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = log p transition (state | old) Ξ± [i][state] = logadd( Ξ± [i][state], Ξ± [i-1][old] + p obs + p move ) scipy.misc.logsumexp } } logadd ππ, ππ = α ππ + log 1 + exp ππ β ππ , ππ β₯ ππ } ππ + log 1 + exp ππ β ππ , ππ > ππ
Recommend
More recommend