sequence to sequence models
play

Sequence to Sequence models: Connectionist Temporal Classification - PowerPoint PPT Presentation

Deep Learning Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1 Sequence-to-sequence modelling Problem: A sequence 1 goes in A different sequence 1 comes out


  1. The actual output of the network 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ /D/ Cannot distinguish between an extended symbol and 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 repetitions of the symbol 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ /F/ /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 /G/ 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • Option 1: Simply select the most probable symbol at each time – Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant 17

  2. The actual output of the network 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 Resulting sequence may be meaningless (what word is “GFIYD”?) 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ /D/ Cannot distinguish between an extended symbol and 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 repetitions of the symbol 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ /F/ /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 /G/ 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • Option 1: Simply select the most probable symbol at each time – Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant 18

  3. The actual output of the network 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • Option 2: Impose external constraints on what sequences are allowed – E.g. only allow sequences corresponding to dictionary words – E.g. Sub-symbol units (like in HW1 – what were they?) 19

  4. The actual output of the network 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 We will refer to the process 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ of obtaining an output from the network as decoding 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • Option 2: Impose external constraints on what sequences are allowed – E.g. only allow sequences corresponding to dictionary words – E.g. Sub-symbol units (like in HW1 – what were they?) 20

  5. The sequence-to-sequence problem /B/ /IY/ /F/ /IY/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • How do we know when to output symbols – In fact, the network produces outputs at every time – Which of these are the real outputs • How do we train these models? 21

  6. Training /B/ /IY/ /F/ /IY/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Given output symbols at the right locations – The phoneme /B/ ends at X 2 , /IY/ at X 4 , /F/ at X 6 , /IY/ at X 9 22

  7. /F/ /IY/ /B/ /IY/ Div Div Div Div 𝑍 𝑍 𝑍 𝑍 2 4 6 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Either just define Divergence as: 𝐞𝐜𝑊 = 𝑌𝑓𝑜𝑢 𝑍 2 , 𝐶 + 𝑌𝑓𝑜𝑢 𝑍 4 , 𝐜𝑍 + 𝑌𝑓𝑜𝑢 𝑍 6 , 𝐺 + 𝑌𝑓𝑜𝑢(𝑍 9 , 𝐜𝑍) • Or.. 23

  8. /IY/ /F/ /IY/ /B/ Div Div Div Div Div Div Div Div Div Div 𝑍 𝑍 𝑍 𝑍 2 4 6 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Either just define Divergence as: 𝐞𝐜𝑊 = 𝑌𝑓𝑜𝑢 𝑍 2 , 𝐶 + 𝑌𝑓𝑜𝑢 𝑍 4 , 𝐜𝑍 + 𝑌𝑓𝑜𝑢 𝑍 6 , 𝐺 + 𝑌𝑓𝑜𝑢(𝑍 9 , 𝐜𝑍) • Or repeat the symbols over their duration 𝐞𝐜𝑊 = ෍ 𝑌𝑓𝑜𝑢 𝑍 𝑢 , 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 = − ෍ log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 𝑢 24

  9. /IY/ /F/ /IY/ /B/ Div Div Div Div Div Div Div Div Div Div 𝑍 𝑍 𝑍 𝑍 2 4 6 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 𝐞𝐜𝑊 = ෍ 𝑌𝑓𝑜𝑢 𝑍 𝑢 , 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 = − ෍ log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 𝑢 • The gradient w.r.t the 𝑢 -th output vector 𝑍 𝑢 −1 𝛌 𝑍 𝑢 𝐞𝐜𝑊 = 0 0 
 0 
 0 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 – Zeros except at the component corresponding to the target 25

  10. Problem: No timing information provided /B/ /IY/ /F/ /IY/ ? ? ? ? ? ? ? ? ? ? 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 4 5 6 7 8 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Only the sequence of output symbols is provided for the training data – But no indication of which one occurs where • How do we compute the divergence? – And how do we compute its gradient w.r.t. 𝑍 𝑢 26

  11. Solution 1: Guess the alignment /F/ /B/ /B/ /IY/ /IY/ /IY/ /F/ /F/ /IY/ /F/ ? ? ? ? ? ? ? ? ? ? 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 4 5 6 7 8 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Initialize: Assign an initial alignment – Either randomly, based on some heuristic, or any other rationale • Iterate: – Train the network using the current alignment – Reestimate the alignment for each training instance • Using the decoding methods already discussed 27

  12. Solution 1: Guess the alignment /F/ /B/ /B/ /IY/ /IY/ /IY/ /F/ /F/ /IY/ /F/ ? ? ? ? ? ? ? ? ? ? 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 4 5 6 7 8 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 • Initialize: Assign an initial alignment – Either randomly, based on some heuristic, or any other rationale • Iterate: – Train the network using the current alignment – Reestimate the alignment for each training instance • Using the decoding methods already discussed 28

  13. Estimating an alignment • Given: – The unaligned 𝐿 -length symbol sequence 𝑇 = 𝑇 0 
 𝑇 𝐿−1 (e.g. /B/ /IY/ /F/ /IY/) – An 𝑂 -length input ( 𝑂 ≥ 𝐿 ) – And a (trained) recurrent network • Find: – An 𝑂 -length expansion 𝑡 0 
 𝑡 𝑂−1 comprising the symbols in S in strict order • e.g. 𝑇 0 𝑇 1 𝑇 1 𝑇 2 𝑇 3 𝑇 3 
 𝑇 𝐿−1 – i.e. 𝑡 0 = 𝑇 0 , 𝑡 2 = 𝑇 1 , 𝑇 3 = 𝑇 1 , 𝑡 4 = 𝑇 2 , 𝑡 5 = 𝑇 3 , 
 𝑡 𝑂−1 = 𝑇 𝐿−1 • E.g. /B/ /B/ /IY/ /IY/ /IY/ /F/ /F/ /F/ /F/ /IY/ .. – 𝑡 𝑗 = 𝑇 𝑙 ⇒ 𝑗 ≥ 𝑙 – 𝑡 𝑗 = 𝑇 𝑙 , 𝑡 𝑘 = 𝑇 𝑚 , 𝑗 < 𝑘 ⇒ 𝑙 ≀ 𝑚 • Outcome: an alignment of the target symbol sequence 𝑇 0 
 𝑇 𝐿−1 to the input 𝑌 0 
 𝑌 𝑂−1 29

  14. Recall: The actual output of the network 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 • At each time the network outputs a probability for each output symbol 30

  15. Recall: unconstrained decoding 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ • We find the most likely sequence of symbols – (Conditioned on input 𝑌 0 
 𝑌 𝑂−1 ) • This may not correspond to an expansion of the desired symbol sequence – E.g. the unconstrained decode may be /AH//AH//AH//D//D//AH//F//IY//IY/ • Contracts to /AH/ /D/ /AH/ /F/ /IY/ – Whereas we want an expansion of /B//IY//F//IY/ 31

  16. Constraining the alignment: Try 1 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝐵𝐌 𝑧 7 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 8 /AH/ 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝐞 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /D/ 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 𝐹𝐌 /EH/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝐻 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /G/ • Block out all rows that do not include symbols from the target sequence – E.g. Block out rows that are not /B/ /IY/ or /F/ 32

  17. Blocking out unnecessary outputs 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 Compute the entire output (for all symbols) Copy the output values for the target symbols into the secondary reduced structure 33

  18. Constraining the alignment: Try 1 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 • Only decode on reduced grid – We are now assured that only the appropriate symbols will be hypothesized 34

  19. Constraining the alignment: Try 1 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 • Only decode on reduced grid – We are now assured that only the appropriate symbols will be hypothesized • Problem: This still doesn’t assure that the decode sequence correctly expands the target symbol sequence – E.g. the above decode is not an expansion of /B//IY//F//IY/ • Still needs additional constraints 35

  20. Try 2: Explicitly arrange the constructed table 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required 36

  21. Try 2: Explicitly arrange the constructed table 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ Note: If a symbol occurs multiple times, we repeat the row in the appropriate location. E.g. the row for /IY/ occurs twice, in the 2 nd and 4 th positions Arrange the constructed table so that from top to bottom it has the exact sequence of symbols required 37

  22. Explicitly constrain alignment 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 /B/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /IY/ 𝑧 5 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 /F/ 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Constrain that the first symbol in the decode must be the top left block • The last symbol must be the bottom right • The rest of the symbols must follow a sequence that monotonically travels down from top left to bottom right – I.e. never goes up • This guarantees that the sequence is an expansion of the target sequence – /B/ /IY/ /F/ /IY/ in this case 38

  23. Explicitly constrain alignment 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Compose a graph such that every path in the graph from source to sink represents a valid alignment – Which maps on to the target symbol sequence (/B//AH//T/) • Edge scores are 1 • Node scores are the probabilities assigned to the symbols by the neural network • The “score” of a path is the product of the probabilities of all nodes along the path • Find the most probable path from source to sink using any dynamic programming algorithm – E.g. The Viterbi algorithm 39

  24. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • At each node, keep track of – The best incoming edge – The score of the best path from the source to the node • Dynamically compute the best path from source to sink 40

  25. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • First, some notation: 𝑇(𝑠) is the probability of the target symbol assigned to the 𝑠 -th row • 𝑧 𝑢 in the 𝑢 -th time (given inputs 𝑌 0 
 𝑌 𝑢 ) – E.g., S(0) = /B/ • The scores in the 0 th row have the form 𝑧 𝑢 𝐶 – E.g. S(1) = S(3) = /IY/ • The scores in the 1 st and 3 rd rows have the form 𝑧 𝑢 𝐜𝑍 – E.g. S(2) = /F/ • The scores in the 2 nd row have the form 𝑧 𝑢 𝐺 41

  26. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄(𝑢, 0) = 0, 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 0 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 42

  27. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 43

  28. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 44

  29. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 45

  30. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 46

  31. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 47

  32. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 48

  33. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 49

  34. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 50

  35. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • Initialization: 𝐶𝑄 0, 𝑗 = 𝑜𝑣𝑚𝑚, 𝑗 = 0 
 𝐿 − 1 𝑇 0 , 𝐶𝑡𝑑𝑠 0, 𝑗 = −∞, 𝑗 = 1 
 𝐿 − 1 𝐶𝑡𝑑𝑠 0,0 = 𝑧 0 • for 𝑢 = 1 
 𝑈 − 1 𝑇 0 𝐶𝑄 𝑢, 0 = 0; 𝐶𝑡𝑑𝑠(𝑢, 0) = 𝐶𝑡𝑑𝑠(𝑢 − 1,0) × 𝑧 𝑢 for 𝑚 = 1 
 𝐿 − 1 • 𝐶𝑄 𝑢, 𝑚 = 𝑗𝑔 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 − 1 > 𝐶𝑡𝑑𝑠 𝑢 − 1, 𝑚 𝑚 − 1; 𝑓𝑚𝑡𝑓 𝑚 𝑇 𝑚 • 𝐶𝑡𝑑𝑠(𝑢, 𝑚) = 𝐶𝑡𝑑𝑠(𝐶𝑄(𝑢, 𝑚)) × 𝑧 𝑢 51

  36. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • 𝑡(𝑈 − 1) = 𝑇(𝐿 − 1) • for 𝑢 = 𝑈 𝑒𝑝𝑥𝑜 𝑢𝑝 1 – s(t-1) = BP(s(t)) 52

  37. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • 𝑡(𝑈 − 1) = 𝑇(𝐿 − 1) • for 𝑢 = 𝑈 − 1 𝑒𝑝𝑥𝑜𝑢𝑝 1 𝑡(𝑢 − 1) = 𝐶𝑄(𝑡(𝑢)) 53

  38. Viterbi algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • 𝑡(𝑈 − 1) = 𝑇(𝐿 − 1) • for 𝑢 = 𝑈 − 1 𝑒𝑝𝑥𝑜𝑢𝑝 1 𝑡(𝑢 − 1) = 𝐶𝑄(𝑡(𝑢)) /B/ /B/ /IY/ /F/ /F/ /IY/ /IY/ /IY/ /IY/ 54

  39. Gradients from the alignment 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ /B/ /B/ /IY/ /F/ /F/ /IY/ /IY/ /IY/ /IY/ 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ = − ෍ 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ 𝐞𝐜𝑊 = ෍ 𝑌𝑓𝑜𝑢 𝑍 𝑢 , 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 𝑢 • The gradient w.r.t the 𝑢 -th output vector 𝑍 𝑢 −1 
 0 
 0 𝛌 𝑍 𝑢 𝐞𝐜𝑊 = 0 0 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 – Zeros except at the component corresponding to the target in the estimated alignment 55

  40. Iterative Estimate and Training /IY/ /IY/ /B/ /B/ /IY/ /F/ /F/ /IY/ /IY/ /IY/ ? ? ? ? ? ? ? ? ? ? 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 𝑍 0 1 2 3 4 5 6 7 8 9 𝑌 0 𝑌 1 𝑌 2 𝑌 3 𝑌 4 𝑌 5 𝑌 6 𝑌 7 𝑌 8 𝑌 9 Initialize Train model with Decode to obtain alignments given alignments alignments The “decode” and “train” steps may be combine into a single “decode, find alignment, 56 compute derivatives” step for SGD and mini -batch updates

  41. Iterative update • Option 1: – Determine alignments for every training instance – Train model (using SGD or your favorite approach) on the entire training set – Iterate • Option 2: – During SGD, for each training instance, find the alignment during the forward pass – Use in backward pass 57

  42. Iterative update: Problem • Approach heavily dependent on initial alignment • Prone to poor local optima • Alternate solution: Do not commit to an alignment during any pass.. 58

  43. The reason for suboptimality 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • We commit to the single “best” estimated alignment – The most likely alignment 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ 𝐞𝐜𝑊 = − ෍ log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 – This can be way off, particularly in early iterations, or if the model is poorly initialized • Alternate view: there is a probability distribution over alignments – Selecting a single alignment is the same as drawing a single sample from this distribution – Selecting the most likely alignment is the same as deterministically always drawing the most probable value from the distribution 59

  44. The reason for suboptimality 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ • We commit to the single “best” estimated alignment – The most likely alignment 𝑐𝑓𝑡𝑢𝑞𝑏𝑢ℎ 𝐞𝐜𝑊 = − ෍ log 𝑍 𝑢, 𝑡𝑧𝑛𝑐𝑝𝑚 𝑢 𝑢 – This can be way off, particularly in early iterations, or if the model is poorly initialized • Alternate view: there is a probability distribution over alignments of the target Symbol sequence (to the input) – Selecting a single alignment is the same as drawing a single sample from it – Selecting the most likely alignment is the same as deterministically always drawing the most probable value from the distribution 60

  45. Averaging over all alignments 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Instead of only selecting the most likely alignment, use the statistical expectation over all possible alignments 𝐞𝐜𝑊 = 𝐹 − ෍ log 𝑍 𝑢, 𝑡 𝑢 𝑢 – Use the entire distribution of alignments – This will mitigate the issue of suboptimal selection of alignment 61

  46. The expectation over all alignments 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝐞𝐜𝑊 = 𝐹 − ෍ log 𝑍 𝑢, 𝑡 𝑢 𝑢 • Using the linearity of expectation 𝐞𝐜𝑊 = − ෍ 𝐹 log 𝑍 𝑢, 𝑡 𝑢 𝑢 – This reduces to finding the expected divergence at each input 𝐞𝐜𝑊 = − ෍ ෍ 𝑄(𝑡 𝑢 = 𝑇|𝐓, 𝐘) log 𝑍 𝑢, 𝑡 𝑢 = 𝑡 𝑢 𝑇∈𝑇 1  𝑇 𝐿 62

  47. The expectation over all alignments 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t The probability of seeing the specific symbol s at time t, 𝐞𝐜𝑊 = 𝐹 − ෍ log 𝑍 𝑢, 𝑡 𝑢 given that the symbol sequence is an expansion of 𝐓 = 𝑇 0 
 𝑇 𝐿−1 and given the input sequence 𝐘 = 𝑌 0 
 𝑌 𝑂−1 𝑢 • Using the linearity of expectation We need to be able to compute this 𝐞𝐜𝑊 = − ෍ 𝐹 log 𝑍 𝑢, 𝑡 𝑢 𝑢 – This reduces to finding the expected divergence at each input 𝐞𝐜𝑊 = − ෍ ෍ 𝑄(𝑡 𝑢 = 𝑇|𝐓, 𝐘) log 𝑍 𝑢, 𝑡 𝑢 = 𝑇 𝑢 𝑇∈𝑇 1  𝑇 𝐿 63

  48. A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄(𝑡 𝑢 = 𝑇 𝑠 |𝐓, 𝐘) ∝ 𝑄(𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘) • 𝑄(𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘) is the total probability of all valid paths in the graph for target sequence 𝐓 that go through the symbol 𝑇 𝑠 (the 𝑠 th symbol in the sequence 𝑇 1 
 𝑇 𝐿 ) at time 𝑢 • We will compute this using the “forward - backward” algorithm 64

  49. A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Decompose 𝑄(𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘) as follows: 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝑡 𝑢+1 
 𝑡 𝑂−1 , 𝐓 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 • [𝑇 𝑠+ ] indicates that 𝑡 𝑢+1 might either be 𝑇 𝑠 or 𝑇 𝑠+1 • [𝑇 𝑠− ] indicates that 𝑡 𝑢−1 might be either 𝑇 𝑠 or 𝑇 𝑠−1 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 – Because the target symbol sequence 𝐓 is implicit in the synchronized sequences 𝑡 0 
 𝑡 𝑂−1 which are constrained to be expansions of 𝐓 65

  50. A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 • For a recurrent network without feedback from the output we can make the conditional independence assumption: 𝑄 𝑡 𝑢+1 
 𝑡 0 
 𝑡 𝑢 , 𝐘 = 𝑄 𝑡 𝑢+1 
 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝑡 𝑢 = 𝑇 𝑠 , 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 66

  51. A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 , 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 • For a recurrent network without feedback from the output we can make the conditional independence assumption: 𝑄 𝑡 𝑢+1 
 𝑡 0 
 𝑡 𝑢 , 𝐘 = 𝑄 𝑡 𝑢+1 
 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝑡 𝑢 = 𝑇 𝑠 , 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 Note: in reality, this assumption is not valid if the hidden states are unknown, but 67 we will make it anyway

  52. Conditional independence 𝑧 0 𝑧 1 𝐘 = 𝑌 0 𝑌 1 
 𝑌 𝑂−1 𝐈 = 𝐌 0 𝐌 1 
 𝐌 𝑂−1 ⋮ 𝑧 𝑂−1 • Dependency graph: Input sequence 𝐘 = 𝑌 0 𝑌 1 
 𝑌 𝑂−1 governs hidden variables 𝐈 = 𝐌 0 𝐌 1 
 𝐌 𝑂−1 • Hidden variables govern output predictions 𝑧 0 , 𝑧 1 , 
 𝑧 𝑂−1 individually • 𝑧 0 , 𝑧 1 , 
 𝑧 𝑂−1 are conditionally independent given 𝐈 • Since 𝐈 is deterministically derived from 𝐘 , 𝑧 0 , 𝑧 1 , 
 𝑧 𝑂−1 are also conditionally independent given 𝐘 – This wouldn’t be true if the relation between 𝐘 and 𝐈 were not deterministic or if 𝐘 is unknown 68

  53. A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 69

  54. A posteriori probabilities of symbols 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 70

  55. The expectation over all alignments 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 • We will call the first term the forward probability 𝛜 𝑢, 𝑠 • We will call the second term the backward probability 𝛟 𝑢, 𝑠 71

  56. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛜 𝑢, 𝑠 = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝑡 0 
 𝑡 𝑢−1 , 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠 𝐘 + ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 𝑠− ] 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 (𝑠−1)− ] 72

  57. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛜 𝑢, 𝑠 = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝑡 0 
 𝑡 𝑢−1 , 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠 𝐘 + ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 𝑠− ] 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 (𝑠−1)− ] 73

  58. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛜 𝑢, 𝑠 = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝑡 0 
 𝑡 𝑢−1 , 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠 𝐘 + ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 𝑠− ] 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 (𝑠−1)− ] 74

  59. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛜 𝑢, 𝑠 = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝑡 0 
 𝑡 𝑢−1 , 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠 𝐘 + ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 𝑠− ] 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 (𝑠−1)− ] 𝑇(𝑠) 75 𝑧 𝑢 𝛜 𝑢 − 1, 𝑠 𝛜 𝑢 − 1, 𝑠 − 1

  60. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛜 𝑢, 𝑠 = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝑡 0 
 𝑡 𝑢−1 , 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠 𝐘 + ෍ 𝑄 𝑡 0 
 𝑡 𝑢−2 , 𝑡 𝑢−1 = 𝑇 𝑠−1 𝐘 𝑄 𝑡 𝑢 = 𝑇 𝑠 𝐘 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 𝑠− ] 𝑡 0  𝑡 𝑢−2 →𝑇 1 
[𝑇 (𝑠−1)− ] 𝑇(𝑠) 𝛜 𝑢, 𝑠 = 𝛜 𝑢 − 1, 𝑠 + 𝛜 𝑢 − 1, 𝑠 − 1 𝑧 𝑢 76

  61. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛜 𝑢 − 1, 𝑠 − 1 𝛜 𝑢 − 1, 𝑠 𝛜 𝑢, 𝑠 𝑇(𝑠) 𝛜 𝑢, 𝑠 = 𝛜 𝑢 − 1, 𝑠 + 𝛜 𝑢 − 1, 𝑠 − 1 𝑧 𝑢 77

  62. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝑇 1 , 𝛜 0,1 = 𝑧 0 𝛜 0, 𝑠 = 0, 𝑠 > 1 • for 𝑢 = 1 
 𝑈 − 1 𝑇 1 𝛜(𝑢, 1) = 𝛜(𝑢 − 1,1)𝑧 𝑢 for 𝑚 = 2 
 𝐿 𝑇 𝑚 • 𝛜(𝑢, 𝑚) = (𝛜 𝑢 − 1, 𝑚 + 𝛜 𝑢 − 1, 𝑚 − 1 )𝑧 𝑢 78

  63. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝑇 1 , 𝛜 0,1 = 𝑧 0 𝛜 0, 𝑠 = 0, 𝑠 > 1 • for 𝑢 = 1 
 𝑈 − 1 𝑇 1 𝛜(𝑢, 1) = 𝛜(𝑢 − 1,1)𝑧 𝑢 for 𝑚 = 2 
 𝐿 𝑇 𝑚 • 𝛜(𝑢, 𝑚) = (𝛜 𝑢 − 1, 𝑚 + 𝛜 𝑢 − 1, 𝑚 − 1 )𝑧 𝑢 79

  64. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝑇 1 , 𝛜 0,1 = 𝑧 0 𝛜 0, 𝑠 = 0, 𝑠 > 1 • for 𝑢 = 1 
 𝑈 − 1 𝑇 1 𝛜(𝑢, 1) = 𝛜(𝑢 − 1,1)𝑧 𝑢 for 𝑚 = 2 
 𝐿 𝑇 𝑚 • 𝛜(𝑢, 𝑚) = (𝛜 𝑢 − 1, 𝑚 + 𝛜 𝑢 − 1, 𝑚 − 1 )𝑧 𝑢 80

  65. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝑇 1 , 𝛜 0,1 = 𝑧 0 𝛜 0, 𝑠 = 0, 𝑠 > 1 • for 𝑢 = 1 
 𝑈 − 1 𝑇 1 𝛜(𝑢, 1) = 𝛜(𝑢 − 1,1)𝑧 𝑢 for 𝑚 = 2 
 𝐿 𝑇 𝑚 • 𝛜(𝑢, 𝑚) = (𝛜 𝑢 − 1, 𝑚 + 𝛜 𝑢 − 1, 𝑚 − 1 )𝑧 𝑢 81

  66. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝑇 1 , 𝛜 0,1 = 𝑧 0 𝛜 0, 𝑠 = 0, 𝑠 > 1 • for 𝑢 = 1 
 𝑈 − 1 𝑇 1 𝛜(𝑢, 1) = 𝛜(𝑢 − 1,1)𝑧 𝑢 for 𝑚 = 2 
 𝐿 𝑇 𝑚 • 𝛜(𝑢, 𝑚) = (𝛜 𝑢 − 1, 𝑚 + 𝛜 𝑢 − 1, 𝑚 − 1 )𝑧 𝑢 82

  67. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝑇 1 , 𝛜 0,1 = 𝑧 0 𝛜 0, 𝑠 = 0, 𝑠 > 1 • for 𝑢 = 1 
 𝑈 − 1 𝑇 1 𝛜(𝑢, 1) = 𝛜(𝑢 − 1,1)𝑧 𝑢 for 𝑚 = 2 
 𝐿 𝑇 𝑚 • 𝛜(𝑢, 𝑚) = (𝛜 𝑢 − 1, 𝑚 + 𝛜 𝑢 − 1, 𝑚 − 1 )𝑧 𝑢 83

  68. In practice.. • The recursion 𝑇 𝑚 𝛜(𝑢, 𝑚) = (𝛜 𝑢 − 1, 𝑚 + 𝛜 𝑢 − 1, 𝑚 − 1 )𝑧 𝑢 will generally underflow • Instead we can do it in the log domain log 𝛜(𝑢, 𝑚) = log(𝑓 log 𝛜 𝑢−1,𝑚 + 𝑓 log 𝛜 𝑢−1,𝑚−1 ) + log 𝑧 𝑢 𝑇 𝑚 – This can be computed entirely without underflow 84

  69. Forward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝛜 0,1 = 1, ො 𝛜 0, 𝑠 = 0, 𝑠 > 1 ො 𝑇 𝑠 , 𝛜 0, 𝑠 = ො 𝛜 0, 𝑠 𝑧 0 1 ≀ 𝑠 ≀ 𝐿 • for 𝑢 = 1 
 𝑈 − 1 𝛜(𝑢, 1) = 𝛜(𝑢 − 1,1) ො for 𝑚 = 2 
 𝐿 • 𝛜(𝑢, 𝑚) = 𝛜 𝑢 − 1, 𝑚 + 𝛜 𝑢 − 1, 𝑚 − 1 ො 𝑇 𝑠 , 𝛜 𝑢, 𝑠 = ො 𝛜 𝑢, 𝑠 𝑧 𝑢 1 ≀ 𝑠 ≀ 𝐿 85

  70. The forward probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = ෍ 𝑄 𝑡 0 
 𝑡 𝑢−1 , 𝑡 𝑢 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 0  𝑡 𝑢−1 →𝑇 1 
[𝑇 𝑠− ] 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 • We will call the first term the forward probability 𝛜 𝑢, 𝑠 • We will call the second term the backward probability 𝛟 𝑢, 𝑠 We have seen how to compute this 𝛜 𝑢, 𝑠 86

  71. The forward probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = 𝛜 𝑢, 𝑠 ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 • We will call the first term the forward probability 𝛜 𝑢, 𝑠 • We will call the second term the backward probability 𝛟 𝑢, 𝑠 We have seen how to compute this 87

  72. The forward probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = 𝛜 𝑢, 𝑠 ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 • We will call the first term the forward probability 𝛜 𝑢, 𝑠 • We will call the second term the backward probability 𝛟 𝑢, 𝑠 Lets look at this 𝛟 𝑢, 𝑠 88

  73. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛟 𝑢, 𝑠 = ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 = ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 , 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 + ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ] 𝑇 𝐿 89

  74. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛟 𝑢, 𝑠 = ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 = ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 , 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 + ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ] 𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠 , 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ] 𝑇 𝐿 90

  75. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛟 𝑢, 𝑠 = ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 = ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 , 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 + ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ] 𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠 , 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ] 𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ] 𝑇 𝐿 91

  76. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝛟 𝑢, 𝑠 = ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 = ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 , 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 + ෍ 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ] 𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠 , 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝑡 𝑢+1 = 𝑇 𝑠+1 , 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ] 𝑇 𝐿 = 𝑄 𝑡 𝑢+1 = 𝑇 𝑠 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 + 𝑄 𝑡 𝑢+1 = 𝑇 𝑠+1 𝐘 ෍ 𝑄 𝑡 𝑢+2 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 𝑡 𝑢+2  𝑡 𝑂−1 →[𝑇 (𝑠+1)+ ] 𝑇 𝐿 92 𝑇(𝑠) 𝑇(𝑠+1) 𝛟 𝑢 + 1, 𝑠 𝛟 𝑢 + 1, 𝑠 + 1 𝑧 𝑢+1 𝑧 𝑢+1

  77. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑇(𝑠) 𝛟 𝑢 + 1, 𝑠 𝑇(𝑠+1) 𝛟 𝑢 + 1, 𝑠 + 1 𝛟 𝑢, 𝑠 = 𝑧 𝑢+1 + 𝑧 𝑢+1 93

  78. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝛟 𝑈 − 1, 𝐿 = 1, 𝛟 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿 • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0 𝑇 𝐿 𝛟(𝑢, 𝐿) = 𝛟(𝑢 + 1, 𝐿)𝑧 𝑢+1 for 𝑚 = 𝐿 − 1 
 1 𝑇(𝑚) 𝛟 𝑢 + 1, 𝑠 𝑇(𝑠+1) 𝛟 𝑢 + 1, 𝑠 + 1 • 𝛟 𝑢, 𝑠 = 𝑧 𝑢+1 + 𝑧 𝑢+1 94

  79. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝛟 𝑈 − 1, 𝐿 = 1, 𝛟 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿 • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0 𝑇 𝐿 𝛟(𝑢, 𝐿) = 𝛟(𝑢 + 1, 𝐿)𝑧 𝑢+1 for 𝑚 = 𝐿 − 1 
 1 𝑇(𝑚) 𝛟 𝑢 + 1, 𝑠 𝑇(𝑠+1) 𝛟 𝑢 + 1, 𝑠 + 1 • 𝛟 𝑢, 𝑠 = 𝑧 𝑢+1 + 𝑧 𝑢+1 95

  80. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝛟 𝑈 − 1, 𝐿 = 1, 𝛟 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿 • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0 𝑇 𝐿 𝛟(𝑢, 𝐿) = 𝛟(𝑢 + 1, 𝐿)𝑧 𝑢+1 for 𝑚 = 𝐿 − 1 
 1 𝑇(𝑚) 𝛟 𝑢 + 1, 𝑠 𝑇(𝑠+1) 𝛟 𝑢 + 1, 𝑠 + 1 • 𝛟 𝑢, 𝑠 = 𝑧 𝑢+1 + 𝑧 𝑢+1 96

  81. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝛟 𝑈 − 1, 𝐿 = 1, 𝛟 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿 • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0 𝑇 𝐿 𝛟(𝑢, 𝐿) = 𝛟(𝑢 + 1, 𝐿)𝑧 𝑢+1 for 𝑚 = 𝐿 − 1 
 1 𝑇(𝑚) 𝛟 𝑢 + 1, 𝑠 𝑇(𝑠+1) 𝛟 𝑢 + 1, 𝑠 + 1 • 𝛟 𝑢, 𝑠 = 𝑧 𝑢+1 + 𝑧 𝑢+1 97

  82. Backward algorithm 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t • Initialization: 𝛟 𝑈 − 1, 𝐿 = 1, 𝛟 𝑈 − 1, 𝑠 = 0, 𝑠 < 𝐿 • for 𝑢 = 𝑈 − 2 𝑒𝑝𝑥𝑜𝑢𝑝 0 𝑇 𝐿 𝛟(𝑢, 𝐿) = 𝛟(𝑢 + 1, 𝐿)𝑧 𝑢+1 for 𝑚 = 𝐿 − 1 
 1 𝑇(𝑚) 𝛟 𝑢 + 1, 𝑠 𝑇(𝑠+1) 𝛟 𝑢 + 1, 𝑠 + 1 • 𝛟 𝑢, 𝑠 = 𝑧 𝑢+1 + 𝑧 𝑢+1 98

  83. The joint probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = 𝛜 𝑢, 𝑠 ෍ 𝑄 𝑡 𝑢+1 
 𝑡 𝑂−1 𝐘 𝑡 𝑢+1  𝑡 𝑂−1 →[𝑇 𝑠+ ] 𝑇 𝐿 • We will call the first term the forward probability 𝛜 𝑢, 𝑠 • We will call the second term the backward probability 𝛟 𝑢, 𝑠 We now can compute this 𝛟 𝑢, 𝑠 99

  84. The joint probability 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝐶 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 /B/ 𝑧 5 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 6 𝑧 7 𝑧 8 𝑧 5 /IY/ 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝐺 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /F/ 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝐜𝑍 𝑧 0 𝑧 1 𝑧 2 𝑧 3 𝑧 4 𝑧 5 𝑧 6 𝑧 7 𝑧 8 /IY/ 0 1 2 3 4 5 6 7 8 t 𝑄 𝑡 𝑢 = 𝑇 𝑠 , 𝐓|𝐘 = 𝛜 𝑢, 𝑠 𝛟 𝑢, 𝑠 • We will call the first term the forward probability 𝛜 𝑢, 𝑠 • We will call the second term the backward probability 𝛟 𝑢, 𝑠 Forward algo Backward algo 100

Recommend


More recommend