introduction to hidden markov models
play

Introduction to Hidden Markov Models CMSC 473/673 UMBC Recap from - PowerPoint PPT Presentation

Introduction to Hidden Markov Models CMSC 473/673 UMBC Recap from last time Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters


  1. Parts of Speech Open class words non- fake Adjectives subsective would-be Nouns red cats Verbs bread large happy Baltimore intransitive cat speak milk wettest UMBC subsective give Kamp & Partee (1995) ditransitive Adverbs Language evolves! Numbers happily run transitive recently “I’m reading this because I want to procrastinate.” → 1,324 “I’m reading this because procrastination.” one I you there https://www.theatlantic.com/technology/archive/2013/11/english-has-a-new-preposition-because-internet/281601 / can may then (set) up (location) Pronouns do (call) modals, off Prepositions auxiliaries Conjunctions and or if Determiners Particles top in every under because a not what the because so (far) Closed class words Adapted from Luke Zettlemoyer

  2. Agenda HMM Motivation (Part of Speech) and Brief Definition What is Part of Speech? HMM Detailed Definition HMM Tasks

  3. Hidden Markov Models: Part of Speech (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(British Left Waffles on Falkland Islands) Bigram model Class-based Model all of the classes model class sequences 𝑞 𝑥 𝑗 |𝑨 𝑗

  4. Hidden Markov Models: Part of Speech (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(British Left Waffles on Falkland Islands) Bigram model Class-based Model all of the classes model class sequences 𝑞 𝑨 𝑗 |𝑨 𝑗−1 𝑞 𝑥 𝑗 |𝑨 𝑗

  5. Hidden Markov Models: Part of Speech (i): Adjective Noun Verb Prep Noun Noun (ii): Noun Verb Noun Prep Noun Noun p(British Left Waffles on Falkland Islands) Bigram model Class-based Model all of the classes model class sequences ෍ 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 𝑞 𝑨 𝑗 |𝑨 𝑗−1 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑨 1 ,..,𝑨 𝑂

  6. Hidden Markov Model 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 𝑗 Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w

  7. Hidden Markov Model 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 𝑗 Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w if we did observe z , estimating the if we knew the probability parameters probability parameters would be easy… then we could estimate z and evaluate but we don’t! :( likelihood… but we don’t! :(

  8. Hidden Markov Model Terminology 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 𝑗 Each z i can take the value of one of K latent states

  9. Hidden Markov Model Terminology 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 𝑗 transition probabilities/parameters Each z i can take the value of one of K latent states

  10. Hidden Markov Model Terminology 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 𝑗 emission transition probabilities/parameters probabilities/parameters Each z i can take the value of one of K latent states

  11. Hidden Markov Model Terminology 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 𝑗 emission transition probabilities/parameters probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change

  12. Hidden Markov Model Terminology 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 𝑗 emission transition probabilities/parameters probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items?

  13. Hidden Markov Model Terminology 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 𝑗 emission transition probabilities/parameters probabilities/parameters Each z i can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items? A: VK emission values and K 2 transition values

  14. Hidden Markov Model Representation 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 represent the probabilities and independence assumptions in a graph

  15. Hidden Markov Model Representation 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 … z 1 z 2 z 3 z 4 w 1 w 2 w 3 w 4 Graphical Models (see CMSC 478/678… and also CMSC 691: Graphical & Statistical Models of Learning)

  16. Hidden Markov Model Representation 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 … z 1 z 2 z 3 z 4 𝑞 𝑥 4 |𝑨 4 𝑞 𝑥 1 |𝑨 1 𝑞 𝑥 2 |𝑨 2 𝑞 𝑥 3 |𝑨 3 w 1 w 2 w 3 w 4

  17. Hidden Markov Model Representation 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 𝑞 𝑨 2 | 𝑨 1 𝑞 𝑨 3 | 𝑨 2 𝑞 𝑨 4 | 𝑨 3 … z 1 z 2 z 3 z 4 𝑞 𝑥 4 |𝑨 4 𝑞 𝑥 1 |𝑨 1 𝑞 𝑥 2 |𝑨 2 𝑞 𝑥 3 |𝑨 3 w 1 w 2 w 3 w 4

  18. Hidden Markov Model Representation 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 initial starting distribution (“BOS”) 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑨 2 | 𝑨 1 𝑞 𝑨 3 | 𝑨 2 𝑞 𝑨 4 | 𝑨 3 … z 1 z 2 z 3 z 4 𝑞 𝑥 4 |𝑨 4 𝑞 𝑥 1 |𝑨 1 𝑞 𝑥 2 |𝑨 2 𝑞 𝑥 3 |𝑨 3 w 1 w 2 w 3 w 4

  19. Hidden Markov Model Representation 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 initial starting distribution (“BOS”) 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑨 2 | 𝑨 1 𝑞 𝑨 3 | 𝑨 2 𝑞 𝑨 4 | 𝑨 3 … z 1 z 2 z 3 z 4 𝑞 𝑥 4 |𝑨 4 𝑞 𝑥 1 |𝑨 1 𝑞 𝑥 2 |𝑨 2 𝑞 𝑥 3 |𝑨 3 w 1 w 2 w 3 w 4 Each z i can take the value of one of K latent states Transition and emission distributions do not change

  20. Example: 2-state Hidden Markov Model as a Lattice … z 1 = z 2 = z 3 = z 4 = V V V V … z 1 = z 2 = z 3 = z 4 = N N N N w 1 w 2 w 3 w 4

  21. Example: 2-state Hidden Markov Model as a Lattice … z 1 = z 2 = z 3 = z 4 = V V V V … z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 4 |𝑊 𝑞 𝑥 1 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 2 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  22. Example: 2-state Hidden Markov Model as a Lattice 𝑞 𝑊| start 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 … z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑂| start 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 4 |𝑊 𝑞 𝑥 1 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 2 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  23. Example: 2-state Hidden Markov Model as a Lattice 𝑞 𝑊| start 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 … z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 4 |𝑊 𝑞 𝑥 1 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 2 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  24. Comparison of Joint Probabilities 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Unigram Language Model

  25. Comparison of Joint Probabilities 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Unigram Language Model 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … ,𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 Unigram Class- based Language Model (“K” coins)

  26. Comparison of Joint Probabilities 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑥 1 𝑞 𝑥 2 ⋯ 𝑞 𝑥 𝑂 = ෑ 𝑞 𝑥 𝑗 𝑗 Unigram Language Model 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … ,𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 𝑗 Unigram Class- based Language Model (“K” coins) 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 𝑗 Hidden Markov Model

  27. Estimating Parameters from Observed Data 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V Transition Counts 𝑞 𝑂| start 𝑞 𝑊| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 𝑞 𝑥 4 |𝑂 N 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑊 𝑞 𝑥 2 |𝑊 V w 1 w 2 w 3 w 4 Emission Counts w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N V V V V V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start end emission not shown 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  28. Estimating Parameters from Observed Data 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V Transition Counts 𝑞 𝑂| start 𝑞 𝑊| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 2 0 0 𝑞 𝑥 4 |𝑂 N 1 2 2 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑊 𝑞 𝑥 2 |𝑊 V 2 1 0 w 1 w 2 w 3 w 4 Emission Counts w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N 2 0 1 2 V V V V V 0 2 1 0 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start end emission not shown 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  29. Estimating Parameters from Observed Data 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V Transition MLE 𝑞 𝑂| start 𝑞 𝑊| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 1 0 0 𝑞 𝑥 4 |𝑂 N .2 .4 .4 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑊 𝑞 𝑥 2 |𝑊 V 2/3 1/3 0 w 1 w 2 w 3 w 4 Emission MLE w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N .4 0 .2 .4 V V V V V 0 2/3 1/3 0 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start end emission not shown 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  30. Estimating Parameters from Observed Data 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V Transition MLE 𝑞 𝑂| start 𝑞 𝑊| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 1 0 0 𝑞 𝑥 4 |𝑂 N .2 .4 .4 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑊 𝑞 𝑥 2 |𝑊 V 2/3 1/3 0 w 1 w 2 w 3 w 4 Emission MLE w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N .4 0 .2 .4 V V V V V 0 2/3 1/3 0 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start end emission not shown 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N smooth these 𝑞 𝑥 2 |𝑊 values if 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 needed w 1 w 2 w 3 w 4

  31. Agenda HMM Motivation (Part of Speech) and Brief Definition What is Part of Speech? HMM Detailed Definition HMM Tasks

  32. Hidden Markov Model Tasks 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 Calculate the (log) likelihood of an observed sequence w 1 , …, w N Calculate the most likely sequence of states (for an observed sequence) Learn the emission and transition parameters

  33. Hidden Markov Model Tasks 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 = 𝑞 𝑨 1 | 𝑨 0 𝑞 𝑥 1 |𝑨 1 ⋯ 𝑞 𝑨 𝑂 | 𝑨 𝑂−1 𝑞 𝑥 𝑂 |𝑨 𝑂 emission transition = ෑ 𝑞 𝑥 𝑗 |𝑨 𝑗 𝑞 𝑨 𝑗 | 𝑨 𝑗−1 probabilities/parameters probabilities/parameters 𝑗 Calculate the (log) likelihood of an observed sequence w 1 , …, w N Calculate the most likely sequence of states (for an observed sequence) Learn the emission and transition parameters

  34. HMM Likelihood Task Marginalize over all latent sequence joint likelihoods 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 ෍ 𝑨 1 ,⋯,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?

  35. HMM Likelihood Task Marginalize over all latent sequence joint likelihoods 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 ෍ 𝑨 1 ,⋯,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: K N

  36. HMM Likelihood Task Marginalize over all latent sequence joint likelihoods 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 ෍ 𝑨 1 ,⋯,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: K N Goal: Find a way to compute this exponential sum efficiently (in polynomial time)

  37. HMM Likelihood Task Like in language modeling, you need to model when to Marginalize over all latent sequence joint stop generating. likelihoods This ending state is generally not included in “K.” 𝑞 𝑥 1 , 𝑥 2 , … , 𝑥 𝑂 = 𝑞 𝑨 1 , 𝑥 1 , 𝑨 2 , 𝑥 2 , … , 𝑨 𝑂 , 𝑥 𝑂 ෍ 𝑨 1 ,⋯,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: K N Goal: Find a way to compute this exponential sum efficiently (in polynomial time)

  38. 2 (3)-State HMM Likelihood 𝑞 𝑊| start 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 … z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 4 |𝑊 𝑞 𝑥 1 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 2 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4 Q: What are the latent sequences here (EOS excluded)?

  39. 2 (3)-State HMM Likelihood 𝑞 𝑊| start 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 … z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 4 |𝑊 𝑞 𝑥 1 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 2 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4 Q: What are the latent sequences here (EOS excluded)? A: (N, w 1 ), (N, w 2 ), (N, w 3 ), (N, w 4 ) (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 ) (V, w 1 ), (N, w 2 ), (N, w 3 ), (N, w 4 ) (N, w 1 ), (N, w 2 ), (N, w 3 ), (V, w 4 ) (N, w 1 ), (V, w 2 ), (N, w 3 ), (V, w 4 ) (V, w 1 ), (N, w 2 ), (N, w 3 ), (V, w 4 ) (N, w 1 ), (N, w 2 ), (V, w 3 ), (N, w 4 ) (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 ) … (six more) (N, w 1 ), (N, w 2 ), (V, w 3 ), (V, w 4 ) (N, w 1 ), (V, w 2 ), (V, w 3 ), (V, w 4 )

  40. 2 (3)-State HMM Likelihood 𝑞 𝑊| start 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 … z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 4 |𝑊 𝑞 𝑥 1 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 2 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4 Q: What are the latent sequences here (EOS excluded)? A: (N, w 1 ), (N, w 2 ), (N, w 3 ), (N, w 4 ) (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 ) (V, w 1 ), (N, w 2 ), (N, w 3 ), (N, w 4 ) (N, w 1 ), (N, w 2 ), (N, w 3 ), (V, w 4 ) (N, w 1 ), (V, w 2 ), (N, w 3 ), (V, w 4 ) (V, w 1 ), (N, w 2 ), (N, w 3 ), (V, w 4 ) (N, w 1 ), (N, w 2 ), (V, w 3 ), (N, w 4 ) (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 ) … (six more) (N, w 1 ), (N, w 2 ), (V, w 3 ), (V, w 4 ) (N, w 1 ), (V, w 2 ), (V, w 3 ), (V, w 4 )

  41. 2 (3)-State HMM Likelihood 𝑞 𝑊| start 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 … z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 4 |𝑊 𝑞 𝑥 1 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 2 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4 N V end w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05

  42. 2 (3)-State HMM Likelihood 𝑞 𝑊| start 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 … z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 4 |𝑊 𝑞 𝑥 1 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 2 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of N V end (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05

  43. 2 (3)-State HMM Likelihood 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of N V end (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = N .15 .8 .05 V .2 .6 .1 .1 0.0002822 V .6 .35 .05

  44. 2 (3)-State HMM Likelihood 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 ) w 1 w 2 w 3 w 4 # N V end with ending included (unique N .7 .2 .05 .05 0 start .7 .2 .1 ending symbol “#”)? V .2 .6 .1 .1 0 A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) * N .15 .8 .05 (.05 * 1) = 0 0 0 0 1 end V .6 .35 .05 0.00001235

  45. 2 (3)-State HMM Likelihood 𝑞 𝑊| start 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 … z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 3 |𝑊 𝑞 𝑥 4 |𝑊 𝑞 𝑥 1 |𝑊 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 2 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of N V end (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05

  46. 2 (3)-State HMM Likelihood z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4 N V end Q: What’s the probability of w 1 w 2 w 3 w 4 (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 )? start .7 .2 .1 N .7 .2 .05 .05 A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) = N .15 .8 .05 V .2 .6 .1 .1 0.00007056 V .6 .35 .05

  47. 2 (3)-State HMM Likelihood z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of w 1 w 2 w 3 w 4 # N V end (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 ) with N .7 .2 .05 .05 0 ending (unique symbol “#”)? start .7 .2 .1 V .2 .6 .1 .1 0 A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) * N .15 .8 .05 (.05 * 1) = 0 0 0 0 1 end V .6 .35 .05 0.000002646

  48. 2 (3)-State HMM Likelihood 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 1 |𝑂 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 𝑞 𝑥 2 |𝑊 w 1 w 2 w 3 w 4 z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  49. 2 (3)-State HMM Likelihood 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 1 |𝑂 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 𝑞 𝑥 2 |𝑊 w 1 w 2 w 3 w 4 Up until here, all the computation was the same z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  50. 2 (3)-State HMM Likelihood 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 1 |𝑂 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 𝑞 𝑥 2 |𝑊 w 1 w 2 w 3 w 4 Up until here, all the computation was the same Let’s reuse what computations we can z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  51. 2 (3)-State HMM Likelihood 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information “forward” in 𝑞 𝑥 1 |𝑂 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 the graph, e.g., from time step 2 to 3… 𝑞 𝑥 2 |𝑊 w 1 w 2 w 3 w 4 z 1 = z 2 = z 3 = z 4 = V V V V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  52. 2 (3)-State HMM Likelihood 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information “forward” in 𝑞 𝑥 1 |𝑂 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 the graph, e.g., from time step 2 to 3… 𝑞 𝑥 2 |𝑊 w 1 w 2 w 3 w 4 Issue: these highlighted paths are only z 1 = z 2 = z 3 = z 4 = 2 of the 16 possible paths through the V V V V trellis 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  53. 2 (3)-State HMM Likelihood 𝑞 𝑊| 𝑊 z 1 = z 2 = z 3 = z 4 = 𝑞 𝑂| 𝑊 V V V V 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information “forward” in 𝑞 𝑥 1 |𝑂 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 the graph, e.g., from time step 2 to 3… 𝑞 𝑥 2 |𝑊 w 1 w 2 w 3 w 4 Issue: these highlighted paths are only z 1 = z 2 = z 3 = z 4 = 2 of the 16 possible paths through the V V V V trellis 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| start 𝑞 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = Solution: marginalize out all N N N N information from previous timesteps 𝑞 𝑥 2 |𝑊 𝑞 𝑥 4 |𝑂 𝑞 𝑥 1 |𝑂 𝑞 𝑥 3 |𝑂 w 1 w 2 w 3 w 4

  54. Reusing Computation z i-2 z i-1 z i = = A = A A z i-2 z i-1 z i = = B = B B z i-2 z i-1 z i = = C = C C let’s first consider “ any shared path ending with B (AB, BB, or CB) → B”

  55. Reusing Computation 𝛽(𝑗 − 1, 𝐵) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 − 1, 𝐶) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 − 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider “ any shared path ending with B (AB, BB, or CB) → B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 − 1, 𝐵) , 𝛽(𝑗 − 1, 𝐶) , 𝛽(𝑗 − 1, 𝐷)

  56. Reusing Computation 𝛽(𝑗 − 1, 𝐵) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 − 1, 𝐶) 𝛽(𝑗, 𝐶) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 − 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider “ any shared path ending with B (AB, BB, or CB) → B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 − 1, 𝐵) , 𝛽(𝑗 − 1, 𝐶) , 𝛽(𝑗 − 1, 𝐷) Marginalize (sum) across the previous timestep’s possible states

  57. Reusing Computation 𝛽(𝑗 − 1, 𝐵) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 − 1, 𝐶) 𝛽(𝑗, 𝐶) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 − 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider “ any shared path ending with B (AB, BB, or CB) → B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐶 = ෍ 𝛽 𝑗 − 1, 𝑡 ∗ 𝑞 𝐶 𝑡) ∗ 𝑞(obs at 𝑗 | 𝐶) 𝑡

  58. Reusing Computation 𝛽(𝑗 − 1, 𝐵) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 − 1, 𝐶) 𝛽(𝑗, 𝐶) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 − 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider “ any shared path ending with B (AB, BB, or CB) → B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐶 = ෍ 𝛽 𝑗 − 1, 𝑡 ∗ 𝑞 𝐶 𝑡) ∗ 𝑞(obs at 𝑗 | 𝐶) 𝑡 computing α at time i-1 will correctly incorporate paths through time i-2 : we correctly obey the Markov property

  59. Forward Probability z i-2 z i-1 z i = = A = A A z i-2 z i-1 z i = = B = B B z i-2 z i-1 z i = = C = C C let’s first consider “ any shared path ending with B (AB, BB, or CB) → B” marginalize across the previous hidden state values α(i, B) is the total probability of all 𝛽 𝑗 − 1, 𝑡 ′ 𝑡 ′ ) ∗ 𝑞(obs at 𝑗 | 𝐶) 𝛽 𝑗, 𝐶 = ෍ ∗ 𝑞 𝐶 paths to that state B from the 𝑡 ′ beginning computing α at time i-1 will correctly incorporate paths through time i-2 : we correctly obey the Markov property

  60. Forward Probability α(i, s ) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

  61. Forward Probability what are the what’s the total probability how likely is it to get immediate ways to up until now? into state s this way? get into state s ? α(i, s ) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

  62. 2 (3) -State HMM Likelihood with Forward Probabilities α[3, V] = α[2, V] * (.35*.1)+ z 3 = z 4 = 𝑞 𝑂| 𝑊 α[2, N] * (.8*.1) V V α[2, V] = 𝑞 𝑊| 𝑊 α[1, N] * (.8*.6) + z 3 = z 4 = α[1, V] * (.35*.6) N N 𝑞 𝑊| start z 1 = z 2 = 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 V V w 3 w 4 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = α[1, N] = N N (.7*.7) z 3 = z 4 = 𝑞 𝑥 2 |𝑊 𝑞 𝑥 1 |𝑂 α[3, N] = V V α[2, V] * (.6*.05) + w 1 w 2 α[2, N] * (.15*.05) 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑂 z 3 = z 4 = N V end N N start .7 .2 .1 w 1 w 2 w 3 w 4 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑂 N .15 .8 .05 N .7 .2 .05 .05 V .2 .6 .1 .1 V .6 .35 .05 w 3 w 4

  63. 2 (3) -State HMM Likelihood with Forward Probabilities α[3, V] = α[2, V] * (.35*.1)+ z 3 = z 4 = 𝑞 𝑂| 𝑊 α[2, N] * (.8*.1) V V α[2, V] = 𝑞 𝑊| 𝑊 α[1, V] = α[1, N] * (.8*.6) + z 3 = z 4 = (.2*.2) α[1, V] * (.35*.6) N N 𝑞 𝑊| start z 1 = z 2 = 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 V V w 3 w 4 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = α[1, N] = N N (.7*.7) z 3 = z 4 = 𝑞 𝑥 2 |𝑊 𝑞 𝑥 1 |𝑂 α[3, N] = V V α[2, V] * (.6*.05) + w 1 w 2 α[2, N] * (.15*.05) 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑂 z 3 = z 4 = N V end N N start .7 .2 .1 w 1 w 2 w 3 w 4 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑂 N .15 .8 .05 N .7 .2 .05 .05 V .2 .6 .1 .1 V .6 .35 .05 w 3 w 4

  64. 2 (3) -State HMM Likelihood with Forward Probabilities α[3, V] = z 3 = z 4 = α[2, V] * (.35*.1)+ 𝑞 𝑂| 𝑊 V V α[2, N] * (.8*.1) α[2, V] = 𝑞 𝑊| 𝑊 α[1, N] * (.8*.6) + α[1, V] = z 3 = z 4 = α[1, V] * (.35*.6) = (.2*.2) = .04 N N 0.2436 𝑞 𝑊| start z 1 = z 2 = 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 V V w 3 w 4 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = α[1, N] = N N (.7*.7) = .49 z 3 = z 4 = 𝑞 𝑥 2 |𝑊 𝑞 𝑥 1 |𝑂 α[2, N] = α[3, N] = V V α[1, N] * (.15*.2) + α[2, V] * (.6*.05) + w 1 w 2 α[1, V] * (.6*.2) = α[2, N] * (.2*.05) 𝑞 𝑂| 𝑊 .0195 𝑞 𝑂| 𝑂 z 3 = z 4 = N V end N N start .7 .2 .1 w 1 w 2 w 3 w 4 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑂 N .15 .8 .05 N .7 .2 .05 .05 V .2 .6 .1 .1 V .6 .35 .05 w 3 w 4

  65. 2 (3) -State HMM Likelihood with Forward Probabilities α[3, V] = z 3 = z 4 = α[2, V] * (.35*.1)+ 𝑞 𝑂| 𝑊 V V α[2, N] * (.8*.1) α[2, V] = 𝑞 𝑊| 𝑊 α[1, N] * (.8*.6) + α[1, V] = z 3 = z 4 = α[1, V] * (.35*.6) = (.2*.2) = .04 N N 0.2436 𝑞 𝑊| start z 1 = z 2 = 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 V V w 3 w 4 𝑞 𝑂| start 𝑞 𝑊| 𝑂 z 1 = z 2 = α[1, N] = N N (.7*.7) = .49 z 3 = z 4 = 𝑞 𝑥 2 |𝑊 𝑞 𝑥 1 |𝑂 α[2, N] = α[3, N] = V V α[1, N] * (.15*.2) + α[2, V] * (.6*.05) + w 1 w 2 α[1, V] * (.6*.2) = α[2, N] * (.2*.05) 𝑞 𝑂| 𝑊 .0195 𝑞 𝑂| 𝑂 z 3 = z 4 = N V end N N start .7 .2 .1 w 1 w 2 w 3 w 4 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑂 N .15 .8 .05 N .7 .2 .05 .05 V .2 .6 .1 .1 V .6 .35 .05 w 3 w 4

  66. 2 (3) -State HMM Likelihood with Forward Probabilities α[3, V] = z 3 = z 4 = α[2, V] * (.35*.1)+ 𝑞 𝑂| 𝑊 V V α[2, N] * (.8*.1) α[2, V] = 𝑞 𝑊| 𝑊 α[1, N] * (.8*.6) + α[1, V] = z 3 = z 4 = α[1, V] * (.35*.6) = (.2*.2) = .04 N N 0.2436 𝑞 𝑊| start z 1 = z 2 = 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑊 V V Use dynamic programming w 3 w 4 𝑞 𝑂| start 𝑞 𝑊| 𝑂 to build the α left-to-right z 1 = z 2 = α[1, N] = N N (.7*.7) = .49 z 3 = z 4 = 𝑞 𝑥 2 |𝑊 𝑞 𝑥 1 |𝑂 α[2, N] = α[3, N] = V V α[1, N] * (.15*.2) + α[2, V] * (.6*.05) + w 1 w 2 α[1, V] * (.6*.2) = α[2, N] * (.2*.05) 𝑞 𝑂| 𝑊 .0195 𝑞 𝑂| 𝑂 z 3 = z 4 = N V end N N start .7 .2 .1 w 1 w 2 w 3 w 4 𝑞 𝑥 4 |𝑂 𝑞 𝑥 3 |𝑂 N .15 .8 .05 N .7 .2 .05 .05 V .2 .6 .1 .1 V .6 .35 .05 w 3 w 4

  67. Forward Algorithm α : a 2D table, N+2 x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the α left-to- right

Recommend


More recommend