anlp lecture 9 algorithms for hmms
play

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 - PowerPoint PPT Presentation

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities Output


  1. ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019

  2. Recap: HMM • Elements of HMM: – Set of states (tags) – Output alphabet (word types) – Start state (beginning of sentence) – State transition probabilities – Output probabilities from each state Algorithms for HMMs (Goldwater, ANLP) 2

  3. More general notation • Previous lecture: – Sequence of tags T = t 1 … t n – Sequence of words S = w 1 … w n • This lecture: – Sequence of states Q = q 1 ... q T – Sequence of outputs O = o 1 ... o T – So t is now a time step, not a tag! And T is the sequence length. Algorithms for HMMs (Goldwater, ANLP) 3

  4. Recap: HMM • Given a sentence O = o 1 ... o T with tags Q = q 1 ... q T , compute P(O,Q) as: 𝑈 𝑄(𝑃, 𝑅) = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 argmax 𝑅 𝑄(𝑅|𝑃) • But we want to find without enumerating all possible Q – Use Viterbi algorithm to store partial computations. Algorithms for HMMs (Goldwater, ANLP) 4

  5. Today’s lecture • What algorithms can we use to – Efficiently compute the most probable tag sequence for a given word sequence? – Efficiently compute the likelihood for an HMM (probability it outputs a given sequence s )? – Learn the parameters of an HMM given unlabelled training data? • What are the properties of these algorithms (complexity, convergence, etc)? Algorithms for HMMs (Goldwater, ANLP) 5

  6. Tagging example Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) Algorithms for HMMs (Goldwater, ANLP) 6

  7. Tagging example Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) • Choosing the best tag for each word independently gives the wrong answer (<s> CD NN NN </s>). • P(VBD|bit) < P(NN|bit), but may yield a better sequence (<s> CD NN VB </s>) – because P(VBD|NN) and P(</s>|VBD) are high. Algorithms for HMMs (Goldwater, ANLP) 7

  8. Viterbi: intuition Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) • Suppose we have already computed a) The best tag sequence for <s> … bit that ends in NN. b) The best tag sequence for <s> … bit that ends in VBD. • Then, the best full sequence would be either – sequence (a) extended to include </s>, or – sequence (b) extended to include </s>. Algorithms for HMMs (Goldwater, ANLP) 8

  9. Viterbi: intuition Words: <s> one dog bit </s> Possible tags: <s> CD NN NN </s> (ordered by NN VB VBD frequency for PRP each word) • But similarly, to get a) The best tag sequence for <s> … bit that ends in NN. • We could extend one of: – The best tag sequence for <s> … dog that ends in NN. – The best tag sequence for <s> … dog that ends in VB. • And so on… Algorithms for HMMs (Goldwater, ANLP) 9

  10. Viterbi: high-level picture • Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. ( t now a time step , not a tag ). Algorithms for HMMs (Goldwater, ANLP) 10

  11. Viterbi: high-level picture • Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. ( t now a time step , not a tag ). So, – Find the best path of length t-1 to each state. – Consider extending each of those by 1 step, to state q . – Take the best of those options as the best path to state q . Algorithms for HMMs (Goldwater, ANLP) 11

  12. Notation • Sequence of observations over time o 1 , o 2 , …, o T – here, words in sentence • Vocabulary size V of possible observations • Set of possible states q 1 , q 2 , …, q N (see note next slide) – here, tags • A , an NxN matrix of transition probabilities – a ij : the prob of transitioning from state i to j . (JM3 Fig 8.7) • B , an NxV matrix of output probabilities – b i (o t ) : the prob of emitting o t from state i . ( JM3 Fig 8.8) Algorithms for HMMs (Goldwater, ANLP) 12

  13. Note on notation • J&M use q 1 , q 2 , …, q N for set of states, but also use q 1 , q 2 , …, q T for state sequence over time. – So, just seeing q 1 is ambiguous (though usually disambiguated from context). – I’ll instead use q i for state names, and q t for state at time t. – So we could have q t = q i , meaning: the state we’re in at time t is q i . Algorithms for HMMs (Goldwater, ANLP) 13

  14. HMM example w/ new notation Start .3 .5 q 1 q 2 .7 .5 .6 .1 .3 .1 .7 .2 x y z x y z • States {q 1 , q 2 } (or {<s>, q 1 , q 2 } ) • Output alphabet {x, y, z} Adapted from Manning & Schuetze, Fig 9.2 Algorithms for HMMs (Goldwater, ANLP) 14

  15. Transition and Output Probabilities • Transition matrix A : q 1 q 2 a ij = P(q j | q i ) <s> 1 0 q 1 .7 .3 q 2 .5 .5 • Output matrix B : x y z b i (o) = P(o | q i ) q 1 .6 .1 .3 q 2 for output o .1 .7 .2 Algorithms for HMMs (Goldwater, ANLP) 15

  16. Joint probability of (states, outputs) • Let λ = (A, B) be the parameters of our HMM. • Using our new notation, given state sequence Q = (q 1 ... q T ) and output sequence O = (o 1 ... o T ), we have: 𝑈 𝑄 𝑃, 𝑅 𝜇 = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 Algorithms for HMMs (Goldwater, ANLP) 16

  17. Joint probability of (states, outputs) • Let λ = (A, B) be the parameters of our HMM. • Using our new notation, given state sequence Q = (q 1 ... q T ) and output sequence O = (o 1 ... o T ), we have: 𝑈 𝑄 𝑃, 𝑅 𝜇 = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 𝑈 • Or: 𝑄 𝑃, 𝑅 𝜇 = 𝑐 𝑟 𝑢 (𝑝 𝑢 ) 𝑏 𝑟 𝑢−1 𝑟 𝑢 𝑢=1 Algorithms for HMMs (Goldwater, ANLP) 17

  18. Joint probability of (states, outputs) • Let λ = (A, B) be the parameters of our HMM. • Using our new notation, given state sequence Q = (q 1 ... q T ) and output sequence O = (o 1 ... o T ), we have: 𝑈 𝑄 𝑃, 𝑅 𝜇 = 𝑄 𝑝 𝑢 𝑟 𝑢 𝑄 𝑟 𝑢 𝑟 𝑢−1 𝑢=1 𝑈 • Or: 𝑄 𝑃, 𝑅 𝜇 = 𝑐 𝑟 𝑢 (𝑝 𝑢 ) 𝑏 𝑟 𝑢−1 𝑟 𝑢 𝑢=1 • Example: 𝑄 𝑃 = 𝑧, 𝑨 , 𝑅 = (𝑟 1 , 𝑟 1 ) 𝜇 = 𝑐 1 𝑧 ∙ 𝑐 1 𝑨 ∙ 𝑏 <𝑡>,1 ∙ 𝑏 11 = (.1)(.3)(1)(.7) Algorithms for HMMs (Goldwater, ANLP) 18

  19. Viterbi: high-level picture argmax 𝑅 𝑄(𝑅|𝑃) • Want to find • Intuition: the best path of length t ending in state q must include the best path of length t-1 to the previous state. So, – Find the best path of length t-1 to each state. – Consider extending each of those by 1 step, to state q . – Take the best of those options as the best path to state q . Algorithms for HMMs (Goldwater, ANLP) 19

  20. Viterbi algorithm • Use a chart to store partial results as we go – NxT table, where v(j,t) is the probability* of the best state sequence for o 1 … o t that ends in state j . Algorithms for HMMs (Goldwater, ANLP) 20 *Specifically, v(j,t) stores the max of the joint probability P(o 1 …o t ,q 1 …q t-1 ,q t =j| λ )

  21. Viterbi algorithm • Use a chart to store partial results as we go – NxT table, where v(j,t) is the probability* of the best state sequence for o 1 … o t that ends in state j . • Fill in columns from left to right, with 𝑂 𝑤 𝑘, 𝑢 = max 𝑗=1 𝑤 𝑗, 𝑢 − 1 ∙ 𝑏 𝑗𝑘 ∙ 𝑐 𝑘 𝑝 𝑢 Algorithms for HMMs (Goldwater, ANLP) 21 *Specifically, v(j,t) stores the max of the joint probability P(o 1 …o t ,q 1 …q t-1 ,q t =j| λ )

  22. Viterbi algorithm • Use a chart to store partial results as we go – NxT table, where v(j,t) is the probability* of the best state sequence for o 1 … o t that ends in state j . • Fill in columns from left to right, with 𝑂 𝑤 𝑘, 𝑢 = max 𝑗=1 𝑤 𝑗, 𝑢 − 1 ∙ 𝑏 𝑗𝑘 ∙ 𝑐 𝑘 𝑝 𝑢 • Store a backtrace to show, for each cell, which state at t-1 we came from. Algorithms for HMMs (Goldwater, ANLP) 22 *Specifically, v(j,t) stores the max of the joint probability P(o 1 …o t ,q 1 …q t-1 ,q t =j| λ )

  23. Example • Suppose O=xzy . Our initially empty table: o 1 =x o 2 =z o 3 =y q 1 q 2 Algorithms for HMMs (Goldwater, ANLP) 23

  24. Filling the first column o 1 =x o 2 =z o 3 =y q 1 .6 q 2 0 𝑤 1,1 = 𝑏 <𝑡>1 ∙ 𝑐 1 𝑦) = 1 (.6 𝑤 2,1 = 𝑏 <𝑡>2 ∙ 𝑐 2 𝑦) = 0 (.1 Algorithms for HMMs (Goldwater, ANLP) 24

  25. Starting the second column o 1 =x o 2 =z o 3 =y q 1 .6 q 2 0 𝑂 𝑤 1,2 = max 𝑗=1 𝑤 𝑗, 1 ∙ 𝑏 𝑗1 ∙ 𝑐 1 𝑨 = max 𝑤 1,1 ∙ 𝑏 11 ∙ 𝑐 1 𝑨 = (.6)(.7)(.3) 𝑤 2,1 ∙ 𝑏 21 ∙ 𝑐 1 𝑨 = (0)(.5)(.3) Algorithms for HMMs (Goldwater, ANLP) 25

  26. Starting the second column o 1 =x o 2 =z o 3 =y q 1 .6 .126 q 2 0 𝑂 𝑤 1,2 = max 𝑗=1 𝑤 𝑗, 1 ∙ 𝑏 𝑗1 ∙ 𝑐 1 𝑨 = max 𝑤 1,1 ∙ 𝑏 11 ∙ 𝑐 1 𝑨 = (.6)(.7)(.3) 𝑤 2,1 ∙ 𝑏 21 ∙ 𝑐 1 𝑨 = (0)(.5)(.3) Algorithms for HMMs (Goldwater, ANLP) 26

Recommend


More recommend