lecture 11 viterbi and forward algorithms
play

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ - PowerPoint PPT Presentation

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Quiz 1 Quiz 1 30 25 20 15 10 5 0 [0-5]


  1. Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1

  2. Quiz 1 Quiz 1 30 25 20 15 10 5 0 [0-5] [6-10] [11-15] [16-20] [21-25] v Max: 24; Mean: 18.1; Median: 18; SD: 3.36 CS6501 Natural Language Processing 2

  3. This lecture v Two important algorithms for inference v Forward algorithm v Viterbi algorithm CS6501 Natural Language Processing 3

  4. CS6501 Natural Language Processing ‹#›

  5. Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501 Natural Language Processing 5

  6. Likelihood of the input v How likely a sentence “I love cat” occur v Compute 𝑄(𝒙 ∣ 𝝁 ) for the input 𝒙 and HMM 𝜇 v Remember, we model 𝑄 𝒖,𝒙 ∣ 𝝁 v 𝑄 𝒙 𝝁 = ∑ 𝑄 𝒖, 𝒙 𝝁 𝒖 Marginal probability: Sum over all possible tag sequences CS6501 Natural Language Processing 6

  7. Likelihood of the input v How likely a sentence “I love cat” occur v 𝑄 𝒙 𝝁 = ∑ 𝑄 𝒖, 𝒙 𝝁 𝒖 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 = ∑ Π ./0 𝒖 v Assume we have 2 tags N, V v 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢” 𝝁 = 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑂𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑂𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑊𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑊𝑂𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑊𝑊𝑊” 𝝁 v Now, let’s write down 𝑄(“𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢” ∣ 𝝁) with 45 tags… CS6501 Natural Language Processing 7

  8. 𝝁 is the parameter set of Trellis diagram HMM. Let’s ignore it in some slides for simplicity’s sake 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v Goal: P(𝐱 ∣ 𝝁) = ∑ Π ./0 𝒖 𝑄(𝑢 C = 2|𝑢 0 = 1 ) 𝑄(𝑥 J |𝑢 J = 1 ) 𝑄(𝑢 J = 1|𝑢 C = 1 ) ⋯ 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 8

  9. Trellis diagram v P(“I eat a fish”, NVVA ) 𝑄(𝑊|𝑊 ) 𝑄(𝐵|𝑊 ) 𝑄(𝑊|𝑂 ) 𝑄(𝐽|𝑂 ) N N N N 𝑄(𝑂| < 𝑇 > ) ⋯ V V V V 𝑄(𝑔𝑗𝑡ℎ| A) A A A A 𝑄(𝑓𝑏𝑢|𝑊 ) 𝑄(𝑏|𝑊 ) 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 9

  10. Trellis diagram 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v ∑ Π ./0 : sum over all paths 𝒖 N N N N ⋯ V V V V A A A A 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 10

  11. Dynamic programming v Recursively decompose a problem into smaller sub-problems v Similar to mathematical induction v Base step: initial values for 𝑗 = 1 v Inductive step: assume we know the values for 𝑗 = 𝑙 , let’s compute 𝑗 = 𝑙 + 1 CS6501 Natural Language Processing 11

  12. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 P(𝒙 𝒍 , 𝑢 Y = 𝑟) v 𝒖 Y : tag sequence with length 𝑙, 𝒙 Y = 𝑥 0 , 𝑥 C … 𝑥 Y 𝑄(𝒖 Y ,𝒙 𝒍 ) 𝑄(𝒖 Y40 , 𝒙 𝒍 , 𝑢 Y = 𝑟) v ∑ = ∑ ∑ k 𝒖 g 𝒖 gij tag @ i=k tag sequences tag sequences N N N N ⋯ V V V V A A A A 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 12

  13. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 𝑄(𝒖 Y ,𝒙) = ∑ P(𝒙 𝒍 , 𝑢 Y = 𝑟) v ∑ k 𝒖 g v P(𝒙 𝒍 , 𝑢 Y = 𝑟) = ∑ 𝒙 𝒍 , 𝑢 Y40 = 𝑟 m , 𝑢 Y = 𝑟) 𝑄( k l = ∑ 𝒙 𝒍40 , 𝑢 Y40 = 𝑟 m )𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) 𝑄( km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 13 CS6501 Natural Language Processing

  14. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 Let’s call it 𝛽 Y (𝑟) 𝑄(𝒖 Y ,𝒙) = ∑ P(𝒙 𝒍 , 𝑢 Y = 𝑟) v ∑ This is 𝛽 Y40 (𝑟′) k 𝒖 g v P(𝒙 𝒍 , 𝑢 Y = 𝑟) = ∑ 𝒙 𝒍 , 𝑢 Y40 = 𝑟 m , 𝑢 Y = 𝑟) 𝑄( k l = ∑ 𝒙 𝒍40 , 𝑢 Y40 = 𝑟 m )𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) 𝑄( km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 14 CS6501 Natural Language Processing

  15. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 v 𝛽 Y (𝑟) = ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 15 CS6501 Natural Language Processing

  16. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 v 𝛽 Y (𝑟) = ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) km = 𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′) km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 16 CS6501 Natural Language Processing

  17. Forward algorithm v Base step: i=0 v 𝛽 0 𝑟 = 𝑄 𝑥 0 𝑢 0 = 𝑟 𝑄(𝑢 0 = 𝑟 ∣ 𝑢 q ) initial probability 𝑞(𝑢 0 = 𝑟) N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 17 CS6501 Natural Language Processing

  18. Implementation using an array From Julia Hockenmaier, Intro to NLP v Use a 𝑜×𝑈 table to keep 𝛽 Y (𝑟) CS6501 Natural Language Processing 18

  19. Implementation using an array Initial: Trellis[1][q] = 𝑄 𝑥 0 𝑢 0 = 𝑟 𝑄(𝑢 0 = 𝑟 ∣ 𝑢 q ) CS6501 Natural Language Processing 19

  20. Implementation using an array i i Induction: 𝛽 Y (𝑟) = 𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′) km CS6501 Natural Language Processing 20

  21. The forward algorithm (Pseudo Code) .fwd=0 CS6501 Natural Language Processing 21

  22. Jason’s ice cream ard" o p(…|C) p(…|H) p(…|START) p(1|…) 0.5 0.1 p(2|…) 0.4 0.2 #cones p(3|…) 0.1 0.7 p(C|…) 0.8 0.2 0.5 p(H|…) 0.2 0.8 0.5 v P(”1,2,1”)? 0.5 0.4 0.5 0.8 0.8 C C C 0.5 0.2 0.2 0.2 0.2 0.5 H H H 0.8 0.8 0.1 0.2 0.1 CS6501 Natural Language Processing 22

  23. Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501 Natural Language Processing 23

  24. Prediction in generative model v Inference: What is the most likely sequence of tags for the given sequence of words w initial probability 𝑞(𝑢 0 ) v What are the latent states that most likely generate the sequence of word w CS6501 Natural Language Processing 24

  25. Tagging the input v Find best tag sequence of “I love cat” v Remember, we model 𝑄 𝒖,𝒙 ∣ 𝝁 v 𝒖 ∗ = arg max 𝑄 𝒖, 𝒙 𝝁 𝒖 Find the best one from all possible tag sequences CS6501 Natural Language Processing 25

  26. Tagging the input v Assume we have 2 tags N, V v Which one is the best? 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑂𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑂𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑊” 𝝁 v Again! We need an efficient algorithm CS6501 Natural Language Processing 26

  27. Trellis diagram 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v Goal: argmax Π ./0 𝒖 𝑄(𝑢 C = 2|𝑢 0 = 1 ) 𝑄(𝑥 J |𝑢 J = 1 ) 𝑄(𝑢 J = 1|𝑢 C = 1 ) ⋯ 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 27

  28. Trellis diagram 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v Goal: argmax Π ./0 𝒖 v Find the best path! N N N N ⋯ V V V V A A A A 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 28

Recommend


More recommend