sequence models
play

Sequence Models Spring 2020 Adapted from slides from Danqi Chen and - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Sequence Models Spring 2020 Adapted from slides from Danqi Chen and Karthik Narasimhan (Princeton COS 484) Overview Hidden markov models (HMM) Viterbi algorithm Maximum entropy


  1. SFU NatLangLab CMPT 825: Natural Language Processing Sequence Models Spring 2020 Adapted from slides from Danqi Chen and Karthik Narasimhan (Princeton COS 484)

  2. Overview • Hidden markov models (HMM) • Viterbi algorithm • Maximum entropy markov models (MEMM)

  3. Sequence Tagging

  4. What are POS tags • Word classes or syntactic categories • Reveal useful information about a word (and its neighbors!) The/DT cat/NN sat/VBD on/IN the/DT mat/NN British/NNP left/NN waffles/NNS on/IN Falkland/NNP Islands/NNP The/DT old/NN man/VB the/DT boat/NN

  5. Parts of Speech • Di ff erent words have di ff erent functions • Closed class: fixed membership, function words • e.g. prepositions ( in, on, of ), determiners ( the, a ) • Open class: New words get added frequently • e.g. nouns (Twitter, Facebook), verbs (google), adjectives, adverbs

  6. Penn Tree Bank tagset [45 tags] (Marcus et al., 1993) Other corpora: Brown, WSJ, Switchboard

  7. Part of Speech Tagging • Disambiguation task: each word might have di ff erent senses/functions • The/DT man/NN bought/VBD a/DT boat/NN • The/DT old/NN man/VB the/DT boat/NN

  8. Part of Speech Tagging • Disambiguation task: each word might have di ff erent senses/functions • The/DT man/NN bought/VBD a/DT boat/NN • The/DT old/NN man/VB the/DT boat/NN Some words have many functions!

  9. A simple baseline • Many words might be easy to disambiguate • Most frequent class: Assign each token (word) to the class it occurred most in the training set. (e.g. man/NN) • Accurately tags 92.34% of word tokens on Wall Street Journal (WSJ)! • State of the art ~ 97% • Average English sentence ~ 14 words • Sentence level accuracies: 0.92 14 = 31% vs 0.97 14 = 65% • POS tagging not solved yet!

  10. Hidden Markov Models

  11. Some observations • The function (or POS) of a word depends on its context • The/DT old/NN man/VB the/DT boat/NN • The/DT old/JJ man/NN bought/VBD the/DT boat/NN • Certain POS combinations are extremely unlikely • <JJ, DT> or <DT, IN> • Better to make decisions on entire sequences instead of individual words (Sequence modeling!)

  12. Markov chains � s 4 � s 1 � s 2 � s 3 • Model probabilities of sequences of variables • Each state can take one of K values ({1, 2, ..., K} for simplicity) • Markov assumption: � P ( s t | s < t ) ≈ P ( s t | s t − 1 ) Where have we seen this before?

  13. Markov chains � s 4 � s 1 � s 2 � s 3 The/?? cat/?? sat/?? on/?? the/?? mat/?? • We don’t observe POS tags at test time

  14. Hidden Markov Model (HMM) hidden Tags � s 4 � s 1 � s 2 � s 3 Words observed on the cat sat The/?? cat/?? sat/?? on/?? the/?? mat/?? • We don’t observe POS tags at test time • But we do observe the words! • HMM allows us to jointly reason over both hidden and observed events.

  15. Components of an HMM Tags � s 4 � s 1 � s 2 � s 3 Words � o 1 � o 2 � o 3 � o 4 1. Set of states S = {1, 2, ..., K} and observations O 2. Initial state probability distribution � π ( s 1 ) 3. Transition probabilities � P ( s t +1 | s t ) 4. Emission probabilities � P ( o t | s t )

  16. Assumptions Tags � s 4 � s 1 � s 2 � s 3 Words � o 1 � o 2 � o 3 � o 4 1. Markov assumption: � P ( s t +1 | s 1 , . . . , s t ) = P ( s t +1 | s t ) 2. Output independence: � P ( o t | s 1 , . . . , s t ) = P ( o t | s t ) Which is a stronger assumption?

  17. Sequence likelihood Tags � s 4 � s 1 � s 2 � s 3 Words � o 1 � o 2 � o 3 � o 4

  18. Sequence likelihood Tags � s 4 � s 1 � s 2 � s 3 Words � o 1 � o 2 � o 3 � o 4

  19. Sequence likelihood Tags � s 4 � s 1 � s 2 � s 3 Words � o 1 � o 2 � o 3 � o 4

  20. Learning • Maximum likelihood estimate: C ( s i , s j ) • � P ( s i , s j ) = C ( s j ) P ( o | s ) = C ( s , o ) • � C ( s )

  21. � � Learning Maximum likelihood estimate: C ( s j , s i ) P ( s i | s j ) = C ( s j ) P ( o | s ) = C ( s , o ) C ( s )

  22. Example: POS tagging the/?? cat/?? sat/?? on/?? the/?? mat/?? s t +1 o t π ( DT ) = 0.8 DT NN IN VBD the cat sat on mat DT 0.5 0.8 0.05 0.1 DT 0.5 0 0 0 0 s t NN 0.05 0.2 0.15 0.6 NN 0.01 0.2 0.01 0.01 0.2 IN 0.5 0.2 0.05 0.25 IN 0 0 0 0.4 0 VBD 0.3 0.3 0.3 0.1 VBD 0 0.01 0.1 0.01 0.01

  23. Example: POS tagging the/?? cat/?? sat/?? on/?? the/?? mat/?? s t +1 o t π ( DT ) = 0.8 DT NN IN VBD the cat sat on mat DT 0.5 0.8 0.05 0.1 DT 0.5 0 0 0 0 NN 0.05 0.2 0.15 0.6 NN 0.01 0.2 0.01 0.01 0.2 s t IN 0.5 0.2 0.05 0.25 IN 0 0 0 0.4 0 VBD 0.3 0.3 0.3 0.1 VBD 0 0.01 0.1 0.01 0.01 1.84 * 10 − 5

  24. Decoding with HMMs ? ? ? ? � o 1 � o 2 � o 3 � o 4 • Task: Find the most probable sequence of states � given the ⟨ s 1 , s 2 , . . . , s n ⟩ observations � ⟨ o 1 , o 2 , . . . , o n ⟩

  25. Decoding with HMMs ? ? ? ? � o 1 � o 2 � o 3 � o 4 • Task: Find the most probable sequence of states � given the ⟨ s 1 , s 2 , . . . , s n ⟩ observations � ⟨ o 1 , o 2 , . . . , o n ⟩

  26. Decoding with HMMs ? ? ? ? � o 4 � o 1 � o 2 � o 3 • Task: Find the most probable sequence of states � given the ⟨ s 1 , s 2 , . . . , s n ⟩ observations � ⟨ o 1 , o 2 , . . . , o n ⟩

  27. Greedy decoding ? ? ? DT The � o 2 � o 3 � o 4

  28. Greedy decoding NN ? ? DT The cat � o 3 � o 4

  29. Greedy decoding IN NN VBD DT on The cat sat • Not guaranteed to be optimal! • Local decisions

  30. Viterbi decoding • Use dynamic programming! • Probability lattice, � M [ T , K ] • � T : Number of time steps • � K : Number of states • � Most probable sequence of states ending with M [ i , j ] : state j at time i

  31. Viterbi decoding M [1, DT ] = π ( DT ) P ( the | DT ) DT M [1, NN ] = π ( NN ) P ( the | NN ) NN M [1, VBD ] = π ( VBD ) P ( the | VBD ) VBD M [1, IN ] = π ( IN ) P ( the | IN ) IN the Forward

  32. Viterbi decoding DT DT M [1, k ] P ( DT | k ) P ( cat | DT ) M [2, DT ] = max k NN NN M [1, k ] P ( NN | k ) P ( cat | NN ) M [2, NN ] = max k VBD VBD M [1, k ] P ( VBD | k ) P ( cat | VBD ) M [2, VBD ] = max k M [1, k ] P ( IN | k ) P ( cat | IN ) IN IN M [2, IN ] = max k the cat Forward

  33. Viterbi decoding DT DT DT DT NN NN NN NN VBD VBD VBD VBD IN IN IN IN The cat sat on M [ i − 1, k ] P ( s j | s k ) P ( o i | s j ) 1 ≤ k ≤ K 1 ≤ i ≤ n M [ i , j ] = max k Backward: Pick max M [ n , k ] and backtrack k

  34. Viterbi decoding DT DT DT DT NN NN NN NN Time VBD VBD VBD VBD complexity? IN IN IN IN The cat sat on M [ i − 1, k ] P ( s j | s k ) P ( o i | s j ) 1 ≤ k ≤ K 1 ≤ i ≤ n M [ i , j ] = max k Backward: Pick max M [ n , k ] and backtrack k

  35. Beam Search • If K (number of states) is too large, Viterbi is too expensive! DT DT DT DT NN NN NN NN VBD VBD VBD VBD IN IN IN IN The cat sat on

  36. Beam Search • If K (number of states) is too large, Viterbi is too expensive! DT DT DT DT NN NN NN NN VBD VBD VBD VBD IN IN IN IN on The cat sat Many paths have very low likelihood!

  37. Beam Search • If K (number of states) is too large, Viterbi is too expensive! • Keep a fixed number of hypotheses at each point • Beam width, � β

  38. Beam Search • Keep a fixed number of hypotheses at each point score = − 4.1 DT score = − 9.8 NN β = 2 score = − 6.7 VBD score = − 10.1 IN The

  39. Beam Search • Keep a fixed number of hypotheses at each point DT DT score = − 16.5 score = − 6.5 NN NN β = 2 VBD VBD score = − 13.0 IN IN score = − 22.1 The cat Step 1: Expand all partial sequences in current beam

  40. Beam Search • Keep a fixed number of hypotheses at each point DT DT score = − 16.5 score = − 6.5 NN NN β = 2 VBD VBD score = − 13.0 IN IN score = − 22.1 The cat Step 2: Prune set back to top � sequences β

  41. Beam Search • Keep a fixed number of hypotheses at each point DT DT DT DT NN NN NN NN β = 2 VBD VBD VBD VBD IN IN IN IN sat on The cat Pick max M [ n , k ] from within beam and backtrack k

  42. Beam Search • If K (number of states) is too large, Viterbi is too expensive! • Keep a fixed number of hypotheses at each point • Beam width, � β • Trade-o ff computation for (some) accuracy Time complexity?

  43. Beyond bigrams • Real-world HMM taggers have more relaxed assumptions • Trigram HMM: � P ( s t +1 | s 1 , s 2 , . . . , s t ) ≈ P ( s t +1 | s t − 1 , s t ) IN NN VBD DT on The cat sat Pros? Cons?

  44. Maximum Entropy Markov Models

  45. Generative vs Discriminative • HMM is a generative model • Can we model � directly? P ( s 1 , . . . , s n | o 1 , . . . , o n ) Generative Discriminative Naive Bayes: Logistic Regression: � P ( c ) P ( d | c ) � P ( c | d ) MEMM: HMM: � P ( s 1 , . . . , s n | o 1 , . . . , o n ) � P ( s 1 , . . . , s n ) P ( o 1 , . . . , o n | s 1 , . . . , s n )

  46. � ̂ MEMM IN IN DT NN VB DT NN VB on on The cat sat The cat sat HMM MEMM • Compute the posterior directly: ∏ S = arg max P ( S | O ) = arg max P ( s i | o i , s i − 1 ) • S S i • Use features: � P ( s i | o i , s i − 1 ) ∝ exp( w ⋅ f ( s i , o i , s i − 1 ))

Recommend


More recommend