Algorithms for NLP CS 11711, Fall 2019 Lecture 7: HMMs, POS tagging Yulia Tsvetkov 1
Readings for today’s lecture ▪ J&M SLP3 https://web.stanford.edu/~jurafsky/slp3/8.pdf Collins (2011) ▪ http://www.cs.columbia.edu/~mcollins/hmms-spring2013.pdf 2
Levels of linguistic knowledge Slide credit: Noah Smith 3
Sequence Labeling ▪ map a sequence of words to a sequence of labels ▪ Part-of-speech tagging (Church, 1988; Brants, 2000) ▪ Named entity recognition (Bikel et al., 1999) ▪ Text chunking and shallow parsing (Ramshaw and Marcus, 1995) ▪ Word alignment of parallel text (Vogel et al., 1996) ▪ Compression (Conroy and O’Leary, 2001) ▪ Acoustic models, discourse segmentation, etc. 4
Sequence labeling as classification 5
Generative sequence labeling: Hidden Markov Models
Markov Chain: weather the future is independent of the past given the present
Markov Chain
Markov Chain: words the future is independent of the past given the present
Hidden Markov Models ▪ In real world many events are not observable q 1 q 2 q n ▪ ... Speech recognition: we observe acoustic features but not the phones ▪ POS tagging: we observe words but o 1 o 2 o n not the POS tags
HMM From J&M
HMM example From J&M
Generative vs. Discriminative models ▪ Generative models specify a joint distribution over the labels and the data. With this you could generate new data ▪ Discriminative models specify the conditional distribution of the label y given the data x. These models focus on how to discriminate between the classes From Bamman
Types of HMMs ▪ + many more From J&M
HMM in Language Technologies ▪ Part-of-speech tagging (Church, 1988; Brants, 2000) ▪ Named entity recognition (Bikel et al., 1999) and other information extraction tasks ▪ Text chunking and shallow parsing (Ramshaw and Marcus, 1995) ▪ Word alignment of parallel text (Vogel et al., 1996) ▪ Acoustic models in speech recognition (emissions are continuous) ▪ Discourse segmentation (labeling parts of a document)
HMM Parameters From J&M
HMMs:Questions From J&M
HMMs:Algorithms Forward Viterbi Forward–Backward; Baum–Welch From J&M
HMM tagging as decoding
HMM tagging as decoding
HMM tagging as decoding
HMM tagging as decoding
HMM tagging as decoding
HMM tagging as decoding
HMM tagging as decoding How many possible choices?
Part of speech tagging example Slide credit: Noah Smith
Part of speech tagging example Greedy decoding? Slide credit: Noah Smith
Part of speech tagging example Greedy decoding? Consider: “the old dog the footsteps of the young” Slide credit: Noah Smith
The Viterbi Algorithm
The Viterbi Algorithm
The Viterbi Algorithm
The Viterbi Algorithm
The Viterbi Algorithm Complexity?
Beam search
Viterbi ▪ n -best decoding ▪ relationship to sequence alignment ▪
HMMs:Algorithms Forward Viterbi Forward–Backward; Baum–Welch From J&M
The Forward Algorithm sum instead of max
Parts of Speech
The closed classes
More Fine-Grained Classes
More Fine-Grained Classes
The Penn Treebank Part-of-Speech Tagset
The Universal POS tagset https://universaldependencies.org
POS tagging
POS tagging goal: resolve POS ambiguities
POS tagging
Most Frequent Class Baseline The WSJ training corpus and test on sections 22-24 of the same corpus the most-frequent-tag baseline achieves an accuracy of 92.34%.
Most Frequent Class Baseline The WSJ training corpus and test on sections 22-24 of the same corpus the most-frequent-tag baseline achieves an accuracy of 92.34%. ● 97% tag accuracy achievable by most algorithms (HMMs, MEMMs, neural networks, rule-based algorithms)
Recommend
More recommend