Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and Language Processing , chapter 8 2. Foundations of Statistical Natural Language Processing , chapter 10 NLP-Berlin Chen 1
Review • Tagging (part-of-speech tagging) – The process of assigning (labeling) a part-of-speech or other lexical class marker to each word in a sentence (or a corpus) • Decide whether each word is a noun, verb, adjective, or whatever The/ AT representative/ NN put/ VBD chairs/ NNS on/ IN the/ AT table/ NN Or The/ AT representative/ JJ put/ NN chairs/ VBZ on/ IN the/ AT table/ NN – An intermediate layer of representation of syntactic structure • When compared with syntactic parsing – Above 96% accuracy for most successful approaches Tagging can be viewed as a kind of syntactic disambiguation NLP-Berlin Chen 2
Introduction • Parts-of-speech – Known as POS, word classes, lexical tags, morphology classes • Tag sets – Penn Treebank : 45 word classes used (Francis, 1979) • Penn Treebank is a parsed corpus – Brown corpus: 87 word classes used (Marcus et al., 1993) – …. The /DT grand /JJ jury /NN commented /VBD on /IN a /DT number /NN of /IN other /JJ topics /NNS . /. NLP-Berlin Chen 3
The Penn Treebank POS Tag Set NLP-Berlin Chen 4
Disambiguation • Resolve the ambiguities and chose the proper tag for the context • Most English words are unambiguous (have only one tag) but many of the most common words are ambiguous – E.g.: “ can ” can be a (an auxiliary) verb or a noun – E.g.: statistics of Brown corpus - 11.5% word types are ambiguous - But 40% tokens are ambiguous (However, the probabilities of tags associated a word are not equal → many ambiguous tokens are easy to disambiguate) ( ) ( ) ≠ ≠ P t w P t w L 1 2 NLP-Berlin Chen 5
Process of POS Tagging A Specified Tagset A String of Words Tagging Algorithm A Single Best Tag of Each Word VB DT NN . Book that flight . VBZ DT NN VB NN ? Does that flight serve dinner ? Two information sources used: - Syntagmatic information (looking at information about tag sequences) - Lexical information (predicting a tag based on the word concerned) NLP-Berlin Chen 6
POS Tagging Algorithms Fall into One of Two Classes • Rule-based Tagger – Involve a large database of handcrafted disambiguation rules • E.g. a rule specifies that an ambiguous word is a noun rather than a verb if it follows a determiner • ENGTWOL : a simple rule-based tagger based on the constraint grammar architecture “a new play” P (NN|JJ) ≈ 0.45 • Stochastic/Probabilistic Tagger P (VBP|JJ) ≈ 0.0005 – Also called model-based tagger – Use a training corpus to compute the probability of a given word having a given context – E.g.: the HMM tagger chooses the best tag for a given word (maximize the product of word likelihood and tag sequence probability ) NLP-Berlin Chen 7
POS Tagging Algorithms (cont.) • Transformation-based/Brill Tagger – A hybrid approach – Like rule-based approach , determine the tag of an ambiguous word based on rules – Like stochastic approach , the rules are automatically induced from previous tagged training corpus with the machine learning technique • Supervised learning NLP-Berlin Chen 8
Rule-based POS Tagging • Two-stage architecture – First stage : Use a dictionary to assign each word a list of potential parts-of-speech – Second stage : Use large lists of hand-written disambiguation rules to winnow down this list to a single part-of-speech for each word Pavlov had shown that salivation … An example for Pavlov PAVLOV N NOM SG PROPER The ENGTOWL tagger (preterit) had HAVE V PAST VFIN SVO (past participle) HAVE PCP2 SVO shown SHOW PCP2 SVOO SVO SV that ADV A set of 1,100 constraints PRON DEM SG can be applied to the input DET CENTRAL DEM SG sentence (complementizer) CS salivation N NOM SG NLP-Berlin Chen 9
Rule-based POS Tagging (cont.) • Simple lexical entries in the ENGTWOL lexicon past participle NLP-Berlin Chen 10
Rule-based POS Tagging (cont.) Example: It isn’t that odd! A ADV I consider that odd. NUM Compliment NLP-Berlin Chen 11
HMM-based Tagging • Also called Maximum Likelihood Tagging – Pick the most-likely tag for a word • For a given sentence or words sequence , an HMM tagger chooses the tag sequence that maximizes the following probability For a word at position i : ( ) ( ) = ⋅ − tag arg max P word tag P tag previous n 1 tags i i j j j tag sequence probability word/lexical likelihood N-gram HMM tagger NLP-Berlin Chen 12
HMM-based Tagging (cont.) For a word w at position i , follow Bayes' theorem : i ( ) = t arg max P t w , t , t ,..., t − − i j i i 1 i 2 1 j ( ) P w , t t , t ,..., t − − = i j i 1 i 2 1 arg max ( ) P w t , t ,..., t j − − i i 1 i 2 1 ( ) = arg max P w , t t , t ,..., t − − i j i 1 i 2 1 j ( ) ( ) = arg max P w t , t , t ,..., t P t t , t ,..., t − − − − i j i 1 i 2 1 j i 1 i 2 1 j ( ) ( ) ≈ arg max P w t P t t , t ,..., t − − − + i j j i 1 i 2 i n 1 j NLP-Berlin Chen 13
HMM-based Tagging (cont.) • Assumptions made here – Words are independent of each other • A word’s identity only depends on its tag – “ Limited Horizon ” and “ Time Invariant ” (“ Stationary ”) • Limited Horizon: a word’s tag only depends on the previous few tags ( limited horizon ) and the dependency does not change over time ( time invariance ) • Time Invariant : the tag dependency won’t change as tag sequence appears different positions of a sentence Do not model long-distance relationships well ! - e.g., Wh-extraction,… NLP-Berlin Chen 14
HMM-based Tagging (cont.) • Apply bigram-HMM tagger to choose the best tag for a given word – Choose the tag t i for word w i that is most probable given the previous tag t i-1 and current word w i ( ) = t arg max P t t , w − i j i 1 i j – Through some simplifying Markov assumptions ) ( ) ( = t arg max P t t P w t − i j i 1 i j j word/lexical likelihood tag sequence probability NLP-Berlin Chen 15
HMM-based Tagging (cont.) • Apply bigram-HMM tagger to choose the best tag for a given word ( ) = t arg max P t t , w − i j i 1 i j ( ) P t , w t − = j i i 1 arg max ( ) The same for all tags P w t j − i i 1 ( ) = arg max P t , w t − j i i 1 j ( ) ( ) The probability of a word = arg max P w t , t P t t only depends on its tag − − i i 1 j j i 1 j ( ) ( ) ( ) ) ( = = arg max P w t P t t arg max P t t P w t − − i j j i 1 j i 1 i j j j NLP-Berlin Chen 16
HMM-based Tagging (cont.) • Example: Choose the best tag for a given word Secretariat/NNP is /VBZ expected/VBN to/TO race/VB tomorrow/NN 0.34 0.00003 to/TO race/??? P (VB|TO) P (race|VB)=0.00001 0.021 0.00041 P (NN|TO) P (race|NN)=0.000007 Pretend that the previous word has already tagged NLP-Berlin Chen 17
HMM-based Tagging (cont.) • Apply bigram-HMM tagger to choose the best sequence of tags for a given sentence ( ) ˆ = Assumptions: T arg max P T W - words are independent T ( ) ( ) of each other P T P W T = - a word’s identity only arg max ( ) P W depends on its tag T ( ) ( ) = arg max P T P W T T ) ( ) ( = arg max P t , t ,..., t P w , w ,..., w t , t ,..., t 1 2 n 1 1 n 1 2 n t , t ,..., t 1 2 n [ ] ( ) ( ) n = arg max ∏ P t t , t ,..., t P w t , t ,..., t − i 1 2 i 1 i 1 2 n = i 1 t , t ,..., t 1 2 n [ ] ( ) ( ) n = ∏ arg max P t t , t ,..., t P w t − + − + − i i m 1 i m 2 i 1 i i = i 1 t , t ,..., t 1 2 n The probability of a word Tag M-gram assumption only depends on its tag NLP-Berlin Chen 18
HMM-based Tagging (cont.) • The Viterbi algorithm for the bigram-HMM tagger – States: distinct tags – Observations: input word generated by each state t J t J t J t J t J Tag State π J π t j+1 t j+1 t j+1 t j+1 t j+1 + j 1 π MAX MAX t j t j t j t j j t j π − j 1 t j-1 t j-1 t j-1 t j-1 t j-1 π 1 t 1 t 1 t 1 t 1 t 1 1 2 i n -1 n Word Sequence w 1 w 2 w n-1 w n w i NLP-Berlin Chen 19
HMM-based Tagging (cont.) • The Viterbi algorithm for the bigram-HMM tagger ( ) ( ) ( ) δ = ≤ ≤ = 1. Initializa tion j π P w t , 1 j J , π P t 1 j 1 j j j ) ( ) ( ) ( ⎡ ⎤ ( ) δ = δ ≤ ≤ ≤ ≤ 2. Induction j max k P t t P w t , 2 i n, 1 j J ⎢ ⎥ − i i 1 j i ⎣ ⎦ k j k [ ] ( ) ( ) ( ) ψ = δ j argmax k P t t − i i 1 j k ≤ ≤ 1 k J ( ) = δ 3.Terminat ion X argmax j n n ≤ ≤ 1 j J = for i : n- 1 to 1 step - 1 do ( ) = ψ X X + i i i 1 end NLP-Berlin Chen 20
Recommend
More recommend