Language Technology Language Processing with Perl and Prolog Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and Prolog 1 / 9
Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques POS Annotation with Statistical Methods Modeling the problem: t 1 , t 2 , t 3 ,..., t n → noisy channel → w 1 , w 2 , w 3 ,..., w n . The optimal part of speech sequence is ˆ T = argmax P ( t 1 , t 2 , t 3 ,..., t n | w 1 , w 2 , w 3 ,..., w n ) , t 1 , t 2 , t 3 ,..., t n The Bayes’ rule on conditional probabilities: P ( A | B ) P ( B ) = P ( B | A ) P ( A ) . ˆ T = argmax P ( T ) P ( W | T ) . T P ( T ) and P ( W | T ) are simplified and estimated on hand-annotated corpora, the “gold standard”. Pierre Nugues Language Processing with Perl and Prolog 2 / 9
Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques The First Term: N -Gram Approximation n ∏ P ( T ) = P ( t 1 , t 2 , t 3 ,..., t n ) ≈ P ( t 1 ) P ( t 2 | t 1 ) P ( t i | t i − 2 , t i − 1 ) . i = 3 If we use a start-of-sentence delimiter <s> , the two first terms of the product, P ( t 1 ) P ( t 2 | t 1 ) , are rewritten as P ( < s > ) P ( t 1 | < s > ) P ( t 2 | < s >, t 1 ) , where P ( < s > ) = 1. We estimate the probabilities with the maximum likelihood, P MLE : P MLE ( t i | t i − 2 , t i − 1 ) = C ( t i − 2 , t i − 1 , t i ) C ( t i − 2 , t i − 1 ) . Pierre Nugues Language Processing with Perl and Prolog 3 / 9
Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Sparse Data If N p is the number of the different parts-of-speech tags, there are N p × N p × N p values to estimate. If data is missing, we can back off to bigrams: n ∏ P ( T ) = P ( t 1 , t 2 , t 3 ,..., t n ) ≈ P ( t 1 ) P ( t i | t i − 1 ) . i = 2 Or to unigrams: n ∏ P ( T ) = P ( t 1 , t 2 , t 3 ,..., t n ) ≈ P ( t i ) . i = 1 And finally, we can combine linearly these approximations: P LinearInter ( t i | t i − 2 t i − 1 ) = λ 1 P ( t i | t i − 2 t i − 1 )+ λ 2 P ( t i | t i − 1 )+ λ 3 P ( t i ) , with λ 1 + λ 2 + λ 3 = 1, for example, λ 1 = 0 . 6, λ 2 = 0 . 3, λ 3 = 0 . 1. Pierre Nugues Language Processing with Perl and Prolog 4 / 9
Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques The Second Term The complete word sequence knowing the part-of-speech sequence is usually approximated as: n ∏ P ( W | T ) = P ( w 1 , w 2 , w 3 ,..., w n | t 1 , t 2 , t 3 ,..., t n ) ≈ P ( w i | t i ) . i = 1 Like the previous probabilities, P ( w i | t i ) is estimated from hand-annotated corpora using the maximum likelihood: P MLE ( w i | t i ) = C ( w i , t i ) C ( t i ) . For N w different words, there are N p × N w values to obtain. But in this case, many of the estimates will be 0. Pierre Nugues Language Processing with Perl and Prolog 5 / 9
Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques An Example Je le donne ‘I give it’ le /ART donne /VERB je /PRO le /PRO donne /NOUN P ( pro | / 0 ) × P ( art | / 0 , pro ) × P ( verb | pro , art ) × P ( je | pro ) × P ( le | art ) × 1 P ( donne | verb ) P ( pro | / 0 ) × P ( art | / 0 , pro ) × P ( noun | pro , art ) × P ( je | pro ) × P ( le | art ) × 2 P ( donne | noun ) P ( pro | / 0 ) × P ( pro | / 0 , pro ) × P ( verb | pro , pro ) × P ( je | pro ) × P ( le | pro ) × 3 P ( donne | verb ) P ( pro | / 0 ) × P ( pro | / 0 , pro ) × P ( noun | pro , pro ) × P ( je | pro ) × P ( le | pro ) × 4 P ( donne | noun ) Pierre Nugues Language Processing with Perl and Prolog 6 / 9
Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Viterbi (Informal) Je le donne demain dans la matinée ‘I give it tomorrow in the morning’ le /ART donne /VERB la /ART je /PRO demain /ADV dans /PREP matinée /NOUN le /PRO donne /NOUN la /PRO Pierre Nugues Language Processing with Perl and Prolog 7 / 9
Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Viterbi (Informal) The term brought by the word demain has still the memory of the ambiguity of donne : P ( adv | verb ) × P ( demain | adv ) and P ( adv | noun ) × P ( demain | adv ) . This is no longer the case with dans . According to the noisy channel model and the bigram assumption, the term brought by the word dans is P ( dans | prep ) × P ( prep | adv ) . It does not show the ambiguity of le and donne . The subsequent terms will ignore it as well. We can discard the corresponding paths. The optimal path does not contain nonoptimal subpaths. Pierre Nugues Language Processing with Perl and Prolog 8 / 9
Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Supervised Learning: A Summary Needs a manually annotated corpus called the Gold Standard The Gold Standard may contain errors ( errare humanum est ) that we ignore A classifier is trained on a part of the corpus, the training set , and evaluated on another part, the test set , where automatic annotation is compared with the “Gold Standard” N-fold cross validation is used avoid the influence of a particular division Some algorithms may require additional optimization on a development set Classifiers can use statistical or symbolic methods Pierre Nugues Language Processing with Perl and Prolog 9 / 9
Recommend
More recommend