lecture 8 sequence labeling with discriminative models
play

Lecture 8: Sequence labeling with discriminative models Julia - PowerPoint PPT Presentation

CS498JH: Introduction to NLP (Fall 2012) http://cs.illinois.edu/class/cs498jh Lecture 8: Sequence labeling with discriminative models Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm Sequence


  1. CS498JH: Introduction to NLP (Fall 2012) http://cs.illinois.edu/class/cs498jh Lecture 8: Sequence labeling with discriminative models Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office Hours: Wednesday, 12:15-1:15pm

  2. Sequence labeling 2 CS498JH: Introduction to NLP

  3. POS tagging Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS old_JJ ,_, will_MD join_VB IBM_NNP ‘s_POS board_NN as_IN a_DT nonexecutive_JJ director_NN Nov._NNP 29_CD ._. Task: assign POS tags to words 3 CS498JH: Introduction to NLP

  4. Noun phrase (NP) chunking Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [NP Pierre Vinken] , [NP 61 years] old , will join [NP IBM] ‘s [NP board] as [NP a nonexecutive director] [NP Nov. 2] . Task: identify all non-recursive NP chunks 4 CS498JH: Introduction to NLP

  5. The BIO encoding We define three new tags: – B-NP : beginning of a noun phrase chunk – I-NP : inside of a noun phrase chunk – O : outside of a noun phrase chunk [NP Pierre Vinken] , [NP 61 years] old , will join [NP IBM] ‘s [NP board] as [NP a nonexecutive director] [NP Nov. 2] . Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP old_O ,_O will_O join_O IBM_B-NP ‘s_O board_B-NP as_O a_B-NP nonexecutive_I-NP director_I-NP Nov._B-NP 29_I-NP ._O 5 CS498JH: Introduction to NLP

  6. Shallow parsing Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [NP Pierre Vinken] , [NP 61 years] old , [VP will join] [NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive director] [NP Nov. 2] . Task: identify all non-recursive NP, verb (“VP”) and preposition (“PP”) chunks 6 CS498JH: Introduction to NLP

  7. The BIO encoding for shallow parsing We define several new tags: – B-NP B-VP B-PP : beginning of an NP, “VP”, “PP” chunk – I-NP : inside of an NP, “VP”, “PP” chunk – O : outside of any chunk [NP Pierre Vinken] , [NP 61 years] old , [VP will join] [NP IBM] ‘s [NP board] [PP as] [NP a nonexecutive director] [NP Nov. 2] . Pierre_B-NP Vinken_I-NP ,_O 61_B-NP years_I-NP old_O ,_O will_B-VP join_I-VP IBM_B-NP ‘s_O board_B-NP as_B-PP a_B-NP nonexecutive_I-NP director_I-NP Nov._B- NP 29_I-NP ._O 7 CS498JH: Introduction to NLP

  8. Named Entity Recognition Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . [PERS Pierre Vinken] , 61 years old , will join [ORG IBM] ‘s board as a nonexecutive director [DATE Nov. 2] . Task: identify all mentions of named entities (people, organizations, locations, dates) 8 CS498JH: Introduction to NLP

  9. The BIO encoding for NER We define many new tags: – B-PERS , B-DATE,…: beginning of a mention of a person/date... – I-PERS , B-DATE,…: : inside of a mention of a person/date... – O : outside of any mention of a named entity [PERS Pierre Vinken] , 61 years old , will join [ORG IBM] ‘s board as a nonexecutive director [DATE Nov. 2] . Pierre_B-PERS Vinken_I-PERS ,_O 61_O years_O old_O ,_O will_O join_O IBM_B-ORG ‘s_O board_O as_O a_O nonexecutive_O director_O Nov._B-DATE 29_I-DATE ._O 9 CS498JH: Introduction to NLP

  10. Many NLP tasks are sequence labeling tasks Input: a sequence of tokens/words: Pierre Vinken , 61 years old , will join IBM ‘s board as a nonexecutive director Nov. 29 . Output: a sequence of labeled tokens/words: POS-tagging: Pierre _NNP Vinken _NNP , _, 61 _CD years _NNS old _JJ , _, will _MD join _VB IBM _NNP ‘s _POS board _NN as _IN a _DT nonexecutive _JJ director _NN Nov. _NNP 29 _CD . _. Named Entity Recognition: Pierre _B-PERS Vinken _I-PERS , _O 61 _O years _O old _O , _O will _O join _O IBM _B-ORG ‘s _O board _O as _O a _O nonexecutive _O director _O Nov. _B-DATE 29 _I-DATE . _O 10 CS498JH: Introduction to NLP

  11. Graphical models for sequence labeling 11 CS498JH: Introduction to NLP

  12. Graphical models Graphical models are a notation for probability models . Nodes represent distributions over random variables: – P(X) = X Arrows represent dependencies: – P(Y) P(X | Y) = Y X – P(Y) P(Z) P(X | Y, Z) = Y X Z Shaded nodes represent observed variables. White nodes represent hidden variables – P(Y) P(X | Y) with Y hidden and X observed = Y X 12 CS498JH: Introduction to NLP

  13. HMMs as graphical models HMMs are generative models of the observed input string w They ‘generate’ w with P( w ) = ∏ i P(t i | t i-1 )P(w i | t i ) We know w , but need to find t t 1 t 2 t 3 t 4 w 1 w 2 w 3 w 4 CS498JH: Introduction to NLP

  14. Models for sequence labeling Sequence labeling: Given an input sequence w = w 1 ...w n , predict the best (most likely) label sequence t = t 1 …t n P ( t | w ) = argmax t Generative models use Bayes Rule: P ( t , w ) P ( t | w ) ) = = argmax argmax P ( w ) t t = = argmax P ( t w ) P ( t , w ) argmax t = = P ( t ) P ( w | t ) ( t ) ( w argmax t Discriminative (conditional) models model P( t | w ) directly 14 CS498JH: Introduction to NLP

  15. Advantages of discriminative models We’re usually not really interested in P( w | t ). – w is given. We don’t need to predict it! Why not model what we’re actually interested in: P( t | w ) Modeling P( w | t ) well is quite difficult: – Prefixes (capital letters) or suffixes are good predictors for certain classes of t (proper nouns, adverbs,…) – But these features may not be independent (e.g. they are overlapping) – These features may also help us deal with unknown words Modeling P( t | w ) should be easier: – Now we can incorporate arbitrary features of the word, because we don’t need to predict w anymore 15 CS498JH: Introduction to NLP

  16. Maximum Entropy Markov Models MEMMs are conditional models of the labels t given the observed input string w. They model P( t | w ) = ∏ P(t i |w i , t i-1 ) [NB: We also use dynamic programming for learning and labeling] t 1 t 2 t 3 t 4 w 1 w 2 w 3 w 4 CS498JH: Introduction to NLP

  17. Probabilistic classification Classification: Predict a class (label) c for an input x Probabilistic classification: –Model the probability P( c | x ) P ( c| x ) is a probability if 0 < P (c i | x ) < 1 , and ∑ i P( c i | x ) = 1 –Predict the class that has the highest probability 17 CS498JH: Introduction to NLP

  18. Representing features Define a set of feature functions f i ( x ) over the input: – Binary feature functions : f first-letter-capitalized ( Urbana ) = 1 f first-letter-capitalized ( computer ) = 0 – Integer (or real-valued) feature functions : f number-of-vowels ( Urbana ) = 3 Because each class might care only about certain features (e.g. capitalization for proper nouns), redefine feature functions f i ( x, c) to take the class label into account: f first-letter-capitalized ( Urbana, NNP ) = 1 f first-letter-capitalized ( Urbana, VB ) = 0 => We turn each feature f i on or off depending on c 18 CS498JH: Introduction to NLP

  19. From features to probabilities – We also associate a real-valued weight w i ( λ i ) with each feature f i – Now we have a score for predicting class c for input x : score( x ,c) = ∑ i w i f i ( x ,c) – This score could be negative, so we exponentiate it: score( x ,c) = exp( ∑ i w i f i ( x ,c)) = e ∑ iwi fi( x ,c) – We normalize this score to define a probability: ∑ c � e ∑ i w i f ( x , c � ) = e ∑ i w i f i ( x , c ) e ∑ i w i f i ( x , c ) P ( c | x ) = Z – Learning = finding the best weights w i 19 CS498JH: Introduction to NLP

  20. Learning: finding w We use conditional maximum likelihood estimation (and standard convex optimization algorithms) to find w Conditional MLE: Find the w that assigns highest probability to all observed outputs c i given the inputs x i w ∏ = P ( c i | x i , w ) ˆ argmax w i w ∑ = log ( P ( c i | x i , w )) argmax i e ∑ j w j f j ( x i , c ) � ⇥ w ∑ = argmax log ∑ c � e ∑ j w j f j ( x i , c � ) i 20 CS498JH: Introduction to NLP

  21. Some terminology We also refer to these models as exponential models because we exponentiate ( e ∑ wf(x,c) ) the weights and features We also refer to them as loglinear models because the log probability is a linear function � e ∑ j w j f j ( x , c ) ⇥ log ( P ( c | x , w )) = log Z = ∑ w j f j ( x , c ) − log ( Z ) j Statisticians refer to them as multinomial logistic regression models. 21 CS498JH: Introduction to NLP

  22. Maximum Entropy Markov Models MEMMs use a MaxEnt classifier for each P(t i |w i , t i-1 ): t i-1 t i w i P j w j f j ( w i ,t i − 1 ,t i ) e P ( t i | w i , t i − 1 ) = Z P j w j f j ( w i ,t i − 1 ,t i ) e = P j w j f j ( w i ,t i − 1 ,t k ) � t k e CS498JH: Introduction to NLP

  23. Terminology II: Maximum Entropy Entropy: Measures uncertainty. Is highest for uniform distributions H ( P ) = − ∑ P ( x ) log 2 P ( x ) x H ( P ( y | x )) = − ∑ P ( y | x ) log 2 P ( y | x ) y We also refer to these models as Maximum Entropy (MaxEnt) models because conditional MLE finds the most uniform distribution (subject to the constraints that the expected counts equal the observed counts in the training data). The default value for all weights w i is zero. 23 CS498JH: Introduction to NLP

  24. Chain Conditional Random Fields Chain CRFs are also conditional models of the labels t given the observed input string w, but instead of one classifier for each P(t i |w i , t i-1 ) they learn global distributions P( t | w ) t 1 t 2 t 3 t 4 w 1 w 2 w 3 w 4 CS498JH: Introduction to NLP

Recommend


More recommend