lecture 4 language model evaluation and advanced methods
play

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei - PowerPoint PPT Presentation

Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Kneser-Ney smoothing v


  1. Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1

  2. This lecture v Kneser-Ney smoothing v Discriminative Language Models v Neural Language Models v Evaluation: Cross-entropy and perplexity 6501 Natural Language Processing 2

  3. Recap: Smoothing v Add-one smoothing v Add- πœ‡ smoothing v parameters tuned by the cross-validation v Witten-Bell Smoothing v T: # word types N: # tokens v T/(N+T): total prob. mass for unseen words v N/(N+T): total prob. mass for observed tokens v Good-Turing v Reallocate the probability mass of n-grams that occur r+1 times to n-grams that occur r times. 6501 Natural Language Processing 3

  4. Recap: Back-off and interpolation v Idea: even if we’ve never seen β€œred glasses”, we know it is more likely to occur than β€œred abacus” v Interpolation: p average (z | xy) = Β΅ 3 p(z | xy) + Β΅ 2 p(z | y) + Β΅ 1 p(z) where Β΅ 3 + Β΅ 2 + Β΅ 1 = 1 and all are β‰₯ 0 6501 Natural Language Processing 4

  5. Absolute Discounting v Save ourselves some time and just subtract 0.75 (or some d)! discounted bigram Interpolation weight c ( w , w ) d βˆ’ i 1 i P ( w | w ) ( w ) P ( w ) βˆ’ = + Ξ» AbsoluteDi scounting i i 1 i 1 βˆ’ βˆ’ c ( w ) i 1 βˆ’ unigram v But should we really just use the regular unigram P(w)? 6501 Natural Language Processing 5

  6. Kneser-Ney Smoothing v Better estimate for probabilities of lower-order unigrams! v Shannon game: I can’t see without my reading ___________ ? Francisco glasses v β€œFrancisco” is more common than β€œglasses” v … but β€œFrancisco” always follows β€œSan” 6501 Natural Language Processing 6

  7. Kneser-Ney Smoothing v Instead of P(w): β€œHow likely is w” v P continuation (w): β€œHow likely is w to appear as a novel continuation? v For each word, count the number of bigram types it completes v Every bigram type was a novel continuation the first time it was seen P CONTINUATION ( w ) ∝ { w i βˆ’ 1 : c ( w i βˆ’ 1 , w ) > 0} 6501 Natural Language Processing 7

  8. Kneser-Ney Smoothing v How many times does w appear as a novel continuation: P CONTINUATION ( w ) ∝ { w i βˆ’ 1 : c ( w i βˆ’ 1 , w ) > 0} v Normalized by the total number of word bigram types {( w j βˆ’ 1 , w j ): c ( w j βˆ’ 1 , w j ) > 0} { w i βˆ’ 1 : c ( w i βˆ’ 1 , w ) > 0} P CONTINUATION ( w ) = {( w j βˆ’ 1 , w j ): c ( w j βˆ’ 1 , w j ) > 0} 6501 Natural Language Processing 8

  9. Kneser-Ney Smoothing v Alternative metaphor: The number of # of word types seen to precede w |{ w i βˆ’ 1 : c ( w i βˆ’ 1 , w ) > 0}| v normalized by the # of words preceding all words: { w i βˆ’ 1 : c ( w i βˆ’ 1 , w ) > 0} P CONTINUATION ( w ) = βˆ‘ { w ' i βˆ’ 1 : c ( w ' i βˆ’ 1 , w ') > 0} w ' v A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability 6501 Natural Language Processing 9

  10. Kneser-Ney Smoothing KN ( w i | w i βˆ’ 1 ) = max( c ( w i βˆ’ 1 , w i ) βˆ’ d ,0) P + Ξ» ( w i βˆ’ 1 ) P CONTINUATION ( w i ) c ( w i βˆ’ 1 ) Ξ» is a normalizing constant; the probability mass we’ve discounted d Ξ» ( w i βˆ’ 1 ) = c ( w i βˆ’ 1 ) { w : c ( w i βˆ’ 1 , w ) > 0} The number of word types that can follow w i-1 the normalized discount = # of word types we discounted = # of times we applied normalized discount 6501 Natural Language Processing 10

  11. Kneser-Ney Smoothing: Recursive formulation i i βˆ’ 1 ) = max( c KN ( w i βˆ’ n + 1 ) βˆ’ d ,0) i βˆ’ 1 ) P i βˆ’ 1 P KN ( w i | w i βˆ’ n + 1 + Ξ» ( w i βˆ’ n + 1 KN ( w i | w i βˆ’ n + 2 ) i βˆ’ 1 ) c KN ( w i βˆ’ n + 1 ! # count ( β€’ ) for the highest order c KN ( β€’ ) = " continuationcount ( β€’ ) for lower order # $ Continuation count = Number of unique single word contexts for Ÿ 6501 Natural Language Processing 11

  12. Practical issue: Huge web-scale n-grams v How to deal with, e.g., Google N-gram corpus v Pruning v Only store N-grams with count > threshold. v Remove singletons of higher-order n-grams 6501 Natural Language Processing 12

  13. Huge web-scale n-grams v Efficiency v Efficient data structures v e.g. tries https://en.wikipedia.org/wiki/Trie v Store words as indexes, not strings v Quantize probabilities (4-8 bits instead of 8-byte float) 6501 Natural Language Processing 13

  14. Smoothing This dark art is why NLP is taught in the engineering school. There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique. 6501 Natural Language Processing 14 600.465 - Intro to NLP - J. Eisner 14

  15. Conditional Modeling v Generative language model (tri-gram model): 𝑄(π‘₯ ) , … π‘₯ + ) = P π‘₯ ) 𝑄 π‘₯ 0 π‘₯ ) …𝑄 π‘₯ + π‘₯ +10 ,π‘₯ +1) v Then, we compute the conditional probabilities by maximum likelihood estimation v Can we model 𝑄 π‘₯ $ π‘₯ % , π‘₯ ' directly? v Given a context x, which outcomes y are likely in that context? P ( NextWord= y | PrecedingWords= x ) 6501 Natural Language Processing 15 600.465 - Intro to NLP - J. Eisner 15

  16. Modeling conditional probabilities v Let’s assume (𝑑𝑑𝑝𝑠𝑓 𝑦,𝑧 D ) 𝑄(𝑧|𝑦) = exp(score x,y )/βˆ‘ exp ED Y: NextWord, x: PrecedingWords v 𝑄(𝑧|𝑦) is high ⇔ score(x,y) is high v This is called soft-max v Require that P(y | x) β‰₯ 0, and βˆ‘ 𝑄(𝑧|𝑦) = 1; E not true of score(x,y) 6501 Natural Language Processing 16

  17. Linear Scoring v Score(x,y): How well does y go with x? v Simplest option: a linear function of (x,y). But (x,y) isn’t a number β‡’ describe it by some numbers (i.e. numeric features) v Then just use a linear function of those numbers. Weight of the k th feature. To be learned … Whether (x,y) has feature k(0 or 1) Ranges over all features Or how many times it fires ( β‰₯ 0) Or how strongly it fires (real #) 6501 Natural Language Processing 17

  18. What features should we use? v Model p w I w $1), w $10 ): 𝑔 ' ( β€œπ‘₯ $1) , π‘₯ $10 ”, β€œπ‘₯ $ ” ) for Score( β€œπ‘₯ $1) , π‘₯ $10 ”, β€œπ‘₯ $ ”) can be v # β€œ π‘₯ $1) ” appears in the training corpus. v 1, if β€œ π‘₯ $ ” is an unseen word; 0, otherwise. v 1, if β€œ π‘₯ $1) , π‘₯ $10 ” = β€œa red”; 0, otherwise. v 1, if β€œ π‘₯ $10 ” belongs to the β€œcolor” category; 0 otherwise. 6501 Natural Language Processing 18

  19. What features should we use? v Model p β€π‘•π‘šπ‘π‘‘π‘‘π‘“π‘‘β€ ”𝑏 𝑠𝑓𝑒”): 𝑔 ' ( β€œπ‘ π‘“π‘’β€, β€œπ‘β€, β€œπ‘•π‘šπ‘π‘‘π‘‘π‘“π‘‘β€ ) for Score (β€œπ‘ π‘“π‘’β€, β€œπ‘β€, β€œπ‘•π‘šπ‘π‘‘π‘‘π‘“π‘‘β€) v # β€œ 𝑠𝑓𝑒 ” appears in the training corpus. v 1, if β€œ 𝑏 ” is an unseen word; 0, otherwise. v 1, if β€œ a 𝑠𝑓𝑒 ” = β€œa red”; 0, otherwise. v 1, if β€œ 𝑠𝑓𝑒 ” belongs to the β€œcolor” category; 0 otherwise. 6501 Natural Language Processing 19

  20. Log-Linear Conditional Probability unnormalized prob (at least it’s positive!) where we choose Z(x) to ensure that Partition function thus, 6501 Natural Language Processing 20 600.465 - Intro to NLP - J. Eisner 20

  21. This version is β€œdiscriminative training”: Training ΞΈ to learn to predict y from x, maximize p(y|x). Whereas in β€œgenerative models”, we learn to model x, too, by maximizing p(x,y). v n training examples v feature functions f 1 , f 2 , … v Want to maximize p(training data| ΞΈ ) v Easier to maximize the log of that: v Alas, some weights ΞΈ i may be optimal at - ∞ or + ∞ . When would this happen? What’s going β€œwrong”? 6501 Natural Language Processing 21

  22. Generalization via Regularization v n training examples v feature functions f 1 , f 2 , … v Want to maximize p(training data| ΞΈ ) β‹… p prior ( ΞΈ ) v Easier to maximize the log of that v Encourages weights close to 0. π‘ž πœ„ ∝ 𝑓 X Y /Z Y v β€œL2 regularization”: Corresponds to a Gaussian prior 6501 Natural Language Processing 22

  23. Gradient-based training v Gradually adjust ΞΈ in a direction that improves Gradient ascent to gradually increase f( ΞΈ ): while ( βˆ‡ f( ΞΈ ) β‰  0) // not at a local max or min ΞΈ = ΞΈ + πœƒ βˆ‡ f( ΞΈ ) // for some small πœƒ > 0 Remember: βˆ‡ f( ΞΈ ) = ( βˆ‚ f( ΞΈ )/ βˆ‚ΞΈ 1 , βˆ‚ f( ΞΈ )/ βˆ‚ΞΈ 2 , …) update means: ΞΈ k += βˆ‚ f( ΞΈ ) / βˆ‚ΞΈ k 6501 Natural Language Processing 23

  24. Gradient-based training v Gradually adjust ΞΈ in a direction that improves v Gradient w.r.t πœ„ 6501 Natural Language Processing 24

  25. More complex assumption? v 𝑄(𝑧|𝑦) = exp(score x,y )/ βˆ‘ exp(𝑑𝑑𝑝𝑠𝑓 𝑦,𝑧 β€² ) 𝑧′ Y: NextWord, x: PrecedingWords v Assume we saw: red glasses; yellow glasses; green glasses; blue glasses red shoes; yellow shoes; green shoes; What is P(shoes; blue)? v Can we learn categories of words(representation) automatically? v Can we build a high order n-gram model without blowing up the model size? 6501 Natural Language Processing 25

  26. Neural language model v Model 𝑄(𝑧|𝑦) with a neural network Example 1: One hot vector: each component of the vector represents one word [0, 0, 1, 0, 0] Example 2: word embeddings 6501 Natural Language Processing 26

Recommend


More recommend