Lecture 4: Language Model Evaluation and Advanced methods Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1
This lecture v Kneser-Ney smoothing v Discriminative Language Models v Neural Language Models v Evaluation: Cross-entropy and perplexity 6501 Natural Language Processing 2
Recap: Smoothing v Add-one smoothing v Add- π smoothing v parameters tuned by the cross-validation v Witten-Bell Smoothing v T: # word types N: # tokens v T/(N+T): total prob. mass for unseen words v N/(N+T): total prob. mass for observed tokens v Good-Turing v Reallocate the probability mass of n-grams that occur r+1 times to n-grams that occur r times. 6501 Natural Language Processing 3
Recap: Back-off and interpolation v Idea: even if weβve never seen βred glassesβ, we know it is more likely to occur than βred abacusβ v Interpolation: p average (z | xy) = Β΅ 3 p(z | xy) + Β΅ 2 p(z | y) + Β΅ 1 p(z) where Β΅ 3 + Β΅ 2 + Β΅ 1 = 1 and all are β₯ 0 6501 Natural Language Processing 4
Absolute Discounting v Save ourselves some time and just subtract 0.75 (or some d)! discounted bigram Interpolation weight c ( w , w ) d β i 1 i P ( w | w ) ( w ) P ( w ) β = + Ξ» AbsoluteDi scounting i i 1 i 1 β β c ( w ) i 1 β unigram v But should we really just use the regular unigram P(w)? 6501 Natural Language Processing 5
Kneser-Ney Smoothing v Better estimate for probabilities of lower-order unigrams! v Shannon game: I canβt see without my reading ___________ ? Francisco glasses v βFranciscoβ is more common than βglassesβ v β¦ but βFranciscoβ always follows βSanβ 6501 Natural Language Processing 6
Kneser-Ney Smoothing v Instead of P(w): βHow likely is wβ v P continuation (w): βHow likely is w to appear as a novel continuation? v For each word, count the number of bigram types it completes v Every bigram type was a novel continuation the first time it was seen P CONTINUATION ( w ) β { w i β 1 : c ( w i β 1 , w ) > 0} 6501 Natural Language Processing 7
Kneser-Ney Smoothing v How many times does w appear as a novel continuation: P CONTINUATION ( w ) β { w i β 1 : c ( w i β 1 , w ) > 0} v Normalized by the total number of word bigram types {( w j β 1 , w j ): c ( w j β 1 , w j ) > 0} { w i β 1 : c ( w i β 1 , w ) > 0} P CONTINUATION ( w ) = {( w j β 1 , w j ): c ( w j β 1 , w j ) > 0} 6501 Natural Language Processing 8
Kneser-Ney Smoothing v Alternative metaphor: The number of # of word types seen to precede w |{ w i β 1 : c ( w i β 1 , w ) > 0}| v normalized by the # of words preceding all words: { w i β 1 : c ( w i β 1 , w ) > 0} P CONTINUATION ( w ) = β { w ' i β 1 : c ( w ' i β 1 , w ') > 0} w ' v A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability 6501 Natural Language Processing 9
Kneser-Ney Smoothing KN ( w i | w i β 1 ) = max( c ( w i β 1 , w i ) β d ,0) P + Ξ» ( w i β 1 ) P CONTINUATION ( w i ) c ( w i β 1 ) Ξ» is a normalizing constant; the probability mass weβve discounted d Ξ» ( w i β 1 ) = c ( w i β 1 ) { w : c ( w i β 1 , w ) > 0} The number of word types that can follow w i-1 the normalized discount = # of word types we discounted = # of times we applied normalized discount 6501 Natural Language Processing 10
Kneser-Ney Smoothing: Recursive formulation i i β 1 ) = max( c KN ( w i β n + 1 ) β d ,0) i β 1 ) P i β 1 P KN ( w i | w i β n + 1 + Ξ» ( w i β n + 1 KN ( w i | w i β n + 2 ) i β 1 ) c KN ( w i β n + 1 ! # count ( β’ ) for the highest order c KN ( β’ ) = " continuationcount ( β’ ) for lower order # $ Continuation count = Number of unique single word contexts for Β 6501 Natural Language Processing 11
Practical issue: Huge web-scale n-grams v How to deal with, e.g., Google N-gram corpus v Pruning v Only store N-grams with count > threshold. v Remove singletons of higher-order n-grams 6501 Natural Language Processing 12
Huge web-scale n-grams v Efficiency v Efficient data structures v e.g. tries https://en.wikipedia.org/wiki/Trie v Store words as indexes, not strings v Quantize probabilities (4-8 bits instead of 8-byte float) 6501 Natural Language Processing 13
Smoothing This dark art is why NLP is taught in the engineering school. There are more principled smoothing methods, too. Weβll look next at log-linear models, which are a good and popular general technique. 6501 Natural Language Processing 14 600.465 - Intro to NLP - J. Eisner 14
Conditional Modeling v Generative language model (tri-gram model): π(π₯ ) , β¦ π₯ + ) = P π₯ ) π π₯ 0 π₯ ) β¦π π₯ + π₯ +10 ,π₯ +1) v Then, we compute the conditional probabilities by maximum likelihood estimation v Can we model π π₯ $ π₯ % , π₯ ' directly? v Given a context x, which outcomes y are likely in that context? P ( NextWord= y | PrecedingWords= x ) 6501 Natural Language Processing 15 600.465 - Intro to NLP - J. Eisner 15
Modeling conditional probabilities v Letβs assume (π‘πππ π π¦,π§ D ) π(π§|π¦) = exp(score x,y )/β exp ED Y: NextWord, x: PrecedingWords v π(π§|π¦) is high β score(x,y) is high v This is called soft-max v Require that P(y | x) β₯ 0, and β π(π§|π¦) = 1; E not true of score(x,y) 6501 Natural Language Processing 16
Linear Scoring v Score(x,y): How well does y go with x? v Simplest option: a linear function of (x,y). But (x,y) isnβt a number β describe it by some numbers (i.e. numeric features) v Then just use a linear function of those numbers. Weight of the k th feature. To be learned β¦ Whether (x,y) has feature k(0 or 1) Ranges over all features Or how many times it fires ( β₯ 0) Or how strongly it fires (real #) 6501 Natural Language Processing 17
What features should we use? v Model p w I w $1), w $10 ): π ' ( βπ₯ $1) , π₯ $10 β, βπ₯ $ β ) for Score( βπ₯ $1) , π₯ $10 β, βπ₯ $ β) can be v # β π₯ $1) β appears in the training corpus. v 1, if β π₯ $ β is an unseen word; 0, otherwise. v 1, if β π₯ $1) , π₯ $10 β = βa redβ; 0, otherwise. v 1, if β π₯ $10 β belongs to the βcolorβ category; 0 otherwise. 6501 Natural Language Processing 18
What features should we use? v Model p βππππ‘π‘ππ‘β βπ π ππβ): π ' ( βπ ππβ, βπβ, βππππ‘π‘ππ‘β ) for Score (βπ ππβ, βπβ, βππππ‘π‘ππ‘β) v # β π ππ β appears in the training corpus. v 1, if β π β is an unseen word; 0, otherwise. v 1, if β a π ππ β = βa redβ; 0, otherwise. v 1, if β π ππ β belongs to the βcolorβ category; 0 otherwise. 6501 Natural Language Processing 19
Log-Linear Conditional Probability unnormalized prob (at least itβs positive!) where we choose Z(x) to ensure that Partition function thus, 6501 Natural Language Processing 20 600.465 - Intro to NLP - J. Eisner 20
This version is βdiscriminative trainingβ: Training ΞΈ to learn to predict y from x, maximize p(y|x). Whereas in βgenerative modelsβ, we learn to model x, too, by maximizing p(x,y). v n training examples v feature functions f 1 , f 2 , β¦ v Want to maximize p(training data| ΞΈ ) v Easier to maximize the log of that: v Alas, some weights ΞΈ i may be optimal at - β or + β . When would this happen? Whatβs going βwrongβ? 6501 Natural Language Processing 21
Generalization via Regularization v n training examples v feature functions f 1 , f 2 , β¦ v Want to maximize p(training data| ΞΈ ) β p prior ( ΞΈ ) v Easier to maximize the log of that v Encourages weights close to 0. π π β π X Y /Z Y v βL2 regularizationβ: Corresponds to a Gaussian prior 6501 Natural Language Processing 22
Gradient-based training v Gradually adjust ΞΈ in a direction that improves Gradient ascent to gradually increase f( ΞΈ ): while ( β f( ΞΈ ) β 0) // not at a local max or min ΞΈ = ΞΈ + π β f( ΞΈ ) // for some small π > 0 Remember: β f( ΞΈ ) = ( β f( ΞΈ )/ βΞΈ 1 , β f( ΞΈ )/ βΞΈ 2 , β¦) update means: ΞΈ k += β f( ΞΈ ) / βΞΈ k 6501 Natural Language Processing 23
Gradient-based training v Gradually adjust ΞΈ in a direction that improves v Gradient w.r.t π 6501 Natural Language Processing 24
More complex assumption? v π(π§|π¦) = exp(score x,y )/ β exp(π‘πππ π π¦,π§ β² ) π§β² Y: NextWord, x: PrecedingWords v Assume we saw: red glasses; yellow glasses; green glasses; blue glasses red shoes; yellow shoes; green shoes; What is P(shoes; blue)? v Can we learn categories of words(representation) automatically? v Can we build a high order n-gram model without blowing up the model size? 6501 Natural Language Processing 25
Neural language model v Model π(π§|π¦) with a neural network Example 1: One hot vector: each component of the vector represents one word [0, 0, 1, 0, 0] Example 2: word embeddings 6501 Natural Language Processing 26
Recommend
More recommend