Language Modeling Professor Marie Roch for details on N-gram models, see chapter 4: Jurafsky, D., and Martin, J. H. (2009). Speech and Language Processing (Pearson Prentice Hall, Upper Saddle River, NJ) Narrowing search with a language model • Don’t move or I’ll … • Get ‘er … • What will she think of … • This enables … 2 1
Applications • Speech recognition • Handwriting recognition • Spelling correction • Augmentative communication and more… 3 Constituencies • Groupings of words • I didn’t see you behind the bush . • She ate quickly as she was late for the meeting. • Movement within the sentence: As she was late for the meeting, she ate quickly As she was late for, she ate quickly the meeting. • Constituencies aid in prediction. 4 2
Strategies for construction • Formal grammar – Requires intimate knowledge of the language – Usually context free and cannot be represented by a regular language – We will not be covering this in detail 5 N-gram models • Suppose we wish to compute the probability the sentence: She sells seashells down by the seashore. • We can think of this as a sequence of words: She se lls seashells down by the seashore w w w w w w w 1 2 4 6 3 7 5 7 P w ( ) P w w w w w w w ( , , , , , , ) = 1 1 2 3 4 5 6 7 Think further: How would you determine if a die is fair? 6 3
Estimating word probability • Suppose we wish to compute the probability w 2 ( sells in the previous example). We could estimate using a relative frequency # times w oc curs P w ( ) 2 = 2 # of times all w ords occur but this ignores what we could have learned with the first word. 7 Conditional probability By defn. of conditional probability P A ( B ) ∩ P ( A | B ) = P ( B ) or in our problem: P w ( w ) P w ( w ) ∩ ∩ P w ( | w ) 2 1 1 2 = = 2 1 P w ( ) P w ( ) 1 1 P ( w w , ) defn 1 2 for words = ∩ P w ( ) 1 8 4
Conditional probability P w w ( , ) Next, consider 1 2 P w w ( , ) Since as ( P w | w ) 1 2 , = 2 1 P w ( ) 1 clearly ( P w w , ) P w ( | w P ) ( w ) = 1 2 2 1 1 9 Chain rule • Now let us consider: P w w ( , , w ) P w ( | w w , ) P w w ( , ) = 1 2 3 3 1 2 1 2 we just did this part P w ( | w w , ) P w ( | w P w ) ( ) = 3 1 2 2 1 1 • By applying conditional probability repeatedly we end up with the chain rule: P( W ) P( w w w ) = 1 2 n P( w )P( w | w )P( w | w w ) P( w | w w w ) = 1 2 1 3 1 2 n 1 2 n 1 − n = ∏ P( w w w | w ) i 1 2 i 1 − 10 i 1 = 5
Sparse problem space • Suppose V distinct words. � has V i possible sequences of words. • 𝑥 � • Tokens N – The number of N-grams (including repetitions) occurring in a corpus • Problem: In general, unique( N)<< valid tokens for the language. “The gently rolling hills were covered with bluebonnets” had no hits on Google at the time this slide was published. 11 Markov assumption • A prediction is dependent on the current state but independent of previous conditions Andrei Markov • In our context: 1856-1922 n 1 P( w | w − ) P w ( | w ) by the M arkov assumption = n 1 n n 1 − which at times relax to N-1 words: n 1 n 1 P( w | w − ) P w ( | w − ) = n 1 n n N 1 − + 12 6
Special N-grams • Unigram • Trigram – Only depends upon the – P( w i |w i-1 , w i-2 ) word itself. – P( w i ) • Quadrigram – P( w i |w i-1 , w i-2 , w i-3 ) • Bigram – P( w i |w i-1 ) 13 Preparing a corpus • Make case independent • Remove punctuation and add start & end of sentence markers <s> </s> • Other possibilities – part of speech tagging – lemmas: mapping of words with similar roots e.g. sing, sang, sung sing – stemming: mapping of derived words to their root e.g. parted part, ostriches ostrich 14 7
An Example <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> Dr. Seuss, Green Eggs and Ham , 1960. ( ) n 1 C w − w n N 1 n ( ) − + P w | w n 1 − = n n N 1 ( ) − + n 1 C w − w N 1 − + 15 Berkeley Restaurant Project Sentences • can you tell me about any good cantonese restaurants close by • mid priced thai food is what i ’ m looking for • tell me about chez panisse • can you give me a listing of the kinds of food that are available • i ’ m looking for a good place to eat breakfast • when is caffe venezia open during the day 16 8
Bigram Counts from 9,222 sentences “i want” w i w − i 1 17 Bigram Probabilities Unigram counts C (i want) 827 P (i want) 0.33 = = ≈ w C (i) 2533 i w − i 1 18 9
Bigram Estimates of Sentence Probabilities P (<s> I want english food </s>) =P(I|<s>)P(want|I)P(english|want)P(food|english)P(</s>|food) =.000 031 19 Shakespeare: N=884,647 tokens, V=29,066 How will this work on Huckleberry Finn ? 20 10
The need for n-gram smoothing • Data for estimation is sparse. • On a sample text with several million words – 50% of trigrams only occurred once – 80% of trigrams occurred less than 5 times • Example: When pigs fly C when pigs fly ( , , ) P( fly when pigs | , ) = C when pigs ( , ) 0 if "when pigs fly" unseen = C when pigs ( , ) 21 Smoothing strategies • Suppose P(fly | when, pigs) = 0 • Backoff strategies do the following – When estimating P(Z | X, Y) where C(XYZ)>0, – don’t assign all of the probability, save some of it for the cases we haven’t seen. This is called discounting and is based on Good-Turing counts 22 11
Smoothing strategies • For things that have C(X, Y, Z) = 0, use P(Z|Y), but scale it by the amount of leftover probability • To handle C(Y, Z) = 0, this process can be computed recursively. 23 Neural language models • Advantages – As the net learns a representation, similarities can be captured Example: Consider food • Possible to learn common things about foods • Yet the individual items can still be considered distinct There are approaches to capture commonality in N- gram models (e.g. Knesser-Ney), but they lose the ability to distinguish the words 24 12
Perplexity • Measure of ability of language model to predict next word • Related to cross entropy of language, H(L), perplexity is 2 H(L) 1 H L ( ) lim H w w ( , , , w ) = … n 1 2 n →∞ n 1 lim P w w ( , , , w )log ( P w w ( , , , w ) ) = − … … n →∞ 1 2 n 1 2 n n W L ∈ • Lower perplexity indicates better modeling (theoretically) 25 Neural language models – Word embeddings can learn low dimensional representations of words that can capture semantic information • Disadvantages – Traditional prediction uses one-hot vectors over vocabulary. • High dimensional output space • Computationally expensive 26 13
Consider a softmax output layer • Suppose – V words in vocabulary 𝕎 – n h units in last hidden layer • ∴ softmax input to & output/unit: a e i ˆ y = a b W h where 1 i V = + ≤ ≤ i V i i i j , j a e k j k 1 = • total cost of softmax layer O(Vn h ) : 𝑊 ≈ 𝑙 � × 10 � , 𝑜 � ≈ 𝑙 � × 10 � 27 Short List (hybrid neural/n-gram) • Create subset 𝕄 of frequently used words • Train a neural net on 𝕄 • Treat tail (remainder of vocabulary) using n-gram models: 𝕌 = 𝕎\𝕄 • Reduces complexity, but the words we don’t model are the hard ones… 28 14
Hierarchical softmax • Addresses large V problem • Basic idea: – build binary hierarchy of word categories – words assigned to classes – each subnet has a small softmax layer – last subnet has manageable size • Higher perplexity than non-hierarchical model 29 Importance sampling • Consider large V softmax layer log softmax a ( ) logP y C ( | ) ∂ y = ∂ θ ∂ θ a e ∂ y log = ∂ a e θ i compute P i of every other ∂ a word a log e = − i y ∂ θ i ∂ a P y ( i C | ) = − = y ∂ θ i 30 15
Importance sampling • What if we could approximate the second � �� 𝑏 � − � 𝑄(𝑧 = 𝑗|𝐷) half of: ? � • We could sample, but to sample we would need to know 𝑄(𝑧 = 𝑗|𝐷) … seems like a dead end… 31 Importance sampling • Importance sampling lets us sample from a different distribution • Suppose we want to sample a function f on elements from distribution p, e.g. 𝐹 𝑔 𝑌 = ∑ 𝑞 � 𝑦 � 𝑔(𝑦 � ) , � but we cannot draw from p . 32 16
Importance sampling • Consider a new distribution q: p ( ) x p ( ) ( ) x f x = q i x i i E [ ( f X )] p ( x ) i q i • We can sample x based on q 1 p ( ) ( ) x f x ˆ E [ ( f X )] x i i = N p ( ) x x ~ q q i i 33 Importance sampling • We can use an n-gram model as q • Now we can sample and produce a cheaper estimate of probability • See Goodfellow et al. 17.2 for more details on importance sampling 34 17
Recommend
More recommend