Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS 6501: Natural Language Processing 1
This lecture Language Models What are N-gram models? How to use probabilities What does P(Y|X) mean? How can I manipulate it? How can I estimate its value in practice? CS 6501: Natural Language Processing 2
What is a language model? Probability distributions over sentences (i.e., word sequences ) P(W) = P( 𝑥 1 𝑥 2 𝑥 3 𝑥 4 … 𝑥 𝑙 ) Can use them to generate strings P( 𝑥 𝑙 ∣ 𝑥 2 𝑥 3 𝑥 4 … 𝑥 𝑙−1 ) Rank possible sentences P(“ Today is Tuesday ”) > P(“ Tuesday Today is ”) P (“ Today is Tuesday ”) > P(“ Today is Virginia ”) CS 6501: Natural Language Processing 3
Language model applications Context-sensitive spelling correction CS 6501: Natural Language Processing 4
Language model applications Autocomplete CS 6501: Natural Language Processing 5
Language model applications Smart Reply CS 6501: Natural Language Processing 6
Language model applications Language generation https://pdos.csail.mit.edu/archive/scigen/ CS 6501: Natural Language Processing 7
Bag-of-Words with N-grams N-grams: a contiguous sequence of n tokens from a given piece of text http://recognize-speech.com/language-model/n-gram-model/comparison CS 6501: Natural Language Processing 8
N-Gram Models Unigram model: 𝑄 𝑥 1 𝑄 𝑥 2 𝑄 𝑥 3 … 𝑄(𝑥 𝑜 ) Bigram model: 𝑄 𝑥 1 𝑄 𝑥 2 |𝑥 1 𝑄 𝑥 3 |𝑥 2 … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 ) Trigram model: 𝑄 𝑥 1 𝑄 𝑥 2 |𝑥 1 𝑄 𝑥 3 |𝑥 2 , 𝑥 1 … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 𝑥 𝑜−2 ) N-gram model: 𝑄 𝑥 1 𝑄 𝑥 2 |𝑥 1 … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 𝑥 𝑜−2 … 𝑥 𝑜−𝑂 ) CS 6501: Natural Language Processing 9
Random language via n-gram http://www.cs.jhu.edu/~jason/465/PowerPo int/lect01,3tr-ngram-gen.pdf Behind the scenes – probability theory CS 6501: Natural Language Processing 10
Sampling with replacement 1. P( ) = ? 2. P( ) = ? 3. P(red, ) = ? 4. P(blue) = ? 5. P(red | ) = ? 6. P( | red) = ? 7. P( ) = ? 8. P( ) = ? 9. P(2 x , 3 x , 4 x ) = ? CS 6501: Natural Language Processing 11
Sampling words with replacement Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 12
Implementation: how to sample? Sample from a discrete distribution 𝑞(𝑌) Assume 𝑜 outcomes in the event space 𝑌 1. Divide the interval [0,1] into 𝑜 intervals according to the probabilities of the outcomes 2. Generate a random number 𝑠 between 0 and 1 3. Return 𝑦 𝑗 where 𝑠 falls into CS 6501: Natural Language Processing 13
Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 14
Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 15
Recap: probability Theory Conditional probability P(blue | ) = ? 𝑄 𝐶 𝐵 = 𝑄(𝐶, 𝐵)/𝑄(𝐵) = 𝑄(A|B)𝑄 𝐶 Bayes’ rule: 𝑄 𝐶 𝐵 𝑄(𝐵) Verify: P(red | ) , P( | red ), P( ), P(red) Independent 𝑄 𝐶 𝐵 = P(B) Prove: 𝑄 A, B = P A P(B) CS 6501: Natural Language Processing 16
The Chain Rule The joint probability can be expressed in terms of the conditional probability: P(X,Y) = P(X | Y) P(Y) More variables: P(X, Y, Z) = P(X | Y, Z) P(Y, Z) = P(X | Y, Z) P(Y | Z) P(Z) 𝑄 X 1 , 𝑌 2 , … 𝑌 𝑜 = 𝑄 𝑌 1 𝑄 𝑌 2 𝑌 1 𝑄 𝑌 3 𝑌 2 , 𝑌 1 … 𝑄 𝑌 𝑜 𝑌 1 , … 𝑌 𝑜−1 𝑜 = 𝑄 𝑌 1 Π 𝑗=2 𝑌 𝑗 𝑌 1 , … 𝑌 𝑗−1 CS 6501: Natural Language Processing 17
Language model for text Probability distribution over sentences We need independence assumptions! 𝑞 𝑥 1 𝑥 2 … 𝑥 𝑜 = 𝑞 𝑥 1 𝑞 𝑥 2 𝑥 1 𝑞 𝑥 3 𝑥 1 , 𝑥 2 … 𝑞 𝑥 𝑜 𝑥 1 , 𝑥 2 , … , 𝑥 𝑜−1 Complexity - 𝑃(𝑊 𝑜 ∗ ) 𝑜 ∗ - maximum sentence length Chain rule: from conditional 475,000 main headwords in Webster's Third New International Dictionary probability to joint probability Average English sentence length is 14.3 words A rough estimate: 𝑃(475000 14 ) 475000 14 How large is this? 8𝑐𝑧𝑢𝑓𝑡 × 1024 4 ≈ 3.38𝑓 66 𝑈𝐶 CS 6501: Natural Language Processing 18
Probability models Building a probability model: defining the model (making independent assumption) estimating the model’s parameters use the model (making inference) Trigram Model param (defined in terms of definition parameters like Values Θ of P P(“is”|”today”) ) CS 6501: Natural Language Processing 19
Independent assumption Independent assumption even though X and Y are not actually independent, we treat them as independent Make the model compact (e.g., from 100𝑙 14 to 100𝑙 2 ) CS 6501: Natural Language Processing 20
Language model with N-gram The chain rule: 𝑄 X 1 , 𝑌 2 , … 𝑌 𝑜 = 𝑄 𝑌 1 𝑄 𝑌 2 𝑌 1 𝑄 𝑌 3 𝑌 2 , 𝑌 1 … 𝑄 𝑌 𝑜 𝑌 1 , … 𝑌 𝑜−1 N-gram language model assumes each word depends only on the last n-1 words (Markov assumption) CS 6501: Natural Language Processing 21
Language model with N-gram Example: trigram (3-gram) 𝑄 𝑥 𝑜 𝑥 1 , … 𝑥 𝑜−1 = 𝑄 𝑥 𝑜 𝑥 𝑜−2 , 𝑥 𝑜−1 𝑄(𝑥 1 , … 𝑥 𝑜 ) = P 𝑥 1 𝑄 𝑥 2 𝑥 1 … 𝑄 𝑥 𝑜 𝑥 𝑜−2 , 𝑥 𝑜−1 𝑄 "𝑈𝑝𝑒𝑏𝑧 𝑗𝑡 𝑏 𝑡𝑣𝑜𝑜𝑧 𝑒𝑏𝑧" = P(“Today”)P(“is”|”Today”)P(“a”|”is”, “Today”)… P(“day”|”sunny”, “a”) CS 6501: Natural Language Processing 22
Unigram model CS 6501: Natural Language Processing 23
Bigram model Condition on the previous word CS 6501: Natural Language Processing 24
Ngram model CS 6501: Natural Language Processing 25
More examples Yoav’s blog post: http://nbviewer.jupyter.org/gist/yoavg/d761 21dfde2618422139 10-gram character-level LM: First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off. BIRON: Hide thy head. CS 6501: Natural Language Processing 26
More examples ~~/* * linux/kernel/time.c * Please report this on hardware. */ Yoav’s blog post: void irq_mark_irq(unsigned long old_entries, eval); /* http://nbviewer.jupyter.org/gist/yoavg/d761 * Divide only 1000 for ns^2 -> us^2 conversion values don't overflow: 21dfde2618422139 seq_puts(m, "\ttramp: %pS", (void *)class->contending_point]++; 10-gram character-level LM: if (likely(t->flags & WQ_UNBOUND)) { /* * Update inode information. If the * slowpath and sleep time (abs or rel) * @rmtp: remaining (either due * to consume the state of ring buffer size. */ header_size - size, in bytes, of the chain. */ BUG_ON(!error); } while (cgrp) { if (old) { if (kdb_continue_catastrophic; #endif CS 6501: Natural Language Processing 27
Questions? CS 6501: Natural Language Processing 28
Maximum likelihood Estimation “Best” means “data likelihood reaches maximum” 𝜾 = 𝐛𝐬𝐡𝐧𝐛𝐲 𝜾 𝐐(𝐘|𝜾) Unigram Language Model Estimation Document p(w| )=? … text 10 10/100 text ? mining 5 5/100 mining ? association 3 3/100 assocation ? database ? 3/100 database 3 … algorithm 2 1/100 query ? … … query 1 efficient 1 A paper (total #words=100) CS 6501: Natural Language Processing 29
Which bag of words more likely generate: aaaDaaaKoaaaa a K a K a o o P D a a a a D F E b a E a n CS 6501: Natural Language Processing 30
Parameter estimation General setting: Given a (hypothesized & probabilistic) model that governs the random experiment The model gives a probability of any data 𝑞(𝑌|𝜄) that depends on the parameter 𝜄 Now, given actual sample data X={x 1 ,…, x n }, what can we say about the value of 𝜄 ? Intuitively, take our best guess of 𝜄 -- “best” means “best explaining/fitting the data” Generally an optimization problem CS 6501: Natural Language Processing 31
Maximum likelihood estimation Data: a collection of words, 𝑥 1 , 𝑥 2 , … , 𝑥 𝑜 Model: multinomial distribution p(𝑋) with parameters 𝜄 𝑗 = 𝑞(𝑥 𝑗 ) Maximum likelihood estimator: 𝜄 = 𝑏𝑠𝑛𝑏𝑦 𝜄∈Θ 𝑞(𝑋|𝜄) 𝑂 𝑂 𝑂 𝑑(𝑥 𝑗 ) ∝ ෑ 𝑑(𝑥 𝑗 ) 𝑞 𝑋 𝜄 = 𝑑 𝑥 1 , … , 𝑑(𝑥 𝑂 ) ෑ 𝜄 𝑗 𝜄 𝑗 𝑗=1 𝑗=1 𝑂 ⇒ log 𝑞 𝑋 𝜄 = 𝑑 𝑥 𝑗 log 𝜄 𝑗 + 𝑑𝑝𝑜𝑡𝑢 𝑗=1 𝑂 𝜄 = 𝑏𝑠𝑛𝑏𝑦 𝜄∈Θ 𝑑 𝑥 𝑗 log 𝜄 𝑗 𝑗=1 CS 6501: Natural Language Processing 32
Recommend
More recommend