Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia - PowerPoint PPT Presentation

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS 6501: Natural Language Processing 1

This lecture  Language Models  What are N-gram models?  How to use probabilities  What does P(Y|X) mean?  How can I manipulate it?  How can I estimate its value in practice? CS 6501: Natural Language Processing 2

What is a language model?  Probability distributions over sentences (i.e., word sequences ) P(W) = P( 𝑥 1 𝑥 2 𝑥 3 𝑥 4 … 𝑥 𝑙 )  Can use them to generate strings P( 𝑥 𝑙 ∣ 𝑥 2 𝑥 3 𝑥 4 … 𝑥 𝑙−1 )  Rank possible sentences  P(“ Today is Tuesday ”) > P(“ Tuesday Today is ”)  P (“ Today is Tuesday ”) > P(“ Today is Virginia ”) CS 6501: Natural Language Processing 3

Language model applications Context-sensitive spelling correction CS 6501: Natural Language Processing 4

Language model applications Autocomplete CS 6501: Natural Language Processing 5

Language model applications Smart Reply CS 6501: Natural Language Processing 6

Language model applications Language generation https://pdos.csail.mit.edu/archive/scigen/ CS 6501: Natural Language Processing 7

Bag-of-Words with N-grams  N-grams: a contiguous sequence of n tokens from a given piece of text http://recognize-speech.com/language-model/n-gram-model/comparison CS 6501: Natural Language Processing 8

Random language via n-gram  http://www.cs.jhu.edu/~jason/465/PowerPo int/lect01,3tr-ngram-gen.pdf  Behind the scenes – probability theory CS 6501: Natural Language Processing 10

Sampling with replacement 1. P( ) = ? 2. P(  ) = ? 3. P(red,  ) = ? 4. P(blue) = ? 5. P(red |  ) = ? 6. P(  | red) = ? 7. P( ) = ? 8. P(  ) = ? 9. P(2 x , 3 x , 4 x ) = ? CS 6501: Natural Language Processing 11

Sampling words with replacement Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 12

Implementation: how to sample?  Sample from a discrete distribution 𝑞(𝑌)  Assume 𝑜 outcomes in the event space 𝑌 1. Divide the interval [0,1] into 𝑜 intervals according to the probabilities of the outcomes 2. Generate a random number 𝑠 between 0 and 1 3. Return 𝑦 𝑗 where 𝑠 falls into CS 6501: Natural Language Processing 13

Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 14

Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 15

Recap: probability Theory  Conditional probability  P(blue |  ) = ?  𝑄 𝐶 𝐵 = 𝑄(𝐶, 𝐵)/𝑄(𝐵) = 𝑄(A|B)𝑄 𝐶  Bayes’ rule: 𝑄 𝐶 𝐵 𝑄(𝐵)  Verify: P(red |  ) , P(  | red ), P(  ), P(red)  Independent 𝑄 𝐶 𝐵 = P(B)  Prove: 𝑄 A, B = P A P(B) CS 6501: Natural Language Processing 16

The Chain Rule  The joint probability can be expressed in terms of the conditional probability: P(X,Y) = P(X | Y) P(Y)  More variables: P(X, Y, Z) = P(X | Y, Z) P(Y, Z) = P(X | Y, Z) P(Y | Z) P(Z)  𝑄 X 1 , 𝑌 2 , … 𝑌 𝑜 = 𝑄 𝑌 1 𝑄 𝑌 2 𝑌 1 𝑄 𝑌 3 𝑌 2 , 𝑌 1 … 𝑄 𝑌 𝑜 𝑌 1 , … 𝑌 𝑜−1 𝑜 = 𝑄 𝑌 1 Π 𝑗=2 𝑌 𝑗 𝑌 1 , … 𝑌 𝑗−1 CS 6501: Natural Language Processing 17

Language model for text  Probability distribution over sentences We need independence assumptions!  𝑞 𝑥 1 𝑥 2 … 𝑥 𝑜 = 𝑞 𝑥 1 𝑞 𝑥 2 𝑥 1 𝑞 𝑥 3 𝑥 1 , 𝑥 2 … 𝑞 𝑥 𝑜 𝑥 1 , 𝑥 2 , … , 𝑥 𝑜−1  Complexity - 𝑃(𝑊 𝑜 ∗ )  𝑜 ∗ - maximum sentence length Chain rule: from conditional  475,000 main headwords in Webster's Third New International Dictionary probability to joint probability  Average English sentence length is 14.3 words  A rough estimate: 𝑃(475000 14 ) 475000 14 How large is this? 8𝑐𝑧𝑢𝑓𝑡 × 1024 4 ≈ 3.38𝑓 66 𝑈𝐶 CS 6501: Natural Language Processing 18

Probability models  Building a probability model:  defining the model (making independent assumption)  estimating the model’s parameters  use the model (making inference) Trigram Model param (defined in terms of definition parameters like Values Θ of P P(“is”|”today”) ) CS 6501: Natural Language Processing 19

Independent assumption  Independent assumption  even though X and Y are not actually independent, we treat them as independent  Make the model compact (e.g., from 100𝑙 14 to 100𝑙 2 ) CS 6501: Natural Language Processing 20

Language model with N-gram  The chain rule: 𝑄 X 1 , 𝑌 2 , … 𝑌 𝑜 = 𝑄 𝑌 1 𝑄 𝑌 2 𝑌 1 𝑄 𝑌 3 𝑌 2 , 𝑌 1 … 𝑄 𝑌 𝑜 𝑌 1 , … 𝑌 𝑜−1  N-gram language model assumes each word depends only on the last n-1 words (Markov assumption) CS 6501: Natural Language Processing 21

Language model with N-gram  Example: trigram (3-gram) 𝑄 𝑥 𝑜 𝑥 1 , … 𝑥 𝑜−1 = 𝑄 𝑥 𝑜 𝑥 𝑜−2 , 𝑥 𝑜−1 𝑄(𝑥 1 , … 𝑥 𝑜 ) = P 𝑥 1 𝑄 𝑥 2 𝑥 1 … 𝑄 𝑥 𝑜 𝑥 𝑜−2 , 𝑥 𝑜−1 𝑄 "𝑈𝑝𝑒𝑏𝑧 𝑗𝑡 𝑏 𝑡𝑣𝑜𝑜𝑧 𝑒𝑏𝑧" = P(“Today”)P(“is”|”Today”)P(“a”|”is”, “Today”)… P(“day”|”sunny”, “a”) CS 6501: Natural Language Processing 22

Unigram model CS 6501: Natural Language Processing 23

Bigram model  Condition on the previous word CS 6501: Natural Language Processing 24

Ngram model CS 6501: Natural Language Processing 25

More examples  Yoav’s blog post: http://nbviewer.jupyter.org/gist/yoavg/d761 21dfde2618422139  10-gram character-level LM: First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off. BIRON: Hide thy head. CS 6501: Natural Language Processing 26

More examples ~~/* * linux/kernel/time.c * Please report this on hardware. */  Yoav’s blog post: void irq_mark_irq(unsigned long old_entries, eval); /* http://nbviewer.jupyter.org/gist/yoavg/d761 * Divide only 1000 for ns^2 -> us^2 conversion values don't overflow: 21dfde2618422139 seq_puts(m, "\ttramp: %pS", (void *)class->contending_point]++;  10-gram character-level LM: if (likely(t->flags & WQ_UNBOUND)) { /* * Update inode information. If the * slowpath and sleep time (abs or rel) * @rmtp: remaining (either due * to consume the state of ring buffer size. */ header_size - size, in bytes, of the chain. */ BUG_ON(!error); } while (cgrp) { if (old) { if (kdb_continue_catastrophic; #endif CS 6501: Natural Language Processing 27

Questions? CS 6501: Natural Language Processing 28

Maximum likelihood Estimation “Best” means “data likelihood reaches maximum” ෡ 𝜾 = 𝐛𝐬𝐡𝐧𝐛𝐲 𝜾 𝐐(𝐘|𝜾) Unigram Language Model  Estimation Document p(w|  )=? … text 10 10/100 text ? mining 5 5/100 mining ? association 3 3/100 assocation ? database ? 3/100 database 3 … algorithm 2 1/100 query ? … … query 1 efficient 1 A paper (total #words=100) CS 6501: Natural Language Processing 29

 Which bag of words more likely generate: aaaDaaaKoaaaa a K a K a o o P D a a a a D F E b a E a n CS 6501: Natural Language Processing 30

Parameter estimation  General setting:  Given a (hypothesized & probabilistic) model that governs the random experiment  The model gives a probability of any data 𝑞(𝑌|𝜄) that depends on the parameter 𝜄  Now, given actual sample data X={x 1 ,…, x n }, what can we say about the value of 𝜄 ?  Intuitively, take our best guess of 𝜄 -- “best” means “best explaining/fitting the data”  Generally an optimization problem CS 6501: Natural Language Processing 31

Maximum likelihood estimation  Data: a collection of words, 𝑥 1 , 𝑥 2 , … , 𝑥 𝑜  Model: multinomial distribution p(𝑋) with parameters 𝜄 𝑗 = 𝑞(𝑥 𝑗 )  Maximum likelihood estimator: ෠ 𝜄 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝜄∈Θ 𝑞(𝑋|𝜄) 𝑂 𝑂 𝑂 𝑑(𝑥 𝑗 ) ∝ ෑ 𝑑(𝑥 𝑗 ) 𝑞 𝑋 𝜄 = 𝑑 𝑥 1 , … , 𝑑(𝑥 𝑂 ) ෑ 𝜄 𝑗 𝜄 𝑗 𝑗=1 𝑗=1 𝑂 ⇒ log 𝑞 𝑋 𝜄 = ෍ 𝑑 𝑥 𝑗 log 𝜄 𝑗 + 𝑑𝑝𝑜𝑡𝑢 𝑗=1 𝑂 ෠ 𝜄 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝜄∈Θ ෍ 𝑑 𝑥 𝑗 log 𝜄 𝑗 𝑗=1 CS 6501: Natural Language Processing 32

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia - PowerPoint PPT Presentation

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS 6501: Natural Language Processing 1 This lecture Language Models What are N-gram models? How to use

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

Gram-Schmidt algorithm Aim lecture: We use the theory of last lecture to give an algorithm for

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

Many words share the same root word This week we are focusing on words with the root gram.

Anaerobes Veillonella Gram positive bacilli Clostridium perfringens, tetani,

N-grams & Language ID If N-gram models represent language models, can we use N-gram

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data

Gram-Schmidt Finding Orthonormal Basis The famous Gram-Schmidt process is used to produce an

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

Natural Language Processing CSCI 4152/6509 Lecture 17 N-gram Model Smoothing Instructor:

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 26

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by conditioning each word on

N-gram Language Models CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin Roadmap

Algorithms for NLP Language Modeling I Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC

Language Modeling Diyi Yang Some slides borrowed from Yulia Tsvetkov at CMU and Kai-Wei Chang at

Self-Monitoring and Assumptions Self-Adapting Systems Performance is important. People

Retrospective review of a blood culture identification panel implementation and its impact on

Lecture 7 Introduction to Neural Networks Julia Hockenmaier juliahmr@illinois.edu 3324

Adaptive Garbled RAM from Adaptive Garbled RAM from Laconic Oblivious Transfer Sanjam Garg

NLP Programming Tutorial 2 - Bigram Language Models Graham Neubig Nara Institute of Science and

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia - PowerPoint PPT Presentation

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS 6501: Natural Language Processing 1 This lecture Language Models What are N-gram models? How to use

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS &amp; COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

Gram-Schmidt algorithm Aim lecture: We use the theory of last lecture to give an algorithm for

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

Many words share the same root word This week we are focusing on words with the root gram.

Anaerobes Veillonella Gram positive bacilli Clostridium perfringens, tetani,

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data

Gram-Schmidt Finding Orthonormal Basis The famous Gram-Schmidt process is used to produce an

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

Natural Language Processing CSCI 4152/6509 Lecture 17 N-gram Model Smoothing Instructor:

ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 26

Recap: N -gram models ANLP Lecture 6 We can model sentence probs by conditioning each word on

N-gram Language Models CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin Roadmap

Algorithms for NLP Language Modeling I Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC

Language Modeling Diyi Yang Some slides borrowed from Yulia Tsvetkov at CMU and Kai-Wei Chang at

Self-Monitoring and Assumptions Self-Adapting Systems Performance is important. People

Retrospective review of a blood culture identification panel implementation and its impact on

Lecture 7 Introduction to Neural Networks Julia Hockenmaier juliahmr@illinois.edu 3324

Adaptive Garbled RAM from Adaptive Garbled RAM from Laconic Oblivious Transfer Sanjam Garg

NLP Programming Tutorial 2 - Bigram Language Models Graham Neubig Nara Institute of Science and

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

N-grams & Language ID If N-gram models represent language models, can we use N-gram

N-gram Language Models CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin Roadmap