CS 533: Natural Language Processing Language Modeling Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/40
Motivation How likely are the following sentences? ◮ the dog barked ◮ the cat barked ◮ dog the barked ◮ oqc shgwqw#w 1g0 Karl Stratos CS 533: Natural Language Processing 2/40
Motivation How likely are the following sentences? ◮ the dog barked “probability 0.1” ◮ the cat barked “probability 0.03” ◮ dog the barked “probability 0.00005” ◮ oqc shgwqw#w 1g0 “probability 10 − 13 ” Karl Stratos CS 533: Natural Language Processing 2/40
Language Model: Definition A language model is a function that defines a probability distribution p ( x 1 . . . x m ) over all sentences x 1 . . . x m . Goal : Design a good language model, in particular p ( the dog barked ) > p ( the cat barked ) > p ( dog the barked ) > p ( oqc shgwqw#w 1g0 ) Karl Stratos CS 533: Natural Language Processing 3/40
Language Models Are Everywhere Karl Stratos CS 533: Natural Language Processing 4/40
Text Generation with Modern Language Models Try it yourself: https://talktotransformer.com/ Karl Stratos CS 533: Natural Language Processing 5/40
Overview Probability of a Sentence n -Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 6/40
Problem Statement ◮ We’ll assume a finite vocabulary V (i.e., the set of all possible word types). ◮ Sample space: Ω = { x 1 . . . x m ∈ V m : m ≥ 1 } ◮ Task: Design a function p over Ω such that p ( x 1 . . . x m ) ≥ 0 ∀ x 1 . . . x m ∈ Ω � p ( x 1 . . . x m ) = 1 x 1 ...x m ∈ Ω ◮ What are some challenges? Karl Stratos CS 533: Natural Language Processing 7/40
Challenge 1: Infinitely Many Sentences ◮ Can we “break up” the probability of a sentence into probabilities of individual words? ◮ Yes : Assume a generative process . ◮ We may assume that each sentence x 1 . . . x m is generated as (1) x 1 is drawn from p ( · ) , (2) x 2 is drawn from p ( ·| x 1 ) , (3) x 3 is drawn from p ( ·| x 1 , x 2 ) , . . . ( m ) x m is drawn from p ( ·| x 1 , . . . , x m − 1 ) , ( m + 1) x m +1 is drawn from p ( ·| x 1 , . . . , x m ) . where x m +1 = STOP is a special token at the end of every sentence. Karl Stratos CS 533: Natural Language Processing 8/40
Justification of the Generative Assumption By the chain rule , p ( x 1 . . . x m STOP ) = p ( x 1 ) × p ( x 2 | x 1 ) × p ( x 3 | x 1 , x 2 ) × · · · · · · × p ( x m | x 1 , . . . , x m − 1 ) × p ( STOP | x 1 , . . . , x m ) Thus we have solved the first challenge. ◮ Sample space = finite V ◮ The model still defines a proper distribution over all sentences. (Does the generative process need to be left-to-right?) Karl Stratos CS 533: Natural Language Processing 9/40
STOP Symbol Ensures that there is probabilty mass left for longer sentences Probabilty mass of sentences with length ≥ 1 � 1 − p ( STOP ) = 1 x ∈ V � �� � P ( X 1 = STOP )=0 Probabilty mass of sentences with length ≥ 2 � 1 − p ( x STOP ) > 0 x ∈ V � �� � P ( X 2 = STOP ) Probabilty mass of sentences with length ≥ 3 � � p ( x x ′ STOP ) 1 − p ( x STOP ) − > 0 x ∈ V x,x ′ ∈ V � �� � � �� � P ( X 2 = STOP ) P ( X 3 = STOP ) Karl Stratos CS 533: Natural Language Processing 10/40
Challenge 2: Infinitely Many Distributions Under the generative process, we need infinitely many conditional word distributions: p ( x 1 ) ∀ x 1 ∈ V p ( x 2 | x 1 ) ∀ x 1 , x 2 ∈ V p ( x 3 | x 1 , x 2 ) ∀ x 1 , x 2 , x 3 ∈ V p ( x 4 | x 1 , x 2 , x 3 ) ∀ x 1 , x 2 , x 3 , x 4 ∈ V . . . . . . Now our goal is to redesign the model to have only a finite, compact set of associated values. Karl Stratos CS 533: Natural Language Processing 11/40
Overview Probability of a Sentence n -Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 12/40
Independence Assumptions X is independent of Y if P ( X = x | Y = y ) = P ( X = x ) X is conditionally independent of Y given Z if P ( X = x | Y = y, Z = z ) = P ( X = x | Z = z ) Can you think of such X, Y, Z ? Karl Stratos CS 533: Natural Language Processing 13/40
Unigram Language Model Assumption. A word is independent of all previous words: p ( x i | x 1 . . . x i − 1 ) = p ( x i ) That is, � m p ( x 1 . . . x m ) = p ( x i ) i =1 Number of parameters: O ( | V | ) Not a very good language model: p ( the dog barked ) = p ( dog the barked ) Karl Stratos CS 533: Natural Language Processing 14/40
Bigram Language Model Assumption. A word is independent of all previous words con- ditioning on the preceding word: p ( x i | x 1 . . . x i − 1 ) = p ( x i | x i − 1 ) That is, � m p ( x 1 . . . x m ) = p ( x i | x i − 1 ) i =1 where x 0 = * is a special token at the start of every sentence. Number of parameters: O ( | V | 2 ) Karl Stratos CS 533: Natural Language Processing 15/40
Trigram Language Model Assumption. A word is independent of all previous words con- ditioning on the two preceding words: p ( x i | x 1 . . . x i − 1 ) = p ( x i | x i − 2 , x i − 1 ) That is, � m p ( x 1 . . . x m ) = p ( x i | x i − 2 , x i − 1 ) i =1 where x − 1 , x 0 = * are special tokens at the start of every sentence. Number of parameters: O ( | V | 3 ) Karl Stratos CS 533: Natural Language Processing 16/40
n -Gram Language Model Assumption. A word is independent of all previous words con- ditioning on the n − 1 preceding words: p ( x i | x 1 . . . x i − 1 ) = p ( x i | x i − n +1 , . . . , x i − 1 ) Number of parameters: O ( | V | n ) This kind of conditional independence assumption (“depends only on the last n − 1 states. . . ”) is called a Markov assumption . ◮ Is this a reasonable assumption for language modeling? Karl Stratos CS 533: Natural Language Processing 17/40
Overview Probability of a Sentence n -Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 18/40
A Practical Question ◮ Summary so far: We have designed probabilistic language models parametrized by finitely many values. ◮ Bigram model: Stores a table of O ( | V | 2 ) values ∀ x, x ′ ∈ V q ( x ′ | x ) (plus q ( x | * ) and q ( STOP | x ) ) representing transition probabilities and computes p ( the cat barked ) = q ( the | * ) × q ( cat | the ) × q ( barked | cat ) q ( STOP | barked ) ◮ Q. But where do we get these values? Karl Stratos CS 533: Natural Language Processing 19/40
Estimation from Data ◮ Our data is a corpus of N sentences x (1) . . . x ( N ) . ◮ Define count ( x, x ′ ) to be the number of times x, x ′ appear together (called “bigram counts”): l i +1 � N � count ( x, x ′ ) = 1 i =1 j =1: x j = x ′ x j − 1 = x ( l i = length of x ( i ) and x l i +1 = STOP ) ◮ Define count ( x ) := � x ′ count ( x, x ′ ) (called “unigram counts”). Karl Stratos CS 533: Natural Language Processing 20/40
Example Counts Corpus: ◮ the dog chased the cat ◮ the cat chased the mouse ◮ the mouse chased the dog Example bigram/unigram counts: count ( x 0 , the ) = 3 count ( the ) = 6 count ( chased , the ) = 3 count ( chased ) = 3 count ( the , dog ) = 2 count ( x 0 ) = 3 count ( cat , STOP ) = 1 count ( cat ) = 2 Karl Stratos CS 533: Natural Language Processing 21/40
Parameter Estimates ◮ For all x, x ′ with count ( x, x ′ ) > 0 , set q ( x ′ | x ) = count ( x, x ′ ) count ( x ) Otherwise q ( x ′ | x ) = 0 . ◮ In the previous example: q ( the | x 0 ) = 3 / 3 = 1 q ( chased | dog ) = 1 / 3 = 0 . ¯ 3 q ( dog | the ) = 2 / 6 = 0 . ¯ 3 q ( STOP | cat ) = 1 / 2 = 0 . 5 q ( dog | cat ) = 0 ◮ Called maximum likelihood estimation (MLE) . Karl Stratos CS 533: Natural Language Processing 22/40
Justification of MLE Claim. The solution of the constrained optimization problem l i +1 � N � q ∗ = arg max log q ( x j | x j − 1 ) q : q ( x ′ | x ) ≥ 0 ∀ x,x ′ i =1 j =1 x ′∈ V q ( x ′ | x )=1 ∀ x � is given by q ∗ ( x ′ | x ) = count ( x, x ′ ) count ( x ) (Proof?) Karl Stratos CS 533: Natural Language Processing 23/40
MLE: Other n -Gram Models Unigram: q ( x ) = count ( x ) N Bigram: q ( x ′ | x ) = count ( x, x ′ ) count ( x ) Trigram: q ( x ′′ | x, x ′ ) = count ( x, x ′ , x ′′ ) count ( x, x ′ ) Karl Stratos CS 533: Natural Language Processing 24/40
Overview Probability of a Sentence n -Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 25/40
Evaluation of a Language Model “How good is the model at predicting unseen sentences?” Held-out corpus : Used for evaluation purposes only Do not use held-out data for training the model! Popular evaluation metric: perplexity Karl Stratos CS 533: Natural Language Processing 26/40
Recommend
More recommend