Language Modeling Karl Stratos Rutgers University Karl Stratos CS - PowerPoint PPT Presentation

CS 533: Natural Language Processing Language Modeling Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/40

Motivation How likely are the following sentences? ◮ the dog barked ◮ the cat barked ◮ dog the barked ◮ oqc shgwqw#w 1g0 Karl Stratos CS 533: Natural Language Processing 2/40

Motivation How likely are the following sentences? ◮ the dog barked “probability 0.1” ◮ the cat barked “probability 0.03” ◮ dog the barked “probability 0.00005” ◮ oqc shgwqw#w 1g0 “probability 10 − 13 ” Karl Stratos CS 533: Natural Language Processing 2/40

Language Model: Definition A language model is a function that defines a probability distribution p ( x 1 . . . x m ) over all sentences x 1 . . . x m . Goal : Design a good language model, in particular p ( the dog barked ) > p ( the cat barked ) > p ( dog the barked ) > p ( oqc shgwqw#w 1g0 ) Karl Stratos CS 533: Natural Language Processing 3/40

Language Models Are Everywhere Karl Stratos CS 533: Natural Language Processing 4/40

Text Generation with Modern Language Models Try it yourself: https://talktotransformer.com/ Karl Stratos CS 533: Natural Language Processing 5/40

Overview Probability of a Sentence n -Gram Language Models Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing Log-Linear Language Models Karl Stratos CS 533: Natural Language Processing 6/40

Problem Statement ◮ We’ll assume a finite vocabulary V (i.e., the set of all possible word types). ◮ Sample space: Ω = { x 1 . . . x m ∈ V m : m ≥ 1 } ◮ Task: Design a function p over Ω such that p ( x 1 . . . x m ) ≥ 0 ∀ x 1 . . . x m ∈ Ω � p ( x 1 . . . x m ) = 1 x 1 ...x m ∈ Ω ◮ What are some challenges? Karl Stratos CS 533: Natural Language Processing 7/40

Challenge 1: Infinitely Many Sentences ◮ Can we “break up” the probability of a sentence into probabilities of individual words? ◮ Yes : Assume a generative process . ◮ We may assume that each sentence x 1 . . . x m is generated as (1) x 1 is drawn from p ( · ) , (2) x 2 is drawn from p ( ·| x 1 ) , (3) x 3 is drawn from p ( ·| x 1 , x 2 ) , . . . ( m ) x m is drawn from p ( ·| x 1 , . . . , x m − 1 ) , ( m + 1) x m +1 is drawn from p ( ·| x 1 , . . . , x m ) . where x m +1 = STOP is a special token at the end of every sentence. Karl Stratos CS 533: Natural Language Processing 8/40

Justification of the Generative Assumption By the chain rule , p ( x 1 . . . x m STOP ) = p ( x 1 ) × p ( x 2 | x 1 ) × p ( x 3 | x 1 , x 2 ) × · · · · · · × p ( x m | x 1 , . . . , x m − 1 ) × p ( STOP | x 1 , . . . , x m ) Thus we have solved the first challenge. ◮ Sample space = finite V ◮ The model still defines a proper distribution over all sentences. (Does the generative process need to be left-to-right?) Karl Stratos CS 533: Natural Language Processing 9/40

STOP Symbol Ensures that there is probabilty mass left for longer sentences Probabilty mass of sentences with length ≥ 1 � 1 − p ( STOP ) = 1 x ∈ V � �� P ( X 1 = STOP )=0 Probabilty mass of sentences with length ≥ 2 � 1 − p ( x STOP ) > 0 x ∈ V � �� P ( X 2 = STOP ) Probabilty mass of sentences with length ≥ 3 � � p ( x x ′ STOP ) 1 − p ( x STOP ) − > 0 x ∈ V x,x ′ ∈ V � �� P ( X 2 = STOP ) P ( X 3 = STOP ) Karl Stratos CS 533: Natural Language Processing 10/40

Challenge 2: Infinitely Many Distributions Under the generative process, we need infinitely many conditional word distributions: p ( x 1 ) ∀ x 1 ∈ V p ( x 2 | x 1 ) ∀ x 1 , x 2 ∈ V p ( x 3 | x 1 , x 2 ) ∀ x 1 , x 2 , x 3 ∈ V p ( x 4 | x 1 , x 2 , x 3 ) ∀ x 1 , x 2 , x 3 , x 4 ∈ V . . . . . . Now our goal is to redesign the model to have only a finite, compact set of associated values. Karl Stratos CS 533: Natural Language Processing 11/40

Independence Assumptions X is independent of Y if P ( X = x | Y = y ) = P ( X = x ) X is conditionally independent of Y given Z if P ( X = x | Y = y, Z = z ) = P ( X = x | Z = z ) Can you think of such X, Y, Z ? Karl Stratos CS 533: Natural Language Processing 13/40

Unigram Language Model Assumption. A word is independent of all previous words: p ( x i | x 1 . . . x i − 1 ) = p ( x i ) That is, � m p ( x 1 . . . x m ) = p ( x i ) i =1 Number of parameters: O ( | V | ) Not a very good language model: p ( the dog barked ) = p ( dog the barked ) Karl Stratos CS 533: Natural Language Processing 14/40

Bigram Language Model Assumption. A word is independent of all previous words con- ditioning on the preceding word: p ( x i | x 1 . . . x i − 1 ) = p ( x i | x i − 1 ) That is, � m p ( x 1 . . . x m ) = p ( x i | x i − 1 ) i =1 where x 0 = * is a special token at the start of every sentence. Number of parameters: O ( | V | 2 ) Karl Stratos CS 533: Natural Language Processing 15/40

Trigram Language Model Assumption. A word is independent of all previous words con- ditioning on the two preceding words: p ( x i | x 1 . . . x i − 1 ) = p ( x i | x i − 2 , x i − 1 ) That is, � m p ( x 1 . . . x m ) = p ( x i | x i − 2 , x i − 1 ) i =1 where x − 1 , x 0 = * are special tokens at the start of every sentence. Number of parameters: O ( | V | 3 ) Karl Stratos CS 533: Natural Language Processing 16/40

n -Gram Language Model Assumption. A word is independent of all previous words con- ditioning on the n − 1 preceding words: p ( x i | x 1 . . . x i − 1 ) = p ( x i | x i − n +1 , . . . , x i − 1 ) Number of parameters: O ( | V | n ) This kind of conditional independence assumption (“depends only on the last n − 1 states. . . ”) is called a Markov assumption . ◮ Is this a reasonable assumption for language modeling? Karl Stratos CS 533: Natural Language Processing 17/40

A Practical Question ◮ Summary so far: We have designed probabilistic language models parametrized by finitely many values. ◮ Bigram model: Stores a table of O ( | V | 2 ) values ∀ x, x ′ ∈ V q ( x ′ | x ) (plus q ( x | * ) and q ( STOP | x ) ) representing transition probabilities and computes p ( the cat barked ) = q ( the | * ) × q ( cat | the ) × q ( barked | cat ) q ( STOP | barked ) ◮ Q. But where do we get these values? Karl Stratos CS 533: Natural Language Processing 19/40

Estimation from Data ◮ Our data is a corpus of N sentences x (1) . . . x ( N ) . ◮ Define count ( x, x ′ ) to be the number of times x, x ′ appear together (called “bigram counts”): l i +1 � N � count ( x, x ′ ) = 1 i =1 j =1: x j = x ′ x j − 1 = x ( l i = length of x ( i ) and x l i +1 = STOP ) ◮ Define count ( x ) := � x ′ count ( x, x ′ ) (called “unigram counts”). Karl Stratos CS 533: Natural Language Processing 20/40

Example Counts Corpus: ◮ the dog chased the cat ◮ the cat chased the mouse ◮ the mouse chased the dog Example bigram/unigram counts: count ( x 0 , the ) = 3 count ( the ) = 6 count ( chased , the ) = 3 count ( chased ) = 3 count ( the , dog ) = 2 count ( x 0 ) = 3 count ( cat , STOP ) = 1 count ( cat ) = 2 Karl Stratos CS 533: Natural Language Processing 21/40

Parameter Estimates ◮ For all x, x ′ with count ( x, x ′ ) > 0 , set q ( x ′ | x ) = count ( x, x ′ ) count ( x ) Otherwise q ( x ′ | x ) = 0 . ◮ In the previous example: q ( the | x 0 ) = 3 / 3 = 1 q ( chased | dog ) = 1 / 3 = 0 . ¯ 3 q ( dog | the ) = 2 / 6 = 0 . ¯ 3 q ( STOP | cat ) = 1 / 2 = 0 . 5 q ( dog | cat ) = 0 ◮ Called maximum likelihood estimation (MLE) . Karl Stratos CS 533: Natural Language Processing 22/40

Justification of MLE Claim. The solution of the constrained optimization problem l i +1 � N � q ∗ = arg max log q ( x j | x j − 1 ) q : q ( x ′ | x ) ≥ 0 ∀ x,x ′ i =1 j =1 x ′∈ V q ( x ′ | x )=1 ∀ x � is given by q ∗ ( x ′ | x ) = count ( x, x ′ ) count ( x ) (Proof?) Karl Stratos CS 533: Natural Language Processing 23/40

MLE: Other n -Gram Models Unigram: q ( x ) = count ( x ) N Bigram: q ( x ′ | x ) = count ( x, x ′ ) count ( x ) Trigram: q ( x ′′ | x, x ′ ) = count ( x, x ′ , x ′′ ) count ( x, x ′ ) Karl Stratos CS 533: Natural Language Processing 24/40

Evaluation of a Language Model “How good is the model at predicting unseen sentences?” Held-out corpus : Used for evaluation purposes only Do not use held-out data for training the model! Popular evaluation metric: perplexity Karl Stratos CS 533: Natural Language Processing 26/40

Language Modeling Karl Stratos Rutgers University Karl Stratos CS - PowerPoint PPT Presentation

CS 533: Natural Language Processing Language Modeling Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/40 Motivation How likely are the following sentences? the dog barked the cat barked dog the

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Language Modeling Michael Collins, Columbia University Overview The language modeling problem

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

NEST Modeling Language: A modeling language for spiking neuron and synapse models for NEST

Topics Why E Field Modeling What is E Field Modeling Case Studies Questions 2 Why

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Verilog HDL:Digital Design and Modeling Chapter 5 Gate-Level Modeling Chapter 5 Gate-Level

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

High frequency waves and the maximal smoothing effect for nonlinear scalar conservation laws

Seamless Modeling from Creek to Ocean on Unstructured Grids Joseph Zhang Virginia Institute of

Homotopy Analysis for Tensor PCA Yuan Deng Duke University Joint work with Anima Anandkumar,

Background Smoothing LM, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

#4: Faux Geometry Rendering tricks that give the appearance

Correction: W2V vs. V2W Recorrection: Perspective Derivation Reading for This Module CPSC 314

Surface Effects CPSC 599.86 / 601.86 Sonny Chan University of Calgary Todays Agenda

Lighting Models Maksim Lind Computer Graphics Seminar MTAT.03.296 Spring 2014 University of