Language models Chapter 3 in Martin/Jurafsky Probabilistic Language - PDF document

10/17/19 Language models Chapter 3 in Martin/Jurafsky Probabilistic Language Models • Goal: assign a probability to a sentence – Machine Translation: Why? » P(hi high h winds tonite) > P(la large winds tonite) – Spell Correction » The office is about fifteen mi minu nuets from my house • P(about fifteen minutes from) > P(about fifteen minuets from) – Speech Recognition » P(I saw a van) >> P(eyes awe of an) – + Summarization, question-answering, etc., etc.!! 1

10/17/19 Probabilistic Language Modeling • Goal: compute the probability of a sentence or sequence of words: P(W) = P(w 1 ,w 2 ,w 3 ,... ,w n ) • Related task: probability of an upcoming word: P(w 5 |w 1 ,w 2 ,w 3 ,w 4 ) • A model that computes either of these: P(W) or P(w n |w 1 ,w 2 …w n-1 ) is called a language model . Probability theory Random variable: a variable whose possible values are the possible outcomes of a random phenomenon. Examples: A person’s height, the outcome of a coin toss Distinguish between discrete and continuous variables. The distribution of a discrete random variable: The probabilities of each value it can take. Notation: P(X = x i ). X These numbers satisfy: P ( X = x i ) = 1 i 2

10/17/19 Joint probability distribution P ( X = x i , Y = y j ) p ij A joint probability distribution for two variables is a table. If the two variables are binary, how many entries does it have? Let’s consider now the joint probability of d variables P(X 1 ,…,X d ). How many entries does it have if each variable is binary? Example • Consider the roll of a fair die and let X be the variable that denotes if the number is even (i.e. 2, 4, or 6) and let Y denote if the number is prime (i.e. 2, 3, or 5). X/Y prime non-prime even 1/6 2/6 odd 2/6 1/6 3

10/17/19 Example • Given P(X, Y) compute the probability that we picked an even number: P(X=even) = P(X=even,Y=prime)+P(X=even,Y=non-prime) = 3/6 X/Y prime non-prime even 1/6 2/6 odd 2/6 1/6 Marginal probability Joint probability P ( X = x i , Y = y j ) p ij Marginal probability X P ( X = x i ) = P ( X = x i , Y = y j ) j 4

10/17/19 Conditional probability • Compute the probability P(X=even | Y=non-prime) P(X=even | Y=non-prime) = P(X=even , Y=non-prime) / P(Y=non-prime) = 2/6 / 1/2 = 2/3 X/Y prime non-prime even 1/6 2/6 odd 2/6 1/6 Marginal probability Joint probability P ( X = x i , Y = y j ) p ij Marginal probability X P ( X = x i ) = P ( X = x i , Y = y j ) j Conditional probability P ( X = x i | Y = y j ) = P ( X = x i , Y = y j ) P ( Y = y j ) 5

10/17/19 The rules of probability Marginalization X P ( x ) = P ( x, y ) y Product Rule P ( x, y ) = P ( x ) P ( y | x ) Independence: X and Y are independent if P(Y=y|X=x) = P(Y=y) This implies P(x,y) = P(x) P(y) Bayes’ rule From the product rule: P(x, y) = P(y | x) P(x) and: P(x, y) = P(x | y) P(y) P ( y | x ) = P ( x | y ) P ( y ) Therefore: P ( x ) This is known as Bayes’ rule 6

10/17/19 How to compute P(W) • We would like to compute this joint probability: P(its, water, is, so, transparent, that) • Let's use the chain rule! Reminder: The Chain Rule • For two variables we have: P(A,B) = P(A)P(B|A) • More variables: P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) • The chain rule: P(x 1 ,x 2 ,x 3 ,…,x n ) = P(x 1 )P(x 2 |x 1 )P(x 3 |x 1 ,x 2 )…P(x n |x 1 ,…,x n-1 ) 7

10/17/19 The chain rule applied for computing the joint probability of words in a sentence ∏ P ( w 1 w 2 … w n ) = P ( w i | w 1 w 2 … w i − 1 ) i P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so) How not to estimate these probabilities • Naive approach: P (the |its water is so transparent that) = Count (its water is so transparent that the) Count (its water is so transparent that) • Won't work: we’ll never see enough data for esbmabng these 8

10/17/19 Markov Assumption • Simplifying assumption: Andrei Markov P (the |its water is so transparent that) ≈ P (the |that) • Or maybe P (the |its water is so transparent that) ≈ P (the |transparent that) Markov Assumption ∏ P ( w 1 w 2 … w n ) ≈ P ( w i | w i − k … w i − 1 ) i • In other words, we approximate each component in the product as P ( w i | w 1 w 2 … w i − 1 ) ≈ P ( w i | w i − k … w i − 1 ) 9

10/17/19 Simplest case: the unigram model ∏ P ( w 1 w 2 … w n ) ≈ P ( w i ) i Some automabcally generated sentences from a unigram model fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass thrift, did, eighty, said, hard, 'm, july, bullish that, or, limited, the Bigram model Condibon on the previous word: P ( w i | w 1 w 2 … w i − 1 ) ≈ P ( w i | w i − 1 ) texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen outside, new, car, parking, lot, of, the, agreement, reached this, would, be, a, record, november 10

10/17/19 N-gram models • We can extend to trigrams, 4-grams, 5-grams • In general this is an insufficient model of language – because language has long-distance dependencies : “The computer(s) which I had just put into the machine room on the fifth floor is (are) crashing.” • But these models are still very useful! 11

Language models Chapter 3 in Martin/Jurafsky Probabilistic Language - PDF document

10/17/19 Language models Chapter 3 in Martin/Jurafsky Probabilistic Language Models Goal: assign a probability to a sentence Machine Translation: Why? P(hi high h winds tonite) > P(la large winds tonite) Spell Correction

Models of Language Evolution models thereof its evolution language Models of Language Evolution

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Language Models Philipp Koehn 8 September 2020 Philipp Koehn Machine Translation: Language

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

N-grams & Language ID If N-gram models represent language models, can we use N-gram

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

Machine Learning Lecture 01-1: Basics of Probability Theory Nevin L. Zhang lzhang@cse.ust.hk

I 02 - Likelihood STAT 587 (Engineering) Iowa State University September 10, 2020 Modeling

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

CSCE 970 Lecture 4: Introduction to Bayesian Networks E.g. each vector represents a medical

Introduction to Machine Learning ML-Basics: Data Learning goals 10 Understand structure of

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability

219323 Probability and Statistics for Software and Knowledge Engineers Lecture 3: Random

46.1 Introduction chapter overview: 46. Introduction and Quantification 47. Representation

Language models Chapter 3 in Martin/Jurafsky Probabilistic Language - PDF document

10/17/19 Language models Chapter 3 in Martin/Jurafsky Probabilistic Language Models Goal: assign a probability to a sentence Machine Translation: Why? P(hi high h winds tonite) > P(la large winds tonite) Spell Correction

Models of Language Evolution models thereof its evolution language Models of Language Evolution

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Language Models Philipp Koehn 8 September 2020 Philipp Koehn Machine Translation: Language

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

Machine Learning Lecture 01-1: Basics of Probability Theory Nevin L. Zhang lzhang@cse.ust.hk

I 02 - Likelihood STAT 587 (Engineering) Iowa State University September 10, 2020 Modeling

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

CSCE 970 Lecture 4: Introduction to Bayesian Networks E.g. each vector represents a medical

Introduction to Machine Learning ML-Basics: Data Learning goals 10 Understand structure of

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability

219323 Probability and Statistics for Software and Knowledge Engineers Lecture 3: Random

46.1 Introduction chapter overview: 46. Introduction and Quantification 47. Representation

N-grams & Language ID If N-gram models represent language models, can we use N-gram