A famous quote Accelerated Natural Language Processing Lecture 4 - PowerPoint PPT Presentation

A famous quote Accelerated Natural Language Processing Lecture 4 It must be recognized that the notion “probability of Models and probability estimation a sentence” is an entirely useless one, under any known interpretation of this term. Sharon Goldwater Noam Chomsky, 1969 23 September 2019 Sharon Goldwater ANLP Lecture 4 23 September 2019 Sharon Goldwater ANLP Lecture 4 1 A famous quote Today’s lecture • What do we mean by the “probability of a sentence” and what is it good for? It must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known • What is probability estimation? What does it require? interpretation of this term. • What is a generative model and what are model parameters? Noam Chomsky, 1969 • “useless”: To everyone? To linguists? • What is maximum-likelihood estimation and how do I compute likelihood? • “known interpretation”: What are possible interpretations? Sharon Goldwater ANLP Lecture 4 2 Sharon Goldwater ANLP Lecture 4 3

Intuitive interpretation Automatic speech recognition Sentence probabilities ( language model ) help decide between • “Probability of a sentence” = how likely is it to occur in natural similar-sounding options. language speech input – Consider only a specific language (English) – Not including meta-language (e.g. linguistic discussion) ↓ (Acoustic model) She studies morphosyntax possible outputs She studies more faux syntax P(She studies morphosyntax) > P(She studies more faux syntax) She’s studies morph or syntax ... ↓ (Language model) best-guess output She studies morphosyntax Sharon Goldwater ANLP Lecture 4 4 Sharon Goldwater ANLP Lecture 4 5 Machine translation So, not “entirely useless”... Sentence probabilities help decide word choice and word order. • Sentence probabilities are clearly useful for language engineering [this course]. non-English input • Given time, I could argue why they’re also useful in linguistic ↓ (Translation model) science (e.g., psycholinguistics). But that’s another course... She is going home possible outputs She is going house She is traveling to home To home she is going ... ↓ (Language model) best-guess output She is going home Sharon Goldwater ANLP Lecture 4 6 Sharon Goldwater ANLP Lecture 4 7

But, what about zero probability The logical flaw sentences? • “Probability of a sentence” = how likely is it to occur in natural the Archaeopteryx winged jaggedly amidst foliage language. vs jaggedly trees the on flew • Is the following statement true? • Neither has ever occurred before. Sentence has never occurred ⇒ sentence has zero probability ⇒ both have zero probability. • More generally, is this one? • But one is grammatical (and meaningful), the other not. ⇒ “Sentence probability” is useless as a measure of Event has never occurred ⇒ event has zero probability grammaticality. Sharon Goldwater ANLP Lecture 4 8 Sharon Goldwater ANLP Lecture 4 9 Events that have never occurred Events that have never occurred • Each of these events has never occurred: • Each of these events has never occurred: My hair turns blue My hair turns blue I injure myself in a skiing accident I injure myself in a skiing accident I travel to Finland I travel to Finland • Yet, they clearly have different (and non-zero!) probabilities. • Yet, they clearly have differing (and non-zero!) probabilities. • Most sentences (and events) have never occurred. – This doesn’t make their probabilities zero (or meaningless), but – it does make estimating their probabilities trickier. Sharon Goldwater ANLP Lecture 4 10 Sharon Goldwater ANLP Lecture 4 11

Probability theory vs estimation Example: weather forecasting What is the probability that it will rain tomorrow? • Probability theory can solve problems like: – I have a jar with 6 blue marbles and 4 red ones. • To answer this question, we need – If I choose a marble uniformly at random, what’s the probability – data: measurements of relevant info (e.g., humidity, wind it’s red? speed/direction, temperature). – model: equations/procedures to estimate the probability using • But what about: the data. – I have a jar of marbles. – I repeatedly choose a marble uniformly at random and then • In fact, to build the model, we will need data (including outcomes ) replace it before choosing again. from previous situations as well. – In ten draws, I get 6 blue marbles and 4 red ones. – On the next draw, what’s the probability I get a red marble? • The latter also requires estimation theory. Sharon Goldwater ANLP Lecture 4 12 Sharon Goldwater ANLP Lecture 4 13 Example: weather forecasting Example: language model What is the probability that it will rain tomorrow? What is the probability of sentence � w = w 1 . . . w n ? • To answer this question, we need • To answer this question, we need – data: measurements of relevant info (e.g., humidity, wind – data: words w 1 . . . w n , plus a large corpus of sentences speed/direction, temperature). (“previous situations”, or training data ). – model: equations/procedures to estimate the probability using – model: equations to estimate the probability using the data. the data. • Different models will yield different estimates, even with the same • In fact, to build the model, we will need data (including outcomes ) data. from previous situations as well. • Deep question: what model/estimation method do humans use? • Note that we will never know the “true” probability of rain P ( rain ) , only our estimated probability ˆ P ( rain ) . Sharon Goldwater ANLP Lecture 4 14 Sharon Goldwater ANLP Lecture 4 15

How to get better probability estimates Notation Better estimates definitely help in language technology. How to • When the distinction is important, will use improve them? – P ( � w ) for true probabilities – ˆ • More training data. Limited by time, money. (Varies a lot!) P ( � w ) for estimated probabilities – P E ( � w ) for estimated probabilities using a particular estimation • Better model. Limited by scientific and mathematical method E . knowledge, computational resources • But since we almost always mean estimated probabilities, may • Better estimation method. Limited by mathematical get lazy later and use P ( � w ) for those too. knowledge, computational resources We will return to the question of how to know if estimates are “better”. Sharon Goldwater ANLP Lecture 4 16 Sharon Goldwater ANLP Lecture 4 17 Example: estimation for coins Example: estimation for coins I flip a coin 10 times, getting 7T, 3H. What is ˆ I flip a coin 10 times, getting 7T, 3H. What is ˆ P (T)? P (T)? • A: ˆ P ( T ) = 0 . 5 • B: ˆ P ( T ) = 0 . 7 • C: Neither of the above • D: I don’t know Sharon Goldwater ANLP Lecture 4 18 Sharon Goldwater ANLP Lecture 4 19

Example: estimation for coins Example: estimation for coins I flip a coin 10 times, getting 7T, 3H. What is ˆ I flip a coin 10 times, getting 7T, 3H. What is ˆ P (T)? P (T)? • Model 1: Coin is fair. Then, ˆ • Model 1: Coin is fair. Then, ˆ P ( T ) = 0 . 5 P ( T ) = 0 . 5 • Model 2: Coin is not fair. 1 Then, ˆ P ( T ) = 0 . 7 (why?) 1 Technically, the physical process of flipping a coin means that it’s not really possible to have a biased coin flip. To see a bias, we’d actually need to spin the coin vertically and wait for it to tip over. See https://www.stat.berkeley.edu/~nolan/Papers/dice.pdf for an interesting discussion of this and other coin flipping issues. Sharon Goldwater ANLP Lecture 4 20 Sharon Goldwater ANLP Lecture 4 21 Example: estimation for coins Example: estimation for coins I flip a coin 10 times, getting 7T, 3H. What is ˆ I flip a coin 10 times, getting 7T, 3H. What is ˆ P (T)? P (T)? • Model 1: Coin is fair. Then, ˆ • Model 1: Coin is fair. Then, ˆ P ( T ) = 0 . 5 P ( T ) = 0 . 5 • Model 2: Coin is not fair. Then, ˆ • Model 2: Coin is not fair. Then, ˆ P ( T ) = 0 . 7 (why?) P ( T ) = 0 . 7 (why?) • Model 3: Two coins, one fair and one not; choose one at random • Model 3: Two coins, one fair and one not; choose one at random to flip 10 times. Then, 0 . 5 < ˆ to flip 10 times. Then, 0 . 5 < ˆ P ( T ) < 0 . 7 . P ( T ) < 0 . 7 . Each is a generative model : a probabilistic process that describes how the data were generated. Sharon Goldwater ANLP Lecture 4 22 Sharon Goldwater ANLP Lecture 4 23

Defining a model Example: height of 30-yr-old females Assume the form of � − ( x − µ ) 2 1 � a normal distribution Usually, two choices in defining a model: p ( x | µ, σ ) = √ 2 π exp 2 σ 2 (or Gaussian ), with σ parameters ( µ, σ ) : • Structure (or form ) of the model: the form of the equations, usually determined by knowledge about the problem. • Parameters of the model: specific values in the equations that are usually determined using the training data. Sharon Goldwater ANLP Lecture 4 24 Sharon Goldwater ANLP Lecture 4 25 Example: height of 30-yr-old females Example: height of 30-yr-old females Collect data to determine values of µ, σ that fit this particular What if our data looked like this? dataset. I could then make good predictions about the likely height of the next 30-yr-old female I meet. Sharon Goldwater ANLP Lecture 4 26 Sharon Goldwater ANLP Lecture 4 27

A famous quote Accelerated Natural Language Processing Lecture 4 - PowerPoint PPT Presentation

A famous quote Accelerated Natural Language Processing Lecture 4 It must be recognized that the notion probability of Models and probability estimation a sentence is an entirely useless one, under any known interpretation of this term.

DRAFT - Do not cite or quote DRAFT - Do not cite or quote DRAFT - Do not cite or quote 2017

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Accelerated Natural Language Processing Lecture 4 Models and probability estimation Sharon

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Scattering amplitudes via AdS / CFT Luis Fernando Alday IAS Annual Theory Meeting - Durham -

A new signature of quantum phase transitions from the numerical range talk at the conference

Solution of the Gravitational Wave Introduction Constrained evolution Tensor Equation Using

Resource Options Engagement Solar Technical Potential Presented by Alex Tu November 12, 2019

Bayesian Networks Part 1 CS 760@UW-Madison Goals for the lecture you should understand the

INF4820 Algorithms for AI and NLP Basic Probability Theory & Language Models Murhaf

Maximum Power Transfer Theorem Calculus-Based Proof Dr. Mahmood A. Hameed ECSE Department

Bas Basic ic El Elec. ec. En Engr gr. . Lab Lab EC ECS S 204 04/2 /210 Dr. Prapun

A famous quote Accelerated Natural Language Processing Lecture 4 - PowerPoint PPT Presentation

A famous quote Accelerated Natural Language Processing Lecture 4 It must be recognized that the notion probability of Models and probability estimation a sentence is an entirely useless one, under any known interpretation of this term.

DRAFT - Do not cite or quote DRAFT - Do not cite or quote DRAFT - Do not cite or quote 2017

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Accelerated Natural Language Processing Lecture 4 Models and probability estimation Sharon

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Scattering amplitudes via AdS / CFT Luis Fernando Alday IAS Annual Theory Meeting - Durham -

A new signature of quantum phase transitions from the numerical range talk at the conference

Solution of the Gravitational Wave Introduction Constrained evolution Tensor Equation Using

Resource Options Engagement Solar Technical Potential Presented by Alex Tu November 12, 2019

Bayesian Networks Part 1 CS 760@UW-Madison Goals for the lecture you should understand the

INF4820 Algorithms for AI and NLP Basic Probability Theory &amp; Language Models Murhaf

Maximum Power Transfer Theorem Calculus-Based Proof Dr. Mahmood A. Hameed ECSE Department

Bas Basic ic El Elec. ec. En Engr gr. . Lab Lab EC ECS S 204 04/2 /210 Dr. Prapun

INF4820 Algorithms for AI and NLP Basic Probability Theory & Language Models Murhaf