Natural Language Processing (CSE 490U): Language Models Noah Smith - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Language Models Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu January 6–9, 2017 1 / 67

Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete 2 / 67

Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) 3 / 67

Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” 4 / 67

Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) 5 / 67

Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) 6 / 67

Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) = p ( X = x, Y = y ) p ( Y = y ) 7 / 67

Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) = p ( X = x, Y = y ) p ( Y = y ) ◮ Always true: p ( X = x, Y = y ) = p ( X = x | Y = y ) · p ( Y = y ) = p ( Y = y | X = x ) · p ( X = x ) 8 / 67

Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) = p ( X = x, Y = y ) p ( Y = y ) ◮ Always true: p ( X = x, Y = y ) = p ( X = x | Y = y ) · p ( Y = y ) = p ( Y = y | X = x ) · p ( X = x ) ◮ Sometimes true: p ( X = x, Y = y ) = p ( X = x ) · p ( Y = y ) 9 / 67

Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) = p ( X = x, Y = y ) p ( Y = y ) ◮ Always true: p ( X = x, Y = y ) = p ( X = x | Y = y ) · p ( Y = y ) = p ( Y = y | X = x ) · p ( X = x ) ◮ Sometimes true: p ( X = x, Y = y ) = p ( X = x ) · p ( Y = y ) ◮ The difference between true and estimated probability distributions 10 / 67

Language Models: Definitions ◮ V is a finite set of (discrete) symbols ( � “words” or possibly characters); V = |V| ◮ V † is the (infinite) set of sequences of symbols from V whose final symbol is � ◮ p : V † → R , such that: ◮ For any x ∈ V † , p ( x ) ≥ 0 � p ( X = x ) = 1 ◮ x ∈V † (I.e., p is a proper probability distribution.) Language modeling: estimate p from examples, x 1: n = � x 1 , x 2 , . . . , x n � . 11 / 67

Immediate Objections 1. Why would we want to do this? 2. Are the nonnegativity and sum-to-one constraints really necessary? 3. Is “finite V ” realistic? 12 / 67

Motivation: Noisy Channel Models A pattern for modeling a pair of random variables, X and Y : source − → Y − → channel − → X 13 / 67

Motivation: Noisy Channel Models A pattern for modeling a pair of random variables, X and Y : source − → Y − → channel − → X ◮ Y is the plaintext, the true message, the missing information, the output 14 / 67

Motivation: Noisy Channel Models A pattern for modeling a pair of random variables, X and Y : source − → Y − → channel − → X ◮ Y is the plaintext, the true message, the missing information, the output ◮ X is the ciphertext, the garbled message, the observable evidence, the input 15 / 67

Motivation: Noisy Channel Models A pattern for modeling a pair of random variables, X and Y : source − → Y − → channel − → X ◮ Y is the plaintext, the true message, the missing information, the output ◮ X is the ciphertext, the garbled message, the observable evidence, the input ◮ Decoding: select y given X = x . y ∗ = argmax p ( y | x ) y p ( x | y ) · p ( y ) = argmax p ( x ) y = argmax p ( x | y ) · p ( y ) y � �� channel model source model 16 / 67

Noisy Channel Example: Speech Recognition → sequence in V † − source − → channel − → acoustics ◮ Acoustic model defines p ( sounds | x ) (channel) ◮ Language model defines p ( x ) (source) 17 / 67

Noisy Channel Example: Speech Recognition Credit: Luke Zettlemoyer word sequence log p ( acoustics | word sequence ) the station signs are in deep in english -14732 the stations signs are in deep in english -14735 the station signs are in deep into english -14739 the station ’s signs are in deep in english -14740 the station signs are in deep in the english -14741 the station signs are indeed in english -14757 the station ’s signs are indeed in english -14760 the station signs are indians in english -14790 the station signs are indian in english -14799 the stations signs are indians in english -14807 the stations signs are indians and english -14815 18 / 67

Noisy Channel Example: Machine Translation Also knowing nothing official about, but having guessed and inferred considerable about, the powerful new mechanized methods in cryptography—methods which I believe succeed even when one does not know what language has been coded—one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” Warren Weaver, 1955 19 / 67

Noisy Channel Examples ◮ Speech recognition ◮ Machine translation ◮ Optical character recognition ◮ Spelling and grammar correction 20 / 67

Immediate Objections 1. Why would we want to do this? 2. Are the nonnegativity and sum-to-one constraints really necessary? 3. Is “finite V ” realistic? 21 / 67

Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 22 / 67

Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 m � ◮ Log-probability of ¯ x 1: m is log 2 p ( ¯ x i ) i =1 23 / 67

Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 m � ◮ Log-probability of ¯ x 1: m is log 2 p ( ¯ x i ) i =1 ◮ Average log-probability per word of ¯ x 1: m is m l = 1 � log 2 p ( ¯ x i ) M i =1 if M = � m i =1 | ¯ x i | (total number of words in the corpus) 24 / 67

Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 m � ◮ Log-probability of ¯ x 1: m is log 2 p ( ¯ x i ) i =1 ◮ Average log-probability per word of ¯ x 1: m is m l = 1 � log 2 p ( ¯ x i ) M i =1 if M = � m i =1 | ¯ x i | (total number of words in the corpus) ◮ Perplexity (relative to ¯ x 1: m ) is 2 − l 25 / 67

Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 m � ◮ Log-probability of ¯ x 1: m is log 2 p ( ¯ x i ) i =1 ◮ Average log-probability per word of ¯ x 1: m is m l = 1 � log 2 p ( ¯ x i ) M i =1 if M = � m i =1 | ¯ x i | (total number of words in the corpus) x 1: m ) is 2 − l ◮ Perplexity (relative to ¯ Lower is better. 26 / 67

Natural Language Processing (CSE 490U): Language Models Noah Smith - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Language Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 69, 2017 1 / 67 Very Quick Review of Probability Event space (e.g., X , Y )in this class,

Natural Language Processing (CSE 490U): Neural Language Models Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith 2017 c

Natural Language Processing (CSE 490U): Sequence Models (II) Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

Natural Language Processing (CSE 490U): Phrase Structure Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Introduction Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Text Classification Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Generation: Translation & Summarization Noah Smith c

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Regular Expressions / Finite State Automata UW CSE 490u Quiz Section, Feb 8, 2017 Sam Thomson

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

CS 4803 / 7643: Deep Learning Topics: Linear Classifiers Loss Functions Dhruv Batra

Lab 8 Reading, wri0ng files Modules Excep0on Handling Using lists to solve problems

Lab 8 Reading, writing files Modules Exception Handling Using lists to solve

Lecture 3: Loss functions and Optimization Fei-Fei Li & Andrej Karpathy & Justin Johnson

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

CSSE 220 Linked List Implementation and Project Preparation Checkout LinkedListSimple project

61A Lecture 26 Wednesday, November 6 Announcements Project 1 composition revisions due

The Gaussian Distribution Continuous distributions Probability density function (pdf) for a

Natural Language Processing (CSE 490U): Language Models Noah Smith - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Language Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 69, 2017 1 / 67 Very Quick Review of Probability Event space (e.g., X , Y )in this class,

Natural Language Processing (CSE 490U): Neural Language Models Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith 2017 c

Natural Language Processing (CSE 490U): Sequence Models (II) Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

Natural Language Processing (CSE 490U): Phrase Structure Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Introduction Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Text Classification Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Generation: Translation &amp; Summarization Noah Smith c

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Regular Expressions / Finite State Automata UW CSE 490u Quiz Section, Feb 8, 2017 Sam Thomson

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

CS 4803 / 7643: Deep Learning Topics: Linear Classifiers Loss Functions Dhruv Batra

Lab 8 Reading, wri0ng files Modules Excep0on Handling Using lists to solve problems

Lab 8 Reading, writing files Modules Exception Handling Using lists to solve

Lecture 3: Loss functions and Optimization Fei-Fei Li &amp; Andrej Karpathy &amp; Justin Johnson

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

CSSE 220 Linked List Implementation and Project Preparation Checkout LinkedListSimple project

61A Lecture 26 Wednesday, November 6 Announcements Project 1 composition revisions due

The Gaussian Distribution Continuous distributions Probability density function (pdf) for a

Natural Language Processing (CSE 490U): Generation: Translation & Summarization Noah Smith c

Lecture 3: Loss functions and Optimization Fei-Fei Li & Andrej Karpathy & Justin Johnson