natural language processing cse 490u language models
play

Natural Language Processing (CSE 490U): Language Models Noah Smith - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Language Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 69, 2017 1 / 67 Very Quick Review of Probability Event space (e.g., X , Y )in this class,


  1. Natural Language Processing (CSE 490U): Language Models Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu January 6–9, 2017 1 / 67

  2. Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete 2 / 67

  3. Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) 3 / 67

  4. Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” 4 / 67

  5. Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) 5 / 67

  6. Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) 6 / 67

  7. Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) = p ( X = x, Y = y ) p ( Y = y ) 7 / 67

  8. Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) = p ( X = x, Y = y ) p ( Y = y ) ◮ Always true: p ( X = x, Y = y ) = p ( X = x | Y = y ) · p ( Y = y ) = p ( Y = y | X = x ) · p ( X = x ) 8 / 67

  9. Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) = p ( X = x, Y = y ) p ( Y = y ) ◮ Always true: p ( X = x, Y = y ) = p ( X = x | Y = y ) · p ( Y = y ) = p ( Y = y | X = x ) · p ( X = x ) ◮ Sometimes true: p ( X = x, Y = y ) = p ( X = x ) · p ( Y = y ) 9 / 67

  10. Very Quick Review of Probability ◮ Event space (e.g., X , Y )—in this class, usually discrete ◮ Random variables (e.g., X , Y ) ◮ Typical statement: “random variable X takes value x ∈ X with probability p ( X = x ) , or, in shorthand, p ( x ) ” ◮ Joint probability: p ( X = x, Y = y ) ◮ Conditional probability: p ( X = x | Y = y ) = p ( X = x, Y = y ) p ( Y = y ) ◮ Always true: p ( X = x, Y = y ) = p ( X = x | Y = y ) · p ( Y = y ) = p ( Y = y | X = x ) · p ( X = x ) ◮ Sometimes true: p ( X = x, Y = y ) = p ( X = x ) · p ( Y = y ) ◮ The difference between true and estimated probability distributions 10 / 67

  11. Language Models: Definitions ◮ V is a finite set of (discrete) symbols ( � “words” or possibly characters); V = |V| ◮ V † is the (infinite) set of sequences of symbols from V whose final symbol is � ◮ p : V † → R , such that: ◮ For any x ∈ V † , p ( x ) ≥ 0 � p ( X = x ) = 1 ◮ x ∈V † (I.e., p is a proper probability distribution.) Language modeling: estimate p from examples, x 1: n = � x 1 , x 2 , . . . , x n � . 11 / 67

  12. Immediate Objections 1. Why would we want to do this? 2. Are the nonnegativity and sum-to-one constraints really necessary? 3. Is “finite V ” realistic? 12 / 67

  13. Motivation: Noisy Channel Models A pattern for modeling a pair of random variables, X and Y : source − → Y − → channel − → X 13 / 67

  14. Motivation: Noisy Channel Models A pattern for modeling a pair of random variables, X and Y : source − → Y − → channel − → X ◮ Y is the plaintext, the true message, the missing information, the output 14 / 67

  15. Motivation: Noisy Channel Models A pattern for modeling a pair of random variables, X and Y : source − → Y − → channel − → X ◮ Y is the plaintext, the true message, the missing information, the output ◮ X is the ciphertext, the garbled message, the observable evidence, the input 15 / 67

  16. Motivation: Noisy Channel Models A pattern for modeling a pair of random variables, X and Y : source − → Y − → channel − → X ◮ Y is the plaintext, the true message, the missing information, the output ◮ X is the ciphertext, the garbled message, the observable evidence, the input ◮ Decoding: select y given X = x . y ∗ = argmax p ( y | x ) y p ( x | y ) · p ( y ) = argmax p ( x ) y = argmax p ( x | y ) · p ( y ) y � �� � ���� channel model source model 16 / 67

  17. Noisy Channel Example: Speech Recognition → sequence in V † − source − → channel − → acoustics ◮ Acoustic model defines p ( sounds | x ) (channel) ◮ Language model defines p ( x ) (source) 17 / 67

  18. Noisy Channel Example: Speech Recognition Credit: Luke Zettlemoyer word sequence log p ( acoustics | word sequence ) the station signs are in deep in english -14732 the stations signs are in deep in english -14735 the station signs are in deep into english -14739 the station ’s signs are in deep in english -14740 the station signs are in deep in the english -14741 the station signs are indeed in english -14757 the station ’s signs are indeed in english -14760 the station signs are indians in english -14790 the station signs are indian in english -14799 the stations signs are indians in english -14807 the stations signs are indians and english -14815 18 / 67

  19. Noisy Channel Example: Machine Translation Also knowing nothing official about, but having guessed and inferred considerable about, the powerful new mechanized methods in cryptography—methods which I believe succeed even when one does not know what language has been coded—one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” Warren Weaver, 1955 19 / 67

  20. Noisy Channel Examples ◮ Speech recognition ◮ Machine translation ◮ Optical character recognition ◮ Spelling and grammar correction 20 / 67

  21. Immediate Objections 1. Why would we want to do this? 2. Are the nonnegativity and sum-to-one constraints really necessary? 3. Is “finite V ” realistic? 21 / 67

  22. Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 22 / 67

  23. Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 m � ◮ Log-probability of ¯ x 1: m is log 2 p ( ¯ x i ) i =1 23 / 67

  24. Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 m � ◮ Log-probability of ¯ x 1: m is log 2 p ( ¯ x i ) i =1 ◮ Average log-probability per word of ¯ x 1: m is m l = 1 � log 2 p ( ¯ x i ) M i =1 if M = � m i =1 | ¯ x i | (total number of words in the corpus) 24 / 67

  25. Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 m � ◮ Log-probability of ¯ x 1: m is log 2 p ( ¯ x i ) i =1 ◮ Average log-probability per word of ¯ x 1: m is m l = 1 � log 2 p ( ¯ x i ) M i =1 if M = � m i =1 | ¯ x i | (total number of words in the corpus) ◮ Perplexity (relative to ¯ x 1: m ) is 2 − l 25 / 67

  26. Evaluation: Perplexity Intuitively, language models should assign high probability to real language they have not seen before. For out-of-sample (“held-out” or “test”) data ¯ x 1: m : m � ◮ Probability of ¯ x 1: m is p ( ¯ x i ) i =1 m � ◮ Log-probability of ¯ x 1: m is log 2 p ( ¯ x i ) i =1 ◮ Average log-probability per word of ¯ x 1: m is m l = 1 � log 2 p ( ¯ x i ) M i =1 if M = � m i =1 | ¯ x i | (total number of words in the corpus) x 1: m ) is 2 − l ◮ Perplexity (relative to ¯ Lower is better. 26 / 67

Recommend


More recommend