language modeling
play

Language Modeling CS 6956: Deep Learning for NLP Overview What is - PowerPoint PPT Presentation

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How do we evaluate language models? Traditional language models Feedforward neural networks for language modeling Recurrent neural networks for


  1. Language Modeling CS 6956: Deep Learning for NLP

  2. Overview • What is a language model? • How do we evaluate language models? • Traditional language models • Feedforward neural networks for language modeling • Recurrent neural networks for language modeling 1

  3. Overview • What is a language model? • How do we evaluate language models? • Traditional language models • Feedforward neural networks for language modeling • Recurrent neural networks for language modeling 2

  4. Language models What is the probability of a sentence? – Grammatically incorrect or rare sentences should be more improbable – Or equivalently, what is the probability of a word following a sequence of words? “The cat chased a mouse” vs “The cat chased a turnip” Can be framed as a sequence modeling task Two classes of models Count-based: Markov assumptions with smoothing – Neural models – 3

  5. Language models What is the probability of a sentence? – Grammatically incorrect or rare sentences should be more improbable – Or equivalently, what is the probability of a word following a sequence of words? “The cat chased a mouse” vs “The cat chased a turnip” Can be framed as a sequence modeling task Two classes of models Count-based: Markov assumptions with smoothing – Neural models – 4

  6. Language models What is the probability of a sentence? – Grammatically incorrect or rare sentences should be more improbable – Or equivalently, what is the probability of a word following a sequence of words? “The cat chased a mouse” vs “The cat chased a turnip” Can be framed as a sequence modeling task Two classes of models Count-based: Markov assumptions with smoothing – Neural models – We have seen this difference before. In this lecture, we will look at some details 5

  7. Overview • What is a language model? • How do we evaluate language models? • Traditional language models • Feedforward neural networks for language modeling • Recurrent neural networks for language modeling 6

  8. Evaluating language models Extrinsic evaluation • A good language model should help with an end task such as machine translation – If we have a MT system that uses language models to produce outputs… – …a better language model can produce better outputs 7

  9. Evaluating language models Extrinsic evaluation • A good language model should help with an end task such as machine translation – If we have a MT system that uses language models to produce outputs… – …a better language model can produce better outputs • To evaluate a language model, is a downstream task needed? – Can be slow, depends on the quality of the downstream system 8

  10. Evaluating language models Extrinsic evaluation • A good language model should help with an end task such as machine translation – If we have a MT system that uses language models to produce outputs… – …a better language model can produce better outputs • To evaluate a language model, is a downstream task needed? – Can be slow, depends on the quality of the downstream system Can we define an intrinsic evaluation? 9

  11. What is a good language model? • Should prefer good sentences to bad ones – It should higher probabilities to valid/grammatical/frequent sentences – It should assign lower probabilities to invalid/ungrammatical/rare sentences • Can we construct an evaluation metric that directly measures this? 10

  12. What is a good language model? • Should prefer good sentences to bad ones – It should higher probabilities to valid/grammatical/frequent sentences – It should assign lower probabilities to invalid/ungrammatical/rare sentences • Can we construct an evaluation metric that directly measures this? Answer: Perplexity 11

  13. Perplexity A good language model should assign high probability to sentences that occur in the real world – Need a metric that captures this intuition, but normalizes for length of sentences 12

  14. Perplexity A good language model should assign high probability to sentences that occur in the real world – Need a metric that captures this intuition, but normalizes for length of sentences Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & 13

  15. Perplexity A good language model should assign high probability to sentences that occur in the real world – Need a metric that captures this intuition, but normalizes for length of sentences Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & Lower perplexity corresponds to higher probability 14

  16. Example: Uniformly likely words Suppose we have n words in a sentence, and they are all independent and uniform! – Would be a strange language…. ( 4 Perplexity = 𝑄 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & 5 ( 4 & 5 " = = 𝑜 & 15

  17. � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & For a history based model, we have 𝑄 𝑥 " ⋯ 𝑥 & = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 16

  18. � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 17

  19. � � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 M4 5 ∏ J K L K 4:LM4 ) 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 EFG H L 18

  20. � � � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 M4 5 ∏ J K L K 4:LM4 ) 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 EFG H L 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 (" & ∑ EFG H J 𝑥 8 𝑥 ":8(" L 19

  21. � � � Perplexity of history based models Given a sentence 𝑥 " 𝑥 # 𝑥 $ ⋯ 𝑥 & , define the perplexity of a language model as (" & 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 7 𝑄 𝑥 8 𝑥 ":8(" ) 8 M4 5 ∏ J K L K 4:LM4 ) 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 EFG H L 𝑄𝑓𝑠𝑞𝑚𝑓𝑦𝑗𝑢𝑧 = 2 (" & ∑ EFG H J 𝑥 8 𝑥 ":8(" L Average number of bits needed to encode the sentence 20

  22. Evaluating language models Several benchmark sets available – Penn Treebank Wall Street Journal corpus • Standard preprocessing by Mikolov • Vocabulary size: 10K words • Training size: 890K tokens – Billion Word Benchmark • English news text [Chelba, et al 2013] • Vocabulary size: ~793K • Training size: ~800M tokens Standard methodology of training on the training set and evaluating on the test set Some papers also continue training on the evaluation set because no – labels needed 21

  23. Overview • What is a language model? • How do we evaluate language models? • Traditional language models • Feedforward neural networks for language modeling • Recurrent neural networks for language modeling 22

  24. Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words 23

  25. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 24

  26. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 Need to get this from data 25

  27. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 TUV&W K LMX:L ,K LZ4 𝑄 𝑥 8Q" 𝑥 8(S:8 = TUV&W(K LMX:L ) 26

  28. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 TUV&W K LMX:L ,K LZ4 𝑄 𝑥 8Q" 𝑥 8(S:8 = TUV&W(K LMX:L ) The problem: Zeros in the counts. 27

  29. � Traditional language models Required counting n-grams The goal: To compute 𝑄(𝑥 " 𝑥 # ⋯ 𝑥 & ) for any sequence of words The (k+1) th order Markov assumption 𝑄 𝑥 " 𝑥 # ⋯ 𝑥 & ≈ 7 𝑄(𝑥 8Q" ∣ 𝑥 8(S:8 ) 8 TUV&W K LMX:L ,K LZ4 𝑄 𝑥 8Q" 𝑥 8(S:8 = TUV&W(K LMX:L ) The problem: Zeros in the counts. The solution: Smoothing 28

Recommend


More recommend