count based language modeling
play

Count-based Language Modeling CMSC 473/673 UMBC Some slides - PowerPoint PPT Presentation

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner Outline Defining Language Models Breaking & Fixing Language Models Evaluating Language Models Goal of Language Modeling p ( ) [text..]


  1. Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

  2. Outline Defining Language Models Breaking & Fixing Language Models Evaluating Language Models

  3. Goal of Language Modeling p θ ( ) […text..] Learn a probabilistic model of text Accomplished through observing text and updating model parameters to make text more likely

  4. Goal of Language Modeling p θ ( ) […text..] Learn a probabilistic model of 0 ≤ 𝑞 𝜄 [… 𝑢𝑓𝑦𝑢 … ] ≤ 1 text Accomplished through ෍ 𝑞 𝜄 𝑢 = 1 observing text and updating model parameters to make 𝑢:𝑢 is valid text text more likely

  5. Design Question 1: What Part of Language Do We Estimate? p θ ( ) […text..] Is […text..] a • Full document? A: It’s task - • Sequence of sentences? dependent! • Sequence of words? • Sequence of characters?

  6. Design Question 2: How do we estimate robustly? p θ ( ) […typo -text..] What if […text..] has a typo?

  7. Design Question 3: How do we generalize? p θ ( ) […synonymous -text..] What if […text..] has a word (or character or…) we’ve never seen before?

  8. “The Unreasonable Effectiveness of Recurrent Neural Networks” http://karpathy.github.io/2015/05/21/rnn-effectiveness/

  9. “The Unreasonable Effectiveness of Recurrent Neural Networks” http://karpathy.github.io/2015/05/21/rnn-effectiveness/ “The Unreasonable Effectiveness of Character - level Language Models” (and why RNNs are still cool) http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

  10. Simple Count-Based 𝑞 item

  11. Simple Count-Based “proportional to” 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

  12. Simple Count-Based “proportional to” 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item 𝑑𝑝𝑣𝑜𝑢(item) = σ all items 𝑧 𝑑𝑝𝑣𝑜𝑢(y)

  13. Simple Count-Based “proportional to” 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item 𝑑𝑝𝑣𝑜𝑢(item) = σ all items 𝑧 𝑑𝑝𝑣𝑜𝑢(y) constant

  14. In Simple Count-Based Models, What Do We Count? 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item) sequence of characters → pseudo-words sequence of words → pseudo-phrases

  15. Shakespearian Sequences of Characters

  16. Shakespearian Sequences of Words

  17. Novel Words, Novel Sentences “Colorless green ideas sleep furiously” – Chomsky (1957) Let’s observe and record all sentences with our big, bad supercomputer Red ideas? Read ideas?

  18. Probability Chain Rule 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑇 = 𝑞 𝑦 1 𝑞 𝑦 2 𝑦 1 )𝑞 𝑦 3 𝑦 1 , 𝑦 2 ) ⋯ 𝑞 𝑦 𝑇 𝑦 1 , … , 𝑦 𝑗 = 𝑇 ෑ 𝑞 𝑦 𝑗 𝑦 1 , … , 𝑦 𝑗−1 ) 𝑗

  19. N-Grams Maintaining an entire inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously)

  20. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) *

  21. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) *

  22. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  23. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * apply the p(ideas | Colorless green) * chain rule p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  24. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * apply the p(ideas | Colorless green) * chain rule p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  25. N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?”

  26. N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info

  27. N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info p(furiously | Colorless green ideas sleep) ≈ p(furiously | Colorless green ideas sleep)

  28. N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info p(furiously | Colorless green ideas sleep) ≈ p(furiously | ideas sleep)

  29. N-Grams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  30. N-Grams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  31. Trigrams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

  32. Trigrams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

  33. Trigrams p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS> ) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) Consistent notation : Pad the left with <BOS> (beginning of sentence) symbols

  34. Trigrams p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS> ) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) * p( <EOS> | sleep furiously) Consistent notation : Pad the left with <BOS> (beginning of sentence) symbols Fully proper distribution : Pad the right with a single <EOS> symbol

  35. N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously)

  36. N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep)

  37. N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram)

  38. N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(w i | w i-n+1 … w i-1 )

  39. N-Gram Probability 𝑞 𝑥 1 , 𝑥 2 , 𝑥 3 , ⋯ , 𝑥 𝑇 = 𝑇 ෑ 𝑞 𝑥 𝑗 𝑥 𝑗−𝑂+1 , ⋯ , 𝑥 𝑗−1 ) 𝑗=1

  40. Count-Based N-Grams (Unigrams) 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

  41. Count-Based N-Grams (Unigrams) 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢(z)

  42. Count-Based N-Grams (Unigrams) 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z 𝑑𝑝𝑣𝑜𝑢 z = σ 𝑤 𝑑𝑝𝑣𝑜𝑢(v)

  43. Count-Based N-Grams (Unigrams) word type word type 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z 𝑑𝑝𝑣𝑜𝑢 z = σ 𝑤 𝑑𝑝𝑣𝑜𝑢(v) word type

  44. Count-Based N-Grams (Unigrams) word type word type 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z = 𝑑𝑝𝑣𝑜𝑢 z 𝑋 number of tokens observed

  45. Count-Based N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) z Raw Count count(z) Normalization Probability p(z) The 1 film 2 got 1 a 2 great 1 opening 1 and 1 the 1 went 1 on 1 to 1 become 1 hit 1 . 1

  46. Count-Based N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) z Raw Count count(z) Normalization Probability p(z) The 1 film 2 got 1 a 2 great 1 opening 1 and 1 16 the 1 went 1 on 1 to 1 become 1 hit 1 . 1

  47. Count-Based N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) z Raw Count count(z) Normalization Probability p(z) The 1 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16 opening 1 1/16 and 1 1/16 16 the 1 1/16 went 1 1/16 on 1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16

  48. Count-Based N-Grams (Trigrams) order matters in order matters in conditioning count 𝑞 z|x, y ∝ 𝑑𝑝𝑣𝑜𝑢(x, y, z) Count of the sequence of items “x y z”

  49. Count-Based N-Grams (Trigrams) order matters in order matters in conditioning count 𝑞 z|x, y ∝ 𝑑𝑝𝑣𝑜𝑢(x, y, z) count(x, y, z) ≠ count(x, z, y) ≠ count(y, x, z) ≠ …

Recommend


More recommend