statistical natural language processing
play

Statistical Natural Language Processing N-gram Language Models ar - PowerPoint PPT Presentation

Statistical Natural Language Processing N-gram Language Models ar ltekin University of Tbingen Seminar fr Sprachwissenschaft Summer Semester 2017 Motivation Estimation Summer Semester 2017 SfS / University of Tbingen .


  1. Motivation Short answer: No. Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, = ? applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? Estimation sentence, and divide it by the total Can we count the occurrences of the How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach )

  2. Motivation Short answer: No. Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? Estimation sentence, and divide it by the total How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach ) • Can we count the occurrences of the P ( ) = ?

  3. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? sentence, and divide it by the total How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach ) • Can we count the occurrences of the P ( ) = ? • Short answer: No.

  4. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? sentence, and divide it by the total How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach ) • Can we count the occurrences of the P ( ) = ? • Short answer: No.

  5. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications intuition, or will not be useful in most probabilities will not refmect our – For the ones observed in a corpus, in very large corpora – Many sentences are not observed even number of sentences (in a large corpus)? sentence, and divide it by the total How do we calculate the probability a count and divide? Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation 12 / 86 sentence like P ( I like pizza wit spinach ) • Can we count the occurrences of the P ( ) = ? • Short answer: No.

  6. Motivation We use probabilities of parts of the sentence (words) to Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, generality), we can write Estimation calculate the probability of the whole sentence 13 / 86 applying the chain rule Assigning probabilities to sentences Extensions Back-ofg & Interpolation Smoothing Evaluation • The solution is to decompose • Using the chain rule of probability (without loss of P ( w 1 , w 2 , . . . , w m ) = P ( w 2 | w 1 ) × P ( w 3 | w 1 , w 2 ) × . . . × P ( w m | w 1 , w 2 , . . . w m − 1 )

  7. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Not really, the last term is equally diffjcult to estimate 14 / 86 Extensions Example: applying the chain rule Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | I like ) × P ( with | I like pizza ) × P ( spinach | I like pizza with ) • Did we solve the problem?

  8. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 14 / 86 Example: applying the chain rule Extensions Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | I like ) × P ( with | I like pizza ) × P ( spinach | I like pizza with ) • Did we solve the problem? • Not really, the last term is equally diffjcult to estimate

  9. Motivation We make a conditional independence assumption: probabilities of Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation and 15 / 86 the Markov assumption Assigning probabilities to sentences Evaluation Extensions Smoothing Back-ofg & Interpolation words are independent, given n previous words P ( w i | w 1 , . . . , w i − 1 ) = P ( w i | w i − n + 1 , . . . , w i − 1 ) m ∏ P ( w 1 , . . . , w m ) = P ( w i | w i − n + 1 , . . . , w i − 1 ) i = 1 For example, with n = 2 (bigram, fjrst order Markov model): m ∏ P ( w 1 , . . . , w m ) = P ( w i | w i − 1 ) i = 1

  10. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Now, hopefully, we can count them in a corpus 16 / 86 Example: bigram probabilities of a sentence Extensions Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | I like ) × P ( with | I like pizza ) × P ( spinach | I like pizza with )

  11. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 16 / 86 Example: bigram probabilities of a sentence Extensions Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | like ) × P ( with | pizza ) × P ( spinach | with ) • Now, hopefully, we can count them in a corpus

  12. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, number of times I occurs in the corpus corpus. 17 / 86 based on their frequencies in a corpus Extensions Maximum-likelihood estimation (MLE) Evaluation Smoothing Back-ofg & Interpolation • Maximum-likelihood estimation of n-gram probabilities is • We are interested in conditional probabilities of the form: P ( w i | w 1 , . . . , w i − 1 ) , which we estimate using C ( w i − n + 1 . . . w i ) P ( w i | w i − n + 1 , . . . , w i − 1 ) = C ( w i − n + 1 . . . w i − 1 ) where, C () is the frequency (count) of the sequence in the • For example, the probability P ( like | I ) would be C ( I like ) P ( like | I ) = C ( I ) = number of times I like occurs in the corpus

  13. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, rameters (conditional probabilities). Training an n-gram model involves estimating these pa- 18 / 86 Extensions MLE estimation of an n-gram language model Back-ofg & Interpolation Smoothing Evaluation An n-gram model conditioned on n − 1 previous words. • In a 1-gram (unigram) model, P ( w i ) = C ( w i ) N • In a 2-gram (bigram) model, P ( w i ) = P ( w i | w i − 1 ) = C ( w i − 1 w i ) C ( w i − 1 ) • In a 3-gram (trigram) model, P ( w i ) = P ( w i | w i − 2 w i − 1 ) = C ( w i − 2 w i − 1 w i ) C ( w i − 2 w i − 1 )

  14. Motivation that , afraid do ’m Dave can sorry freq . ’t Traditionally, can’t is tokenized as ca␣n’t (similar to have␣n’t , is␣n’t etc.), but for our purposes can␣’t is more readable. Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 I ngram Estimation A small corpus Evaluation Smoothing Back-ofg & Interpolation Extensions Unigrams Unigrams are simply the single words (or tokens). I’m sorry, Dave. freq I’m afraid I can’t do that. Unigram counts ngram freq ngram freq ngram 19 / 86

  15. Motivation can freq I Estimation afraid do ’m Dave that freq sorry . ’t Traditionally, can’t is tokenized as ca␣n’t (similar to have␣n’t , is␣n’t etc.), but for our purposes can␣’t is more readable. Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 ngram , ngram When tokenized, we Evaluation Smoothing Back-ofg & Interpolation Extensions Unigrams Unigrams are simply the single words (or tokens). A small corpus I ’m sorry , Dave . freq I ’m afraid I can ’t do that . 19 / 86 ngram ngram freq Unigram counts types . have 15 tokens , and 11 3 1 1 1 2 1 1 1 1 2 1

  16. Motivation . do ’m Dave can that sorry ’t , Estimation , 'm I . sorry Dave What is the most likely sentence according to this model? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 afraid 20 / 86 Unigram probability of a sentence freq I Extensions freq Back-ofg & Interpolation ngram Smoothing freq Evaluation ngram Unigram counts freq ngram ngram 3 1 1 1 2 1 1 1 1 2 1 P ( I 'm sorry , Dave . ) = P ( I ) × P ( 'm ) × P ( sorry ) × P ( , ) × P ( Dave ) × P ( . ) 3 2 1 1 1 2 = × × × × × 15 15 15 15 15 15 = 0 . 000 001 05

  17. Motivation , Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation ’t . sorry that can Dave ’m do afraid 20 / 86 I ngram Smoothing Evaluation Extensions Unigram probability of a sentence Unigram counts ngram freq Back-ofg & Interpolation freq ngram freq ngram freq 3 1 1 1 2 1 1 1 1 2 1 P ( I 'm sorry , Dave . ) = P ( I ) × P ( 'm ) × P ( sorry ) × P ( , ) × P ( Dave ) × P ( . ) 3 2 1 1 1 2 = × × × × × 15 15 15 15 15 15 = 0 . 000 001 05 • P ( , 'm I . sorry Dave ) = ? • What is the most likely sentence according to this model?

  18. Motivation Dave prob I Estimation . ’t , afraid What about sentences? can do sorry that Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 word ’m 21 / 86 equal size. For example (length 2), N-gram models defjne probability distributions Extensions distributions over word sequences of distribution over words Back-ofg & Interpolation Smoothing Evaluation • An n-gram model defjnes a probability 0 . 200 0 . 133 ∑ P ( w ) = 1 0 . 133 w ∈ V 0 . 067 0 . 067 • They also defjne probability 0 . 067 0 . 067 0 . 067 ∑ ∑ 0 . 067 P ( w ) P ( v ) = 1 0 . 067 w ∈ V v ∈ V 0 . 067 1 . 000

  19. Motivation afraid I Estimation . ’t , Dave can word do sorry that Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 prob ’m 21 / 86 equal size. For example (length 2), N-gram models defjne probability distributions distribution over words distributions over word sequences of Extensions Back-ofg & Interpolation Smoothing Evaluation • An n-gram model defjnes a probability 0 . 200 0 . 133 ∑ P ( w ) = 1 0 . 133 w ∈ V 0 . 067 0 . 067 • They also defjne probability 0 . 067 0 . 067 0 . 067 ∑ ∑ 0 . 067 P ( w ) P ( v ) = 1 0 . 067 w ∈ V v ∈ V 0 . 067 • What about sentences? 1 . 000

  20. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 22 / 86 Unigram probabilities Extensions Evaluation Smoothing Back-ofg & Interpolation 3 0 . 2 0 . 15 2 2 0 . 1 1 1 1 1 1 1 1 1 , . can sorry I ’m ’t Dave afraid do that

  21. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, MLE probability rank 23 / 86 Unigram probabilities in a (slightly) larger corpus Evaluation Smoothing MLE probabilities in the Universal Declaration of Human Rights Back-ofg & Interpolation Extensions 0 . 06 0 . 04 0 . 02 a long tail follows … 536 0 . 00 0 50 100 150 200 250

  22. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, (or n-grams) – even very large corpora will not contain some of the words distribution most linguistic units follow more-or-less a similar rank 24 / 86 or The frequency of a word is inversely proportional to its rank: Zipf’s law – a short divergence Extensions Back-ofg & Interpolation Smoothing Evaluation 1 rank × frequency = k frequency ∝ • This is a reoccurring theme in (computational) linguistics: • Important consequence for us (in this lecture):

  23. Motivation ’m sorry freq ngram freq Estimation freq I ’m , Dave afraid I n’t do Dave . freq I can do that sorry , ’m afraid can ’t that . What about the bigram ‘ . I ’? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 ngram ngram ngram . Evaluation Smoothing Back-ofg & Interpolation Extensions Bigrams Bigrams are overlapping sequences of two tokens. I ’m sorry , Bigram counts Dave 25 / 86 I I ’m . that afraid do ’t can 2 1 1 1 1 1 1 1 1 1 1 1

  24. Motivation ’m sorry freq ngram freq Estimation freq I ’m , Dave afraid I n’t do Dave . freq I can do that sorry , ’m afraid can ’t that . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 ngram ngram ngram . Evaluation Smoothing Back-ofg & Interpolation Extensions Bigrams Bigrams are overlapping sequences of two tokens. I ’m sorry Bigram counts Dave , 25 / 86 I I . that ’m do ’t afraid can 2 1 1 1 1 1 1 1 1 1 1 1 • What about the bigram ‘ . I ’?

  25. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, distribution to sentences beginning of a sentence 26 / 86 If we want sentence probabilities, we need to mark them. Sentence boundary markers Extensions Back-ofg & Interpolation Smoothing Evaluation ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ⟨ s ⟩ I ’m afraid I can ’t do that . ⟨ /s ⟩ • The bigram ‘ ⟨ s ⟩ I ’ is not the same as the unigram ‘ I ’ Including ⟨ s ⟩ allows us to predict likely words at the • Including ⟨ /s ⟩ allows us to assign a proper probability

  26. Motivation and, the MLE Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, not its unigram probability Estimation 27 / 86 recap with some more detail Evaluation Calculating bigram probabilities Smoothing Back-ofg & Interpolation Extensions We want to calculate P ( w 2 | w 1 ) . From the chain rule: P ( w 2 | w 1 ) = P ( w 1 , w 2 ) P ( w 1 ) C ( w 1 w 2 ) = C ( w 1 w 2 ) N P ( w 2 | w 1 ) = C ( w 1 ) C ( w 1 ) N P ( w 2 | w 1 ) is the probability of w 2 given the previous word is w 1 P ( w 2 , w 1 ) is the probability of the sequence w 1 w 2 P ( w 1 ) is the probability of w 1 occurring as the fjrst item in a bigram,

  27. Motivation I can sorry , , Dave Dave . ’m afraid Estimation afraid I can ’t I ’m n’t do do that that . unigram probability! Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 ’m sorry 28 / 86 Extensions Bigram probabilities Back-ofg & Interpolation Smoothing Evaluation C ( w 1 w 2 ) C ( w 1 ) P ( w 1 w 2 ) P ( w 1 ) P ( w 2 | w 1 ) P ( w 2 ) w 1 w 2 ⟨ s ⟩ I 2 2 0 . 12 0 . 12 1 . 00 0 . 18 2 3 0 . 12 0 . 18 0 . 67 0 . 12 1 2 0 . 06 0 . 12 0 . 50 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 12 1 2 0 . 06 0 . 12 0 . 50 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 18 1 3 0 . 06 0 . 18 0 . 33 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 12 . ⟨ /s ⟩ 2 2 0 . 12 0 . 12 1 . 00 0 . 12

  28. Motivation Dave Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Bigram Unigram Estimation . 29 / 86 , Sentence probability: bigram vs. unigram Evaluation Smoothing Back-ofg & Interpolation sorry Extensions ’m I 1 0 . 5 0 ⟨ /s ⟩ = 2 . 83 × 10 − 9 P uni ( ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ) P bi ( ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ) = 0 . 33

  29. Motivation I , ’m I . sorry Dave uni bi w ’m Estimation afraid , Dave . uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 w 30 / 86 w Unigram vs. bigram probabilities I Evaluation Smoothing Back-ofg & Interpolation Extensions ’m sorry , in sentences and non-sentences Dave . 2 . 83 × 10 − 9 P uni 0 . 20 0 . 13 0 . 07 0 . 07 0 . 07 0 . 07 P bi 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 0 . 33

  30. Motivation ’m Estimation I . sorry Dave w I afraid w , Dave . uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 , ’m 30 / 86 w , sorry ’m . I Dave in sentences and non-sentences Unigram vs. bigram probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation 2 . 83 × 10 − 9 P uni 0 . 20 0 . 13 0 . 07 0 . 07 0 . 07 0 . 07 P bi 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 0 . 33 2 . 83 × 10 − 9 P uni 0 . 07 0 . 13 0 . 20 0 . 07 0 . 07 0 . 07 P bi 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 1 . 00 0 . 00

  31. Motivation I ’m I . sorry Dave Estimation w ’m w afraid , Dave . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 , 30 / 86 ’m sorry I w in sentences and non-sentences Unigram vs. bigram probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation , Dave . 2 . 83 × 10 − 9 P uni 0 . 20 0 . 13 0 . 07 0 . 07 0 . 07 0 . 07 P bi 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 0 . 33 2 . 83 × 10 − 9 P uni 0 . 07 0 . 13 0 . 20 0 . 07 0 . 07 0 . 07 P bi 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 1 . 00 0 . 00 2 . 83 × 10 − 9 P uni 0 . 07 0 . 13 0 . 07 0 . 07 0 . 07 0 . 07 P bi 1 . 00 0 . 67 0 . 50 0 . 00 0 . 50 1 . 00 0 . 00

  32. Motivation , Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation that do ’t . Dave 31 / 86 afraid Smoothing Extensions Bigram model as a fjnite-state automaton Back-ofg & Interpolation I Evaluation ’m can sorry 1 . 0 1 . 0 0 . 5 1 . 0 7 6 0 . 0 . 5 1 . 0 1 . 0 ⟨ s ⟩ ⟨ /s ⟩ 1 . 0 0 1 . 0 3 3 . 1 . 0 1 . 0 1 . 0

  33. Motivation can ’t do sorry , Dave , Dave . I ’m afraid ’m afraid I afraid I can I can ’t ’t do that I ’m sorry How many -grams are there in a sentence of length ? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 Estimation ’m sorry , do that . freq Evaluation Smoothing Back-ofg & Interpolation Extensions Trigrams Trigram counts ngram freq ngram 32 / 86 ngram freq ⟨ s ⟩ ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I ’m afraid I can ’t do that . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I that . ⟨ /s ⟩ 2 1 1 ⟨ s ⟩ I ’m 2 1 1 Dave . ⟨ /s ⟩ 1 1 1 1 1 1 1 1 1

  34. Motivation do that . Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, ’t do that can ’t do I can ’t afraid I can ’m afraid I I ’m afraid , Dave . sorry , Dave ’m sorry , Estimation I ’m sorry 32 / 86 ngram Evaluation Smoothing Back-ofg & Interpolation Extensions freq ngram freq Trigrams ngram freq Trigram counts ⟨ s ⟩ ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I ’m afraid I can ’t do that . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I that . ⟨ /s ⟩ 2 1 1 ⟨ s ⟩ I ’m 2 1 1 Dave . ⟨ /s ⟩ 1 1 1 1 1 1 1 1 1 • How many n -grams are there in a sentence of length m ?

  35. Motivation Dave Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Trigram Bigram Unigram Estimation . 33 / 86 , Back-ofg & Interpolation Evaluation Smoothing ’m I Trigram probabilities of a sentence Extensions sorry 1 0 . 5 0 ⟨ /s ⟩ = 2 . 83 × 10 − 9 P uni ( I ’m sorry , Dave . ⟨ /s ⟩ ) P bi ( I ’m sorry , Dave . ⟨ /s ⟩ ) = 0 . 33 P tri ( I ’m sorry , Dave . ⟨ /s ⟩ ) = 0 . 50

  36. Motivation – Furiously sleep ideas green colorless Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Should n-gram models model the difgerence? Can n-gram models model the difgerence? – Colorless green ideas sleep furiously interpretation of this term. — Chomsky (1968) Estimation sentence’ is an entirely useless one, under any known But it must be recognized that the notion ’probability of a Short detour: colorless green ideas Extensions Back-ofg & Interpolation Smoothing Evaluation 34 / 86 • The following ‘sentences’ are categorically difgerent:

  37. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Should n-gram models model the difgerence? – Colorless green ideas sleep furiously – Furiously sleep ideas green colorless interpretation of this term. — Chomsky (1968) sentence’ is an entirely useless one, under any known But it must be recognized that the notion ’probability of a Short detour: colorless green ideas Extensions Back-ofg & Interpolation Smoothing Evaluation 34 / 86 • The following ‘sentences’ are categorically difgerent: • Can n-gram models model the difgerence?

  38. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, – Colorless green ideas sleep furiously – Furiously sleep ideas green colorless interpretation of this term. — Chomsky (1968) sentence’ is an entirely useless one, under any known But it must be recognized that the notion ’probability of a Short detour: colorless green ideas Extensions Back-ofg & Interpolation Smoothing Evaluation 34 / 86 • The following ‘sentences’ are categorically difgerent: • Can n-gram models model the difgerence? • Should n-gram models model the difgerence?

  39. Motivation Some cultural aspects of everyday language: ‘ Chinese Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been more aspects of ‘usage’ of language food ’ is more likely than ‘ British food ’ ideas ’ Estimation Some semantics: ‘ bright ideas ’ is more likely than ‘ green more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much

  40. Motivation Some cultural aspects of everyday language: ‘ Chinese Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been more aspects of ‘usage’ of language food ’ is more likely than ‘ British food ’ ideas ’ Estimation more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green

  41. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been more aspects of ‘usage’ of language food ’ is more likely than ‘ British food ’ ideas ’ more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green • Some cultural aspects of everyday language: ‘ Chinese

  42. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been food ’ is more likely than ‘ British food ’ ideas ’ more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green • Some cultural aspects of everyday language: ‘ Chinese • more aspects of ‘usage’ of language

  43. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been food ’ is more likely than ‘ British food ’ ideas ’ more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green • Some cultural aspects of everyday language: ‘ Chinese • more aspects of ‘usage’ of language

  44. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, useful for many tasks. N-gram models are practical tools, and they have been food ’ is more likely than ‘ British food ’ ideas ’ more) likely than ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 86 • Some morphosyntax: the bigram ‘ ideas are ’ is (much • Some semantics: ‘ bright ideas ’ is more likely than ‘ green • Some cultural aspects of everyday language: ‘ Chinese • more aspects of ‘usage’ of language

  45. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, markers phonemes , phrases , … probabilities, but other units are also possible: characters , on its parts (sequences of words ) N-grams, so far … regularities that are useful in many applications Extensions Back-ofg & Interpolation Smoothing Evaluation 36 / 86 • N-gram language models are one of the basic tools in NLP • They capture some linguistic (and non-linguistic) • The idea is to estimate the probability of a sentence based • N-grams are n consecutive units in a sequence • Typically, we use sequences of words to estimate sentence • For most applications, we introduce sentence boundary

  46. Motivation probabilities is using relative frequencies (leads to MLE) Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, words and sentences Estimation 37 / 86 N-grams, so far … Extensions Back-ofg & Interpolation Smoothing Evaluation • The most straightforward method for estimating • Due to Zipf’s law, as we increase ‘ n ’, the counts become smaller (data sparseness), many counts become 0 • If there are unknown words, we get 0 probabilities for both • In practice, bigrams or trigrams are used most commonly, applications/datasets of up to 5 -grams are also used

  47. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, better the model. A few measures: Intrinsic: the higher the probability assigned to a test set applications 38 / 86 application: Extrinsic: how (much) the model improves the target How to test n-gram models? Extensions Back-ofg & Interpolation Smoothing Evaluation • Speech recognition accuracy • BLEU score for machine translation • Keystroke savings in predictive text • Likelihood • (cross) entropy • perplexity

  48. Motivation Estimation Evaluation Smoothing Back-ofg & Interpolation Extensions Training and test set division training data model may overfjt the training set Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 39 / 86 • We (almost) never use a statistical (language) model on the • Testing a model on the training set is misleading: the • Always test your models on a separate test set

  49. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, because of readability and ease of numerical manipulation the model 40 / 86 Intrinsic evaluation metrics: likelihood Extensions Back-ofg & Interpolation Smoothing Evaluation • Likelihood of a model M is the probability of the (test) set w given the model ∏ L ( M | w ) = P ( w | M ) = P ( s ) s ∈ w • The higher the likelihood (for a given test set), the better • Likelihood is sensitive to test set size • Practical note: (minus) log likelihood is more common,

  50. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, set language model) distribution) using an approximate distribution (the to encode the data coming from a distribution (test set 41 / 86 Extensions Intrinsic evaluation metrics: cross entropy Back-ofg & Interpolation Smoothing Evaluation • Cross entropy of a language model on a test set w is H ( w ) = − 1 Nlog 2 P ( w ) • The lower the cross entropy, the better the model • Remember that cross entropy is the average bits required • Note that cross entropy is not sensitive to length of the test

  51. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, – not sensitive to test set size – lower better 42 / 86 language models Intrinsic evaluation metrics: perplexity Extensions Back-ofg & Interpolation Smoothing Evaluation • Perplexity is a more common measure for evaluating √ 1 PP ( w ) = 2 H ( w ) = P ( w ) − 1 N = N P ( w ) • Perplexity is the average branching factor • Similar to cross entropy

  52. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, unseen seen seen from known words, and assign it to unknown words data sequences containing unseen words the Zipf’s law: many words are rare . and other issues with MLE estimates What do we do with unseen n-grams? Extensions Back-ofg & Interpolation Smoothing Evaluation 43 / 86 • Words (and word sequences) are distributed according to • MLE will assign 0 probabilities to unseen words, and • Even with non-zero probabilities, MLE overfjts the training • One solution is smoothing: take some probability mass

  53. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 44 / 86 Smoothing: what is in the name? Back-ofg & Interpolation Extensions Smoothing Evaluation samples from N ( 0 , 1 ) 1 0 . 8 5 samples 10 samples 0 . 6 0 . 5 0 . 4 0 . 2 0 0 0 . 6 0 . 4 30 samples 1000 samples 0 . 4 0 . 2 0 . 2 0 0 − 4 − 2 − 4 − 2 0 2 4 0 2 4

  54. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, V number of word types - the size of the vocabulary N number of word tokens 45 / 86 (Add-one smoothing) Laplace smoothing Extensions Back-ofg & Interpolation Smoothing Evaluation • The idea (from 1790): add one to all counts • The probability of a word is estimated by P +1 ( w ) = C ( w )+ 1 N + V • Then, probability of an unknown word is: 0 + 1 N + V

  55. Motivation for n-grams Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation 46 / 86 Laplace smoothing Back-ofg & Interpolation Extensions Evaluation Smoothing • The probability of a bigram becomes P +1 ( w i w i − 1 ) = C ( w i w i − 1 )+ 1 N + V 2 • and, the conditional probability P +1 ( w i | w i − 1 ) = C ( w i − 1 w i )+ 1 C ( w i − 1 )+ V • In general C ( w i i − n + 1 ) + 1 P +1 ( w i i − n + 1 ) = N + V n C ( w i i − n + 1 ) + 1 P +1 ( w i i − n + 1 | w i − 1 i − n + 1 ) = C ( w i − 1 i − n + 1 ) + V

  56. Motivation I ’m Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, that . do that n’t do can ’t I can afraid I Estimation Dave . , Dave sorry , 47 / 86 non-smoothed vs. Laplace smoothing Bigram probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation w 1 w 2 C +1 P MLE ( w 1 w 2 ) P +1 ( w 1 w 2 ) P MLE ( w 2 | w 1 ) P +1 ( w 2 | w 1 ) ⟨ s ⟩ I 3 0 . 118 0 . 019 1 . 000 0 . 188 3 0 . 118 0 . 019 0 . 667 0 . 176 ’m sorry 2 0 . 059 0 . 012 0 . 500 0 . 125 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 ’m afraid 2 0 . 059 0 . 012 0 . 500 0 . 125 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 0 . 333 0 . 118 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 . ⟨ /s ⟩ 3 0 . 118 0 . 019 1 . 000 0 . 188 ∑ 1 . 000 0 . 193

  57. Motivation I , ’m I . sorry Dave /s MLE +1 w ’m Estimation afraid , Dave . /s uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 w 48 / 86 w MLE vs. Laplace probabilities Evaluation I Smoothing Back-ofg & Interpolation Extensions ’m sorry , Dave . bigram probabilities in sentences and non-sentences ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 1 . 00 0 . 33 1 . 44 × 10 − 5 P +1 0 . 25 0 . 23 0 . 17 0 . 18 0 . 18 0 . 18 0 . 25

  58. Motivation afraid ’m Estimation . sorry Dave w I ’m , w Dave . /s uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 , I 48 / 86 w , sorry . I ’m bigram probabilities in sentences and non-sentences MLE vs. Laplace probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation Dave ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 1 . 00 0 . 33 1 . 44 × 10 − 5 P +1 0 . 25 0 . 23 0 . 17 0 . 18 0 . 18 0 . 18 0 . 25 ⟨ /s ⟩ P MLE 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 3 . 34 × 10 − 8 P +1 0 . 08 0 . 09 0 . 08 0 . 08 0 . 08 0 . 09 0 . 09

  59. Motivation I ’m I . sorry Dave Estimation w ’m w afraid , Dave . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 , 48 / 86 Smoothing . ’m I w bigram probabilities in sentences and non-sentences MLE vs. Laplace probabilities Extensions Back-ofg & Interpolation sorry Evaluation , Dave ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 1 . 00 0 . 33 1 . 44 × 10 − 5 P +1 0 . 25 0 . 23 0 . 17 0 . 18 0 . 18 0 . 18 0 . 25 ⟨ /s ⟩ P MLE 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 3 . 34 × 10 − 8 P +1 0 . 08 0 . 09 0 . 08 0 . 08 0 . 08 0 . 09 0 . 09 ⟨ /s ⟩ P uni 1 . 00 0 . 67 0 . 50 0 . 00 1 . 00 1 . 00 1 . 00 0 . 00 7 . 22 × 10 − 6 P bi 0 . 25 0 . 23 0 . 17 0 . 09 0 . 18 0 . 18 0 . 25

  60. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Trigrams seen Bigrams seen Unigrams seen (e.g., trigrams) are possible the higher level n-grams higher order n-grams How much mass does +1 smoothing steal? Evaluation Smoothing Back-ofg & Interpolation Extensions reserves probability mass proportional to vocabulary size of the vocabulary large vocabularies and 49 / 86 • Laplace smoothing unseen ( 3 . 33 % ) • This is just too much for unseen ( 83 . 33 % ) • Note that only very few of unseen ( 98 . 55 % )

  61. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, it has high variance: it overfjts 50 / 86 Extensions Lindstone correction Back-ofg & Interpolation Smoothing Evaluation (Add- α smoothing) • A simple improvement over Laplace smoothing is adding 0 < α (and typically < 1 ) instead of 1 ) = C ( w i − n + 1 ) + α P ( w i − n + 1 i i N + αV • With smaller α values, the model behaves similar to MLE, • Larger α values reduce the variance, but has large bias

  62. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, later in this course) aside a development set for tuning hyperparameters wrong setting smoothing parameters Extensions Back-ofg & Interpolation Smoothing Evaluation 51 / 86 How do we pick a good α value • We want α value that works best outside the training data • Peeking at your test data during training/development is • This calls for another division of the available data: set • Alternatively, we can use k-fold cross validation and take the α with the best average score (more on cross validation

  63. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, events 52 / 86 Absolute discounting Extensions Back-ofg & Interpolation Smoothing Evaluation ϵ • An alternative to the additive smoothing is to reserve an explicit amount of probability mass, ϵ , for the unseen • The probabilities of known events has to be re-normalized • This is often not very convenient • How do we decide what ϵ value to use?

  64. Motivation n-grams using the observed n-grams Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, data corpus is Estimation 53 / 86 ‘discounting’ view Good-Turing smoothing Extensions Back-ofg & Interpolation Smoothing Evaluation • Estimate the probability mass to be reserved for the novel • Novel events in our training set is the ones that occur once p 0 = n 1 n where n 1 is the number of distinct n-grams with frequency 1 in the training data • Now we need to discount this mass from the higher counts • The probability of an n-gram that occurred r times in the ( r + 1 ) n r + 1 n r n

  65. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Note: frequency 54 / 86 Evaluation Smoothing Back-ofg & Interpolation Extensions Some terminology frequencies of frequencies and equivalence classes n 3 = 1 3 n 2 = 2 3 2 2 n 1 = 8 2 1 1 1 1 1 1 1 1 1 0 . , sorry I ’m ’t Dave afraid can do that • We often put n-grams into equivalence classes • Good-Turing forms the equivalence classes based on ∑ n = r × n r r

  66. Motivation – novel n-grams Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, * Yes, this seems to be a word. Estimation 55 / 86 Good-Turing estimation: leave-one-out justifjcation Extensions Back-ofg & Interpolation Smoothing Evaluation • Leave each n-gram out • Count the number of times the left-out n-gram had frequency r in the remaining data n 1 n – n-grams with frequency 1 (singletons) ( 1 + 1 ) n 2 n 1 n – n-grams with freqnency 2 (doubletons) * ( 2 + 1 ) n 3 n 2 n

  67. Motivation n-gram under the smoothing method. Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Estimation 56 / 86 Sometimes it is instructive to see the ‘efgective count’ of an Adjusted counts Extensions Back-ofg & Interpolation Smoothing Evaluation For Good-Turing smoothing, the updated count, r ∗ is r ∗ = ( r + 1 ) n r + 1 n r • novel items: n 1 • singeltons: 2 × n 2 n 1 • doubletons: 3 × n 3 n 2 • …

  68. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 57 / 86 Good-Turing example Evaluation Smoothing Back-ofg & Interpolation Extensions n 3 = 1 3 n 2 = 2 3 2 2 n 1 = 8 2 1 1 1 1 1 1 1 1 1 0 . , can sorry I ’m ’t Dave afraid do that P GT ( the ) = P GT ( a ) = . . . = 8 15 P GT ( that ) = P GT ( do ) = . . . = 2 × 2 15 P GT ( ’m ) = P GT ( . ) = 3 × 1 15

  69. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, unreliable 58 / 86 With some solutions Issues with Good-Turing discounting Extensions Back-ofg & Interpolation Smoothing Evaluation • Zero counts: we cannot assign probabilities if n r + 1 = 0 • The estimates of some of the frequencies of frequencies are • A solution is to replace n r with smoothed counts z r • A well-known technique (simple Good-Turing) for smoothing n r is to use linear interpolation log z r = a + b log r

  70. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, markers phonemes , phrases , … probabilities, but other units are also possible: characters , on its parts (sequences of words ) N-grams, so far … regularities that are useful in many applications Extensions Back-ofg & Interpolation Smoothing Evaluation 59 / 86 • N-gram language models are one of the basic tools in NLP • They capture some linguistic (and non-linguistic) • The idea is to estimate the probability of a sentence based • N-grams are n consecutive units in a sequence • Typically, we use sequences of words to estimate sentence • For most applications, we introduce sentence boundary

  71. Motivation probabilities is using relative frequencies (leads to MLE) Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, words and sentences Estimation 60 / 86 N-grams, so far … Extensions Back-ofg & Interpolation Smoothing Evaluation • The most straightforward method for estimating • Due to Zipf’s law, as we increase ‘ n ’, the counts become smaller (data sparseness), many counts become 0 • If there are unknown words, we get 0 probabilities for both • In practice, bigrams or trigrams are used most commonly, applications/datasets of up to 5 -grams are also used

  72. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, extrinsic metrics Intrinsic likelihood, (cross) entropy, perplexity Extrinsic success in an external application N-grams, so far … Extensions Back-ofg & Interpolation Smoothing Evaluation 61 / 86 • Two difgerent ways of evaluating n-gram models: • Intrinsic evaluation metrics often correlate well with the • Test your n-grams models on an ‘unseen’ test set

  73. Motivation simple but often not very useful Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, the unobserved events based on the n-grams seen only – Good-Turing discounting reserves the probability mass to from the observed n-grams tuning over a development set Estimation 62 / 86 Back-ofg & Interpolation observed n-grams, and assigns it to unobserved ones reduce the variance) Evaluation Smoothing N-grams, so far … Extensions • Smoothing methods solve the zero-count problem (also • Smoothing takes away some probability mass from the – Additive smoothing: add a constant α to all counts • α = 1 (Laplace smoothing) simply adds one to all counts – • A simple correction is to add a smaller α , which requires – Discounting removes a fjxed amount of probability mass, ϵ , • We need to re-normalize the probability estimates • Again, we need a development set to tune ϵ once: p 0 = n 1 n

  74. Motivation How about black wug ? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) Would make a difgerence if we used a better smoothing black squirrel wug black wug) black Estimation Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation 63 / 86 • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability P + 1 ( squirrel | black ) =

  75. Motivation How about black wug ? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) Would make a difgerence if we used a better smoothing black squirrel wug black wug) 63 / 86 Estimation Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V

  76. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) Would make a difgerence if we used a better smoothing black squirrel wug 63 / 86 Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V • How about black wug ? P + 1 ( black wug) =

  77. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) Would make a difgerence if we used a better smoothing black 63 / 86 Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V • How about black wug ? P + 1 ( black wug) = P + 1 ( squirrel | wug ) =

  78. Motivation Estimation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, method (e.g., Good-Turing?) 63 / 86 Extensions Not all (unknown) n-grams are equal Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V • How about black wug ? 0 + 1 P + 1 ( black wug) = P + 1 ( squirrel | wug ) = C ( black ) + V • Would make a difgerence if we used a better smoothing

Recommend


More recommend