Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Now, hopefully, we can count them in a corpus 11 / 63 Example: bigram probabilities of a sentence Extensions Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | I like ) × P ( with | I like pizza ) × P ( spinach | I like pizza with )
Motivation with fjrst-order Markov assumption Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Estimation 11 / 63 Example: bigram probabilities of a sentence Extensions Back-ofg & Interpolation Smoothing Evaluation P ( I like pizza with spinach ) = P ( like | I ) × P ( pizza | like ) × P ( with | pizza ) × P ( spinach | with ) • Now, hopefully, we can count them in a corpus
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, number of times I occurs in the corpus 12 / 63 Maximum-likelihood estimation (MLE) Extensions Evaluation Smoothing Back-ofg & Interpolation • The MLE of n-gram probabilities is based on their frequencies in a corpus • We are interested in conditional probabilities of the form: P ( w i | w 1 , . . . , w i − 1 ) , which we estimate using C ( w i − n + 1 . . . w i ) P ( w i | w i − n + 1 , . . . , w i − 1 ) = C ( w i − n + 1 . . . w i − 1 ) where, C () is the frequency (count) of the sequence in the corpus. • For example, the probability P ( like | I ) would be C ( I like ) P ( like | I ) = C ( I ) = number of times I like occurs in the corpus
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Parameters of an n-gram model are these conditional probabilities. trigram bigram MLE estimation of an n-gram language model unigram Extensions Back-ofg & Interpolation Smoothing Evaluation 13 / 63 An n-gram model conditioned on n − 1 previous words. P ( w i ) = C ( w i ) N P ( w i ) = P ( w i | w i − 1 ) = C ( w i − 1 w i ) C ( w i − 1 ) P ( w i ) = P ( w i | w i − 2 w i − 1 ) = C ( w i − 2 w i − 1 w i ) C ( w i − 2 w i − 1 )
Motivation do Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Traditionally, can’t is tokenized as ca␣n’t (similar to have␣n’t , is␣n’t etc.), but for our purposes can␣’t is more readable. ’t . sorry that can Dave ’m afraid Estimation , I Unigram counts I’m afraid I can’t do that. I’m sorry, Dave. A small corpus Unigrams are simply the single words (or tokens). Unigrams Extensions Back-ofg & Interpolation Smoothing Evaluation 14 / 63
Motivation , Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Traditionally, can’t is tokenized as ca␣n’t (similar to have␣n’t , is␣n’t etc.), but for our purposes can␣’t is more readable. ’t . sorry that can Dave ’m Estimation afraid do 14 / 63 I ’m afraid I can ’t do that . Evaluation Smoothing Back-ofg & Interpolation Extensions Unigrams Unigrams are simply the single words (or tokens). A small corpus I ’m sorry , Dave . Unigram counts I When tokenized, we have 15 tokens , and 11 types . 3 1 1 1 2 1 1 1 1 2 1
Motivation ’m Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, What is the most likely sentence according to this model? Where did all the probability mass go? , 'm I . sorry Dave Estimation ’t . sorry that can Dave 15 / 63 Unigram counts Evaluation do Unigram probability of a sentence Extensions afraid I Back-ofg & Interpolation , Smoothing 3 1 1 1 2 1 1 1 1 2 1 P ( I 'm sorry , Dave . ) × × × × × = P ( I ) P ( 'm ) P ( sorry ) P ( , ) P ( Dave ) P ( . ) 3 2 1 1 1 2 = × × × × × 15 15 15 15 15 15 = 0 . 000 001 05
Motivation do Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Estimation ’t . sorry that can Dave ’m 15 / 63 Extensions Unigram counts afraid Back-ofg & Interpolation Smoothing , Evaluation Unigram probability of a sentence I 3 1 1 1 2 1 1 1 1 2 1 P ( I 'm sorry , Dave . ) × × × × × = P ( I ) P ( 'm ) P ( sorry ) P ( , ) P ( Dave ) P ( . ) 3 2 1 1 1 2 = × × × × × 15 15 15 15 15 15 = 0 . 000 001 05 • P ( , 'm I . sorry Dave ) = ? • Where did all the probability mass go? • What is the most likely sentence according to this model?
Motivation Dave prob I Estimation . ’t , afraid What about sentences? can do sorry that Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 word ’m 16 / 63 over words Evaluation Smoothing Back-ofg & Interpolation Extensions sequences of equal size. For example (length 2), N-gram models defjne probability distributions • An n-gram model defjnes a probability distribution 0 . 200 ∑ P ( w ) = 1 0 . 133 w ∈ V 0 . 133 • They also defjne probability distributions over word 0 . 067 0 . 067 0 . 067 ∑ ∑ P ( w ) P ( v ) = 1 0 . 067 w ∈ V v ∈ V 0 . 067 0 . 067 0 . 067 0 . 067 1 . 000
Motivation afraid I Estimation . ’t , Dave can word do sorry that Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 prob ’m 16 / 63 sequences of equal size. For example (length 2), Evaluation Smoothing Back-ofg & Interpolation N-gram models defjne probability distributions Extensions over words • An n-gram model defjnes a probability distribution 0 . 200 ∑ P ( w ) = 1 0 . 133 w ∈ V 0 . 133 • They also defjne probability distributions over word 0 . 067 0 . 067 0 . 067 ∑ ∑ P ( w ) P ( v ) = 1 0 . 067 w ∈ V v ∈ V 0 . 067 0 . 067 • What about sentences? 0 . 067 0 . 067 1 . 000
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, 17 / 63 Unigram probabilities Extensions Evaluation Smoothing Back-ofg & Interpolation 3 0 . 2 2 2 0 . 1 1 1 1 1 1 1 1 1 0 . , sorry can I ’m ’t Dave afraid do that
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, MLE probability rank 18 / 63 Unigram probabilities in a (slightly) larger corpus MLE probabilities in the Universal Declaration of Human Rights Back-ofg & Interpolation Smoothing Extensions Evaluation 0 . 06 0 . 04 0 . 02 a long tail follows … 536 0 . 00 0 20 40 60 80 100 120 140 160 180 200 220 240
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, – there will be many low-probability events (words/n-grams) – even very large corpora will not contain some of the words (or n-grams) units follow more-or-less a similar distribution rank 19 / 63 or The frequency of a word is inversely proportional to its rank: Zipf’s law – a short divergence Extensions Back-ofg & Interpolation Smoothing Evaluation 1 rank × frequency = k frequency ∝ • This is a reoccurring theme in (computational) linguistics: most linguistic • Important consequence for us (in this lecture):
Motivation ’m sorry freq ngram freq Estimation freq I ’m , Dave afraid I n’t do Dave . freq I can do that sorry , ’m afraid can ’t that . What about the bigram ‘ . I ’? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 ngram ngram ngram . Evaluation Smoothing Back-ofg & Interpolation Extensions Bigrams Bigrams are overlapping sequences of two tokens. I ’m sorry , Bigram counts Dave 20 / 63 I I ’m . that afraid do ’t can 2 1 1 1 1 1 1 1 1 1 1 1
Motivation ’m sorry freq ngram freq Estimation freq I ’m , Dave afraid I n’t do Dave . freq I can do that sorry , ’m afraid can ’t that . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 ngram ngram ngram . Evaluation Smoothing Back-ofg & Interpolation Extensions Bigrams Bigrams are overlapping sequences of two tokens. I ’m sorry Bigram counts Dave , 20 / 63 I I . that ’m do ’t afraid can 2 1 1 1 1 1 1 1 1 1 1 1 • What about the bigram ‘ . I ’?
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, sentences 21 / 63 If we want sentence probabilities, we need to mark them. Sentence boundary markers Extensions Back-ofg & Interpolation Smoothing Evaluation ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ⟨ s ⟩ I ’m afraid I can ’t do that . ⟨ /s ⟩ • The bigram ‘ ⟨ s ⟩ I ’ is not the same as the unigram ‘ I ’ Including ⟨ s ⟩ allows us to predict likely words at the beginning of a sentence • Including ⟨ /s ⟩ allows us to assign a proper probability distribution to
Motivation and, the MLE Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, not its unigram probability Estimation 22 / 63 recap with some more detail Evaluation Calculating bigram probabilities Smoothing Back-ofg & Interpolation Extensions We want to calculate P ( w 2 | w 1 ) . From the chain rule: P ( w 2 | w 1 ) = P ( w 1 , w 2 ) P ( w 1 ) C ( w 1 w 2 ) = C ( w 1 w 2 ) N P ( w 2 | w 1 ) = C ( w 1 ) C ( w 1 ) N P ( w 2 | w 1 ) is the probability of w 2 given the previous word is w 1 P ( w 1 , w 2 ) is the probability of the sequence w 1 w 2 P ( w 1 ) is the probability of w 1 occurring as the fjrst item in a bigram,
Motivation I can sorry , , Dave Dave . ’m afraid Estimation afraid I can ’t I ’m n’t do do that that . unigram probability Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 ’m sorry 23 / 63 Extensions Bigram probabilities Back-ofg & Interpolation Smoothing Evaluation w 1 w 2 C ( w 1 w 2 ) C ( w 1 ) P ( w 1 w 2 ) P ( w 1 ) P ( w 2 | w 1 ) P ( w 2 ) ⟨ s ⟩ I 2 2 0 . 12 0 . 12 1 . 00 0 . 18 2 3 0 . 12 0 . 18 0 . 67 0 . 12 1 2 0 . 06 0 . 12 0 . 50 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 12 1 2 0 . 06 0 . 12 0 . 50 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 18 1 3 0 . 06 0 . 18 0 . 33 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 06 1 1 0 . 06 0 . 06 1 . 00 0 . 12 . ⟨ /s ⟩ 2 2 0 . 12 0 . 12 1 . 00 0 . 12
Motivation Dave Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Bigram Unigram Estimation . 24 / 63 , Sentence probability: bigram vs. unigram Evaluation Smoothing Back-ofg & Interpolation sorry Extensions ’m I 1 0 . 5 0 ⟨ /s ⟩ = 2 . 83 × 10 − 9 P uni ( ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ) P bi ( ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ) = 0 . 33
Motivation I , ’m I . sorry Dave uni bi w ’m Estimation afraid , Dave . uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 w 25 / 63 w Unigram vs. bigram probabilities I Evaluation Smoothing Back-ofg & Interpolation Extensions ’m sorry , in sentences and non-sentences Dave . 4 . 97 × 10 − 7 P uni 0 . 18 0 . 12 0 . 06 0 . 06 0 . 06 0 . 12 P bi 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 0 . 33
Motivation ’m Estimation I . sorry Dave w I afraid w , Dave . uni bi Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 , ’m 25 / 63 w , sorry ’m . I Dave in sentences and non-sentences Unigram vs. bigram probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation 4 . 97 × 10 − 7 P uni 0 . 18 0 . 12 0 . 06 0 . 06 0 . 06 0 . 12 P bi 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 0 . 33 4 . 97 × 10 − 7 P uni 0 . 06 0 . 12 0 . 18 0 . 12 0 . 06 0 . 06 P bi 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00
Motivation I ’m I . sorry Dave Estimation w ’m w afraid , Dave . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 , 25 / 63 ’m sorry I w in sentences and non-sentences Unigram vs. bigram probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation , Dave . 4 . 97 × 10 − 7 P uni 0 . 18 0 . 12 0 . 06 0 . 06 0 . 06 0 . 12 P bi 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 0 . 33 4 . 97 × 10 − 7 P uni 0 . 06 0 . 12 0 . 18 0 . 12 0 . 06 0 . 06 P bi 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 4 . 97 × 10 − 7 P uni 0 . 18 0 . 12 0 . 06 0 . 06 0 . 06 0 . 12 P bi 1 . 00 0 . 67 0 . 50 0 . 00 1 . 00 1 . 00 0 . 00
Motivation afraid Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Estimation do ’t . Dave , that sorry Extensions Evaluation Smoothing Back-ofg & Interpolation can 26 / 63 Bigram models as weighted fjnite-state automata ’m I 1 . 0 1 . 0 0 . 5 1 . 0 0 . 67 0 . 5 1 . 0 1 . 0 ⟨ s ⟩ ⟨ /s ⟩ 1 . 0 0 . 33 1 . 0 1 . 0 1 . 0 1 . 0
Motivation can ’t do sorry , Dave , Dave . I ’m afraid ’m afraid I afraid I can I can ’t ’t do that I ’m sorry How many -grams are there in a sentence of length ? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 Estimation ’m sorry , do that . freq Evaluation Smoothing Back-ofg & Interpolation Extensions Trigrams Trigram counts ngram freq ngram 27 / 63 ngram freq ⟨ s ⟩ ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I ’m afraid I can ’t do that . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I 2 1 that . ⟨ /s ⟩ 1 ⟨ s ⟩ I ’m 2 1 1 1 1 Dave . ⟨ /s ⟩ 1 1 1 1 1 1 1
Motivation do that . Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, ’t do that can ’t do I can ’t afraid I can ’m afraid I I ’m afraid , Dave . sorry , Dave ’m sorry , Estimation I ’m sorry 27 / 63 ngram Evaluation Smoothing Back-ofg & Interpolation Extensions Trigrams freq ngram freq ngram freq Trigram counts ⟨ s ⟩ ⟨ s ⟩ I ’m sorry , Dave . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I ’m afraid I can ’t do that . ⟨ /s ⟩ ⟨ s ⟩ ⟨ s ⟩ I 2 1 that . ⟨ /s ⟩ 1 ⟨ s ⟩ I ’m 2 1 1 1 1 Dave . ⟨ /s ⟩ 1 1 1 1 1 1 1 • How many n -grams are there in a sentence of length m ?
Motivation Dave Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Trigram Bigram Unigram Estimation . 28 / 63 , Back-ofg & Interpolation Evaluation Smoothing ’m I Trigram probabilities of a sentence Extensions sorry 1 0 . 5 0 ⟨ /s ⟩ = 2 . 83 × 10 − 9 P uni ( I ’m sorry , Dave . ⟨ /s ⟩ ) P bi ( I ’m sorry , Dave . ⟨ /s ⟩ ) = 0 . 33 P tri ( I ’m sorry , Dave . ⟨ /s ⟩ ) = 0 . 50
Motivation – Furiously sleep ideas green colorless Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Should n-gram models model the difgerence? Can n-gram models model the difgerence? – Colorless green ideas sleep furiously 29 / 63 Estimation useless one, under any known interpretation of this term. — Chomsky (1968) But it must be recognized that the notion ’probability of a sentence’ is an entirely Short detour: colorless green ideas Extensions Back-ofg & Interpolation Smoothing Evaluation • The following ‘sentences’ are categorically difgerent:
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Should n-gram models model the difgerence? – Colorless green ideas sleep furiously – Furiously sleep ideas green colorless 29 / 63 useless one, under any known interpretation of this term. — Chomsky (1968) But it must be recognized that the notion ’probability of a sentence’ is an entirely Short detour: colorless green ideas Extensions Back-ofg & Interpolation Smoothing Evaluation • The following ‘sentences’ are categorically difgerent: • Can n-gram models model the difgerence?
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, – Colorless green ideas sleep furiously – Furiously sleep ideas green colorless 29 / 63 useless one, under any known interpretation of this term. — Chomsky (1968) But it must be recognized that the notion ’probability of a sentence’ is an entirely Short detour: colorless green ideas Extensions Back-ofg & Interpolation Smoothing Evaluation • The following ‘sentences’ are categorically difgerent: • Can n-gram models model the difgerence? • Should n-gram models model the difgerence?
Motivation Some semantics: ‘ bright ideas ’ is more likely than ‘ green ideas ’ Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, more aspects of ‘usage’ of language than ‘ British food ’ Some cultural aspects of everyday language: ‘ Chinese food ’ is more likely ‘ ideas is ’ Estimation What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 30 / 63 • Some morphosyntax: the bigram ‘ ideas are ’ is (much more) likely than
Motivation Some cultural aspects of everyday language: ‘ Chinese food ’ is more likely Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, more aspects of ‘usage’ of language than ‘ British food ’ ‘ ideas is ’ Estimation What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 30 / 63 • Some morphosyntax: the bigram ‘ ideas are ’ is (much more) likely than • Some semantics: ‘ bright ideas ’ is more likely than ‘ green ideas ’
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, more aspects of ‘usage’ of language than ‘ British food ’ ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 30 / 63 • Some morphosyntax: the bigram ‘ ideas are ’ is (much more) likely than • Some semantics: ‘ bright ideas ’ is more likely than ‘ green ideas ’ • Some cultural aspects of everyday language: ‘ Chinese food ’ is more likely
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, than ‘ British food ’ ‘ ideas is ’ What do n-gram models model? Extensions Back-ofg & Interpolation Smoothing Evaluation 30 / 63 • Some morphosyntax: the bigram ‘ ideas are ’ is (much more) likely than • Some semantics: ‘ bright ideas ’ is more likely than ‘ green ideas ’ • Some cultural aspects of everyday language: ‘ Chinese food ’ is more likely • more aspects of ‘usage’ of language
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Like any ML method, test set has to be difgerent than training set. few measures: Intrinsic: the higher the probability assigned to a test set better the model. A 31 / 63 Extrinsic: improvement of the target application due to the language model: How to test n-gram models? Extensions Back-ofg & Interpolation Smoothing Evaluation • Speech recognition accuracy • BLEU score for machine translation • Keystroke savings in predictive text applications • Likelihood • (cross) entropy • perplexity
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Like any ML method, test set has to be difgerent than training set. few measures: Intrinsic: the higher the probability assigned to a test set better the model. A 31 / 63 Extrinsic: improvement of the target application due to the language model: How to test n-gram models? Extensions Back-ofg & Interpolation Smoothing Evaluation • Speech recognition accuracy • BLEU score for machine translation • Keystroke savings in predictive text applications • Likelihood • (cross) entropy • perplexity
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, ease of numerical manipulation Practical note: (minus) log likelihood is used more commonly, because of Likelihood is sensitive to the test set size The higher the likelihood (for a given test set), the better the model 32 / 63 Intrinsic evaluation metrics: likelihood Extensions Back-ofg & Interpolation Smoothing Evaluation • Likelihood of a model M is the probability of the (test) set w given the model ∏ L ( M | w ) = P ( w | M ) = P ( s ) s ∈ w
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, ease of numerical manipulation Practical note: (minus) log likelihood is used more commonly, because of Likelihood is sensitive to the test set size 32 / 63 Extensions Intrinsic evaluation metrics: likelihood Back-ofg & Interpolation Smoothing Evaluation • Likelihood of a model M is the probability of the (test) set w given the model ∏ L ( M | w ) = P ( w | M ) = P ( s ) s ∈ w • The higher the likelihood (for a given test set), the better the model
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, ease of numerical manipulation Practical note: (minus) log likelihood is used more commonly, because of 32 / 63 Intrinsic evaluation metrics: likelihood Extensions Back-ofg & Interpolation Smoothing Evaluation • Likelihood of a model M is the probability of the (test) set w given the model ∏ L ( M | w ) = P ( w | M ) = P ( s ) s ∈ w • The higher the likelihood (for a given test set), the better the model • Likelihood is sensitive to the test set size
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, ease of numerical manipulation 32 / 63 Extensions Intrinsic evaluation metrics: likelihood Back-ofg & Interpolation Smoothing Evaluation • Likelihood of a model M is the probability of the (test) set w given the model ∏ L ( M | w ) = P ( w | M ) = P ( s ) s ∈ w • The higher the likelihood (for a given test set), the better the model • Likelihood is sensitive to the test set size • Practical note: (minus) log likelihood is used more commonly, because of
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, . using another (approximate) distribution Reminder: Cross entropy is the bits required to encode the data coming from Cross entropy is not sensitive to the test-set size The lower the cross entropy, the better the model 33 / 63 Intrinsic evaluation metrics: cross entropy Extensions Back-ofg & Interpolation Smoothing Evaluation • Cross entropy of a language model on a test set w is H ( w ) = − 1 ∑ log 2 � P ( w i ) N w i
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, . using another (approximate) distribution Reminder: Cross entropy is the bits required to encode the data coming from Cross entropy is not sensitive to the test-set size 33 / 63 Extensions Intrinsic evaluation metrics: cross entropy Back-ofg & Interpolation Smoothing Evaluation • Cross entropy of a language model on a test set w is H ( w ) = − 1 ∑ log 2 � P ( w i ) N w i • The lower the cross entropy, the better the model
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, . using another (approximate) distribution Reminder: Cross entropy is the bits required to encode the data coming from 33 / 63 Intrinsic evaluation metrics: cross entropy Extensions Back-ofg & Interpolation Smoothing Evaluation • Cross entropy of a language model on a test set w is H ( w ) = − 1 ∑ log 2 � P ( w i ) N w i • The lower the cross entropy, the better the model • Cross entropy is not sensitive to the test-set size
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, . using another (approximate) distribution Reminder: Cross entropy is the bits required to encode the data coming from 33 / 63 Intrinsic evaluation metrics: cross entropy Extensions Back-ofg & Interpolation Smoothing Evaluation • Cross entropy of a language model on a test set w is H ( w ) = − 1 ∑ log 2 � P ( w i ) N w i • The lower the cross entropy, the better the model • Cross entropy is not sensitive to the test-set size
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Reminder: Cross entropy is the bits required to encode the data coming from 33 / 63 Back-ofg & Interpolation Intrinsic evaluation metrics: cross entropy Evaluation Extensions Smoothing • Cross entropy of a language model on a test set w is H ( w ) = − 1 ∑ log 2 � P ( w i ) N w i • The lower the cross entropy, the better the model • Cross entropy is not sensitive to the test-set size P using another (approximate) distribution � P . ∑ P ( x ) log � H ( P , Q ) = − P ( x ) x
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, – not sensitive to test set size – lower better 34 / 63 Intrinsic evaluation metrics: perplexity Extensions Back-ofg & Interpolation Smoothing Evaluation • Perplexity is a more common measure for evaluating language models � 1 PP ( w ) = 2 H ( w ) = P ( w ) − 1 N = N P ( w ) • Perplexity is the average branching factor • Similar to cross entropy
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, unseen seen seen and assign it to unknown words unseen words many words are rare . …and other issues with MLE estimates What do we do with unseen n-grams? Extensions Back-ofg & Interpolation Smoothing Evaluation 35 / 63 • Words (and word sequences) are distributed according to the Zipf’s law: • MLE will assign 0 probabilities to unseen words, and sequences containing • Even with non-zero probabilities, MLE overfjts the training data • One solution is smoothing: take some probability mass from known words,
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, V number of word types - the size of the vocabulary N number of word tokens 36 / 63 (Add-one smoothing) Laplace smoothing Extensions Back-ofg & Interpolation Smoothing Evaluation • The idea (from 1790): add one to all counts • The probability of a word is estimated by P +1 ( w ) = C ( w )+ 1 N + V • Then, probability of an unknown word is: 0 + 1 N + V
Motivation for n-grams Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Estimation 37 / 63 Laplace smoothing Back-ofg & Interpolation Extensions Evaluation Smoothing • The probability of a bigram becomes P +1 ( w i w i − 1 ) = C ( w i w i − 1 )+ 1 N + V 2 • and, the conditional probability P +1 ( w i | w i − 1 ) = C ( w i − 1 w i )+ 1 C ( w i − 1 )+ V • In general C ( w i i − n + 1 ) + 1 P +1 ( w i i − n + 1 ) = N + V n C ( w i i − n + 1 ) + 1 P +1 ( w i i − n + 1 | w i − 1 i − n + 1 ) = C ( w i − 1 i − n + 1 ) + V
Motivation I can sorry , , Dave Dave . ’m afraid Estimation afraid I can ’t I ’m n’t do do that that . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 ’m sorry 38 / 63 MLE vs. Laplace smoothing Bigram probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation P MLE ( w 1 w 2 ) P +1 ( w 1 w 2 ) P MLE ( w 2 | w 1 ) P +1 ( w 2 | w 1 ) w 1 w 2 C +1 ⟨ s ⟩ I 3 0 . 118 0 . 019 1 . 000 0 . 188 3 0 . 118 0 . 019 0 . 667 0 . 176 2 0 . 059 0 . 012 0 . 500 0 . 125 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 0 . 500 0 . 125 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 0 . 333 0 . 118 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 2 0 . 059 0 . 012 1 . 000 0 . 133 . ⟨ /s ⟩ 3 0 . 118 0 . 019 1 . 000 0 . 188 ∑ 1 . 000 0 . 193
Motivation I , ’m I . sorry Dave /s MLE +1 w ’m Estimation afraid , Dave . /s MLE +1 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 w 39 / 63 w MLE vs. Laplace probabilities Evaluation I Smoothing Back-ofg & Interpolation Extensions ’m sorry , Dave . probabilities of sentences and non-sentences (based on the bigram model) ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 1 . 00 0 . 33 1 . 84 × 10 − 6 P +1 0 . 19 0 . 18 0 . 13 0 . 13 0 . 13 0 . 13 0 . 19
Motivation afraid ’m Estimation . sorry Dave w I ’m , w Dave . /s MLE +1 Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 , I 39 / 63 w , sorry . I ’m probabilities of sentences and non-sentences (based on the bigram model) MLE vs. Laplace probabilities Extensions Back-ofg & Interpolation Smoothing Evaluation Dave ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 1 . 00 0 . 33 1 . 84 × 10 − 6 P +1 0 . 19 0 . 18 0 . 13 0 . 13 0 . 13 0 . 13 0 . 19 ⟨ /s ⟩ P MLE 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 1 . 17 × 10 − 12 P +1 0 . 03 0 . 03 0 . 03 0 . 03 0 . 03 0 . 03 0 . 03
Motivation I ’m I . sorry Dave Estimation w ’m w afraid , Dave . Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 , 39 / 63 Smoothing . ’m I w probabilities of sentences and non-sentences (based on the bigram model) MLE vs. Laplace probabilities Extensions Back-ofg & Interpolation sorry Evaluation , Dave ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 1 . 00 1 . 00 1 . 00 1 . 00 0 . 33 1 . 84 × 10 − 6 P +1 0 . 19 0 . 18 0 . 13 0 . 13 0 . 13 0 . 13 0 . 19 ⟨ /s ⟩ P MLE 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 0 . 00 1 . 17 × 10 − 12 P +1 0 . 03 0 . 03 0 . 03 0 . 03 0 . 03 0 . 03 0 . 03 ⟨ /s ⟩ P MLE 1 . 00 0 . 67 0 . 50 0 . 00 1 . 00 1 . 00 1 . 00 0 . 00 4 . 45 × 10 − 7 P +1 0 . 19 0 . 18 0 . 13 0 . 03 0 . 13 0 . 13 0 . 19
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Trigrams seen Bigrams seen Unigrams seen are possible higher level n-grams (e.g., trigrams) n-grams vocabularies and higher order size of the vocabulary probability mass proportional to the How much probability mass does +1 smoothing steal? Extensions Back-ofg & Interpolation Smoothing Evaluation 40 / 63 unseen ( 3 . 33 % ) • Laplace smoothing reserves • This is just too much for large unseen ( 83 . 33 % ) • Note that only very few of the unseen ( 98 . 55 % )
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, like any other hyperparameter. We need to tune high variance 41 / 63 Extensions Lidstone correction Back-ofg & Interpolation Smoothing Evaluation (Add- α smoothing) • A simple improvement over Laplace smoothing is adding α instead of 1 i − n + 1 ) = C ( w i i − n + 1 ) + α P + α ( w i i − n + 1 | w i − 1 C ( w i − 1 i − n + 1 ) + αV • With smaller α values, the model behaves similar to MLE, it overfjts: it has • Larger α values reduce overfjtting/variance, but result in large bias
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, high variance 41 / 63 Lidstone correction Extensions Back-ofg & Interpolation Smoothing Evaluation (Add- α smoothing) • A simple improvement over Laplace smoothing is adding α instead of 1 i − n + 1 ) = C ( w i i − n + 1 ) + α P + α ( w i i − n + 1 | w i − 1 C ( w i − 1 i − n + 1 ) + αV • With smaller α values, the model behaves similar to MLE, it overfjts: it has • Larger α values reduce overfjtting/variance, but result in large bias We need to tune α like any other hyperparameter.
Motivation Estimation Evaluation Smoothing Back-ofg & Interpolation Extensions Absolute discounting Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2020 42 / 63 ϵ • An alternative to the additive smoothing is to reserve an explicit amount of probability mass, ϵ , for the unseen events • The probabilities of known events has to be re-normalized • How do we decide what ϵ value to use?
Motivation observed n-grams Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, data Estimation 43 / 63 Good-Turing smoothing Extensions Back-ofg & Interpolation Smoothing Evaluation • Estimate the probability mass to be reserved for the novel n-grams using the • Novel events in our training set is the ones that occur once p 0 = n 1 n where n 1 is the number of distinct n-grams with frequency 1 in the training • Now we need to discount this mass from the higher counts • The probability of an n-gram that occurred r times in the corpus is ( r + 1 ) n r + 1 n r n
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, 44 / 63 Good-Turing example Evaluation Smoothing Back-ofg & Interpolation Extensions n 3 = 1 3 n 2 = 2 3 2 2 n 1 = 8 2 1 1 1 1 1 1 1 1 1 0 . , sorry I ’m ’t Dave afraid can do that P GT ( the ) + P GT ( a ) + . . . = 8 15 P GT ( that ) = P GT ( do ) = . . . = 2 × 2 15 × 8 P GT ( ’m ) = P GT ( . ) = 3 × 1 15 × 2
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, linear interpolation 45 / 63 With some solutions Issues with Good-Turing discounting Extensions Back-ofg & Interpolation Smoothing Evaluation • Zero counts: we cannot assign probabilities if n r + 1 = 0 • The estimates of some of the frequencies of frequencies are unreliable • A solution is to replace n r with smoothed counts z r • A well-known technique (simple Good-Turing) for smoothing n r is to use log z r = a + b log r
Motivation How about black wug ? Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Good-Turing?) Would it make a difgerence if we used a better smoothing method (e.g., black wug black black wug) black Estimation Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation 46 / 63 • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability P + 1 ( squirrel | black ) =
Motivation How about black wug ? Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Good-Turing?) Would it make a difgerence if we used a better smoothing method (e.g., black wug black black wug) 46 / 63 Estimation Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Good-Turing?) Would it make a difgerence if we used a better smoothing method (e.g., black wug black 46 / 63 Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V • How about black wug ? P + 1 ( black wug) =
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Good-Turing?) Would it make a difgerence if we used a better smoothing method (e.g., black 46 / 63 Not all (unknown) n-grams are equal Extensions Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V • How about black wug ? P + 1 ( black wug) = P + 1 ( wug | black ) =
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Good-Turing?) 46 / 63 Extensions Not all (unknown) n-grams are equal Back-ofg & Interpolation Smoothing Evaluation • Let’s assume that black squirrel is an unknown bigram • How do we calculate the smoothed probability 0 + 1 P + 1 ( squirrel | black ) = C ( black ) + V • How about black wug ? 0 + 1 P + 1 ( black wug) = P + 1 ( wug | black ) = C ( black ) + V • Would it make a difgerence if we used a better smoothing method (e.g.,
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, in a reasonably sized corpus it is unlikely that 47 / 63 when estimation is unreliable The general idea is to fall-back to lower order n-gram Back-ofg and interpolation Extensions Back-ofg & Interpolation Smoothing Evaluation • Even if, C ( black squirrel ) = C ( black wug ) = 0 C ( squirrel ) = C ( wug )
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, where, otherwise 48 / 63 otherwise: Back-ofg uses the estimate if it is available, ‘backs ofg’ to the lower order n-gram(s) Back-ofg Extensions Back-ofg & Interpolation Smoothing Evaluation { P ∗ ( w i | w i − 1 ) if C ( w i − 1 wi ) > 0 P ( w i | w i − 1 ) = αP ( w i ) • P ∗ ( · ) is the discounted probability • α makes sure that ∑ P ( w ) is the discounted amount • P ( w i ) , typically, smoothed unigram probability
Motivation In general (recursive defjnition), Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, – either smoothed unigram counts Estimation 49 / 63 Interpolation uses a linear combination: Interpolation Extensions Back-ofg & Interpolation Smoothing Evaluation P int ( w i | w i − 1 ) = λP ( w i | w i − 1 ) + ( 1 − λ ) P ( w i ) P int ( w i | w i − 1 i − n + 1 ) = λP ( w i | w i − 1 i − n + 1 ) + ( 1 − λ ) P int ( w i | w i − 1 i − n + 2 ) • ∑ λ i = 1 • Recursion terminates with – or uniform distribution 1 V
Motivation To solve this, the back-ofg or interpolation parameters ( Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, int int For example, conditioned on the context or ) are often bigrams Estimation are unknown, the above formulations assign the same probability to both – wuggy squirrel – black squirrel Not all contexts are equal Extensions Back-ofg & Interpolation Smoothing Evaluation 50 / 63 • Back to our example: given both bigrams
Motivation are unknown, the above formulations assign the same probability to both Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, Estimation conditioned on the context bigrams 50 / 63 – wuggy squirrel – black squirrel Not all contexts are equal Extensions Evaluation Smoothing Back-ofg & Interpolation • Back to our example: given both bigrams • To solve this, the back-ofg or interpolation parameters ( α or λ ) are often • For example, P int ( w i | w i − 1 i − n + 1 P ( w i | w i − 1 i − n + 1 ) = λ w i − 1 i − n + 1 ) i − n + 1 ) P int ( w i | w i − 1 + ( 1 − λ w i − 1 i − n + 2 )
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, amount with small counts) otherwise 51 / 63 Evaluation Extensions Smoothing A popular back-ofg method is Katz back-ofg: Back-ofg & Interpolation Katz back-ofg { P ∗ ( w i | w i − 1 if C ( w i i − n + 1 ) i − n + 1 ) > 0 P Katz ( w i | w i − 1 i − n + 1 ) = i − n + 1 P Katz ( w i | w i − 1 α w i − 1 i − n + 2 ) • P ∗ ( · ) is the Good-Turing discounted probability estimate (only for n-grams • α w i − 1 i − n + 1 makes sure that the back-ofg probabilities sum to the discounted • α is high for frequent contexts. So, hopefully, α black P ( squirrel ) > α wuggy P ( squirrel ) P ( squirrel | black ) > P ( squirrel | wuggy )
Motivation . Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, they appear makes glasses more likely Assigning probabilities to unigrams based on the number of unique contexts But Francisco occurs only in the context San Francisco English corpus, PTB) It turns out the word Francisco is more frequent than glasses (in the typical I can't see without my reading Estimation target word occurring in a new context Kneser-Ney interpolation: intuition Extensions Back-ofg & Interpolation Smoothing Evaluation 52 / 63 • Use absolute discounting for the higher order n-gram • Estimate the lower order n-gram probabilities based on the probability of the • Example:
Motivation I can't see without my reading glasses. Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, they appear makes glasses more likely Assigning probabilities to unigrams based on the number of unique contexts But Francisco occurs only in the context San Francisco English corpus, PTB) It turns out the word Francisco is more frequent than glasses (in the typical 52 / 63 Estimation target word occurring in a new context Kneser-Ney interpolation: intuition Extensions Back-ofg & Interpolation Smoothing Evaluation • Use absolute discounting for the higher order n-gram • Estimate the lower order n-gram probabilities based on the probability of the • Example:
Motivation I can't see without my reading glasses. Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, they appear makes glasses more likely Assigning probabilities to unigrams based on the number of unique contexts But Francisco occurs only in the context San Francisco English corpus, PTB) 52 / 63 Estimation target word occurring in a new context Kneser-Ney interpolation: intuition Extensions Back-ofg & Interpolation Smoothing Evaluation • Use absolute discounting for the higher order n-gram • Estimate the lower order n-gram probabilities based on the probability of the • Example: • It turns out the word Francisco is more frequent than glasses (in the typical
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, they appear makes glasses more likely Assigning probabilities to unigrams based on the number of unique contexts English corpus, PTB) I can't see without my reading glasses. 52 / 63 target word occurring in a new context Kneser-Ney interpolation: intuition Extensions Back-ofg & Interpolation Smoothing Evaluation • Use absolute discounting for the higher order n-gram • Estimate the lower order n-gram probabilities based on the probability of the • Example: • It turns out the word Francisco is more frequent than glasses (in the typical • But Francisco occurs only in the context San Francisco
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, they appear makes glasses more likely English corpus, PTB) I can't see without my reading glasses. 52 / 63 target word occurring in a new context Kneser-Ney interpolation: intuition Extensions Back-ofg & Interpolation Smoothing Evaluation • Use absolute discounting for the higher order n-gram • Estimate the lower order n-gram probabilities based on the probability of the • Example: • It turns out the word Francisco is more frequent than glasses (in the typical • But Francisco occurs only in the context San Francisco • Assigning probabilities to unigrams based on the number of unique contexts
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, better) All unique contexts Absolute discount 53 / 63 for bigrams Kneser-Ney interpolation Extensions Back-ofg & Interpolation Smoothing Evaluation Unique contexts w i appears P KN ( w i | w i − 1 ) = C ( w i − 1 w i ) − D |{ v | C ( vw i ) > 0 }| + λ w i − 1 ∑ C ( w i ) w | { v | C ( vw ) > 0 }| • λ s make sure that the probabilities sum to 1 • The same idea can be applied to back-ofg as well (interpolation seems to work
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, n-gram model trained on business news for medical texts . In the last race, the horse he bought last year finally The n-gram language models are simple and successful, but … Some shortcomings of the n-gram language models Extensions Back-ofg & Interpolation Smoothing Evaluation 54 / 63 • They cannot handle long-distance dependencies: • The success often drops in morphologically complex languages • The smoothing methods are often ‘a bag of tricks’ • They are highly sensitive to the training data: you do not want to use an
Motivation Estimation Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, soft words can be assigned to clusters probabilistically (more fmexible) hard a word belongs to only one cluster (simplifjes the model) – if you have never seen eat your broccoli , estimate {apple, pear, broccoli, spinach} – a clustering algorithm is likely to form a cluster containing words for food, e.g., cluster Cluster-based n-grams Extensions Back-ofg & Interpolation Smoothing Evaluation 55 / 63 • The idea is to cluster the words, and fall-back (back-ofg or interpolate) to the • For example, P ( broccoli | eat your ) = P ( FOOD | eat your ) × P ( broccoli | FOOD ) • Clustering can be
Motivation are completely difgerent for an n-gram model Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, – … interpolation) of words Estimation 56 / 63 Extensions Skipping Back-ofg & Interpolation Smoothing Evaluation • The contexts – boring | the lecture was – boring | (the) lecture yesterday was • A potential solution is to consider contexts with gaps, ‘skipping’ one or more • We would, for example model P ( e | abcd ) with a combination (e.g., – P ( e | abc _ ) – P ( e | ab _ d ) – P ( e | a _ cd )
Motivation types Summer Semester 2020 SfS / University of Tübingen Ç. Çöltekin, have difgerent behavior Estimation 57 / 63 Modeling sentence types Extensions Back-ofg & Interpolation Smoothing Evaluation • Another way to improve a language model is to condition on the sentence • The idea is difgerent types of sentences (e.g., ones related to difgerent topics) • Sentence types are typically based on clustering • We create multiple language models, one for each sentence type • Often a ‘general’ language model is used, as a fall-back
Recommend
More recommend