language modeling
play

Language Modeling Michael Collins, Columbia University Overview - PowerPoint PPT Presentation

Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting methods The


  1. Language Modeling Michael Collins, Columbia University

  2. Overview ◮ The language modeling problem ◮ Trigram models ◮ Evaluating language models: perplexity ◮ Estimation techniques: ◮ Linear interpolation ◮ Discounting methods

  3. The Language Modeling Problem ◮ We have some (finite) vocabulary, say V = { the, a, man, telescope, Beckham, two , . . . } ◮ We have an (infinite) set of strings, V † the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP

  4. The Language Modeling Problem (Continued) ◮ We have a training sample of example sentences in English

  5. The Language Modeling Problem (Continued) ◮ We have a training sample of example sentences in English ◮ We need to “learn” a probability distribution p i.e., p is a function that satisfies � p ( x ) ≥ 0 for all x ∈ V † p ( x ) = 1 , x ∈V †

  6. The Language Modeling Problem (Continued) ◮ We have a training sample of example sentences in English ◮ We need to “learn” a probability distribution p i.e., p is a function that satisfies � p ( x ) ≥ 0 for all x ∈ V † p ( x ) = 1 , x ∈V † p ( the STOP ) = 10 − 12 p ( the fan STOP ) = 10 − 8 p ( the fan saw Beckham STOP ) = 2 × 10 − 8 p ( the fan saw saw STOP ) = 10 − 15 . . . p ( the fan saw Beckham play for Real Madrid STOP ) = 2 × 10 − 9 . . .

  7. Why on earth would we want to do this?! ◮ Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.)

  8. Why on earth would we want to do this?! ◮ Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.) ◮ The estimation techniques developed for this problem will be VERY useful for other problems in NLP

  9. A Naive Method ◮ We have N training sentences ◮ For any sentence x 1 . . . x n , c ( x 1 . . . x n ) is the number of times the sentence is seen in our training data ◮ A naive estimate: p ( x 1 . . . x n ) = c ( x 1 . . . x n ) N

  10. Overview ◮ The language modeling problem ◮ Trigram models ◮ Evaluating language models: perplexity ◮ Estimation techniques: ◮ Linear interpolation ◮ Discounting methods

  11. Markov Processes ◮ Consider a sequence of random variables X 1 , X 2 , . . . X n . Each random variable can take any value in a finite set V . For now we assume the length n is fixed (e.g., n = 100 ). ◮ Our goal: model P ( X 1 = x 1 , X 2 = x 2 , . . . , X n = x n )

  12. First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n )

  13. First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X 1 = x 1 ) P ( X i = x i | X 1 = x 1 , . . . , X i − 1 = x i − 1 ) i =2

  14. First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X 1 = x 1 ) P ( X i = x i | X 1 = x 1 , . . . , X i − 1 = x i − 1 ) i =2 n � = P ( X 1 = x 1 ) P ( X i = x i | X i − 1 = x i − 1 ) i =2

  15. First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X 1 = x 1 ) P ( X i = x i | X 1 = x 1 , . . . , X i − 1 = x i − 1 ) i =2 n � = P ( X 1 = x 1 ) P ( X i = x i | X i − 1 = x i − 1 ) i =2 The first-order Markov assumption: For any i ∈ { 2 . . . n } , for any x 1 . . . x i , P ( X i = x i | X 1 = x 1 . . . X i − 1 = x i − 1 ) = P ( X i = x i | X i − 1 = x i − 1 )

  16. Second-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n )

  17. Second-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) = P ( X 1 = x 1 ) × P ( X 2 = x 2 | X 1 = x 1 ) n � × P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =3

  18. Second-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) = P ( X 1 = x 1 ) × P ( X 2 = x 2 | X 1 = x 1 ) n � × P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =3 n � = P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =1 (For convenience we assume x 0 = x − 1 = *, where * is a special “start” symbol.)

  19. Modeling Variable Length Sequences ◮ We would like the length of the sequence, n , to also be a random variable ◮ A simple solution: always define X n = STOP where STOP is a special symbol

  20. Modeling Variable Length Sequences ◮ We would like the length of the sequence, n , to also be a random variable ◮ A simple solution: always define X n = STOP where STOP is a special symbol ◮ Then use a Markov process as before: P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =1 (For convenience we assume x 0 = x − 1 = *, where * is a special “start” symbol.)

  21. Trigram Language Models ◮ A trigram language model consists of: 1. A finite set V 2. A parameter q ( w | u, v ) for each trigram u, v, w such that w ∈ V ∪ { STOP } , and u, v ∈ V ∪ { * } .

  22. Trigram Language Models ◮ A trigram language model consists of: 1. A finite set V 2. A parameter q ( w | u, v ) for each trigram u, v, w such that w ∈ V ∪ { STOP } , and u, v ∈ V ∪ { * } . ◮ For any sentence x 1 . . . x n where x i ∈ V for i = 1 . . . ( n − 1) , and x n = STOP, the probability of the sentence under the trigram language model is n � p ( x 1 . . . x n ) = q ( x i | x i − 2 , x i − 1 ) i =1 where we define x 0 = x − 1 = *.

  23. An Example For the sentence the dog barks STOP we would have p ( the dog barks STOP ) = q ( the | *, * ) × q ( dog | *, the ) × q ( barks | the, dog ) × q ( STOP | dog, barks )

  24. The Trigram Estimation Problem Remaining estimation problem: q ( w i | w i − 2 , w i − 1 ) For example: q ( laughs | the, dog )

  25. The Trigram Estimation Problem Remaining estimation problem: q ( w i | w i − 2 , w i − 1 ) For example: q ( laughs | the, dog ) A natural estimate (the “maximum likelihood estimate”): q ( w i | w i − 2 , w i − 1 ) = Count ( w i − 2 , w i − 1 , w i ) Count ( w i − 2 , w i − 1 ) q ( laughs | the, dog ) = Count ( the, dog, laughs ) Count ( the, dog )

  26. Sparse Data Problems A natural estimate (the “maximum likelihood estimate”): q ( w i | w i − 2 , w i − 1 ) = Count ( w i − 2 , w i − 1 , w i ) Count ( w i − 2 , w i − 1 ) q ( laughs | the, dog ) = Count ( the, dog, laughs ) Count ( the, dog ) Say our vocabulary size is N = |V| , then there are N 3 parameters in the model. 20 , 000 3 = 8 × 10 12 parameters e.g., N = 20 , 000 ⇒

  27. Overview ◮ The language modeling problem ◮ Trigram models ◮ Evaluating language models: perplexity ◮ Estimation techniques: ◮ Linear interpolation ◮ Discounting methods

  28. Evaluating a Language Model: Perplexity ◮ We have some test data, m sentences s 1 , s 2 , s 3 , . . . , s m

  29. Evaluating a Language Model: Perplexity ◮ We have some test data, m sentences s 1 , s 2 , s 3 , . . . , s m ◮ We could look at the probability under our model � m i =1 p ( s i ) . Or more conveniently, the log probability m m � � log p ( s i ) = log p ( s i ) i =1 i =1

  30. Evaluating a Language Model: Perplexity ◮ We have some test data, m sentences s 1 , s 2 , s 3 , . . . , s m ◮ We could look at the probability under our model � m i =1 p ( s i ) . Or more conveniently, the log probability m m � � log p ( s i ) = log p ( s i ) i =1 i =1 ◮ In fact the usual evaluation measure is perplexity m l = 1 Perplexity = 2 − l � where log p ( s i ) M i =1 and M is the total number of words in the test data.

  31. Some Intuition about Perplexity ◮ Say we have a vocabulary V , and N = |V| + 1 and model that predicts q ( w | u, v ) = 1 N for all w ∈ V ∪ { STOP } , for all u, v ∈ V ∪ { * } . ◮ Easy to calculate the perplexity in this case: l = log 1 Perplexity = 2 − l where N ⇒ Perplexity = N Perplexity is a measure of effective “branching factor”

  32. Typical Values of Perplexity ◮ Results from Goodman (“A bit of progress in language modeling”), where |V| = 50 , 000 ◮ A trigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 2 , x i − 1 ) . Perplexity = 74

  33. Typical Values of Perplexity ◮ Results from Goodman (“A bit of progress in language modeling”), where |V| = 50 , 000 ◮ A trigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 2 , x i − 1 ) . Perplexity = 74 ◮ A bigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 1 ) . Perplexity = 137

  34. Typical Values of Perplexity ◮ Results from Goodman (“A bit of progress in language modeling”), where |V| = 50 , 000 ◮ A trigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 2 , x i − 1 ) . Perplexity = 74 ◮ A bigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 1 ) . Perplexity = 137 ◮ A unigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i ) . Perplexity = 955

  35. Some History ◮ Shannon conducted experiments on entropy of English i.e., how good are people at the perplexity game? C. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:50–64, 1951.

Recommend


More recommend