natural language processing
play

Natural Language Processing Lecture 5: Language Models and - PowerPoint PPT Presentation

Natural Language Processing Lecture 5: Language Models and Smoothing Language Modeling Is this sentences good? This is a pen Pen this is a Help choose between optons, help score optons


  1. Natural Language Processing Lecture 5: Language Models and Smoothing

  2. Language Modeling • Is this sentences good? – This is a pen – Pen this is a • Help choose between optons, help score optons – 他向记者介绍了发言的主要内容 – He briefed to reporters on the chief contents of the statement – He briefed reporters on the chief contents of the statement – He briefed to reporters on the main contents of the statement – He briefed reporters on the main contents of the statement

  3. One-Slide Review of Probability Terminology • Random variables take diferent values, depending on chance. • Notaton: p ( X = x ) is the probability that r.v. X takes value x p ( x ) is shorthand for the same p ( X ) is the distributon over values X can take (a functon) • Joint probability: p ( X = x , Y = y ) – Independence – Chain rule • Conditonal probability: p ( X = x | Y = y )

  4. Unigram Model • Every word in Σ is assigned some probability. • Random variables W 1 , W 2 , ... (one per word).

  5. Part of A Unigram Distributon [rank 1] … p(the) = 0.038 [rank 1001] p(of) = 0.023 p(joint) = 0.00014 p(and) = 0.021 p(relatvely) = 0.00014 p(to) = 0.017 p(plot) = 0.00014 p(is) = 0.013 p(DEL1SUBSEQ) = 0.00014 p(a) = 0.012 p(rule) = 0.00014 p(in) = 0.012 p(62.0) = 0.00014 p(for) = 0.009 p(9.1) = 0.00014 ... p(evaluated) = 0.00014 ...

  6. Unigram Model as a Generator fjrst, fsrsm! etss tet Tiess sftrtnt 2ee4*), suut wues e gds e 19.2 Ms te tetsr It ~(s?1), gdsvtn e.62 tetst (xe; m! t e 1 s et uuet. x 6e 1998. uun tr by Nsts t wut sfs st tt CFG 12e bt 1ee es tssn uur y Ifs m!s tes nstt 21.8 t e e WP te t tet te t Nsv? k. ts fsuun tssn; ts [e, ts sftrtnt v euuts, m!s te 65 sts. s s - 24*.94* stnttn ts nst te t 2 In ts euusttrsngd t e K&M 1ee Bse fs t X))] ppest ; In 1e4* S. gdr m!m! r wu s (St tssn sntr stsvt tetsss, tet m! esnts t bet -5.66 trs es: An tet ttxtuu e (fs m!sey ppes tssns. Wt e vt fssr m!s tes 4*e.1 ns 156 txpt tt rt ntsgdebsress

  7. Full History Model • Every word in Σ is assigned some probability, conditoned on every history .

  8. Bsee Cesntsn's uunuusuu eey srt t sm!m!tnt Wt nts y sn tet pssssbet rset sfs r t sn tet tet tssn wu s sn kttpsngd wuste tet Cesntsns' bs ts psrtr y Ob m! , wues ss sm!sngd ts bt sm!t tet fjrst be k U.S. prtss tnt, s tet et r fs vsrstt, tetrtby etsstnsngd tet psttnts e fs eesuut sfs Hsee ry Cesntsn sts nst wusn sn Ssuute C rsesn .

  9. N-Gram Model • Every word in Σ is assigned some probability, conditoned on a fjxed-jlfengfth history ( n – 1 ).

  10. Bigram Model as a Generator t. (A.33) (A.34*) A.5 Ms teS rt ess bttn sm!petttey suurp sst sn ptrfssrm! n t sn r fsts sfs snesnt egdsrstem!s n estvt fs r m!srt ss wueset suubst nts eey sm!prsvt uussngd CE. 4*.4*.1 MLE s C stsfsCE 71 26.34* 23.1 57.8 K&M 4*2.4* 62.7 4*e.9 4*4* 4*3 9e.7 1ee.e 1ee.e 1ee.e 15.1 3e.9 18.e 21.2 6e.1 uun srt tt tv euu tssns srt tt DEL1 TiRANS1 ntsgdebsress . Tiess sntsnuuts, wuste suuptrvsst snst., stm!ssuuptrvsst MLE wuste tet METiU- S b n sTirttb nk 195 ADJA ADJD ADV APPR APPRARTi APPO APZR ARTi CARD FM ITiJ KOUI KOUS KON KOKOM NN NN NN IN JJ NNTietsr prsbetm! ss y x. Tiet tv euu tssn sftrs tet eypstetsszt esnk gdr m!m! r wuste G uusss n

  11. Trigram Model as a Generator tsp(xI ,rsgdet,B). (A.39) vsnte(X, I) r snstste(I 1, I). (A.4*e) vsnt(n). (A.4*1) Tietst tquu tssns wutrt prtstntt sn bste sts; tetst s srts uu<AC>snts prsb bsesty sstrsbuutssn ss tvtn sm! eetr(r =e.e5). Tiess ss tx tey fsEM. Duursngd DA, ss gdr uu eey rte xt . Tiess pprs e suue bt tf stntey uust sn prtvssuus e pttrs) btfssrt tr snsngd (ttst) K&MZtrsLs er n sm! m!s tes Fsgduurt4*.12: Dsrt tt uur y sn ee ssx e ngduu gdts. Im!psrt ntey, tetst p ptrs estvt st tt- sfs-tet- rt rtsuuets sn tetsr t sks n uune btet t n tet vtrbs rt eeswut (fssr snst n t) ts stet t tet r sn esty sfs ss rttt struu tuurts, eskt m! t esngds sn wutsgdett gdr pes (M Dsn e tt e., 1993) (35 t gd typts, 3.39 bsts). Tiet Buuegd rs n,

  12. What’s in a word • Is punctuaton a word? – Does knowing the last “word” is a “,” help? • In speech – I do uh main- mainly business processing – Is “uh” a word?

  13. For Thought • Do N-Gram models “know” English? • Unknown words • N-gram models and fnite-state automata

  14. Startng and Stopping Unigram model: ... Bigram model: ... Trigram model: ...

  15. Evaluatio

  16. Which model is beter? • Can I get a number about how good my model is for a test set? • What is the P(test_set | Model ) • We measure this by Perplexity • Perplexity is the probability of test set normalized by the number of words

  17. Perplexity

  18. Perplexity of diferent models • Beter models have lower perplexity – WSJ: Unigram 962; Bigram 170; Trigram 109 • Diferent tasks have diferent perplexity – WSJ (109) vs Bus Informaton Queries (~25) • Higher the conditonal probability, lower the perplexity • Perplexity is the average branching rate

  19. What about open class • What is the probability of unseen words? – (Naïve answer is 0.0) • But that’s not what you want – Test set will usually include words not in training • What is the probability of – P(Nebuchadnezzur | son of )

  20. LM smoothing • Laplace or add-one smoothing – Add one to all counts – Or add “epsilon” to all counts – You stll need to know all your vocabulary • Have an OOV word in your vocabulary – The probability of seeing an unseen word

  21. Good-Turing Smoothing • Good (1953) From Turing. – Using the count of things you’ve seen once to estmate count of things you’ve never seen. • Calculate the frequency of frequencies of Ngrams – Count of Ngrams that appear 1 tmes – Count of Ngrams that appear 2 tmes – Count of Ngrams that appear 3 tmes – … – Estmate new c = (c+1) (N_c + 1)/N_c) • Change the counts a litle so we get a beter estmate for count 0

  22. Good-Turing’s Discounted Counts AP Newswire Berkeley Restaurants Bigrams Smith Thesis Bigrams Bigrams c N c c* N c c* N c c* e 74*,671,1ee,eee e.eeee27e 2,e81,4*96 e.ee2553 x 38,e4*8 / x 1 2,e18,e4*6 e.4*4*6 5,315 e.53396e 38,e4*8 e.2114*7 2 4*4*9,721 1.26 1,4*19 1.357294* 4*,e32 1.e5e71 3 188,933 2.24* 64*2 2.373832 1,4*e9 2.12633 4* 1e5,668 3.24* 381 4*.e81365 74*9 2.63685 5 68,379 4*.22 311 3.78135e 395 3.91899 6 4*8,19e 5.19 196 4*.5eeeee 258 4*.4*224*8

  23. Backof • If no trigram, use bigram • If no bigram, use unigram • If no unigram … smooth the unigrams

  24. Estmatng p ( w | esstsry) • Relatve frequencies (count & normalize) • Transform the counts: – Laplace/“add one”/“add λ” – Good-Turing discountng • Interpolate or “backof”: – With Good-Turing discountng: Katz backof – “Stupid” backof – Absolute discountng: Kneser-Ney

Recommend


More recommend