csci 5832 natural language processing
play

CSCI 5832 Natural Language Processing Jim Martin Lecture 8 2/7/08 - PDF document

CSCI 5832 Natural Language Processing Jim Martin Lecture 8 2/7/08 1 Today 2/7 Finish remaining LM issues Smoothing Backoff and Interpolation Parts of Speech POS Tagging HMMs and Viterbi 2 2/7/08 Laplace smoothing


  1. CSCI 5832 Natural Language Processing Jim Martin Lecture 8 2/7/08 1 Today 2/7 • Finish remaining LM issues  Smoothing  Backoff and Interpolation • Parts of Speech • POS Tagging • HMMs and Viterbi 2 2/7/08 Laplace smoothing • Also called add-one smoothing • Just add one to all the counts! • Very simple • MLE estimate: • Laplace estimate: • Reconstructed counts: 3 2/7/08 1

  2. Laplace smoothed bigram counts 4 2/7/08 Laplace-smoothed bigrams 5 2/7/08 Reconstituted counts 6 2/7/08 2

  3. Big Changes to Counts • C(count to) went from 608 to 238! • P(to|want) from .66 to .26! • Discount d= c*/c  d for “chinese food” =.10!!! A 10x reduction  So in general, Laplace is a blunt instrument  Could use more fine-grained method (add-k) • Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially  For pilot studies  in domains where the number of zeros isn’t so huge. 7 2/7/08 Better Discounting Methods • Intuition used by many smoothing algorithms  Good-Turing  Kneser-Ney  Witten-Bell • Is to use the count of things we’ve seen once to help estimate the count of things we’ve never seen 8 2/7/08 Good-Turing • Imagine you are fishing  There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass • You have caught  10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish (tokens) = 6 species (types) • How likely is it that you’ll next see another trout? 9 2/7/08 3

  4. Good-Turing • Now how likely is it that next species is new (i.e. catfish or bass) There were 18 distinct events... 3 of those represent singleton species 3/18 10 2/7/08 Good-Turing • But that 3/18s isn’t represented in our probability mass. Certainly not the one we used for estimating another trout. 11 2/7/08 Good-Turing Intuition • Notation: N x is the frequency-of-frequency-x  So N 10 =1, N 1 =3, etc • To estimate total number of unseen species  Use number of species (words) we’ve seen once  c 0 * =c 1 p 0 = N 1 /N • All other estimates are adjusted (down) to give probabilities for unseen 12 Slide from Josh Goodman 2/7/08 4

  5. Good-Turing Intuition • Notation: N x is the frequency-of-frequency-x  So N 10 =1, N 1 =3, etc • To estimate total number of unseen species  Use number of species (words) we’ve seen once  c 0* =c 1 p 0 = N 1 /N p 0 =N 1 /N=3/18 • All other estimates are adjusted (down) to give probabilities for unseen P(eel) = c*(1) = (1+1) 1/ 3 = 2/3 13 Slide from Josh Goodman 2/7/08 Bigram frequencies of frequencies and GT re-estimates 14 2/7/08 GT smoothed bigram probs 15 2/7/08 5

  6. Backoff and Interpolation • Another really useful source of knowledge • If we are estimating:  trigram p(z|xy)  but c(xyz) is zero • Use info from:  Bigram p(z|y) • Or even:  Unigram p(z) • How to combine the trigram/bigram/unigram info? 16 2/7/08 Backoff versus interpolation • Backoff : use trigram if you have it, otherwise bigram, otherwise unigram • Interpolation : mix all three 17 2/7/08 Interpolation • Simple interpolation • Lambdas conditional on context: 18 2/7/08 6

  7. How to set the lambdas? • Use a held-out corpus • Choose lambdas which maximize the probability of some held-out data  I.e. fix the N-gram probabilities  Then search for lambda values  That when plugged into previous equation  Give largest probability for held-out set  Can use EM to do this search 19 2/7/08 Practical Issues • We do everything in log space  Avoid underflow  (also adding is faster than multiplying) 20 2/7/08 Language Modeling Toolkits • SRILM • CMU-Cambridge LM Toolkit 21 2/7/08 7

  8. Google N-Gram Release 22 2/7/08 Google N-Gram Release • serve as the incoming 92 • serve as the incubator 99 • serve as the independent 794 • serve as the index 223 • serve as the indication 72 • serve as the indicator 120 • serve as the indicators 45 • serve as the indispensable 111 • serve as the indispensible 40 • serve as the individual 234 23 2/7/08 LM Summary • Probability  Basic probability  Conditional probability  Bayes Rule • Language Modeling (N-grams)  N-gram Intro  The Chain Rule  Perplexity  Smoothing:  Add-1  Good-Turing 24 2/7/08 8

  9. Break • Moving quiz to Thursday (2/14) • Readings  Chapter 2: All  Chapter 3:  Skip 3.4.1 and 3.12  Chapter 4  Skip 4.7, 4.9, 4.10 and 4.11  Chapter 5  Read 5.1 through 5.5 25 2/7/08 Outline • Probability • Part of speech tagging  Parts of speech  Tag sets  Rule-based tagging  Statistical tagging  Simple most-frequent-tag baseline  Important Ideas  Training sets and test sets  Unknown words  Error analysis  HMM tagging 26 2/7/08 Part of Speech tagging • Part of speech tagging  Parts of speech  What’s POS tagging good for anyhow?  Tag sets  Rule-based tagging  Statistical tagging  Simple most-frequent-tag baseline  Important Ideas  Training sets and test sets  Unknown words  HMM tagging 27 2/7/08 9

  10. Parts of Speech • 8 (ish) traditional parts of speech  Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc  Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS  Lots of debate in linguistics about the number, nature, and universality of these  We’ll completely ignore this debate. 28 2/7/08 POS examples • N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adjective purple, tall, ridiculous • ADV adverb unfortunately, slowly • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those 29 2/7/08 POS Tagging: Definition • The process of assigning a part-of-speech or lexical class marker to each word in a corpus: WORDS TAGS the koala put N the V keys P on DET the table 30 2/7/08 10

  11. POS Tagging example WORD tag the DET koala N put V the DET keys N on P the DET table N 31 2/7/08 What is POS tagging good for? • First step of a vast number of practical tasks • Speech synthesis  How to pronounce “lead”?  INsult inSULT  OBject obJECT  OVERflow overFLOW  DIScount disCOUNT  CONtent conTENT • Parsing  Need to know if a word is an N or V before you can parse • Information extraction  Finding names, relations, etc. • Machine Translation 32 2/7/08 Open and Closed Classes • Closed class: a relatively fixed membership  Prepositions: of, in, by, …  Auxiliaries: may, can, will had, been, …  Pronouns: I, you, she, mine, his, them, …  Usually function words (short common words which play a role in grammar) • Open class: new ones can be created all the time  English has 4: Nouns, Verbs, Adjectives, Adverbs  Many languages have these 4, but not all! 33 2/7/08 11

  12. Open class words • Nouns  Proper nouns (Boulder, Granby, Eli Manning)  English capitalizes these.  Common nouns (the rest).  Count nouns and mass nouns  Count: have plurals, get counted: goat/goats, one goat, two goats  Mass: don’t get counted (snow, salt, communism) (*two snows) • Adverbs: tend to modify things  Unfortunately, John walked home extremely slowly yesterday  Directional/locative adverbs (here,home, downhill)  Degree adverbs (extremely, very, somewhat)  Manner adverbs (slowly, slinkily, delicately) • Verbs :  In English, have morphological affixes (eat/eats/eaten) 34 2/7/08 Closed Class Words • Idiosyncratic • Examples:  prepositions: on, under, over, …  particles: up, down, on, off, …  determiners: a, an, the, …  pronouns: she, who, I, ..  conjunctions: and, but, or, …  auxiliary verbs: can, may should, …  numerals: one, two, three, third, … 35 2/7/08 Prepositions from CELEX 36 2/7/08 12

  13. English particles 37 2/7/08 Conjunctions 38 2/7/08 POS tagging: Choosing a tagset • There are so many parts of speech, potential distinctions we can draw • To do POS tagging, need to choose a standard set of tags to work with • Could pick very coarse tagets  N, V, Adj, Adv. • More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags  PRP$, WRB, WP$, VBG • Even more fine-grained tagsets exist 39 2/7/08 13

  14. Penn TreeBank POS Tag set 40 2/7/08 Using the UPenn tagset • The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. • Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”) • Except the preposition/complementizer “to” is just marked “TO”. 41 2/7/08 POS Tagging • Words often have more than one POS: back  The back door = JJ  On my back = NN  Win the voters back = RB  Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word. These examples from Dekang Lin 42 2/7/08 14

  15. How hard is POS tagging? Measuring ambiguity 43 2/7/08 2 methods for POS tagging 1. Rule-based tagging  (ENGTWOL) 2. Stochastic (=Probabilistic) tagging  HMM (Hidden Markov Model) tagging 44 2/7/08 Rule-based tagging • Start with a dictionary • Assign all possible tags to words from the dictionary • Write rules by hand to selectively remove tags • Leaving the correct tag for each word. 45 2/7/08 15

Recommend


More recommend