info 4300 cs4300 information retrieval slides adapted
play

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 25/25: Text Classification and Exam Overview Paul Ginsparg Cornell University, Ithaca, NY 2 Dec 2009 1 / 59


  1. INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 25/25: Text Classification and Exam Overview Paul Ginsparg Cornell University, Ithaca, NY 2 Dec 2009 1 / 59

  2. Administrativa Assignment 4 due Fri 3 Dec (extended to Sun 5 Dec). Mon 13 Dec, Early Final examination, 2:00-4:30 p.m., Upson B17 (by prior notification of intent via CMS) Fri 17 Dec, Final examination, 2:00-4:30 p.m., in Hollister Hall B14 Office Hours: Wed 8 Dec, Fri 10 Dec, Wed 15 Dec No office hour: Fri 3 Dec (due to conflict with talk I’m giving that afternoon) 2 / 59

  3. Overview Recap 1 Discussion 2 More Statistical Learning 3 Naive Bayes, cont’d 4 Evaluation of TC 5 NB independence assumptions 6 Structured Retrieval 7 Exam Overview 8 3 / 59

  4. Outline Recap 1 Discussion 2 More Statistical Learning 3 Naive Bayes, cont’d 4 Evaluation of TC 5 NB independence assumptions 6 Structured Retrieval 7 Exam Overview 8 4 / 59

  5. Formal definition of TC — summary Training Given: A document space X Documents are represented in some high-dimensional space. A fixed set of classes C = { c 1 , c 2 , . . . , c J } human-defined for needs of application (e.g., rel vs. non-rel). A training set D of labeled documents � d , c � ∈ X × C Using a learning method or learning algorithm, we then wish to learn a classifier γ that maps documents to classes: γ : X → C Application/Testing Given: a description d ∈ X of a document Determine: γ ( d ) ∈ C , i.e., the class most appropriate for d 5 / 59

  6. Classification methods — summary 1. Manual (accurate if done by experts, consistent for problem size and team is small difficult and expensive to scale) 2. Rule-based (accuracy very high if a rule has been carefully refined over time by a subject expert, building and maintaining expensive) 3. Statistical/Probabilistic As per our definition of the classification problem – text classification as a learning problem Supervised learning of a the classification function γ and its application to classifying new documents We have looked at a couple of methods for doing this: Rocchio, kNN. Now Naive Bayes No free lunch: requires hand-classified training data But this manual classification can be done by non-experts. 6 / 59

  7. The Naive Bayes classifier The Naive Bayes classifier is a probabilistic classifier. We compute the probability of a document d being in a class c as follows: � P ( c | d ) ∝ P ( c ) P ( t k | c ) 1 ≤ k ≤ n d n d is the length of the document. (number of tokens) P ( t k | c ) is the conditional probability of term t k occurring in a document of class c P ( t k | c ) as a measure of how much evidence t k contributes that c is the correct class. P ( c ) is the prior probability of c . If a document’s terms do not provide clear evidence for one class vs. another, we choose the c with higher P ( c ). 7 / 59

  8. To avoid zeros: Add-one smoothing Add one to each count to avoid zeros: T ct + 1 T ct + 1 ˆ P ( t | c ) = t ′ ∈ V ( T ct ′ + 1) = � ( � t ′ ∈ V T ct ′ ) + B B is the number of different words (in this case the size of the vocabulary: | V | = M ) 8 / 59

  9. Exercise docID words in document in c = China ? training set 1 Chinese Beijing Chinese yes 2 Chinese Chinese Shanghai yes 3 Chinese Macao yes 4 Tokyo Japan Chinese no test set 5 Chinese Chinese Chinese Tokyo Japan ? Estimate parameters of Naive Bayes classifier Classify test document 9 / 59

  10. Example: Parameter estimates Priors: ˆ P ( c ) = 3 / 4 and ˆ P ( c ) = 1 / 4 Conditional probabilities: ˆ P ( Chinese | c ) = (5 + 1) / (8 + 6) = 6 / 14 = 3 / 7 P ( Tokyo | c ) = ˆ ˆ P ( Japan | c ) = (0 + 1) / (8 + 6) = 1 / 14 ˆ P ( Chinese | c ) = P ( Tokyo | c ) = ˆ ˆ P ( Japan | c ) = (1 + 1) / (3 + 6) = 2 / 9 The denominators are (8 + 6) and (3 + 6) because the lengths of text c and text c are 8 and 3, respectively, and because the constant B is 6 since the vocabulary consists of six terms. Exercise: verify that ˆ P ( Chinese | c ) + ˆ P ( Beijing | c ) + ˆ P ( Shanghai | c ) + ˆ P ( Macao | c ) + ˆ P ( Tokyo | c ) + ˆ P ( Japan | c ) = 1 and ˆ P ( Chinese | c ) + ˆ P ( Beijing | c ) + ˆ P ( Shanghai | c ) + ˆ P ( Macao | c ) + ˆ P ( Tokyo | c ) + ˆ P ( Japan | c ) = 1 10 / 59

  11. Example: Classification d 5 = ( Chinese Chinese Chinese Tokyo Japan ) 3 / 4 · (3 / 7) 3 · 1 / 14 · 1 / 14 ≈ 0 . 0003 ˆ P ( c | d 5 ) ∝ 1 / 4 · (2 / 9) 3 · 2 / 9 · 2 / 9 ≈ 0 . 0001 ˆ P ( c | d 5 ) ∝ Thus, the classifier assigns the test document to c = China : the three occurrences of the positive indicator Chinese in d 5 outweigh the occurrences of the two negative indicators Japan and Tokyo . Exercise: evaluate ˆ P ( c | d ) and ˆ P ( c | d ) for d 6 = ( Chinese Chinese Tokyo Japan ) and d 7 = ( Chinese Tokyo Japan ) 11 / 59

  12. Outline Recap 1 Discussion 2 More Statistical Learning 3 Naive Bayes, cont’d 4 Evaluation of TC 5 NB independence assumptions 6 Structured Retrieval 7 Exam Overview 8 12 / 59

  13. Discussion 6 More Statistical Methods: Peter Norvig, “How to Write a Spelling Corrector” http://norvig.com/spell-correct.html See also http://yehha.net/20794/facebook.com/peter-norvig.html , ”Engineering@Facebook: Tech Talk with Peter Norvig” roughly 00:11:00 – 00:19:15 of a one hour video, but whole first half (or more) if you have time... or as well http://videolectures.net/cikm08 norvig slatuad/ , “Statistical Learning as the Ultimate Agile Development Tool” Additional related reference: http://doi.ieeecomputersociety.org/10.1109/MIS.2009.36 A. Halevy, P. Norvig, F. Pereira, The Unreasonable Effectiveness of Data, Intelligent Systems Mar/Apr 2009 (copy at readings/unrealdata.pdf) 13 / 59

  14. A little theory Find the correction c that maximizes the probability of c given the original word w : argmax c P ( c | w ) By Bayes’ Theorem, equivalent to argmax c P ( w | c ) P ( c ) / P ( w ). P ( w ) the same for every possible c , so ignore, and consider: argmax c P ( w | c ) P ( c ) . Three parts : P ( c ), the probability that a proposed correction c stands on its own. The language model: “how likely is c to appear in an English text?” ( P (“the”) high, P (“zxzxzxzyyy”) near zero) P ( w | c ), the probability that w would be typed when author meant c . The error model: “how likely is author to type w by mistake instead of c ?” argmax c , the control mechanism: choose c that gives the best combined probability score. 14 / 59

  15. Example w =“thew” two candidate corrections c =“the” and c =“thaw”. which has higher P ( c | w )? “thaw” has only small change “a” to “e” “the” is a very common word, and perhaps the typist’s finger slipped off the “e” onto the “w”. To estimate P ( c | w ), have to consider both the probability of c and the probability of the change from c to w 15 / 59

  16. Complete Spelling Corrector import re, collections def words(text): return re.findall(’[a-z]+’, text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file(’big.txt’).read())) alphabet = ’abcdefghijklmnopqrstuvwxyz’ = ⇒ 16 / 59

  17. def edits1(word): s = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in s if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b) > 1] replaces = [a + c + b[1:] for a, b in s for c in alphabet if b] inserts = [a + c + b for a, b in s for c in alphabet] return set(deletes + transposes + replaces + inserts) def known edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known edits2(word) or [word] return max(candidates, key=NWORDS.get) 17 / 59

  18. Outline Recap 1 Discussion 2 More Statistical Learning 3 Naive Bayes, cont’d 4 Evaluation of TC 5 NB independence assumptions 6 Structured Retrieval 7 Exam Overview 8 18 / 59

  19. More Data Figure 1. Learning Curves for Confusion Set Disambiguation http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf Scaling to Very Very Large Corpora for Natural Language Disambiguation M. Banko and E. Brill (2001) 19 / 59

  20. More Data for this Task http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf Scaling to Very Very Large Corpora for Natural Language Disambiguation M. Banko and E. Brill (2001) The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost. (Confusion set disambiguation is the problem of choosing the correct use of a word, given a set of words with which it is commonly confused. Example confusion sets include: { principle , principal } , { then , than } , { to , two , too } , and { weather,whether } .) 20 / 59

Recommend


More recommend