info 4300 cs4300 information retrieval slides adapted
play

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 24/25: Text Classification and Naive Bayes Paul Ginsparg Cornell University, Ithaca, NY 30 Nov 2010 1 / 46


  1. INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 24/25: Text Classification and Naive Bayes Paul Ginsparg Cornell University, Ithaca, NY 30 Nov 2010 1 / 46

  2. Administrativa Assignment 4 due Fri 3 Dec (extended to Sun 5 Dec). Mon 13 Dec, Early Final examination, 2:00-4:30 p.m., Upson B17 (by prior notification of intent via CMS) Fri 17 Dec, Final examination, 2:00-4:30 p.m., in Hollister Hall B14 2 / 46

  3. Overview Recap 1 Text classification 2 Naive Bayes 3 Discussion 4 3 / 46

  4. Outline Recap 1 Text classification 2 Naive Bayes 3 Discussion 4 4 / 46

  5. Major issue in clustering – labeling After a clustering algorithm finds a set of clusters: how can they be useful to the end user? We need a pithy label for each cluster. For example, in search result clustering for “jaguar”, The labels of the three clusters could be “animal”, “car”, and “operating system”. How can we automatically find good labels for clusters? discriminative vs. non-discriminative labelling titles? 5 / 46

  6. Feature selection In text classification, we usually represent documents in a high-dimensional space, with each dimension corresponding to a term. In this lecture: axis = dimension = word = term = feature Many dimensions correspond to rare words. Rare words can mislead the classifier. Rare misleading features are called noise features. Eliminating noise features from the representation increases efficiency and effectiveness of text classification. Eliminating features is called feature selection. 6 / 46

  7. Different feature selection methods A feature selection method is mainly defined by the feature utility measures it employs. Feature utility measures: Frequency – select the most frequent terms Mutual information – select the terms with the highest mutual information (mutual information is also called information gain in this context) χ 2 (Chi-square) 7 / 46

  8. How to compute MI values Based on maximum likelihood estimates, the formula we actually use is: N 11 NN 11 + N 10 NN 10 I ( U ; C ) = N log 2 N log 2 (1) N 1 . N . 1 N 1 . N . 0 + N 01 NN 01 + N 00 NN 00 N log 2 N log 2 N 0 . N . 1 N 0 . N . 0 N 11 : # of documents that contain t ( e t = 1) and are in c ( e c = 1) N 10 : # of documents that contain t ( e t = 1) and not in c ( e c = 0) N 01 : # of documents that don’t contain t ( e t = 0) and in c ( e c = 1) N 00 : # of documents that don’t contain t ( e t = 0) and not in c ( e c = 0) N = N 00 + N 01 + N 10 + N 11 p ( t , c ) ≈ N 11 / N , p ( t , c ) ≈ N 01 / N , p ( t , c ) ≈ N 10 / N , p ( t , c ) ≈ N 00 / N N 1 . = N 10 + N 11 : # documents that contain t , p ( t ) ≈ N 1 . / N N . 1 = N 01 + N 11 : # documents in c , p ( c ) ≈ N . 1 / N N 0 . = N 00 + N 01 : # documents that don’t contain t , p ( t ) ≈ N 0 . / N N . 0 = N 00 + N 10 : # documents not in c , p ( c ) ≈ N . 0 / N 8 / 46

  9. MI example for poultry / export in Reuters e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 N 11 = 49 N 10 = 141 e t = e export = 0 N 01 = 27 , 652 N 00 = 774 , 106 Plug these values into formula: 49 801 , 948 · 49 I ( U ; C ) = 801 , 948 log 2 (49+27 , 652)(49+141) 141 801 , 948 · 141 + 801 , 948 log 2 (141+774 , 106)(49+141) + 27 , 652 801 , 948 · 27 , 652 801 , 948 log 2 (49+27 , 652)(27 , 652+774 , 106) +774 , 106 801 , 948 · 774 , 106 801 , 948 log 2 (141+774 , 106)(27 , 652+774 , 106) ≈ 0 . 000105 9 / 46

  10. MI feature selection on Reuters Terms with highest mutual information for three classes: coffee sports poultry 0.0111 0.0681 0.0013 coffee soccer poultry bags 0.0042 cup 0.0515 meat 0.0008 0.0025 0.0441 0.0006 growers match chicken kg 0.0019 matches 0.0408 agriculture 0.0005 0.0018 0.0388 0.0004 colombia played avian brazil 0.0016 league 0.0386 broiler 0.0003 0.0014 0.0301 0.0003 export beat veterinary exporters 0.0013 game 0.0299 birds 0.0003 exports 0.0013 games 0.0284 inspection 0.0003 crop 0.0012 team 0.0264 pathogenic 0.0003 I ( export , poultry ) ≈ . 000105 not among the ten highest for class poultry , but still potentially significant. 10 / 46

  11. χ 2 Feature selection χ 2 tests independence of two events, p ( A , B ) = p ( A ) p ( B ) (or p ( A | B ) = p ( A ), p ( B | A ) = p ( B )) . Test occurrence of the term, occurrence of the class, rank w.r.t.: ( N e t e c − E e t e c ) 2 X 2 ( D , t , c ) = � � E e t e c e t ∈{ 0 , 1 } e c ∈{ 0 , 1 } where N = observed frequency in D , E = expected frequency (e.g., E 11 is the expected frequency of t and c occurring together in a document, assuming term and class are independent) High value of X 2 indicates independence hypothesis is incorrect, i.e., observed and expected are too dissimilar. If occurrence of term and class are dependent events, then occurrence of term makes class more (or less) likely, hence helpful as feature. 11 / 46

  12. χ 2 Feature selection, example Are class poultry and term export interdependent by χ 2 test? e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 N 11 = 49 N 10 = 141 e t = e export = 0 N 01 = 27 , 652 N 00 = 774 , 106 N = N 11 + N 10 + N 01 + N 00 = 801948 Identify: p ( t ) = N 11 + N 10 , p ( c ) = N 11 + N 01 , p ( t ) = N 01 + N 00 , p ( c ) = N 10 + N 00 N N N N Then estimate expected frequencies: e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 E 11 = Np ( t ) p ( c ) E 10 = Np ( t ) p ( c ) e t = e export = 0 E 01 = Np ( t ) p ( c ) E 00 = Np ( t ) p ( c ) e . g ., E 11 = N · p ( t ) · p ( c ) = N · N 11 + N 10 · N 11 + N 01 N N = N · 49 + 141 · 49 + 27652 ≈ 6 . 6 N N 12 / 46

  13. Expected Frequencies From e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 E 11 = Np ( t ) p ( c ) E 10 = Np ( t ) p ( c ) e t = e export = 0 E 01 = Np ( t ) p ( c ) E 00 = Np ( t ) p ( c ) the full table of expected frequencies is e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 E 11 ≈ 6 . 6 E 10 ≈ 183 . 4 e t = e export = 0 E 01 ≈ 27694 . 4 E 00 ≈ 774063 . 6 Compared to the original data: e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 N 11 = 49 N 10 = 141 e t = e export = 0 N 01 = 27 , 652 N 00 = 774 , 106 the question is now whether a quantity like the surplus N 11 = 49 over the expected E 11 ≈ 6 . 6 is statistically significant. 13 / 46

  14. For these values of N and E , the result for X 2 is ( N e t e c − E e t e c ) 2 � � X 2 ( D , t , c ) = ≈ 284 E e t e c e t ∈{ 0 , 1 } e c ∈{ 0 , 1 } We are testing the assumption that the values of the N e t e c are generated by two independent probabilities, fitting the three ratios with two parameters p ( t ) and p ( c ), leaving one degree of freedom. χ 2 critical p There is a tabulated distribution, called the χ 2 .1 2.71 distribution (in this case with one degree of .05 3.84 freedom) which assesses the statistical likelihood .01 6.63 of any value of X 2 , as defined above (and is .005 7.88 analogous to likelihood of standard deviations .001 10.83 from the mean of a gaussian distribution): The above X 2 ≈ 284 > 10 . 83, i.e., giving a less than .1% chance that so large a value of X 2 would occur if export / poultry were really independent (equivalently a 99.9% chance they’re dependent). 14 / 46

  15. Outline Recap 1 Text classification 2 Naive Bayes 3 Discussion 4 15 / 46

  16. Relevance feedback In relevance feedback, the user marks documents as relevant/nonrelevant. Relevant/nonrelevant can be viewed as classes or categories. For each document, the user decides which of these two classes is correct. The IR system then uses these class assignments to build a better query (“model”) of the information need . . . . . . and returns better documents. Relevance feedback is a form of text classification. The notion of text classification (TC) is very general and has many applications within and beyond information retrieval. 16 / 46

  17. Another TC task: spam filtering From: ‘‘’’ <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm ================================================= How would you write a program that would automatically detect and delete this type of message? 17 / 46

  18. Formal definition of TC — summary Training Given: A document space X Documents are represented in some high-dimensional space. A fixed set of classes C = { c 1 , c 2 , . . . , c J } human-defined for needs of application (e.g., rel vs. non-rel). A training set D of labeled documents � d , c � ∈ X × C Using a learning method or learning algorithm, we then wish to learn a classifier γ that maps documents to classes: γ : X → C Application/Testing Given: a description d ∈ X of a document Determine: γ ( d ) ∈ C , i.e., the class most appropriate for d 18 / 46

Recommend


More recommend