INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University, Ithaca, NY 3 Dec 2009 1 / 32
Administrativa Assignment 4 due Fri 4 Dec (extended to Sun 6 Dec). 2 / 32
Combiner in Simulator “Can be added, but makes less sense to have a combiner in a simulator. Combiners help to speed things by providing local (in-memory) partial reduces. In a simulator we are not really concerned about efficiency.” Hadoop Wiki: “When the map operation outputs its pairs they are already available in memory. For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying a combiner class to perform a reduce-type function. If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner’s reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation.” 3 / 32
Assignment 3 The page rank r j of page j is determined self-consistently by the equation r j = α r i � n + (1 − α ) , d i i | i → j α is a number between 0 and 1 (originally taken to be .15) the sum on i is over pages i pointing to j d i is the outgoing degree of page i . Incidence matrix A ij = 1 if i points to j , otherwise A ij = 0. Transition probability from page i to page j P ij = α n O ij + (1 − α ) 1 A ij d i where n = total # of pages, d i is the outdegree of node i , and r = P T � O ij = 1( ∀ i , j ). The matrix eigenvector relation � rP = � r or � r is equivalent to the equation above (with � r is normalized as a probability, so that � i r i O ij = � i r i = 1). 4 / 32
Overview Recap 1 Feature selection 2 Structured Retrieval 3 Exam Overview 4 5 / 32
Outline Recap 1 Feature selection 2 Structured Retrieval 3 Exam Overview 4 6 / 32
More Data Figure 1. Learning Curves for Confusion Set Disambiguation http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf Scaling to Very Very Large Corpora for Natural Language Disambiguation M. Banko and E. Brill (2001) 7 / 32
Statistical Learning Spelling with Statistical Learning Google Sets Statistical Machine Translation Canonical image selection from the web Learning people annotation from the web via consistency learning and others . . . 8 / 32
Outline Recap 1 Feature selection 2 Structured Retrieval 3 Exam Overview 4 9 / 32
Feature selection In text classification, we usually represent documents in a high-dimensional space, with each dimension corresponding to a term. In this lecture: axis = dimension = word = term = feature Many dimensions correspond to rare words. Rare words can mislead the classifier. Rare misleading features are called noise features. Eliminating noise features from the representation increases efficiency and effectiveness of text classification. Eliminating features is called feature selection. 10 / 32
Different feature selection methods A feature selection method is mainly defined by the feature utility measures it employs Feature utility measures Frequency – select the most frequent terms Mutual information – select the terms with the highest mutual information Mutual information is also called information gain in this context. Chi-square 11 / 32
Information H [ p ] = � i =1 , n − p i log 2 p i measures information uncertainty (p.91 in book) has maximum H = log 2 n for all p i = 1 / n Consider two probability distributions: p ( x ) for x ∈ X and p ( y ) for y ∈ Y MI: I [ X ; Y ] = H [ p ( x )] + H [ p ( y )] − H [ p ( x , y )] measures how much information p ( x ) gives about p ( y ) (and vice versa) MI is zero iff p ( x , y ) = p ( x ) p ( y ), i.e., x and y are independent for all x ∈ X and y ∈ Y can be as large as H [ p ( x )] or H [ p ( y )] p ( x , y ) � I [ X ; Y ] = p ( x , y ) log 2 p ( x ) p ( y ) x ∈ X , y ∈ Y 12 / 32
Mutual information Compute the feature utility A ( t , c ) as the expected mutual information (MI) of term t and class c . MI tells us “how much information” the term contains about the class and vice versa. For example, if a term’s occurrence is independent of the class (same proportion of docs within/without class contain the term), then MI is 0. Definition: P ( U = e t , C = e c ) � � I ( U ; C )= P ( U = e t , C = e c ) log 2 P ( U = e t ) P ( C = e c ) e t ∈{ 1 , 0 } e c ∈{ 1 , 0 } 13 / 32
How to compute MI values Based on maximum likelihood estimates, the formula we actually use is: N 11 NN 11 + N 01 NN 01 I ( U ; C ) = N log 2 N log 2 N 1 . N . 1 N 0 . N . 1 + N 10 NN 10 + N 00 NN 00 N log 2 N log 2 N 1 . N . 0 N 0 . N . 0 N 10 : number of documents that contain t ( e t = 1) and are not in c ( e c = 0); N 11 : number of documents that contain t ( e t = 1) and are in c ( e c = 1); N 01 : number of documents that do not contain t ( e t = 1) and are in c ( e c = 1); N 00 : number of documents that do not contain t ( e t = 1) and are not in c ( e c = 1); N = N 00 + N 01 + N 10 + N 11 . 14 / 32
MI example for poultry / export in Reuters e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 N 11 = 49 N 10 = 27 , 652 e t = e export = 0 N 01 = 141 N 00 = 774 , 106 Plug these values into formula: 49 801 , 948 · 49 I ( U ; C ) = 801 , 948 log 2 (49+27 , 652)(49+141) 141 801 , 948 · 141 + 801 , 948 log 2 (141+774 , 106)(49+141) + 27 , 652 801 , 948 · 27 , 652 801 , 948 log 2 (49+27 , 652)(27 , 652+774 , 106) +774 , 106 801 , 948 · 774 , 106 801 , 948 log 2 (141+774 , 106)(27 , 652+774 , 106) ≈ 0 . 000105 15 / 32
MI feature selection on Reuters coffee sports 0.0111 0.0681 coffee soccer 0.0042 0.0515 bags cup 0.0025 0.0441 growers match 0.0019 0.0408 kg matches 0.0018 0.0388 colombia played 0.0016 0.0386 brazil league 0.0014 0.0301 export beat 0.0013 0.0299 exporters game 0.0013 0.0284 exports games 0.0012 0.0264 crop team 16 / 32
χ 2 Feature selection χ 2 tests independence of two events, p ( A , B ) = p ( A ) p ( B ) (or p ( A | B ) = p ( A ), p ( B | A ) = p ( B )) test occurrence of the term, occurrence of the class, rank w.r.t.: ( N e t e c − E e t e c ) 2 X 2 ( D , t , c ) = � � E e t e c e t ∈{ 0 , 1 } e c ∈{ 0 , 1 } where N = observed frequency in D , E = expected frequency (e.g., E 11 is the expected frequency of t and c occurring together in a document, assuming term and class are independent) High value of X 2 indicates independence hypothesis is incorrect, i.e., observed and expected are not similar. Occurrence of term and class dependent events ⇒ occurrence of term makes class more (or less) likely, hence helpful as feature. 17 / 32
χ 2 Feature selection, example e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 N 11 = 49 N 10 = 27 , 652 e t = e export = 0 N 01 = 141 N 00 = 774 , 106 E 11 = N · P ( t ) · P ( c ) = N · N 11 + N 10 · N 11 + N 01 N N = N · 49 + 27652 · 49 + 141 ≈ 6 . 6 N N e c = e poultry = 1 e c = e poultry = 0 e t = e export = 1 E 11 ≈ 6 . 6 E 10 ≈ 183 . 4 e t = e export = 0 E 01 ≈ 27694 . 4 E 00 ≈ 774063 . 6 ( N e t e c − E e t e c ) 2 X 2 ( D , t , c ) = � � ≈ 284 E e t e c e t ∈{ 0 , 1 } e c ∈{ 0 , 1 } 18 / 32
Naive Bayes: Effect of feature selection 0.8 # o o # b 0.6 bb b o b # x b x x F1 measure b x o x o # # o x b 0.4 o # x x b x x # o x b x b o x # b b b # o # 0.2 # # multinomial, MI o multinomial, chisquare o o oo x multinomial, frequency # 0.0 o x b # # # b binomial, MI 1 10 100 1000 10000 number of features selected (multinomial = multinomial Naive Bayes) 19 / 32
Feature selection for Naive Bayes In general, feature selection is necessary for Naive Bayes to get decent performance. Also true for most other learning methods in text classification: you need feature selection for optimal performance. 20 / 32
Outline Recap 1 Feature selection 2 Structured Retrieval 3 Exam Overview 4 21 / 32
XML markup � play � � author � Shakespeare � /author � � title � Macbeth � /title � � act number=”I” � � scene number=”vii” � � title � Macbeths castle � /title � � verse � Will I with wine and wassail ... � /verse � � /scene � � /act � � /play � 22 / 32
XML Doc as DOM object 23 / 32
Outline Recap 1 Feature selection 2 Structured Retrieval 3 Exam Overview 4 24 / 32
Definition of information retrieval (from Lecture 1) Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Three scales (web, enterprise/inst/domain, personal) 25 / 32
“Plan” (from Lecture 1) Search full text: basic concepts Web search Probabalistic Retrieval Interfaces Metadata / Semantics IR ⇔ NLP ⇔ ML Prereqs: Introductory courses in data structures and algorithms, in linear algebra and in probability theory 26 / 32
Recommend
More recommend