INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell University, Ithaca, NY 11 Nov 2009 1 / 98
Administrativa Assignment 4 to be posted tomorrow, due Fri 3 Dec (last day of classes), permitted until Sun 5 Dec (no extensions) 2 / 98
Discussion 5 (16,18 Nov) For this class, read and be prepared to discuss the following: Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI ’04, 2004. http://www.usenix.org/events/osdi04/tech/full papers/dean/dean.pdf See also (Jan 2009): http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/ part of lectures on “google technology stack”: http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/ (including PageRank, etc.) 3 / 98
Overview Recap 1 Linear classifiers 2 > two classes 3 Clustering: Introduction 4 Clustering in IR 5 K -means 6 4 / 98
Outline Recap 1 Linear classifiers 2 > two classes 3 Clustering: Introduction 4 Clustering in IR 5 K -means 6 5 / 98
Classes in the vector space ⋄ ⋄ ⋄ ⋄ ⋄ UK ⋆ ⋄ China x x x x Kenya Should the document ⋆ be assigned to China , UK or Kenya ? Find separators between the classes Based on these separators: ⋆ should be assigned to China How do we find separators that do a good job at classifying new documents like ⋆ ? 6 / 98
Rocchio illustrated: a 1 = a 2 , b 1 = b 2 , c 1 = c 2 ⋄ ⋄ ⋄ ⋄ ⋄ UK ⋆ a 1 c 1 b 1 ⋄ a 2 c 2 b 2 China x x x x Kenya 7 / 98
kNN classification kNN classification is another vector space classification method. It also is very simple and easy to implement. kNN is more accurate (in most cases) than Naive Bayes and Rocchio. If you need to get a pretty accurate classifier up and running in a short time . . . . . . and you don’t care about efficiency that much . . . . . . use kNN. 8 / 98
kNN is based on Voronoi tessellation 1NN, 3NN x x ⋄ classifica- x x x tion decision ⋄ for star? ⋄ x ⋄ x x ⋄ ⋄ ⋆ x x ⋄ ⋄ ⋄ x ⋄ ⋄ 9 / 98
Exercise x x x o x o o ⋆ x x o x o x x x How is star classified by: (i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio 10 / 98
kNN: Discussion No training necessary But linear preprocessing of documents is as expensive as training Naive Bayes. You will always preprocess the training set, so in reality training time of kNN is linear. kNN is very accurate if training set is large. Optimality result: asymptotically zero error if Bayes rate is zero. But kNN can be very inaccurate if training set is small. 11 / 98
Digression: “naive” Bayes Spam classifier: Imagine a training set of 2000 messages, 1000 classified as spam ( S ), and 1000 classified as non-spam ( S ). 180 of the S messages contain the word “offer”. 20 of the S messages contain the word “offer”. Suppose you receive a message containing the word “offer”. What is the probability it is S ? Estimate: 180 + 20 = 9 180 10 . (Formally, assuming “flat prior” p ( S ) = p ( S ): 180 p ( offer | S ) p ( S ) = 9 1000 p ( S | offer ) = p ( offer | S ) p ( S ) + p ( offer | S ) p ( S ) = 10 . ) 180 20 1000 + 1000 12 / 98
Basics of probability theory A = event 0 ≤ p ( A ) ≤ 1 joint probability p ( A , B ) = p ( A ∩ B ) conditional probability p ( A | B ) = p ( A , B ) / p ( B ) Note p ( A , B ) = p ( A | B ) p ( B ) = p ( B | A ) p ( A ), gives posterior probability of A after seeing the evidence B p ( A | B ) = p ( B | A ) p ( A ) Bayes ‘ Thm ‘ : p ( B ) In denominator, use p ( B ) = p ( B , A ) + p ( B , A ) = p ( B | A ) p ( A ) + p ( B | A ) p ( A ) O ( A ) = p ( A ) p ( A ) Odds: p ( A ) = 1 − p ( A ) 13 / 98
“naive” Bayes, cont’d Spam classifier: Imagine a training set of 2000 messages, 1000 classified as spam ( S ), and 1000 classified as non-spam ( S ). words w i = { “offer”,“FF0000”,“click”,“unix”,“job”,“enlarge”, . . . } n i of the S messages contain the word w i . m i of the S messages contain the word w i . Suppose you receive a message containing the words w 1 , w 4 , w 5 , . . . . What are the odds it is S ? Estimate: p ( S | w 1 , w 4 , w 5 , . . . ) ∝ p ( w 1 , w 4 , w 5 , . . . | S ) p ( S ) p ( S | w 1 , w 4 , w 5 , . . . ) ∝ p ( w 1 , w 4 , w 5 , . . . | S ) p ( S ) Odds are p ( S | w 1 , w 4 , w 5 , . . . ) p ( S | w 1 , w 4 , w 5 , . . . ) = p ( w 1 , w 4 , w 5 , . . . | S ) p ( S ) p ( w 1 , w 4 , w 5 , . . . | S ) p ( S ) 14 / 98
“naive” Bayes odds Odds p ( S | w 1 , w 4 , w 5 , . . . ) p ( S | w 1 , w 4 , w 5 , . . . ) = p ( w 1 , w 4 , w 5 , . . . | S ) p ( S ) p ( w 1 , w 4 , w 5 , . . . | S ) p ( S ) are approximated by ≈ p ( w 1 | S ) p ( w 4 | S ) p ( w 5 | S ) · · · p ( w ℓ | S ) p ( S ) p ( w 1 | S ) p ( w 4 | S ) p ( w 5 | S ) · · · p ( w ℓ | S ) p ( S ) ( n 1 / 1000)( n 4 / 1000)( n 5 / 1000) · · · ( n ℓ / 1000) n 1 n 4 n 5 · · · n ℓ ≈ ( m 1 / 1000)( m 4 / 1000)( m 5 / 1000) · · · ( m ℓ / 1000) = m 1 m 4 m 5 · · · m ℓ where we’ve assumed words are independent events p ( w 1 , w 4 , w 5 , . . . | S ) ≈ p ( w 1 | S ) p ( w 4 | S ) p ( w 5 | S ) · · · p ( w ℓ | S ), and p ( w i | S ) ≈ n i / | S | , p ( w i | S ) ≈ m i / | S | (recall n i and m i , respectively, counted the number of spam S and non-spam S training messages containing the word w i ) 15 / 98
Outline Recap 1 Linear classifiers 2 > two classes 3 Clustering: Introduction 4 Clustering in IR 5 K -means 6 16 / 98
Linear classifiers Linear classifiers compute a linear combination or weighted sum � i w i x i of the feature values. Classification decision: � i w i x i > θ ? . . . where θ (the threshold) is a parameter. (First, we only consider binary classifiers.) Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities) Assumption: The classes are linearly separable. Can find hyperplane (=separator) based on training set Methods for finding separator: Perceptron, Rocchio, Naive Bayes – as we will explain on the next slides 17 / 98
A linear classifier in 1D A linear classifier in 1D is a point described by the equation w 1 x 1 = θ The point at θ/ w 1 Points ( x 1 ) with w 1 x 1 ≥ θ are in the class c . x 1 Points ( x 1 ) with w 1 x 1 < θ are in the complement class c . 18 / 98
A linear classifier in 2D A linear classifier in 2D is a line described by the equation w 1 x 1 + w 2 x 2 = θ Example for a 2D linear classifier Points ( x 1 x 2 ) with w 1 x 1 + w 2 x 2 ≥ θ are in the class c . Points ( x 1 x 2 ) with w 1 x 1 + w 2 x 2 < θ are in the complement class c . 19 / 98
A linear classifier in 3D A linear classifier in 3D is a plane described by the equation w 1 x 1 + w 2 x 2 + w 3 x 3 = θ Example for a 3D linear classifier Points ( x 1 x 2 x 3 ) with w 1 x 1 + w 2 x 2 + w 3 x 3 ≥ θ are in the class c . Points ( x 1 x 2 x 3 ) with w 1 x 1 + w 2 x 2 + w 3 x 3 < θ are in the complement class c . 20 / 98
Rocchio as a linear classifier Rocchio is a linear classifier defined by: M � w i x i = � w · � x = θ i =1 where the normal vector � w = � µ ( c 1 ) − � µ ( c 2 ) and µ ( c 1 ) | 2 − | � µ ( c 2 ) | 2 ). θ = 0 . 5 ∗ ( | � µ ( c 1 ) − � µ ( c 2 ) − � (follows from decision boundary | � x | = | � x | ) 21 / 98
Naive Bayes classifier � x represents document, what is p ( c | � x ) that document is in class c ? x ) = p ( � x | c ) p ( c ) x ) = p ( � x | ¯ c ) p (¯ c ) p ( c | � p (¯ c | � p ( � x ) p ( � x ) � 1 ≤ k ≤ n d p ( t k | c ) p ( c | � x ) x ) = p ( � x | c ) p ( c ) c ) ≈ p ( c ) odds : p (¯ c | � p ( � x | ¯ c ) p (¯ p (¯ c ) � 1 ≤ k ≤ n d p ( t k | ¯ c ) log p ( c | � x ) x ) = log p ( c ) log p ( t k | c ) � log odds : c ) + p (¯ c | � p (¯ p ( t k | ¯ c ) 1 ≤ k ≤ n d 22 / 98
Naive Bayes as a linear classifier Naive Bayes is a linear classifier defined by: M � w i x i = θ i =1 � � where w i = log p ( t i | c ) / p ( t i | ¯ c ) , x i = number of occurrences of t i in d , and � � θ = − log p ( c ) / p (¯ c ) . (the index i , 1 ≤ i ≤ M , refers to terms of the vocabulary) Linear in log space 23 / 98
kNN is not a linear classifier x x ⋄ Classification decision x x x based on majority of ⋄ k nearest neighbors. ⋄ x ⋄ The decision x x ⋄ boundaries between ⋄ ⋆ x classes are piecewise x ⋄ linear . . . ⋄ ⋄ x ⋄ ⋄ . . . but they are not linear classifiers that can be described as � M i =1 w i x i = θ . 24 / 98
Recommend
More recommend