INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 20/26: Linear Classifiers and Flat clustering Paul Ginsparg Cornell University, Ithaca, NY 10 Nov 2009 1 / 92
Discussion 6, 12 Nov For this class, read and be prepared to discuss the following: Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Usenix SDI ’04, 2004. http://www.usenix.org/events/osdi04/tech/full papers/dean/dean.pdf See also (Jan 2009): http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/ part of lectures on “google technology stack”: http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/ (including PageRank, etc.) 2 / 92
Overview Recap 1 Linear classifiers 2 > two classes 3 Clustering: Introduction 4 Clustering in IR 5 K -means 6 3 / 92
Outline Recap 1 Linear classifiers 2 > two classes 3 Clustering: Introduction 4 Clustering in IR 5 K -means 6 4 / 92
Poisson Distribution Bernoulli process with N trials, each probability p of success: � N � p m (1 − p ) N − m . p ( m ) = m Probability p ( m ) of m successes, in limit N very large and p small, parametrized by just µ = Np ( µ = mean number of successes). N ! ( N − m )! = N ( N − 1) · · · ( N − m + 1) ≈ N m , For N ≫ m , we have m !( N − m )! ≈ N m � N N ! � so ≡ m ! , and m m ! N m � µ ≈ µ m = e − µ µ m p ( m ) ≈ 1 � m � 1 − µ � N − m 1 − µ � N � m ! lim N N N m ! N →∞ (ignore (1 − µ/ N ) − m since by assumption N ≫ µ m ). N dependence drops out for N → ∞ , with average µ fixed ( p → 0). The form p ( m ) = e − µ µ m m ! is known as a Poisson distribution µ m m ! = e − µ · e µ = 1). (properly normalized: � ∞ m =0 p ( m ) = e − µ � ∞ m =0 5 / 92
Poisson Distribution for µ = 10 p ( m ) = e − 10 10 m m ! 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 30 Compare to power law p ( m ) ∝ 1 / m 2 . 1 6 / 92
Classes in the vector space ⋄ ⋄ ⋄ ⋄ ⋄ UK ⋆ ⋄ China x x x x Kenya Should the document ⋆ be assigned to China , UK or Kenya ? Find separators between the classes Based on these separators: ⋆ should be assigned to China How do we find separators that do a good job at classifying new documents like ⋆ ? 7 / 92
Rocchio illustrated: a 1 = a 2 , b 1 = b 2 , c 1 = c 2 ⋄ ⋄ ⋄ ⋄ ⋄ UK ⋆ a 1 c 1 b 1 ⋄ a 2 c 2 b 2 China x x x x Kenya 8 / 92
kNN classification kNN classification is another vector space classification method. It also is very simple and easy to implement. kNN is more accurate (in most cases) than Naive Bayes and Rocchio. If you need to get a pretty accurate classifier up and running in a short time . . . . . . and you don’t care about efficiency that much . . . . . . use kNN. 9 / 92
kNN is based on Voronoi tessellation 1NN, 3NN x x ⋄ classifica- x x x tion decision ⋄ for star? ⋄ x ⋄ x x ⋄ ⋄ ⋆ x x ⋄ ⋄ ⋄ x ⋄ ⋄ 10 / 92
Exercise x x x o x o o ⋆ x x o x o x x x How is star classified by: (i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio 11 / 92
kNN: Discussion No training necessary But linear preprocessing of documents is as expensive as training Naive Bayes. You will always preprocess the training set, so in reality training time of kNN is linear. kNN is very accurate if training set is large. Optimality result: asymptotically zero error if Bayes rate is zero. But kNN can be very inaccurate if training set is small. 12 / 92
Outline Recap 1 Linear classifiers 2 > two classes 3 Clustering: Introduction 4 Clustering in IR 5 K -means 6 13 / 92
Linear classifiers Linear classifiers compute a linear combination or weighted sum � i w i x i of the feature values. Classification decision: � i w i x i > θ ? . . . where θ (the threshold) is a parameter. (First, we only consider binary classifiers.) Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities) Assumption: The classes are linearly separable. Can find hyperplane (=separator) based on training set Methods for finding separator: Perceptron, Rocchio, Naive Bayes – as we will explain on the next slides 14 / 92
A linear classifier in 1D A linear classifier in 1D is a point described by the equation w 1 x 1 = θ The point at θ/ w 1 Points ( x 1 ) with w 1 x 1 ≥ θ are in the class c . x 1 Points ( x 1 ) with w 1 x 1 < θ are in the complement class c . 15 / 92
A linear classifier in 2D A linear classifier in 2D is a line described by the equation w 1 x 1 + w 2 x 2 = θ Example for a 2D linear classifier Points ( x 1 x 2 ) with w 1 x 1 + w 2 x 2 ≥ θ are in the class c . Points ( x 1 x 2 ) with w 1 x 1 + w 2 x 2 < θ are in the complement class c . 16 / 92
A linear classifier in 3D A linear classifier in 3D is a plane described by the equation w 1 x 1 + w 2 x 2 + w 3 x 3 = θ Example for a 3D linear classifier Points ( x 1 x 2 x 3 ) with w 1 x 1 + w 2 x 2 + w 3 x 3 ≥ θ are in the class c . Points ( x 1 x 2 x 3 ) with w 1 x 1 + w 2 x 2 + w 3 x 3 < θ are in the complement class c . 17 / 92
Rocchio as a linear classifier Rocchio is a linear classifier defined by: M � w i x i = � w · � x = θ i =1 where the normal vector � w = � µ ( c 1 ) − � µ ( c 2 ) and µ ( c 1 ) | 2 − | � µ ( c 2 ) | 2 ). θ = 0 . 5 ∗ ( | � µ ( c 1 ) − � µ ( c 2 ) − � (follows from decision boundary | � x | = | � x | ) 18 / 92
Naive Bayes classifier (Just like BIM, see lecture 13) � x represents document, what is p ( c | � x ) that document is in class c ? x ) = p ( � x | c ) p ( c ) x ) = p ( � x | ¯ c ) p (¯ c ) p ( c | � p (¯ c | � p ( � x ) p ( � x ) � 1 ≤ k ≤ n d p ( t k | c ) p ( c | � x ) x ) = p ( � x | c ) p ( c ) c ) ≈ p ( c ) odds : p (¯ c | � p ( � x | ¯ c ) p (¯ p (¯ c ) � 1 ≤ k ≤ n d p ( t k | ¯ c ) log p ( c | � x ) x ) = log p ( c ) log p ( t k | c ) � log odds : c ) + p (¯ c | � p (¯ p ( t k | ¯ c ) 1 ≤ k ≤ n d 19 / 92
Naive Bayes as a linear classifier Naive Bayes is a linear classifier defined by: M � w i x i = θ i =1 � � where w i = log p ( t i | c ) / p ( t i | ¯ c ) , x i = number of occurrences of t i in d , and � � θ = − log p ( c ) / p (¯ c ) . (the index i , 1 ≤ i ≤ M , refers to terms of the vocabulary) Linear in log space 20 / 92
kNN is not a linear classifier x x ⋄ Classification decision x x x based on majority of ⋄ k nearest neighbors. ⋄ x ⋄ The decision x x ⋄ boundaries between ⋄ ⋆ x classes are piecewise x ⋄ linear . . . ⋄ ⋄ x ⋄ ⋄ . . . but they are not linear classifiers that can be described as � M i =1 w i x i = θ . 21 / 92
Example of a linear two-class classifier t i w i x 1 i x 2 i t i w i x 1 i x 2 i prime 0.70 0 1 dlrs -0.71 1 1 rate 0.67 1 0 world -0.35 1 0 interest 0.63 0 0 sees -0.33 0 0 rates 0.60 0 0 year -0.25 0 0 discount 0.46 1 0 group -0.24 0 0 bundesbank 0.43 0 0 dlr -0.24 0 0 This is for the class interest in Reuters-21578. For simplicity: assume a simple 0/1 vector representation x 1 : “rate discount dlrs world” x 2 : “prime dlrs” Exercise: Which class is x 1 assigned to? Which class is x 2 assigned to? We assign document � d 1 “rate discount dlrs world” to interest since w T · � � d 1 = 0 . 67 · 1 + 0 . 46 · 1 + ( − 0 . 71) · 1 + ( − 0 . 35) · 1 = 0 . 07 > 0 = b . We assign � d 2 “prime dlrs” to the complement class (not in interest ) since w T · � � d 2 = − 0 . 01 ≤ b . (dlr and world have negative weights because they are indicators for the competing class currency ) 22 / 92
Which hyperplane? 23 / 92
Which hyperplane? For linearly separable training sets: there are infinitely many separating hyperplanes. They all separate the training set perfectly . . . . . . but they behave differently on test data. Error rates on new data are low for some, high for others. How do we find a low-error separator? Perceptron: generally bad; Naive Bayes, Rocchio: ok; linear SVM: good 24 / 92
Linear classifiers: Discussion Many common text classifiers are linear classifiers: Naive Bayes, Rocchio, logistic regression, linear support vector machines etc. Each method has a different way of selecting the separating hyperplane Huge differences in performance on test documents Can we get better performance with more powerful nonlinear classifiers? Not in general: A given amount of training data may suffice for estimating a linear boundary, but not for estimating a more complex nonlinear boundary. 25 / 92
A nonlinear problem 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Linear classifier like Rocchio does badly on this task. kNN will do well (assuming enough training data) 26 / 92
A linear problem with noise Figure 14.10: hypothetical web page classification scenario: Chinese-only web pages (solid circles) and mixed Chinese-English web (squares). linear class boundary, except for three noise docs 27 / 92
Recommend
More recommend