kNN classification kNN classification is another vector space classification method. It also is very simple and easy to implement. kNN is more accurate (in most cases) than Naive Bayes and Rocchio. If you need to get a pretty accurate classifier up and running in a short time . . . . . . and you don’t care about efficiency that much . . . . . . use kNN. 28 / 121
kNN classification kNN = k nearest neighbors kNN classification rule for k = 1 (1NN): Assign each test document to the class of its nearest neighbor in the training set. 1NN is not very robust – one document can be mislabeled or atypical. kNN classification rule for k > 1 (kNN): Assign each test document to the majority class of its k nearest neighbors in the training set. Rationale of kNN: contiguity hypothesis We expect a test document d to have the same label as the training documents located in the local region surrounding d . 29 / 121
Probabilistic kNN Probabilistic version of kNN: P ( c | d ) = fraction of k neighbors of d that are in c kNN classification rule for probabilistic kNN: Assign d to class c with highest P ( c | d ) 30 / 121
kNN is based on Voronoi tessellation 1NN, 3NN x x ⋄ classifica- x x x tion decision ⋄ for star? ⋄ x ⋄ x x ⋄ ⋄ ⋆ x x ⋄ ⋄ ⋄ x ⋄ ⋄ 31 / 121
kNN algorithm Train-kNN ( C , D ) D ′ ← Preprocess ( D ) 1 k ← Select-k ( C , D ′ ) 2 return D ′ , k 3 Apply-kNN ( D ′ , k , d ) S k ← ComputeNearestNeighbors ( D ′ , k , d ) 1 for each c j ∈ C ( D ′ ) 2 3 do p j ← | S k ∩ c j | / k 4 return arg max j p j 32 / 121
Exercise x x x o x o o ⋆ x x o x o x x x How is star classified by: (i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio? 33 / 121
Exercise x x x o x o o ⋆ x x o x o x x x How is star classified by: (i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio 34 / 121
Time complexity of kNN kNN with preprocessing of training set training Θ( | D | L ave ) testing Θ( L a + | D | M ave M a ) = Θ( | D | M ave M a ) kNN test time proportional to the size of the training set! The larger the training set, the longer it takes to classify a test document. kNN is inefficient for very large training sets. 35 / 121
kNN: Discussion No training necessary But linear preprocessing of documents is as expensive as training Naive Bayes. You will always preprocess the training set, so in reality training time of kNN is linear. kNN is very accurate if training set is large. Optimality result: asymptotically zero error if Bayes rate is zero. But kNN can be very inaccurate if training set is small. 36 / 121
Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 37 / 121
Linear classifiers Linear classifiers compute a linear combination or weighted sum � i w i x i of the feature values. Classification decision: � i w i x i > θ ? . . . where θ (the threshold) is a parameter. (First, we only consider binary classifiers.) Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities) Assumption: The classes are linearly separable. Can find hyperplane (=separator) based on training set Methods for finding separator: Perceptron, Rocchio, Naive Bayes – as we will explain on the next slides 38 / 121
A linear classifier in 1D A linear classifier in 1D is a point described by the equation w 1 x 1 = θ The point at θ/ w 1 Points ( x 1 ) with w 1 x 1 ≥ θ are in the class c . x 1 Points ( x 1 ) with w 1 x 1 < θ are in the complement class c . 39 / 121
A linear classifier in 2D A linear classifier in 2D is a line described by the equation w 1 x 1 + w 2 x 2 = θ Example for a 2D linear classifier Points ( x 1 x 2 ) with w 1 x 1 + w 2 x 2 ≥ θ are in the class c . Points ( x 1 x 2 ) with w 1 x 1 + w 2 x 2 < θ are in the complement class c . 40 / 121
A linear classifier in 3D A linear classifier in 3D is a plane described by the equation w 1 x 1 + w 2 x 2 + w 3 x 3 = θ Example for a 3D linear classifier Points ( x 1 x 2 x 3 ) with w 1 x 1 + w 2 x 2 + w 3 x 3 ≥ θ are in the class c . Points ( x 1 x 2 x 3 ) with w 1 x 1 + w 2 x 2 + w 3 x 3 < θ are in the complement class c . 41 / 121
Rocchio as a linear classifier Rocchio is a linear classifier defined by: M � w i x i = � w · � x = θ i =1 where the normal vector � w = � µ ( c 1 ) − � µ ( c 2 ) and µ ( c 1 ) | 2 − | � µ ( c 2 ) | 2 ). θ = 0 . 5 ∗ ( | � µ ( c 1 ) − � µ ( c 2 ) − � (follows from decision boundary | � x | = | � x | ) 42 / 121
Naive Bayes classifier � x represents document, what is p ( c | � x ) that document is in class c ? x ) = p ( � x | c ) p ( c ) x ) = p ( � x | ¯ c ) p (¯ c ) p ( c | � p (¯ c | � p ( � x ) p ( � x ) � 1 ≤ k ≤ n d p ( t k | c ) p ( c | � x ) x ) = p ( � x | c ) p ( c ) c ) ≈ p ( c ) odds : p (¯ c | � p ( � x | ¯ c ) p (¯ p (¯ c ) � 1 ≤ k ≤ n d p ( t k | ¯ c ) log p ( c | � x ) x ) = log p ( c ) log p ( t k | c ) � log odds : c ) + p (¯ c | � p (¯ p ( t k | ¯ c ) 1 ≤ k ≤ n d 43 / 121
Naive Bayes as a linear classifier Naive Bayes is a linear classifier defined by: M � w i x i = θ i =1 � � where w i = log p ( t i | c ) / p ( t i | ¯ c ) , x i = number of occurrences of t i in d , and � � θ = − log p ( c ) / p (¯ c ) . (the index i , 1 ≤ i ≤ M , refers to terms of the vocabulary) Linear in log space 44 / 121
kNN is not a linear classifier x x ⋄ Classification decision x x x based on majority of ⋄ k nearest neighbors. ⋄ x ⋄ The decision x x ⋄ boundaries between ⋄ ⋆ x classes are piecewise x ⋄ linear . . . ⋄ ⋄ x ⋄ ⋄ . . . but they are not linear classifiers that can be described as � M i =1 w i x i = θ . 45 / 121
Example of a linear two-class classifier t i w i x 1 i x 2 i t i w i x 1 i x 2 i prime 0.70 0 1 dlrs -0.71 1 1 rate 0.67 1 0 world -0.35 1 0 interest 0.63 0 0 sees -0.33 0 0 rates 0.60 0 0 year -0.25 0 0 discount 0.46 1 0 group -0.24 0 0 bundesbank 0.43 0 0 dlr -0.24 0 0 This is for the class interest in Reuters-21578. For simplicity: assume a simple 0/1 vector representation x 1 : “rate discount dlrs world” x 2 : “prime dlrs” Exercise: Which class is x 1 assigned to? Which class is x 2 assigned to? We assign document � d 1 “rate discount dlrs world” to interest since w T · � � d 1 = 0 . 67 · 1 + 0 . 46 · 1 + ( − 0 . 71) · 1 + ( − 0 . 35) · 1 = 0 . 07 > 0 = b . We assign � d 2 “prime dlrs” to the complement class (not in interest ) since w T · � � d 2 = − 0 . 01 ≤ b . (dlr and world have negative weights because they are indicators for the competing class currency ) 46 / 121
Which hyperplane? 47 / 121
Which hyperplane? For linearly separable training sets: there are infinitely many separating hyperplanes. They all separate the training set perfectly . . . . . . but they behave differently on test data. Error rates on new data are low for some, high for others. How do we find a low-error separator? Perceptron: generally bad; Naive Bayes, Rocchio: ok; linear SVM: good 48 / 121
Linear classifiers: Discussion Many common text classifiers are linear classifiers: Naive Bayes, Rocchio, logistic regression, linear support vector machines etc. Each method has a different way of selecting the separating hyperplane Huge differences in performance on test documents Can we get better performance with more powerful nonlinear classifiers? Not in general: A given amount of training data may suffice for estimating a linear boundary, but not for estimating a more complex nonlinear boundary. 49 / 121
A nonlinear problem 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Linear classifier like Rocchio does badly on this task. kNN will do well (assuming enough training data) 50 / 121
A linear problem with noise Figure 14.10: hypothetical web page classification scenario: Chinese-only web pages (solid circles) and mixed Chinese-English web (squares). linear class boundary, except for three noise docs 51 / 121
Which classifier do I use for a given TC problem? Is there a learning method that is optimal for all text classification problems? No, because there is a tradeoff between bias and variance. Factors to take into account: How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the problem? How stable is the problem over time? For an unstable problem, it’s better to use a simple and robust classifier. 52 / 121
Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 53 / 121
How to combine hyperplanes for > 2 classes? ? (e.g.: rank and select top-ranked classes) 54 / 121
One-of problems One-of or multiclass classification Classes are mutually exclusive. Each document belongs to exactly one class. Example: language of a document (assumption: no document contains multiple languages) 55 / 121
One-of classification with linear classifiers Combine two-class linear classifiers as follows for one-of classification: Run each classifier separately Rank classifiers (e.g., according to score) Pick the class with the highest score 56 / 121
Any-of problems Any-of or multilabel classification A document can be a member of 0, 1, or many classes. A decision on one class leaves decisions open on all other classes. A type of “independence” (but not statistical independence) Example: topic classification Usually: make decisions on the region, on the subject area, on the industry and so on “independently” 57 / 121
Any-of classification with linear classifiers Combine two-class linear classifiers as follows for any-of classification: Simply run each two-class classifier separately on the test document and assign document accordingly 58 / 121
Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 59 / 121
What is clustering? (Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 60 / 121
Data set with clear cluster structure 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 61 / 121
Classification vs. Clustering Classification: supervised learning Clustering: unsupervised learning Classification: Classes are human-defined and part of the input to the learning algorithm. Clustering: Clusters are inferred from the data without human input. However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . . 62 / 121
Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 63 / 121
The cluster hypothesis Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. All applications in IR are based (directly or indirectly) on the cluster hypothesis. 64 / 121
Applications of clustering in IR Application What is Benefit Example clustered? Search result clustering search more effective infor- next slide results mation presentation to user Scatter-Gather (subsets of) alternative user inter- two slides ahead collection face: “search without typing” Collection clustering collection effective information McKeown et al. 2002, presentation for ex- news.google.com ploratory browsing Cluster-based retrieval collection higher efficiency: Salton 1971 faster search 65 / 121
Search result clustering for better navigation Jaguar the cat not among top results, but available via menu at left 66 / 121
Scatter-Gather A collection of news stories is clustered (“scattered”) into eight clusters (top row), user manually gathers three into smaller collection ‘International Stories’ and performs another scattering. Process repeats until a small cluster with relevant documents is found (e.g., Trinidad). 67 / 121
Global navigation: Yahoo 68 / 121
Global navigation: MESH (upper level) 69 / 121
Global navigation: MESH (lower level) 70 / 121
Note: Yahoo/MESH are not examples of clustering. But they are well known examples for using a global hierarchy for navigation. Some examples for global navigation/exploration based on clustering: Cartia Themescapes Google News 71 / 121
Global navigation combined with visualization (1) 72 / 121
Global navigation combined with visualization (2) 73 / 121
Global clustering for navigation: Google News http://news.google.com 74 / 121
Clustering for improving recall To improve search recall: Cluster docs in collection a priori When a query matches a doc d , also return other docs in the cluster containing d Hope: if we do this: the query “car” will also return docs containing “automobile” Because clustering groups together docs containing “car” with those containing “automobile”. Both types of documents contain words like “parts”, “dealer”, “mercedes”, “road trip”. 75 / 121
Data set with clear cluster structure Exercise: Come up with an algorithm for finding the three 2.5 clusters in this case 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 76 / 121
Document representations in clustering Vector space model As in vector space classification, we measure relatedness between vectors by Euclidean distance . . . . . . which is almost equivalent to cosine similarity. Almost: centroids are not length-normalized. For centroids, distance and cosine give different results. 77 / 121
Issues in clustering General goal: put related docs in the same cluster, put unrelated docs in different clusters. But how do we formalize this? How many clusters? Initially, we will assume the number of clusters K is given. Often: secondary goals in clustering Example: avoid very small and very large clusters Flat vs. hierarchical clustering Hard vs. soft clustering 78 / 121
Flat vs. Hierarchical clustering Flat algorithms Usually start with a random (partial) partitioning of docs into groups Refine iteratively Main algorithm: K -means Hierarchical algorithms Create a hierarchy Bottom-up, agglomerative Top-down, divisive 79 / 121
Hard vs. Soft clustering Hard clustering: Each document belongs to exactly one cluster. More common and easier to do Soft clustering: A document can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: sports apparel shoes You can only do that with a soft clustering approach. For soft clustering, see course text: 16.5,18 Today: Flat, hard clustering Next time: Hierarchical, hard clustering 80 / 121
Flat algorithms Flat algorithms compute a partition of N documents into a set of K clusters. Given: a set of documents and the number K Find: a partition in K clusters that optimizes the chosen partitioning criterion Global optimization: exhaustively enumerate partitions, pick optimal one Not tractable Effective heuristic method: K -means algorithm 81 / 121
Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 82 / 121
K -means Perhaps the best known clustering algorithm Simple, works well in many cases Use as default / baseline for clustering documents 83 / 121
K -means Each cluster in K -means is defined by a centroid. Objective/partitioning criterion: minimize the average squared difference from the centroid Recall definition of centroid: µ ( ω ) = 1 � � � x | ω | x ∈ ω � where we use ω to denote a cluster. We try to find the minimum average squared difference by iterating two steps: reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment 84 / 121
K -means algorithm K -means ( { � x 1 , . . . ,� x N } , K ) 1 ( � s 1 ,� s 2 , . . . ,� s K ) ← SelectRandomSeeds ( { � x 1 , . . . ,� x N } , K ) 2 for k ← 1 to K 3 do � µ k ← � s k 4 while stopping criterion has not been met 5 do for k ← 1 to K 6 do ω k ← {} 7 for n ← 1 to N 8 do j ← arg min j ′ | � µ j ′ − � x n | 9 ω j ← ω j ∪ { � x n } (reassignment of vectors) 10 for k ← 1 to K 1 11 do � µ k ← � x ∈ ω k � x (recomputation of centroids) | ω k | � 12 return { � µ 1 , . . . , � µ K } 85 / 121
Set of points to be clustered b b b b b b b b b b b b b b b b b b b b 86 / 121
Random selection of initial cluster centers ( k = 2 means) × b b b b b b b b b b b × b b b b b b b b Centroids after convergence? b 87 / 121
Assign points to closest centroid × b b b b b b b b b b b × b b b b b b b b b 88 / 121
Assignment × 2 2 2 2 2 1 1 1 1 1 1 × 1 1 1 1 1 1 1 1 1 89 / 121
Recompute cluster centroids × 2 × 2 2 2 2 1 1 1 1 1 × 1 1 1 × 1 1 1 1 1 1 1 90 / 121
Assign points to closest centroid b × b b b b b b b b b × b b b b b b b b b b 91 / 121
Assignment 2 × 2 2 2 2 2 2 1 1 1 × 1 1 1 1 1 1 1 1 1 1 92 / 121
Recompute cluster centroids 2 × 2 × 2 2 2 2 2 1 1 1 1 1 × 1 × 1 1 1 1 1 1 1 93 / 121
Assign points to closest centroid b × b b b b b b b b b b × b b b b b b b b b 94 / 121
Assignment 2 × 2 2 2 2 2 2 1 1 2 1 1 × 1 1 1 1 1 1 1 1 95 / 121
Recompute cluster centroids 2 2 × × 2 2 2 2 2 1 1 2 1 1 × 1 × 1 1 1 1 1 1 1 96 / 121
Assign points to closest centroid b b × b b b b b b b b b b × b b b b b b b b 97 / 121
Assignment 2 2 × 2 2 2 2 2 1 1 2 1 1 × 1 2 1 1 1 1 1 1 98 / 121
Recompute cluster centroids 2 2 × 2 2 × 2 2 2 1 1 2 1 1 × 1 × 2 1 1 1 1 1 1 99 / 121
Assign points to closest centroid b b × b bb b b b b b b b × b b b b b b b b 100 / 121
Recommend
More recommend