info 4300 cs4300 information retrieval slides adapted
play

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell University, Ithaca, NY 10 Nov 2011 1 /


  1. kNN classification kNN classification is another vector space classification method. It also is very simple and easy to implement. kNN is more accurate (in most cases) than Naive Bayes and Rocchio. If you need to get a pretty accurate classifier up and running in a short time . . . . . . and you don’t care about efficiency that much . . . . . . use kNN. 28 / 121

  2. kNN classification kNN = k nearest neighbors kNN classification rule for k = 1 (1NN): Assign each test document to the class of its nearest neighbor in the training set. 1NN is not very robust – one document can be mislabeled or atypical. kNN classification rule for k > 1 (kNN): Assign each test document to the majority class of its k nearest neighbors in the training set. Rationale of kNN: contiguity hypothesis We expect a test document d to have the same label as the training documents located in the local region surrounding d . 29 / 121

  3. Probabilistic kNN Probabilistic version of kNN: P ( c | d ) = fraction of k neighbors of d that are in c kNN classification rule for probabilistic kNN: Assign d to class c with highest P ( c | d ) 30 / 121

  4. kNN is based on Voronoi tessellation 1NN, 3NN x x ⋄ classifica- x x x tion decision ⋄ for star? ⋄ x ⋄ x x ⋄ ⋄ ⋆ x x ⋄ ⋄ ⋄ x ⋄ ⋄ 31 / 121

  5. kNN algorithm Train-kNN ( C , D ) D ′ ← Preprocess ( D ) 1 k ← Select-k ( C , D ′ ) 2 return D ′ , k 3 Apply-kNN ( D ′ , k , d ) S k ← ComputeNearestNeighbors ( D ′ , k , d ) 1 for each c j ∈ C ( D ′ ) 2 3 do p j ← | S k ∩ c j | / k 4 return arg max j p j 32 / 121

  6. Exercise x x x o x o o ⋆ x x o x o x x x How is star classified by: (i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio? 33 / 121

  7. Exercise x x x o x o o ⋆ x x o x o x x x How is star classified by: (i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio 34 / 121

  8. Time complexity of kNN kNN with preprocessing of training set training Θ( | D | L ave ) testing Θ( L a + | D | M ave M a ) = Θ( | D | M ave M a ) kNN test time proportional to the size of the training set! The larger the training set, the longer it takes to classify a test document. kNN is inefficient for very large training sets. 35 / 121

  9. kNN: Discussion No training necessary But linear preprocessing of documents is as expensive as training Naive Bayes. You will always preprocess the training set, so in reality training time of kNN is linear. kNN is very accurate if training set is large. Optimality result: asymptotically zero error if Bayes rate is zero. But kNN can be very inaccurate if training set is small. 36 / 121

  10. Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 37 / 121

  11. Linear classifiers Linear classifiers compute a linear combination or weighted sum � i w i x i of the feature values. Classification decision: � i w i x i > θ ? . . . where θ (the threshold) is a parameter. (First, we only consider binary classifiers.) Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane (higher dimensionalities) Assumption: The classes are linearly separable. Can find hyperplane (=separator) based on training set Methods for finding separator: Perceptron, Rocchio, Naive Bayes – as we will explain on the next slides 38 / 121

  12. A linear classifier in 1D A linear classifier in 1D is a point described by the equation w 1 x 1 = θ The point at θ/ w 1 Points ( x 1 ) with w 1 x 1 ≥ θ are in the class c . x 1 Points ( x 1 ) with w 1 x 1 < θ are in the complement class c . 39 / 121

  13. A linear classifier in 2D A linear classifier in 2D is a line described by the equation w 1 x 1 + w 2 x 2 = θ Example for a 2D linear classifier Points ( x 1 x 2 ) with w 1 x 1 + w 2 x 2 ≥ θ are in the class c . Points ( x 1 x 2 ) with w 1 x 1 + w 2 x 2 < θ are in the complement class c . 40 / 121

  14. A linear classifier in 3D A linear classifier in 3D is a plane described by the equation w 1 x 1 + w 2 x 2 + w 3 x 3 = θ Example for a 3D linear classifier Points ( x 1 x 2 x 3 ) with w 1 x 1 + w 2 x 2 + w 3 x 3 ≥ θ are in the class c . Points ( x 1 x 2 x 3 ) with w 1 x 1 + w 2 x 2 + w 3 x 3 < θ are in the complement class c . 41 / 121

  15. Rocchio as a linear classifier Rocchio is a linear classifier defined by: M � w i x i = � w · � x = θ i =1 where the normal vector � w = � µ ( c 1 ) − � µ ( c 2 ) and µ ( c 1 ) | 2 − | � µ ( c 2 ) | 2 ). θ = 0 . 5 ∗ ( | � µ ( c 1 ) − � µ ( c 2 ) − � (follows from decision boundary | � x | = | � x | ) 42 / 121

  16. Naive Bayes classifier � x represents document, what is p ( c | � x ) that document is in class c ? x ) = p ( � x | c ) p ( c ) x ) = p ( � x | ¯ c ) p (¯ c ) p ( c | � p (¯ c | � p ( � x ) p ( � x ) � 1 ≤ k ≤ n d p ( t k | c ) p ( c | � x ) x ) = p ( � x | c ) p ( c ) c ) ≈ p ( c ) odds : p (¯ c | � p ( � x | ¯ c ) p (¯ p (¯ c ) � 1 ≤ k ≤ n d p ( t k | ¯ c ) log p ( c | � x ) x ) = log p ( c ) log p ( t k | c ) � log odds : c ) + p (¯ c | � p (¯ p ( t k | ¯ c ) 1 ≤ k ≤ n d 43 / 121

  17. Naive Bayes as a linear classifier Naive Bayes is a linear classifier defined by: M � w i x i = θ i =1 � � where w i = log p ( t i | c ) / p ( t i | ¯ c ) , x i = number of occurrences of t i in d , and � � θ = − log p ( c ) / p (¯ c ) . (the index i , 1 ≤ i ≤ M , refers to terms of the vocabulary) Linear in log space 44 / 121

  18. kNN is not a linear classifier x x ⋄ Classification decision x x x based on majority of ⋄ k nearest neighbors. ⋄ x ⋄ The decision x x ⋄ boundaries between ⋄ ⋆ x classes are piecewise x ⋄ linear . . . ⋄ ⋄ x ⋄ ⋄ . . . but they are not linear classifiers that can be described as � M i =1 w i x i = θ . 45 / 121

  19. Example of a linear two-class classifier t i w i x 1 i x 2 i t i w i x 1 i x 2 i prime 0.70 0 1 dlrs -0.71 1 1 rate 0.67 1 0 world -0.35 1 0 interest 0.63 0 0 sees -0.33 0 0 rates 0.60 0 0 year -0.25 0 0 discount 0.46 1 0 group -0.24 0 0 bundesbank 0.43 0 0 dlr -0.24 0 0 This is for the class interest in Reuters-21578. For simplicity: assume a simple 0/1 vector representation x 1 : “rate discount dlrs world” x 2 : “prime dlrs” Exercise: Which class is x 1 assigned to? Which class is x 2 assigned to? We assign document � d 1 “rate discount dlrs world” to interest since w T · � � d 1 = 0 . 67 · 1 + 0 . 46 · 1 + ( − 0 . 71) · 1 + ( − 0 . 35) · 1 = 0 . 07 > 0 = b . We assign � d 2 “prime dlrs” to the complement class (not in interest ) since w T · � � d 2 = − 0 . 01 ≤ b . (dlr and world have negative weights because they are indicators for the competing class currency ) 46 / 121

  20. Which hyperplane? 47 / 121

  21. Which hyperplane? For linearly separable training sets: there are infinitely many separating hyperplanes. They all separate the training set perfectly . . . . . . but they behave differently on test data. Error rates on new data are low for some, high for others. How do we find a low-error separator? Perceptron: generally bad; Naive Bayes, Rocchio: ok; linear SVM: good 48 / 121

  22. Linear classifiers: Discussion Many common text classifiers are linear classifiers: Naive Bayes, Rocchio, logistic regression, linear support vector machines etc. Each method has a different way of selecting the separating hyperplane Huge differences in performance on test documents Can we get better performance with more powerful nonlinear classifiers? Not in general: A given amount of training data may suffice for estimating a linear boundary, but not for estimating a more complex nonlinear boundary. 49 / 121

  23. A nonlinear problem 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Linear classifier like Rocchio does badly on this task. kNN will do well (assuming enough training data) 50 / 121

  24. A linear problem with noise Figure 14.10: hypothetical web page classification scenario: Chinese-only web pages (solid circles) and mixed Chinese-English web (squares). linear class boundary, except for three noise docs 51 / 121

  25. Which classifier do I use for a given TC problem? Is there a learning method that is optimal for all text classification problems? No, because there is a tradeoff between bias and variance. Factors to take into account: How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the problem? How stable is the problem over time? For an unstable problem, it’s better to use a simple and robust classifier. 52 / 121

  26. Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 53 / 121

  27. How to combine hyperplanes for > 2 classes? ? (e.g.: rank and select top-ranked classes) 54 / 121

  28. One-of problems One-of or multiclass classification Classes are mutually exclusive. Each document belongs to exactly one class. Example: language of a document (assumption: no document contains multiple languages) 55 / 121

  29. One-of classification with linear classifiers Combine two-class linear classifiers as follows for one-of classification: Run each classifier separately Rank classifiers (e.g., according to score) Pick the class with the highest score 56 / 121

  30. Any-of problems Any-of or multilabel classification A document can be a member of 0, 1, or many classes. A decision on one class leaves decisions open on all other classes. A type of “independence” (but not statistical independence) Example: topic classification Usually: make decisions on the region, on the subject area, on the industry and so on “independently” 57 / 121

  31. Any-of classification with linear classifiers Combine two-class linear classifiers as follows for any-of classification: Simply run each two-class classifier separately on the test document and assign document accordingly 58 / 121

  32. Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 59 / 121

  33. What is clustering? (Document) clustering is the process of grouping a set of documents into clusters of similar documents. Documents within a cluster should be similar. Documents from different clusters should be dissimilar. Clustering is the most common form of unsupervised learning. Unsupervised = there are no labeled or annotated data. 60 / 121

  34. Data set with clear cluster structure 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 61 / 121

  35. Classification vs. Clustering Classification: supervised learning Clustering: unsupervised learning Classification: Classes are human-defined and part of the input to the learning algorithm. Clustering: Clusters are inferred from the data without human input. However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . . 62 / 121

  36. Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 63 / 121

  37. The cluster hypothesis Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs. All applications in IR are based (directly or indirectly) on the cluster hypothesis. 64 / 121

  38. Applications of clustering in IR Application What is Benefit Example clustered? Search result clustering search more effective infor- next slide results mation presentation to user Scatter-Gather (subsets of) alternative user inter- two slides ahead collection face: “search without typing” Collection clustering collection effective information McKeown et al. 2002, presentation for ex- news.google.com ploratory browsing Cluster-based retrieval collection higher efficiency: Salton 1971 faster search 65 / 121

  39. Search result clustering for better navigation Jaguar the cat not among top results, but available via menu at left 66 / 121

  40. Scatter-Gather A collection of news stories is clustered (“scattered”) into eight clusters (top row), user manually gathers three into smaller collection ‘International Stories’ and performs another scattering. Process repeats until a small cluster with relevant documents is found (e.g., Trinidad). 67 / 121

  41. Global navigation: Yahoo 68 / 121

  42. Global navigation: MESH (upper level) 69 / 121

  43. Global navigation: MESH (lower level) 70 / 121

  44. Note: Yahoo/MESH are not examples of clustering. But they are well known examples for using a global hierarchy for navigation. Some examples for global navigation/exploration based on clustering: Cartia Themescapes Google News 71 / 121

  45. Global navigation combined with visualization (1) 72 / 121

  46. Global navigation combined with visualization (2) 73 / 121

  47. Global clustering for navigation: Google News http://news.google.com 74 / 121

  48. Clustering for improving recall To improve search recall: Cluster docs in collection a priori When a query matches a doc d , also return other docs in the cluster containing d Hope: if we do this: the query “car” will also return docs containing “automobile” Because clustering groups together docs containing “car” with those containing “automobile”. Both types of documents contain words like “parts”, “dealer”, “mercedes”, “road trip”. 75 / 121

  49. Data set with clear cluster structure Exercise: Come up with an algorithm for finding the three 2.5 clusters in this case 2.0 1.5 1.0 0.5 0.0 0.0 0.5 1.0 1.5 2.0 76 / 121

  50. Document representations in clustering Vector space model As in vector space classification, we measure relatedness between vectors by Euclidean distance . . . . . . which is almost equivalent to cosine similarity. Almost: centroids are not length-normalized. For centroids, distance and cosine give different results. 77 / 121

  51. Issues in clustering General goal: put related docs in the same cluster, put unrelated docs in different clusters. But how do we formalize this? How many clusters? Initially, we will assume the number of clusters K is given. Often: secondary goals in clustering Example: avoid very small and very large clusters Flat vs. hierarchical clustering Hard vs. soft clustering 78 / 121

  52. Flat vs. Hierarchical clustering Flat algorithms Usually start with a random (partial) partitioning of docs into groups Refine iteratively Main algorithm: K -means Hierarchical algorithms Create a hierarchy Bottom-up, agglomerative Top-down, divisive 79 / 121

  53. Hard vs. Soft clustering Hard clustering: Each document belongs to exactly one cluster. More common and easier to do Soft clustering: A document can belong to more than one cluster. Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: sports apparel shoes You can only do that with a soft clustering approach. For soft clustering, see course text: 16.5,18 Today: Flat, hard clustering Next time: Hierarchical, hard clustering 80 / 121

  54. Flat algorithms Flat algorithms compute a partition of N documents into a set of K clusters. Given: a set of documents and the number K Find: a partition in K clusters that optimizes the chosen partitioning criterion Global optimization: exhaustively enumerate partitions, pick optimal one Not tractable Effective heuristic method: K -means algorithm 81 / 121

  55. Outline Recap 1 Rocchio 2 kNN 3 Linear classifiers 4 > two classes 5 Clustering: Introduction 6 Clustering in IR 7 K -means 8 82 / 121

  56. K -means Perhaps the best known clustering algorithm Simple, works well in many cases Use as default / baseline for clustering documents 83 / 121

  57. K -means Each cluster in K -means is defined by a centroid. Objective/partitioning criterion: minimize the average squared difference from the centroid Recall definition of centroid: µ ( ω ) = 1 � � � x | ω | x ∈ ω � where we use ω to denote a cluster. We try to find the minimum average squared difference by iterating two steps: reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment 84 / 121

  58. K -means algorithm K -means ( { � x 1 , . . . ,� x N } , K ) 1 ( � s 1 ,� s 2 , . . . ,� s K ) ← SelectRandomSeeds ( { � x 1 , . . . ,� x N } , K ) 2 for k ← 1 to K 3 do � µ k ← � s k 4 while stopping criterion has not been met 5 do for k ← 1 to K 6 do ω k ← {} 7 for n ← 1 to N 8 do j ← arg min j ′ | � µ j ′ − � x n | 9 ω j ← ω j ∪ { � x n } (reassignment of vectors) 10 for k ← 1 to K 1 11 do � µ k ← � x ∈ ω k � x (recomputation of centroids) | ω k | � 12 return { � µ 1 , . . . , � µ K } 85 / 121

  59. Set of points to be clustered b b b b b b b b b b b b b b b b b b b b 86 / 121

  60. Random selection of initial cluster centers ( k = 2 means) × b b b b b b b b b b b × b b b b b b b b Centroids after convergence? b 87 / 121

  61. Assign points to closest centroid × b b b b b b b b b b b × b b b b b b b b b 88 / 121

  62. Assignment × 2 2 2 2 2 1 1 1 1 1 1 × 1 1 1 1 1 1 1 1 1 89 / 121

  63. Recompute cluster centroids × 2 × 2 2 2 2 1 1 1 1 1 × 1 1 1 × 1 1 1 1 1 1 1 90 / 121

  64. Assign points to closest centroid b × b b b b b b b b b × b b b b b b b b b b 91 / 121

  65. Assignment 2 × 2 2 2 2 2 2 1 1 1 × 1 1 1 1 1 1 1 1 1 1 92 / 121

  66. Recompute cluster centroids 2 × 2 × 2 2 2 2 2 1 1 1 1 1 × 1 × 1 1 1 1 1 1 1 93 / 121

  67. Assign points to closest centroid b × b b b b b b b b b b × b b b b b b b b b 94 / 121

  68. Assignment 2 × 2 2 2 2 2 2 1 1 2 1 1 × 1 1 1 1 1 1 1 1 95 / 121

  69. Recompute cluster centroids 2 2 × × 2 2 2 2 2 1 1 2 1 1 × 1 × 1 1 1 1 1 1 1 96 / 121

  70. Assign points to closest centroid b b × b b b b b b b b b b × b b b b b b b b 97 / 121

  71. Assignment 2 2 × 2 2 2 2 2 1 1 2 1 1 × 1 2 1 1 1 1 1 1 98 / 121

  72. Recompute cluster centroids 2 2 × 2 2 × 2 2 2 1 1 2 1 1 × 1 × 2 1 1 1 1 1 1 99 / 121

  73. Assign points to closest centroid b b × b bb b b b b b b b × b b b b b b b b 100 / 121

Recommend


More recommend