Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Ch. 16 What is clustering? Clustering: grouping a set of objects into similar ones Docs within a cluster should be similar. Docs from different clusters should be dissimilar. The commonest form of unsupervised learning Unsupervised learning learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in IR and other places 2
Ch. 16 A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this case? 3
Applications of clustering in IR 5
Search result clustering 6
yippy.com – grouping search results 7
Clustering the collection Cluster-based navigation is an interesting alternative to keyword searching (i.e., the standard IR paradigm) User may prefer browsing over searching when they are unsure about which terms to use Well suited to a collection of news stories News reading is not really search, but rather a process of selecting a subset of stories about recent events 8
Yahoo! Hierarchy www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity Yahoo! Hierarchy isn ’ t clustering but is the kind of output you want from clustering 9
Google News: automatic clustering gives an effective news presentation metaphor 10
11
To improve efficiency and effectiveness of search system Improve language modeling: replacing the collection model used for smoothing by a model derived from doc ’ s cluster Clustering can speed-up search (via an inexact algorithm) Clustering can improve recall 12
Sec. 16.1 For improving search recall Cluster hypothesis : Docs in the same cluster behave similarly with respect to relevance to information needs Therefore, to improve search recall: Cluster docs in corpus a priori When a query matches a doc 𝑒 , also return other docs in the cluster containing 𝑒 Query car : also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 13
Sec. 16.2 Issues for clustering Representation for clustering Doc representation Vector space? Normalization? Centroids aren ’ t length normalized Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid “ trivial ” clusters - too large or small too large: for navigation purposes you've wasted an extra user click without whittling down the set of docs much 14
Notion of similarity/distance Ideal: semantic similarity. Practical: term-statistical similarity We will use cosine similarity. For many algorithms, easier to think in terms of a distance (rather than similarity) We will mostly speak of Euclidean distance But real implementations use cosine similarity 15
Clustering algorithms categorization Flat algorithms ( k- means ) Usually start with a random (partial) partitioning Refine it iteratively Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive 16
Hard vs. soft clustering Hard clustering : Each doc belongs to exactly one cluster More common and easier to do Soft clustering :A doc can belong to more than one cluster. 17
Partitioning algorithms Construct a partition of 𝑂 docs into 𝐿 clusters Given: a set of docs and the number 𝐿 Find: a partition of docs into 𝐿 clusters that optimizes the chosen partitioning criterion Finding a global optimum is intractable for many objective functions of clustering Effective heuristic methods: K -means and K -medoids algorithms 18
Sec. 16.4 K -means Assumes docs are real-valued vectors 𝒚 (1) , … , 𝒚 (𝑂) . Clusters based on centroids (aka the center of gravity or mean) of points in a cluster: 𝝂 𝑘 = 1 𝒚 (𝑗) 𝒟 𝒚 (𝑗) ∈𝒟 𝑘 𝑘 K-means cost function: 𝐿 2 𝒚 (𝑗) – 𝝂 𝑘 𝐾(𝒟) = 𝒚 (𝑗) ∈𝒟 𝑘 𝑘=1 𝒟 = {𝒟 1 , 𝒟 2 , … , 𝒟 𝐿 } 𝒟 𝑘 : the set of data points assigned to j-th cluster 19
Sec. 16.4 K -means algorithm Select K random points {𝝂 1 , 𝝂 2 , … 𝝂 𝐿 } as clusters ’ initial centroids. Until clustering converges (or other stopping criterion): For each doc 𝒚 (𝑗) : Assign 𝒚 (𝑗) to the cluster 𝒟 𝑘 such that 𝑒𝑗𝑡𝑢( 𝒚 (𝑗) , 𝝂 𝑘 ) is minimal. For each cluster 𝐷 𝑘 𝑗∈𝒟𝑘 𝒚 (𝑗) 𝝂 𝑘 = 𝒟 𝑘 Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities) 20
21 [Bishop]
22
Sec. 16.4 Termination conditions Several possibilities for terminal condition, e.g., A fixed number of iterations Doc partition unchanged 𝐾 < 𝜄 : cost function falls below a threshold ∆𝐾 < 𝜄 : the decrease in the cost function (in two successive iterations) falls below a threshold 23
Sec. 16.4 Convergence of K -means K -means algorithm ever reaches a fixed point in which clusters don ’ t change. We must use tie-breaking when there are samples with the same distance from two or more clusters (by assigning it to the lower index cluster) 24
Sec. 16.4 K -means decreases 𝐾(𝒟) in each iteration (before convergence) First, reassignment monotonically decreases 𝐾(𝒟) since each vector is assigned to the closest centroid. Second, recomputation monotonically decreases each 𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 2 : 2 reaches minimum for 𝝂 𝑙 = 1 𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 𝐷 𝑙 𝑗∈𝐷 𝑙 𝒚 (𝑗) K -means typically converges quickly 25
Sec. 16.4 Time complexity of K -means Computing distance between two docs: 𝑃(𝑁) 𝑁 is the dimensionality of the vectors. Reassigning clusters: 𝑃(𝐿𝑂) distance computations ⇒ 𝑃(𝐿𝑂𝑁) . Computing centroids: Each doc gets added once to some centroid: 𝑃(𝑂𝑁) . Assume these two steps are each done once for 𝐽 iterations: 𝑃(𝐽𝐿𝑂𝑁) . 26
Sec. 16.4 Seed choice Results can vary based on random Example showing selection of initial centroids. sensitivity to seeds Some initializations get poor convergence rate, or convergence to sub-optimal clustering Try out multiple starting points If you start with B and E as centroids you converge to Select good seeds using a heuristic (e.g., doc {A,B,C} and {D,E,F} least similar to any existing mean) Initialize with the results of another method. If you start with D and F, you converge to {A,B,D,E} {C,F} 27
Sec. 16.4 K -means issues, variations, etc. Computes the centroid only after all points are re- assigned Instead, we can re-compute the centroid after every assignment It can improve speed of convergence of K -means Assumes clusters are spherical in vector space Sensitive to coordinate changes, weighting etc. Disjoint and exhaustive Doesn ’ t have a notion of “ outliers ” by default But can add outlier filtering Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters 28
How many clusters? Number of clusters 𝐿 is given Partition n docs into predetermined number of clusters Finding the “ right ” number is part of the problem Given docs, partition into an “ appropriate ” no. of subsets. E.g., for query results - ideal value of K not known up front - though UI may impose limits. 29
How many clusters? How many clusters? Six Clusters Two Clusters Four Clusters 30
Selecting k 31
K not specified in advance Tradeoff between having better focus within each cluster and having too many clusters Solve an optimization problem: penalize having lots of clusters application dependent e.g., compressed summary of search results list. 𝑙 ∗ = min 𝑙 𝐾 𝑛𝑗𝑜 𝑙 + 𝜇𝑙 𝐾 𝑛𝑗𝑜 𝑙 : show the minimum value of 𝐾 {𝒟 1 ,𝒟 2 , … ,𝒟 𝑙 } obtained in e.g. 100 runs of k-means (with different initializations) 32
Penalize lots of clusters Benefit for a doc: cosine similarity to its centroid Total Benefit: sum of the individual doc Benefits. Why is there always a clustering ofTotal Benefit n ? For each cluster, we have a Cost C . For K clusters, the Total Cost is KC . Value of a clustering = Total Benefit -Total Cost. Find clustering of highest value , over all choices of K . Total benefit increases with increasing K . But can stop when it doesn ’ t increase by “ much ” . The Cost term enforces this. 33
Sec. 16.3 What is a good clustering? Internal criterion: intra-class (that is, intra-cluster) similarity is high inter-class similarity is low The measured quality of a clustering depends on both the doc representation and the similarity measure 34
Recommend
More recommend