Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Ch. 16 What is clustering? } Clustering: grouping a set of objects into similar ones } Docs within a cluster should be similar. } Docs from different clusters should be dissimilar. } The commonest form of unsupervised learning } Unsupervised learning } learning from raw data, as opposed to supervised data where a classification of examples is given } A common and important task that finds many applications in IR and other places 2
Ch. 16 A data set with clear cluster structure } How would you design an algorithm for finding the three clusters in this case? 3
Sec. 16.1 Applications of clustering in IR } For better navigation of search results } Effective “user recall” will be higher } Whole corpus analysis/navigation } Better user interface: search without typing } For improving recall in search applications } Better search results (like pseudo RF) } For speeding up vector space retrieval } Cluster-based retrieval gives faster search 4
Applications of clustering in IR 5
Search result clustering 6
yippy.com – grouping search results 7
Clustering the collection } Cluster-based navigation is an interesting alternative to keyword searching (i.e., the standard IR paradigm) } User may prefer browsing over searching when they are unsure about which terms to use } Well suited to a collection of news stories } News reading is not really search, but rather a process of selecting a subset of stories about recent events 8
Google News: automatic clustering gives an effective news presentation metaphor 9
10
To improve efficiency and effectiveness of search system } Improve language modeling: replacing the collection model used for smoothing by a model derived from doc’s cluster } Clustering can speed-up search (via an inexact algorithm) } Clustering can improve recall 11
Sec. 16.1 For improving search recall } Cluster hypothesis : Docs in the same cluster behave similarly with respect to relevance to information needs } Therefore, to improve search recall: } Cluster docs in corpus a priori } When a query matches a doc 𝑒 , also return other docs in the cluster containing 𝑒 } Query car : also return docs containing automobile } Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 12
Sec. 16.2 Issues for clustering } Representation for clustering } Doc representation } Vector space? Normalization? } Centroids aren’t length normalized } Need a notion of similarity/distance } How many clusters? } Fixed a priori? } Completely data driven? } Avoid “trivial” clusters - too large or small ¨ too large: for navigation purposes you've wasted an extra user click without whittling down the set of docs much 13
Notion of similarity/distance } Ideal: semantic similarity. } Practical: term-statistical similarity } We will use cosine similarity. } For many algorithms, easier to think in terms of a distance (rather than similarity) } We will mostly speak of Euclidean distance ¨ But real implementations use cosine similarity 14
Clustering algorithms categorization } Flat algorithms ( k- means ) } Usually start with a random (partial) partitioning } Refine it iteratively } Hierarchical algorithms } Bottom-up, agglomerative } T op-down, divisive 15
Hard vs. soft clustering } Hard clustering : Each doc belongs to exactly one cluster } More common and easier to do } Soft clustering :A doc can belong to more than one cluster. 16
Partitioning algorithms } Construct a partition of 𝑂 docs into 𝐿 clusters } Given: a set of docs and the number 𝐿 } Find: a partition of docs into 𝐿 clusters that optimizes the chosen partitioning criterion } Finding a global optimum is intractable for many objective functions of clustering } Effective heuristic methods: K -means and K -medoids algorithms 17
� K-means Clustering } Input: data {𝒚 & , … , 𝒚 (*) } and number of clusters 𝑙 } Output: 𝒟 & , … , 𝒟 / } Optimization problem: / 8 𝒚 (3) – 𝒅 𝑘 𝐾(𝒟) = 2 2 𝒚 (:) ∈𝒟 < =>& } This is an NP-Hard problem in general. 18
� Sec. 16.4 K -means } Assumes docs are real-valued vectors 𝒚 (&) , … , 𝒚 (*) . } Clusters based on centroids (aka the center of gravity or mean) of points in a cluster: 𝝂 = = 1 𝒚 (3) 2 𝒟 𝒚 (:) ∈𝒟 < = } K-means cost function: / 8 𝒚 (3) – 𝝂 𝑘 𝐾(𝒟) = 2 2 𝒚 (:) ∈𝒟 < =>& 𝒟 = {𝒟 1 , 𝒟 2 , … , 𝒟 𝐿 } 𝒟 𝑘 : the set of data points assigned to j-th cluster 19
� Sec. 16.4 K -means algorithm Select K random points {𝝂 1 , 𝝂 2 , … 𝝂 𝐿 } as clusters’ initial centroids. Until clustering converges (or other stopping criterion): For each doc 𝒚 (3) : Assign 𝒚 (3) to the cluster 𝒟 = such that 𝑒𝑗𝑡𝑢( 𝒚 (3) , 𝝂 𝑘 ) is minimal. For each cluster 𝐷 𝑘 𝒚 (:) ∑ :∈𝒟< 𝝂 𝑘 = 𝒟 < Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities) 20
21 [Bishop]
22
Sec. 16.4 Termination conditions } Several possibilities for terminal condition, e.g., } A fixed number of iterations } Doc partition unchanged } 𝐾 < 𝜄 : cost function falls below a threshold } ∆𝐾 < 𝜄 : the decrease in the cost function (in two successive iterations) falls below a threshold 23
Sec. 16.4 Convergence of K -means } K -means algorithm ever reaches a fixed point in which clusters don’t change. } We must use tie-breaking when there are samples with the same distance from two or more clusters (by assigning it to the lower index cluster) 24
� � � Sec. 16.4 K -means decreases 𝐾(𝒟) in each iteration (before convergence) } First, reassignment monotonically decreases 𝐾(𝒟) since each vector is assigned to the closest centroid. } Second, recomputation monotonically decreases each 𝒚 (3) – 𝝂 𝑙 ∑ 2 : 3∈J K & 𝒚 (3) – 𝝂 𝑙 𝒚 (3) } ∑ J L ∑ 2 reaches minimum for 𝝂 𝑙 = 3∈J K 3∈J K } K -means typically converges quickly 25
Sec. 16.4 Time complexity of K -means } Computing distance between two docs: 𝑃(𝑁) } 𝑁 is the dimensionality of the vectors. } Reassigning clusters: 𝑃(𝐿𝑂) distance computations ⇒ 𝑃(𝐿𝑂𝑁) . } Computing centroids: Each doc gets added once to some centroid: 𝑃(𝑂𝑁) . } Assume these two steps are each done once for 𝐽 iterations: 𝑃(𝐽𝐿𝑂𝑁) . 26
Sec. 16.4 Seed choice } Results can vary based on random Example showing selection of initial centroids. sensitivity to seeds } Some initializations get poor convergence rate, or convergence to sub-optimal clustering If you start with B and E as } Exclude outliers from the seed set centroids you converge to } Try out multiple starting points and {A,B,C} and {D,E,F} choosing the clustering with lowest cost } Select good seeds using a heuristic (e.g., doc If you start with D and F, you least similar to any existing mean) converge to {A,B,D,E} {C,F} } Obtaining seeds from another method such as hierarchical clustering 27
Sec. 16.4 K -means issues, variations, etc. } Computes the centroid after all points are re-assigned } Instead, we can re-compute the centroid after every assignment } It can improve speed of convergence of K -means } Assumes clusters are spherical in vector space } Sensitive to coordinate changes, weighting etc. } Disjoint and exhaustive } Doesn’t have a notion of “outliers” by default } But can add outlier filtering Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters 28
How many clusters? } Number of clusters 𝐿 is given } Partition n docs into predetermined number of clusters } Finding the “right” number is part of the problem } Given docs, partition into an “appropriate” no. of subsets. } E.g., for query results - ideal value of K not known up front - though UI may impose limits. 29
How many clusters? How many clusters? Six Clusters Two Clusters Four Clusters 30
Selecting k } Is it possible by assessing the cost function for different number of clusters? } Keep adding cluster until adding more no longer decreases error significantly (e.g., finding knees) 31
K not specified in advance } Tradeoff between having better focus within each cluster and having too many clusters } Solve an optimization problem: penalize having lots of clusters } application dependent } e.g., compressed summary of search results list. 𝑙 ∗ = min L 𝐾 U3V 𝑙 + 𝜇𝑙 𝐾 U3V 𝑙 : 𝐾 {𝒟 1 ,𝒟 2 , … ,𝒟 𝑙 } show the minimum value of obtained in e.g. 100 runs of k-means (with different initializations) 32
Sec. 16.3 What is a good clustering? } Internal criterion: } intra-class (that is, intra-cluster) similarity is high } inter-class similarity is low } The measured quality of a clustering depends on both the doc representation and the similarity measure 33
Recommend
More recommend