clustering
play

Clustering CE-324: Modern Information Retrieval Sharif University - PowerPoint PPT Presentation

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What is clustering? Clustering:


  1. Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

  2. Ch. 16 What is clustering?  Clustering: grouping a set of objects into similar ones  Docs within a cluster should be similar.  Docs from different clusters should be dissimilar.  The commonest form of unsupervised learning  Unsupervised learning  learning from raw data, as opposed to supervised data where a classification of examples is given  A common and important task that finds many applications in IR and other places 2

  3. Ch. 16 A data set with clear cluster structure  How would you design an algorithm for finding the three clusters in this case? 3

  4. Applications of clustering in IR 5

  5. Search result clustering 6

  6. yippy.com – grouping search results 7

  7. Clustering the collection  Cluster-based navigation is an interesting alternative to keyword searching (i.e., the standard IR paradigm)  User may prefer browsing over searching when they are unsure about which terms to use  Well suited to a collection of news stories  News reading is not really search, but rather a process of selecting a subset of stories about recent events 8

  8. Yahoo! Hierarchy www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity Yahoo! Hierarchy isn ’ t clustering but is the kind of output you want from clustering 9

  9. Google News: automatic clustering gives an effective news presentation metaphor 10

  10. 11

  11. To improve efficiency and effectiveness of search system  Improve language modeling: replacing the collection model used for smoothing by a model derived from doc ’ s cluster  Clustering can speed-up search (via an inexact algorithm)  Clustering can improve recall 12

  12. Sec. 16.1 For improving search recall  Cluster hypothesis : Docs in the same cluster behave similarly with respect to relevance to information needs  Therefore, to improve search recall:  Cluster docs in corpus a priori  When a query matches a doc 𝑒 , also return other docs in the cluster containing 𝑒  Query car : also return docs containing automobile  Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 13

  13. Sec. 16.2 Issues for clustering  Representation for clustering  Doc representation  Vector space? Normalization?  Centroids aren ’ t length normalized  Need a notion of similarity/distance  How many clusters?  Fixed a priori?  Completely data driven?  Avoid “ trivial ” clusters - too large or small  too large: for navigation purposes you've wasted an extra user click without whittling down the set of docs much 14

  14. Notion of similarity/distance  Ideal: semantic similarity.  Practical: term-statistical similarity  We will use cosine similarity.  For many algorithms, easier to think in terms of a distance (rather than similarity)  We will mostly speak of Euclidean distance  But real implementations use cosine similarity 15

  15. Clustering algorithms categorization  Flat algorithms ( k- means )  Usually start with a random (partial) partitioning  Refine it iteratively  Hierarchical algorithms  Bottom-up, agglomerative  Top-down, divisive 16

  16. Hard vs. soft clustering  Hard clustering : Each doc belongs to exactly one cluster  More common and easier to do  Soft clustering :A doc can belong to more than one cluster. 17

  17. Partitioning algorithms  Construct a partition of 𝑂 docs into 𝐿 clusters  Given: a set of docs and the number 𝐿  Find: a partition of docs into 𝐿 clusters that optimizes the chosen partitioning criterion  Finding a global optimum is intractable for many objective functions of clustering  Effective heuristic methods: K -means and K -medoids algorithms 18

  18. Sec. 16.4 K -means  Assumes docs are real-valued vectors 𝒚 (1) , … , 𝒚 (𝑂) .  Clusters based on centroids (aka the center of gravity or mean) of points in a cluster: 𝝂 𝑘 = 1 𝒚 (𝑗) 𝒟 𝒚 (𝑗) ∈𝒟 𝑘 𝑘  K-means cost function: 𝐿 2 𝒚 (𝑗) – 𝝂 𝑘 𝐾(𝒟) = 𝒚 (𝑗) ∈𝒟 𝑘 𝑘=1 𝒟 = {𝒟 1 , 𝒟 2 , … , 𝒟 𝐿 } 𝒟 𝑘 : the set of data points assigned to j-th cluster 19

  19. Sec. 16.4 K -means algorithm Select K random points {𝝂 1 , 𝝂 2 , … 𝝂 𝐿 } as clusters ’ initial centroids. Until clustering converges (or other stopping criterion): For each doc 𝒚 (𝑗) : Assign 𝒚 (𝑗) to the cluster 𝒟 𝑘 such that 𝑒𝑗𝑡𝑢( 𝒚 (𝑗) , 𝝂 𝑘 ) is minimal. For each cluster 𝐷 𝑘 𝑗∈𝒟𝑘 𝒚 (𝑗) 𝝂 𝑘 = 𝒟 𝑘 Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities) 20

  20. 21 [Bishop]

  21. 22

  22. Sec. 16.4 Termination conditions  Several possibilities for terminal condition, e.g.,  A fixed number of iterations  Doc partition unchanged  𝐾 < 𝜄 : cost function falls below a threshold  ∆𝐾 < 𝜄 : the decrease in the cost function (in two successive iterations) falls below a threshold 23

  23. Sec. 16.4 Convergence of K -means  K -means algorithm ever reaches a fixed point in which clusters don ’ t change.  We must use tie-breaking when there are samples with the same distance from two or more clusters (by assigning it to the lower index cluster) 24

  24. Sec. 16.4 K -means decreases 𝐾(𝒟) in each iteration (before convergence)  First, reassignment monotonically decreases 𝐾(𝒟) since each vector is assigned to the closest centroid.  Second, recomputation monotonically decreases each 𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 2 : 2 reaches minimum for 𝝂 𝑙 = 1  𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 𝐷 𝑙 𝑗∈𝐷 𝑙 𝒚 (𝑗)  K -means typically converges quickly 25

  25. Sec. 16.4 Time complexity of K -means  Computing distance between two docs: 𝑃(𝑁)  𝑁 is the dimensionality of the vectors.  Reassigning clusters: 𝑃(𝐿𝑂) distance computations ⇒ 𝑃(𝐿𝑂𝑁) .  Computing centroids: Each doc gets added once to some centroid: 𝑃(𝑂𝑁) .  Assume these two steps are each done once for 𝐽 iterations: 𝑃(𝐽𝐿𝑂𝑁) . 26

  26. Sec. 16.4 Seed choice  Results can vary based on random Example showing selection of initial centroids. sensitivity to seeds  Some initializations get poor convergence rate, or convergence to sub-optimal clustering  Try out multiple starting points If you start with B and E as centroids you converge to  Select good seeds using a heuristic (e.g., doc {A,B,C} and {D,E,F} least similar to any existing mean)  Initialize with the results of another method. If you start with D and F, you converge to {A,B,D,E} {C,F} 27

  27. Sec. 16.4 K -means issues, variations, etc.  Computes the centroid only after all points are re- assigned  Instead, we can re-compute the centroid after every assignment  It can improve speed of convergence of K -means  Assumes clusters are spherical in vector space  Sensitive to coordinate changes, weighting etc.  Disjoint and exhaustive  Doesn ’ t have a notion of “ outliers ” by default  But can add outlier filtering Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters 28

  28. How many clusters?  Number of clusters 𝐿 is given  Partition n docs into predetermined number of clusters  Finding the “ right ” number is part of the problem  Given docs, partition into an “ appropriate ” no. of subsets.  E.g., for query results - ideal value of K not known up front - though UI may impose limits. 29

  29. How many clusters? How many clusters? Six Clusters Two Clusters Four Clusters 30

  30. Selecting k 31

  31. K not specified in advance  Tradeoff between having better focus within each cluster and having too many clusters  Solve an optimization problem: penalize having lots of clusters  application dependent  e.g., compressed summary of search results list. 𝑙 ∗ = min 𝑙 𝐾 𝑛𝑗𝑜 𝑙 + 𝜇𝑙 𝐾 𝑛𝑗𝑜 𝑙 : show the minimum value of 𝐾 {𝒟 1 ,𝒟 2 , … ,𝒟 𝑙 } obtained in e.g. 100 runs of k-means (with different initializations) 32

  32. Penalize lots of clusters  Benefit for a doc: cosine similarity to its centroid  Total Benefit: sum of the individual doc Benefits.  Why is there always a clustering ofTotal Benefit n ?  For each cluster, we have a Cost C .  For K clusters, the Total Cost is KC .  Value of a clustering = Total Benefit -Total Cost.  Find clustering of highest value , over all choices of K .  Total benefit increases with increasing K .  But can stop when it doesn ’ t increase by “ much ” . The Cost term enforces this. 33

  33. Sec. 16.3 What is a good clustering?  Internal criterion:  intra-class (that is, intra-cluster) similarity is high  inter-class similarity is low  The measured quality of a clustering depends on both the doc representation and the similarity measure 34

Recommend


More recommend