Clustering CE-324: Modern Information Retrieval Sharif University - PowerPoint PPT Presentation

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Ch. 16 What is clustering?  Clustering: grouping a set of objects into similar ones  Docs within a cluster should be similar.  Docs from different clusters should be dissimilar.  The commonest form of unsupervised learning  Unsupervised learning  learning from raw data, as opposed to supervised data where a classification of examples is given  A common and important task that finds many applications in IR and other places 2

Ch. 16 A data set with clear cluster structure  How would you design an algorithm for finding the three clusters in this case? 3

Applications of clustering in IR 5

Search result clustering 6

yippy.com – grouping search results 7

Clustering the collection  Cluster-based navigation is an interesting alternative to keyword searching (i.e., the standard IR paradigm)  User may prefer browsing over searching when they are unsure about which terms to use  Well suited to a collection of news stories  News reading is not really search, but rather a process of selecting a subset of stories about recent events 8

Yahoo! Hierarchy www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity Yahoo! Hierarchy isn ’ t clustering but is the kind of output you want from clustering 9

Google News: automatic clustering gives an effective news presentation metaphor 10

To improve efficiency and effectiveness of search system  Improve language modeling: replacing the collection model used for smoothing by a model derived from doc ’ s cluster  Clustering can speed-up search (via an inexact algorithm)  Clustering can improve recall 12

Sec. 16.1 For improving search recall  Cluster hypothesis : Docs in the same cluster behave similarly with respect to relevance to information needs  Therefore, to improve search recall:  Cluster docs in corpus a priori  When a query matches a doc 𝑒 , also return other docs in the cluster containing 𝑒  Query car : also return docs containing automobile  Because clustering grouped together docs containing car with those containing automobile. Why might this happen? 13

Sec. 16.2 Issues for clustering  Representation for clustering  Doc representation  Vector space? Normalization?  Centroids aren ’ t length normalized  Need a notion of similarity/distance  How many clusters?  Fixed a priori?  Completely data driven?  Avoid “ trivial ” clusters - too large or small  too large: for navigation purposes you've wasted an extra user click without whittling down the set of docs much 14

Notion of similarity/distance  Ideal: semantic similarity.  Practical: term-statistical similarity  We will use cosine similarity.  For many algorithms, easier to think in terms of a distance (rather than similarity)  We will mostly speak of Euclidean distance  But real implementations use cosine similarity 15

Clustering algorithms categorization  Flat algorithms ( k- means )  Usually start with a random (partial) partitioning  Refine it iteratively  Hierarchical algorithms  Bottom-up, agglomerative  Top-down, divisive 16

Hard vs. soft clustering  Hard clustering : Each doc belongs to exactly one cluster  More common and easier to do  Soft clustering :A doc can belong to more than one cluster. 17

Partitioning algorithms  Construct a partition of 𝑂 docs into 𝐿 clusters  Given: a set of docs and the number 𝐿  Find: a partition of docs into 𝐿 clusters that optimizes the chosen partitioning criterion  Finding a global optimum is intractable for many objective functions of clustering  Effective heuristic methods: K -means and K -medoids algorithms 18

Sec. 16.4 K -means  Assumes docs are real-valued vectors 𝒚 (1) , … , 𝒚 (𝑂) .  Clusters based on centroids (aka the center of gravity or mean) of points in a cluster: 𝝂 𝑘 = 1 𝒚 (𝑗) 𝒟 𝒚 (𝑗) ∈𝒟 𝑘 𝑘  K-means cost function: 𝐿 2 𝒚 (𝑗) – 𝝂 𝑘 𝐾(𝒟) = 𝒚 (𝑗) ∈𝒟 𝑘 𝑘=1 𝒟 = {𝒟 1 , 𝒟 2 , … , 𝒟 𝐿 } 𝒟 𝑘 : the set of data points assigned to j-th cluster 19

Sec. 16.4 K -means algorithm Select K random points {𝝂 1 , 𝝂 2 , … 𝝂 𝐿 } as clusters ’ initial centroids. Until clustering converges (or other stopping criterion): For each doc 𝒚 (𝑗) : Assign 𝒚 (𝑗) to the cluster 𝒟 𝑘 such that 𝑒𝑗𝑡𝑢( 𝒚 (𝑗) , 𝝂 𝑘 ) is minimal. For each cluster 𝐷 𝑘 𝑗∈𝒟𝑘 𝒚 (𝑗) 𝝂 𝑘 = 𝒟 𝑘 Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities) 20

21 [Bishop]

Sec. 16.4 Termination conditions  Several possibilities for terminal condition, e.g.,  A fixed number of iterations  Doc partition unchanged  𝐾 < 𝜄 : cost function falls below a threshold  ∆𝐾 < 𝜄 : the decrease in the cost function (in two successive iterations) falls below a threshold 23

Sec. 16.4 Convergence of K -means  K -means algorithm ever reaches a fixed point in which clusters don ’ t change.  We must use tie-breaking when there are samples with the same distance from two or more clusters (by assigning it to the lower index cluster) 24

Sec. 16.4 K -means decreases 𝐾(𝒟) in each iteration (before convergence)  First, reassignment monotonically decreases 𝐾(𝒟) since each vector is assigned to the closest centroid.  Second, recomputation monotonically decreases each 𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 2 : 2 reaches minimum for 𝝂 𝑙 = 1  𝑗∈𝐷 𝑙 𝒚 (𝑗) – 𝝂 𝑙 𝐷 𝑙 𝑗∈𝐷 𝑙 𝒚 (𝑗)  K -means typically converges quickly 25

Sec. 16.4 Time complexity of K -means  Computing distance between two docs: 𝑃(𝑁)  𝑁 is the dimensionality of the vectors.  Reassigning clusters: 𝑃(𝐿𝑂) distance computations ⇒ 𝑃(𝐿𝑂𝑁) .  Computing centroids: Each doc gets added once to some centroid: 𝑃(𝑂𝑁) .  Assume these two steps are each done once for 𝐽 iterations: 𝑃(𝐽𝐿𝑂𝑁) . 26

Sec. 16.4 Seed choice  Results can vary based on random Example showing selection of initial centroids. sensitivity to seeds  Some initializations get poor convergence rate, or convergence to sub-optimal clustering  Try out multiple starting points If you start with B and E as centroids you converge to  Select good seeds using a heuristic (e.g., doc {A,B,C} and {D,E,F} least similar to any existing mean)  Initialize with the results of another method. If you start with D and F, you converge to {A,B,D,E} {C,F} 27

Sec. 16.4 K -means issues, variations, etc.  Computes the centroid only after all points are re- assigned  Instead, we can re-compute the centroid after every assignment  It can improve speed of convergence of K -means  Assumes clusters are spherical in vector space  Sensitive to coordinate changes, weighting etc.  Disjoint and exhaustive  Doesn ’ t have a notion of “ outliers ” by default  But can add outlier filtering Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters 28

How many clusters?  Number of clusters 𝐿 is given  Partition n docs into predetermined number of clusters  Finding the “ right ” number is part of the problem  Given docs, partition into an “ appropriate ” no. of subsets.  E.g., for query results - ideal value of K not known up front - though UI may impose limits. 29

How many clusters? How many clusters? Six Clusters Two Clusters Four Clusters 30

Selecting k 31

K not specified in advance  Tradeoff between having better focus within each cluster and having too many clusters  Solve an optimization problem: penalize having lots of clusters  application dependent  e.g., compressed summary of search results list. 𝑙 ∗ = min 𝑙 𝐾 𝑛𝑗𝑜 𝑙 + 𝜇𝑙 𝐾 𝑛𝑗𝑜 𝑙 : show the minimum value of 𝐾 {𝒟 1 ,𝒟 2 , … ,𝒟 𝑙 } obtained in e.g. 100 runs of k-means (with different initializations) 32

Penalize lots of clusters  Benefit for a doc: cosine similarity to its centroid  Total Benefit: sum of the individual doc Benefits.  Why is there always a clustering ofTotal Benefit n ?  For each cluster, we have a Cost C .  For K clusters, the Total Cost is KC .  Value of a clustering = Total Benefit -Total Cost.  Find clustering of highest value , over all choices of K .  Total benefit increases with increasing K .  But can stop when it doesn ’ t increase by “ much ” . The Cost term enforces this. 33

Sec. 16.3 What is a good clustering?  Internal criterion:  intra-class (that is, intra-cluster) similarity is high  inter-class similarity is low  The measured quality of a clustering depends on both the doc representation and the similarity measure 34

Clustering CE-324: Modern Information Retrieval Sharif University - PowerPoint PPT Presentation

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What is clustering? Clustering:

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Outline 0) Course Info 1) Introduction 2) Data Preparation and Cleaning 3) Schema mappings and

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Exploring API Embedding for API Usages and Applications Yi Chang Trong Duc Nguyen, Anh Tuan

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

Web Vitals for a healthier open web Ben Morss Developer Advocate DrupalCon Ben Morss

Exposing Inconsistent Search Results with Bobble Nick Feamster Georgia Tech Wenke Lee, Xinyu Xing,

NewsDiffs: Version Controlling the News Eric Price Margaret Sullivan MIT The New York Times

Matrix Completion and Matrix Concentration Lester Mackey, Ameet Talwalkar, Michael I. Jordan

Clustering CE-324: Modern Information Retrieval Sharif University - PowerPoint PPT Presentation

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What is clustering? Clustering:

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Outline 0) Course Info 1) Introduction 2) Data Preparation and Cleaning 3) Schema mappings and

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Exploring API Embedding for API Usages and Applications Yi Chang Trong Duc Nguyen, Anh Tuan

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson Content

Web Vitals for a healthier open web Ben Morss Developer Advocate DrupalCon Ben Morss

Exposing Inconsistent Search Results with Bobble Nick Feamster Georgia Tech Wenke Lee, Xinyu Xing,

NewsDiffs: Version Controlling the News Eric Price Margaret Sullivan MIT The New York Times

Matrix Completion and Matrix Concentration Lester Mackey, Ameet Talwalkar, Michael I. Jordan

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large