Clustering Techniques Clustering Techniques Berlin Chen 2003 References: 1. Modern Information Retrieval, chapters 5, 7 2. Foundations of Statistical Natural Language Processing, Chapter 14
Clustering • Place similar objects in the same group and assign dissimilar objects to different groups – Word clustering • Neighbor overlap: words occur with the similar left and right neighbors (such as in and on ) – Document clustering • Documents with the similar topics or concepts are put together • But clustering cannot give a comprehensive description of the object – How to label objects shown on the visual display • Clustering is a way of learning 2
Clustering vs. Classification • Classification is supervised and requires a set of labeled training instances for each group (class) • Clustering is unsupervised and learns without a teacher to provide the labeling information of the training data set – Also called automatic or unsupervised classification 3
Types of Clustering Algorithms • Two types of structures produced by clustering algorithms – Flat or non-hierarchical clustering – Hierarchical clustering • Flat clustering – Simply consisting of a certain number of clusters and the relation between clusters is often undetermined • Hierarchical clustering – A hierarchy with usual interpretation that each node stands for a subclass of its mother’s node • The leaves of the tree are the single objects • Each node represents the cluster that contains all the objects of its descendants 4
Hard Assignment vs. Soft Assignment • Another important distinction between clustering algorithms is whether they perform soft or hard assignment • Hard Assignment – Each object is assigned to one and only one cluster • Soft Assignment – Each object may be assigned to multiple clusters ( ) x – An object has a probability distribution ⋅ P x i i ( ) c over clusters where is the probability P x i c j j x c that is a member of i j – Is somewhat more appropriate in many tasks such as NLP, IR, … 5
Hard Assignment vs. Soft Assignment • Hierarchical clustering usually adopts hard assignment while in flat clustering both types of clustering are common 6
Summarized Attributes of Clustering Algorithms • Hierarchical Clustering – Preferable for detailed data analysis – Provide more information than flat clustering – No single best algorithm (each of the algorithms only optimal for some applications) – Less efficient than flat clustering (minimally have to compute n x n matrix of similarity coefficients) • Flat clustering – Preferable if efficiency is a consideration or data sets are very large – K-means is the conceptually method and should probably be used on a new data because its results are often sufficient – K-means assumes a simple Euclidean representation space, and so cannot be used for many data sets, e.g., nominal data like colors – The EM algorithm is the most choice. It can accommodate definition of clusters and allocation of objects based on complex probabilistic models 7
Hierarchical Clustering Hierarchical Clustering 8
Hierarchical Clustering • Can be in either bottom-up or top-down manners – Bottom-up ( agglomerative ) • Start with individual objects and grouping the most similar ones – E.g., with the minimum distance apart 1 ( ) = sim x , y ( ) + 1 d x , y • The procedure terminates when one cluster containing all objects has been formed – Top-down ( divisive ) • Start with all objects in a group and divide them into groups so as to maximize within-group similarity 9
Hierarchical Agglomerative Clustering (HAC) • A bottom-up approach • Assume a similarity measure for determining the similarity of two objects • Start with all objects in a separate cluster and then repeatedly joins the two clusters that have the most similarity until there is one only cluster survived • The history of merging/clustering forms a binary tree or hierarchy 10
Hierarchical Agglomerative Clustering (HAC) • Algorithm cluster number 11
Distance Metrics • Euclidian distance ( L 2 norm) m = ∑ r r − 2 L ( x , y ) ( x y ) 2 i i = i 1 • L 1 norm m r r ∑ = − L ( x , y ) x y 1 i i = i 1 • Cosine Similarity (transform to a distance by subtracting from 1) r r x y • − 1 r r ⋅ x y 12
Measures of Cluster Similarity • Especially for the bottom-up approaches • Single-link clustering – The similarity between two clusters is the similarity of the two closest objects in the clusters C u C v – Search over all pairs of objects that are from the two different clusters and select the pair with the greatest similarity greatest similarity ( ) r r ( ) = max sim c ,c sim x , y r r i j ∈ ∈ x c , y c i j • Complete-link clustering – The similarity between two clusters is the similarity of their two most dissimilar members – Sphere-shaped clusters are achieved C u C v – Preferable for most IR and NLP ( ) r r ( ) applications = min sim c ,c sim x , y least similarity r r i j ∈ ∈ x c , y c 13 i j
Measures of Cluster Similarity 14
Measures of Cluster Similarity • Group-average agglomerative clustering – A compromise between single-link and complete-link clustering – The similarity between two clusters is the average similarity between members – If the objects are represented as length-normalized vectors and the similarity measure is the cosine • There exists an fast algorithm for computing the average similarity r r ⋅ x y r r r r r r ( ) ( ) = = = ⋅ sim x , y cos x , y x y r r x y 15
Measures of Cluster Similarity • Group-average agglomerative clustering (cont.) – The average similarity SIM between vectors in a cluster c j is defined as 1 ( ) r , r ( ) = ∑ ∑ SIM c ( ) sim x y − j c c 1 r r ∈ ∈ x c y c j j r r j j ≠ y x ( ) r r ∑ = s c x – The sum of members in a cluster c j : j r ∈ x c ( ) r j ( ) s c – Express in terms of SIM c j j ( ) ( ) ( ) r r r r r r ∑ ∑ ∑ ⋅ = ⋅ = ⋅ s c s c x s c x y j j j r r r ∈ ∈ ∈ x c x c y c j j j ( ) ( ) r r ∑ = − + ⋅ c c 1 SIM c x x =1 j j j r ∈ x c j ( ) ( ) = − + c c 1 SIM c c j j j j ( ) ( ) r r ⋅ − s c s c c ( ) ∴ = SIM c j ( j ) j − j c c 1 16 j j
Measures of Cluster Similarity • Group-average agglomerative clustering (cont.) -As merging two clusters c j and c j , the cluster sum r ( ) r ( ) vectors and are known in advance s c s c i ( ) j r r r ( ) ( ) = + = + s c s c s c , c c c New i j New i j – The average similarity for their union will be ( ) ∪ = SIM c c i j ) ( ) ( ( ) ) ( ( ) r r r r ( ) ( ) + ⋅ + − + s c s c s c s c c c i j i j i j ( )( ) + + − c c c c 1 i j i j 17
An Example 18
Divisive Clustering • A top-down approach • Start with all objects in a single cluster • At each iteration, select the least coherent cluster and split it • Continue the iterations until a predefined criterion (e.g., the cluster number) is achieved • The history of clustering forms a binary tree or hierarchy 19
Divisive Clustering • To select the least coherent cluster, the measures used in bottom-up clustering can be used again here – Single link measure – Complete-link measure – Group-average measure • How to split a cluster – Also is a clustering task (finding two sub-clusters) – Any clustering algorithm can be used for the splitting operation, e.g., • Bottom-up algorithms • Non-hierarchical clustering algorithms (e.g., K-means) 20
Divisive Clustering • Algorithm : 21
Non-Hierarchical Clustering Non-Hierarchical Clustering 22
Non-hierarchical Clustering • Start out with a partition based on randomly selected seeds (one seed per cluster) and then refine the initial partition – In a multi-pass manner • Problems associated non-hierarchical clustering – When to stop MI, group average similarity, likelihood – What is the right number of clusters k-1 → k → k+1 • Algorithms introduced here Hierarchical clustering – The K-means algorithm also has to face this problem – The EM algorithm 23
The K-means Algorithm • A hard clustering algorithm • Define clusters by the center of mass of their members • Initialization – A set of initial cluster centers is needed • Recursion – Assign each object to the cluster whose center is closet – Then, re-compute the center of each cluster as the centroid or mean of its members • Using the medoid as the cluster center ? 24
The K-means Algorithm • Algorithm centroid cluster cluster assignment calculation of new centroid 25
The K-means Algorithm • Example 1 26
The K-means Algorithm • Example 2 government finance sports research name 27
Recommend
More recommend