Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of n objects (into k clusters), s.t. intracluster similarity maximized and intercluster similarity minimized One objective: minimize the sum of squared distance from cluster centroid 2 k ( ) p m 1 i p C i i How to find optimal partition? September 26, 2017 Data Mining: Concepts and Techniques 30
Number of partitionings Data Mining: Concepts and Techniques 31
Number of partitionings Stirling partition number – number of ways to partition n objects into k non-empty subsets (n= 5, k = 1, 2, 3, 4, 5): 1, 15, 25, 10, 1 (n=10, k = 1, 2, 3, 4, 5, …): 1, 511, 9330, 34105, 42525, … Bell numbers – number of ways to partition n objects (n = 0, 1, 2, 3, 4, 5, …): 1 , 1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975, 678570, 4213597, 27644437, 190899322, 1382958545, 10480142147, 82864869804, 682076806159, 5832742205057, ... Data Mining: Concepts and Techniques 32
Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of n objects into k clusters, s.t. intracluster similarity maximized and intercluster similarity minimized One objective: minimize the sum of squared distance from cluster centroid 2 k ( ) p m 1 i p C i i Heuristic methods: k-means and k-medoids algorithms k-means (Lloyd’57, MacQueen’67 ): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster September 26, 2017 Data Mining: Concepts and Techniques 33
K-Means Clustering: Lloyd Algorithm Given k , and randomly choose k initial cluster centers Partition objects into k nonempty subsets by assigning each object to the cluster with the nearest centroid Update centroid, i.e. mean point of the cluster Go back to Step 2, stop when no more new assignment and centroids do not change September 26, 2017 Data Mining: Concepts and Techniques 34
The K-Means Clustering Method Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 Update 4 Assign 3 3 the 3 each 2 2 2 cluster 1 1 objects 1 0 means 0 0 to most 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 similar reassign reassign center 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center Update 4 4 the 3 3 2 2 cluster 1 1 means 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 September 26, 2017 Data Mining: Concepts and Techniques 35
K -means Clustering – Details Initial centroids are often chosen randomly Example: Pick one point at random, then k-1 other points, each as far away as possible from the previous points The centroid is (typically) the mean of the points in the cluster. ‘ Nearest ’ is measured by Euclidean distance, cosine similarity, correlation, etc. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ O(tkn) Complexity is n is # objects, k is # clusters, and t is # iterations.
Comments on the K-Means Method Strength Simple and works well for “regular” disjoint clusters Relatively efficient and scalable (normally, k, t << n) Weakness Need to specify k, the number of clusters, in advance Depending on initial centroids, may terminate at a local optimum Sensitive to noisy data and outliers Not suitable for clusters of Different sizes Non-convex shapes September 26, 2017 Data Mining: Concepts and Techniques 37
Getting the k right How to select k ? Try different k , looking at the change in the average distance to centroid (or SSE) as k increases Average falls rapidly until right k , then changes little Best value of k Average distance to k centroid J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
Example: Picking k x Too few; x xx x many long distances x x x x to centroid. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
Example: Picking k x x Just right; xx x distances x x rather short. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
Example: Picking k x Too many; x little improvement xx x in average x x x x distance. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41
Importance of Choosing Initial Centroids – Case 1 Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x
Importance of Choosing Initial Centroids – Case 2 Iteration 1 Iteration 2 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Iteration 3 Iteration 4 Iteration 5 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x
Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points
Limitations of K-means: Non-convex Shapes Original Points K-means (2 Clusters)
Overcoming K-means Limitations Original Points K-means Clusters
Overcoming K-means Limitations Original Points K-means Clusters
Assignment 2 Implement k-means clustering Evaluate the results September 26, 2017 Data Mining: Concepts and Techniques 48
Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Similarity and distances Partitioning Methods Hierarchical Methods Density-Based Methods Probabilistic Methods Evaluation of Clustering 49 49
Cluster Evaluation Determine clustering tendency of data, i.e. distinguish whether non-random structure exists Determine correct number of clusters Evaluate the cohesion and separation of the clustering without external information Evaluate how well the cluster results are compared to externally known results Compare different clustering algorithms/results
Measures Unsupervised (internal): Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Supervised (external): Used to measure the extent to which cluster labels match externally supplied class labels. Entropy Relative: Used to compare two different clustering results Often an external or internal index is used for this function, e.g., SSE or entropy
Internal Measures: Cohesion and Separation Cluster Cohesion: how closely related are objects in a cluster Cluster Separation: how distinct or well-separated a cluster is from other clusters Example: Squared Error Cohesion: within cluster sum of squares (SSE) 2 ( ) WSS x m i i x C i Separation: between cluster sum of squares separation Cohesion 2 ( ) BSS m m i j i j
Cluster Validity: Clusters found in Random Data 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Random DBSCAN 0.6 0.6 Points y 0.5 0.5 y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x 1 1 0.9 0.9 K-means Complete 0.8 0.8 Link 0.7 0.7 0.6 0.6 0.5 0.5 y y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x
Internal Measures: Cluster Validity Statistics framework for cluster validity More “atypical” -> likely valid structure in the data Use values resulting from random data as baseline Example Clustering: SSE = 0.005 SSE of three clusters in 500 sets of random data points 1 50 0.9 45 0.8 40 0.7 35 0.6 30 Count 0.5 y 25 0.4 20 0.3 15 0.2 10 0.1 5 0 0 0 0.2 0.4 0.6 0.8 1 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 x SSE
Internal Measures: number of clusters Good for comparing two clusterings Can also be used to estimate the number of clusters Elbow method: use turning point in the curve of SSE wrt # of clusters 10 6 9 8 4 7 2 6 SSE 5 0 4 -2 3 -4 2 1 -6 0 5 10 15 2 5 10 15 20 25 30 K
Internal Measures: Number of clusters Another example of a more complicated data set with varying number of clusters 1 2 6 3 4 5 7 SSE of clusters found using K-means
External Measures Compare cluster results with “ground truth” or manually clustering Still different from classification measures Classification-oriented measures: entropy/purity based, precision and recall based Similarity-oriented measures: Jaccard scores
External Measures: Classification-Oriented Measures Entropy based measures: the degree to which each cluster consists of objects of a single class Purity: based on majority class in each cluster
External Measures: Classification-Oriented Measures BCubed Precision and recall: measures precision and recall associated with each object Precision of an object: proportion of objects in the same cluster belong to the same category Recall of an object: proportion of objects of the same category are assigned to the same cluster Bcubed precision and recall are the average precision and recall of all objects
BCubed precision and recall September 26, 2017 60
External Measure: Similarity-Oriented Measures Given a reference clustering T and clustering S f 00 : number of pair of points belonging to different clusters in both T and S f 01 : number of pair of points belonging to different cluster in T but same cluster in S f 10 : number of pair of points belonging to same cluster in T but different cluster in S f 11 : number of pair of points belonging to same clusters in both T and S f f 00 11 Rand f f f f 00 01 10 11 f 11 Jaccard f f f 01 10 11 T S September 26, 2017 Li Xiong 61
Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Similarity and distances Partitioning Methods Hierarchical Methods Density-Based Methods Probabilistic Methods Evaluation of Clustering 62 62
Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes (Huang’98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method September 26, 2017 Data Mining: Concepts and Techniques 63
K-Medoids Method The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the mean of the data. K-Medoids: Instead of using the mean as cluster representative, use medoid , the most centrally located object in a cluster. Possible number of solutions? 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 September 26, 2017 Data Mining: Concepts and Techniques 64
The K - Medoids Clustering Method PAM (Partitioning Around Medoids) (Kaufman and Rousseeuw, 1987) Arbitrarily select k objects as medoid Assign each data object in the given data set to most similar medoid. For each nonmedoid object O’ and medoid object O Compute total cost, S, of swapping the medoid object O to O’ (cost as total sum of absolute error) If min S<0, then swap O with O’ Repeat until there is no change in the medoids. September 26, 2017 Data Mining: Concepts and Techniques 65
A Typical K-Medoids Algorithm (PAM) Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary Assign 7 7 7 6 6 6 choose k each 5 5 5 object as remaining 4 4 4 initial object to 3 3 3 medoids nearest 2 2 2 medoids 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Select a nonmedoid object,O random Total Cost = 26 Do loop 10 10 Compute 9 9 Swapping O 8 8 total cost of Until no 7 7 and O ramdom swapping 6 6 change 5 5 If quality is 4 4 improved. 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 September 26, 2017 Data Mining: Concepts and Techniques 66
What Is the Problem with PAM? Pam is more robust than k-means in the presence of noise and outliers Pam works efficiently for small data sets but does not scale well for large data sets. Complexity? n is # of data, k is # of clusters September 26, 2017 Data Mining: Concepts and Techniques 67
What Is the Problem with PAM? Pam is more robust than k-means in the presence of noise and outliers Pam works efficiently for small data sets but does not scale well for large data sets. Complexity? O(k(n-k) 2 ) n is # of data, k is # of clusters September 26, 2017 Data Mining: Concepts and Techniques 68
CLARA (Clustering Large Applications) (1990) CLARA (Kaufmann and Rousseeuw in 1990) Draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output September 26, 2017 Data Mining: Concepts and Techniques 69
CLARANS (“Randomized” CLARA) (1994) CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94) The clustering process can be represented as searching a graph where every node is a potential solution, that is, a set of k medoids September 26, 2017 Data Mining: Concepts and Techniques 70
Search graph September 26, 2017 Data Mining: Concepts and Techniques 71
CLARANS (“Randomized” CLARA) (1994) CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94) The clustering process can be represented as searching a graph where every node is a potential solution, that is, a set of k medoids PAM examines all neighbors for local minimum CLARA works on subgraphs of samples CLARANS examines neighbors dynamically Limit the neighbors to explore (maxneighbor) If local optimum is found, start with new randomly selected node in search for a new local optimum (numlocal) September 26, 2017 Data Mining: Concepts and Techniques 72
Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Similarity and distances Partitioning Methods Hierarchical Methods Density-Based Methods Probabilistic Methods Evaluation of Clustering 73 73
Overcoming K-means Limitations Original Points K-means Clusters
Overcoming K-means Limitations Original Points K-means Clusters
Hierarchical Clustering Produces a set of nested clusters Can be visualized as a dendrogram, a tree like diagram Y-axis measures closeness Clustering obtained by cutting at desired level Do not have to assume any particular number of clusters May correspond to meaningful taxonomies 5 6 0.2 4 3 4 0.15 2 5 2 0.1 1 0.05 1 3 0 1 3 2 5 4 6
September 26, 2017 Data Mining: Concepts and Techniques 77
Hierarchical Clustering Two main types of hierarchical clustering Agglomerative (AGNES) Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive (DIANA) Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters)
Agglomerative Clustering Algorithm Compute the proximity matrix 1. Let each data point be a cluster 2. Repeat 3. Merge the two closest clusters 4. Update the proximity matrix 5. Until only a single cluster remains 6.
Starting Situation Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . Proximity Matrix . ... p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 Proximity Matrix C1 C5 C2 ... p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 Similarity? p4 p5 . . . Proximity Matrix
Distance between Clusters X X Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(K i , K j ) = min(t ip , t jq ) Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(K i , K j ) = max(t ip , t jq ) Average: avg distance between an element in one cluster and an element in the other, i.e., dist(K i , K j ) = avg(t ip , t jq ) Centroid: distance between the centroids of two clusters, i.e., dist(K i , K j ) = dist(C i , C j ) Medoid: distance between the medoids of two clusters, i.e., dist(K i , K j ) = dist(M i , M j ) Medoid: a chosen, centrally located object in the cluster 83
Hierarchical Clustering: MIN 5 1 3 5 0.2 2 1 0.15 2 3 6 0.1 0.05 4 4 0 3 6 2 5 4 1 Nested Clusters Dendrogram
View points/similarities as a graph Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . Proximity Matrix . ... p1 p2 p3 p4 p9 p10 p11 p12
Single link clustering and MST (Minimum Spanning Tree) An aggolomerative algorithm using minimum distance (single-link clustering) essentially the same as Kruskal’s algorithm for minimal spanning tree (MST) MST: a subgraph which is a tree and connects all vertices together that has the minimum weight Kruskal’s algorithm: Add edges in increasing weight, skipping those whose addition would create a cycle Prim’s algorithm: Grow a tree with any root node, adding the frontier edge with smallest weight
Min vs. Max vs. Group Average 5 4 1 1 3 2 5 MIN 5 5 2 2 1 MAX 2 3 3 6 6 3 1 4 4 4 5 1 2 5 2 Group Average 3 6 3 1 4 4
Strength of MIN Original Points Two Clusters • Can handle clusters with varying sizes • Can also handle non-elliptical shapes
Limitations of MAX Original Points Two Clusters • Tends to break large clusters • Biased towards globular clusters
Limitations of MIN Original Points Two Clusters • Chaining phenomenon • Sensitive to noise and outliers
Strength of MAX Original Points Two Clusters • Less susceptible to noise and outliers
Hierarchical Clustering: Group Average Compromise between Single and Complete Link Strengths Less susceptible to noise and outliers Limitations Biased towards globular clusters
Hierarchical Clustering: Major Weaknesses Do not scale well (N: number of points) Space complexity: Time complexity:
Hierarchical Clustering: Major Weaknesses Do not scale well (N: number of points) Space complexity: O(N 2 ) Time complexity: O(N 3 ) O(N 2 log(N)) for some cases/approaches Cannot undo what was done previously Quality varies in terms of distance measures MIN (single link): susceptible to noise/outliers MAX/GROUP AVERAGE: may not work well with non- globular clusters
Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Similarity and distances Partitioning Methods Hierarchical Methods Density-Based Methods Probabilistic Methods Evaluation of Clustering 95 95
Density-Based Clustering Methods Clustering based on density Major features: Clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD ’ 96) OPTICS: Ankerst, et al (SIGMOD ’ 99). DENCLUE: Hinneburg & D. Keim (KDD ’ 98) CLIQUE: Agrawal, et al. (SIGMOD ’ 98) (more grid-based) September 26, 2017 Data Mining: Concepts and Techniques 96
DBSCAN: Basic Concepts Density = number of points within a specified radius core point: has high density border point: has less density, but in the neighborhood of a core point noise point: not a core point or a border point. Core point border point noise point
DBScan: Definitions Two parameters : Eps : radius of the neighbourhood MinPts : Minimum number of points in an Eps- neighbourhood of that point N Eps (p) : {q belongs to D | dist(p,q) <= Eps} core point: | N Eps (q) | >= MinPts p MinPts = 5 q Eps = 1 cm September 26, 2017 Data Mining: Concepts and Techniques 98
DBScan: Definitions Directly density-reachable (p from q): p MinPts = 5 p belongs to N Eps (q) q Eps = 1 cm Density-reachable (p from q): if there p is a chain of points p 1 , … , p n , p 1 = q , p 2 p n = p such that p i+1 is directly q density-reachable from p i p q Density-connected (p and q): if there is a point o such that both, p and q o are density-reachable from o w.r.t. Eps and MinPts Data Mining: Concepts and Techniques 99
DBSCAN: Cluster Definition A cluster is defined as a maximal set of density-connected points Outlier Border Eps = 1cm Core MinPts = 5 September 26, 2017 Data Mining: Concepts and Techniques 100
DBSCAN: The Algorithm Arbitrary select an unvisited point p, retrieve all neighbor points density-reachable from p w.r.t. Eps and MinPts If p is a core point, a cluster is formed, add all neighbors of p to the cluster, and recursively add their neighbors if they are a core point Otherwise, mark p as a noise point Continue the process until all of the points have been processed. Complexity: O(n 2 ). If a spatial index is used, O(nlogn) September 26, 2017 Data Mining: Concepts and Techniques 101
DBSCAN: Sensitive to Parameters September 26, 2017 Data Mining: Concepts and Techniques 102
DBSCAN: Determining EPS and MinPts Basic idea (given MinPts = k, find eps): For points in a cluster, their k th nearest neighbors are at roughly the same distance Noise points have the k th nearest neighbor at farther distance Plot sorted distance of every point to its k th nearest neighbor
Recommend
More recommend