data mining
play

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong Slide credits: Jiawei Han and Micheline Kamber Tan, Steinbach, Kumar September 26, 2017 Data Mining: Concepts and Techniques 1 Cluster Analysis: Basic Concepts and Methods


  1. Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of n objects (into k  clusters), s.t. intracluster similarity maximized and intercluster similarity minimized One objective: minimize the sum of squared distance from cluster  centroid    2 k ( ) p m   1 i p C i i How to find optimal partition?  September 26, 2017 Data Mining: Concepts and Techniques 30

  2. Number of partitionings Data Mining: Concepts and Techniques 31

  3. Number of partitionings Stirling partition number – number of ways  to partition n objects into k non-empty subsets (n= 5, k = 1, 2, 3, 4, 5): 1, 15, 25, 10, 1 (n=10, k = 1, 2, 3, 4, 5, …): 1, 511, 9330, 34105, 42525, … Bell numbers – number of ways to partition  n objects (n = 0, 1, 2, 3, 4, 5, …): 1 , 1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975, 678570, 4213597, 27644437, 190899322, 1382958545, 10480142147, 82864869804, 682076806159, 5832742205057, ... Data Mining: Concepts and Techniques 32

  4. Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of n objects into k clusters,  s.t. intracluster similarity maximized and intercluster similarity minimized One objective: minimize the sum of squared distance from cluster  centroid    2 k ( ) p m   1 i p C i i Heuristic methods: k-means and k-medoids algorithms   k-means (Lloyd’57, MacQueen’67 ): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster September 26, 2017 Data Mining: Concepts and Techniques 33

  5. K-Means Clustering: Lloyd Algorithm  Given k , and randomly choose k initial cluster centers  Partition objects into k nonempty subsets by assigning each object to the cluster with the nearest centroid  Update centroid, i.e. mean point of the cluster  Go back to Step 2, stop when no more new assignment and centroids do not change September 26, 2017 Data Mining: Concepts and Techniques 34

  6. The K-Means Clustering Method  Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 Update 4 Assign 3 3 the 3 each 2 2 2 cluster 1 1 objects 1 0 means 0 0 to most 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 similar reassign reassign center 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center Update 4 4 the 3 3 2 2 cluster 1 1 means 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 September 26, 2017 Data Mining: Concepts and Techniques 35

  7. K -means Clustering – Details Initial centroids are often chosen randomly  Example: Pick one point at random, then k-1 other points, each as  far away as possible from the previous points The centroid is (typically) the mean of the points in the  cluster. ‘ Nearest ’ is measured by Euclidean distance, cosine  similarity, correlation, etc. Most of the convergence happens in the first few  iterations. Often the stopping condition is changed to ‘Until relatively few  points change clusters’ O(tkn) Complexity is  n is # objects, k is # clusters, and t is # iterations.

  8. Comments on the K-Means Method Strength   Simple and works well for “regular” disjoint clusters  Relatively efficient and scalable (normally, k, t << n) Weakness   Need to specify k, the number of clusters, in advance  Depending on initial centroids, may terminate at a local optimum  Sensitive to noisy data and outliers  Not suitable for clusters of  Different sizes  Non-convex shapes September 26, 2017 Data Mining: Concepts and Techniques 37

  9. Getting the k right How to select k ?  Try different k , looking at the change in the average distance to centroid (or SSE) as k increases  Average falls rapidly until right k , then changes little Best value of k Average distance to k centroid J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38

  10. Example: Picking k x Too few; x xx x many long distances x x x x to centroid. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

  11. Example: Picking k x x Just right; xx x distances x x rather short. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40

  12. Example: Picking k x Too many; x little improvement xx x in average x x x x distance. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

  13. Importance of Choosing Initial Centroids – Case 1 Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

  14. Importance of Choosing Initial Centroids – Case 2 Iteration 1 Iteration 2 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Iteration 3 Iteration 4 Iteration 5 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

  15. Limitations of K-means: Differing Sizes K-means (3 Clusters) Original Points

  16. Limitations of K-means: Non-convex Shapes Original Points K-means (2 Clusters)

  17. Overcoming K-means Limitations Original Points K-means Clusters

  18. Overcoming K-means Limitations Original Points K-means Clusters

  19. Assignment 2  Implement k-means clustering  Evaluate the results September 26, 2017 Data Mining: Concepts and Techniques 48

  20. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts  Similarity and distances  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Probabilistic Methods  Evaluation of Clustering  49 49

  21. Cluster Evaluation  Determine clustering tendency of data, i.e. distinguish whether non-random structure exists  Determine correct number of clusters  Evaluate the cohesion and separation of the clustering without external information  Evaluate how well the cluster results are compared to externally known results  Compare different clustering algorithms/results

  22. Measures  Unsupervised (internal): Used to measure the goodness of a clustering structure without respect to external information.  Sum of Squared Error (SSE)  Supervised (external): Used to measure the extent to which cluster labels match externally supplied class labels.  Entropy  Relative: Used to compare two different clustering results  Often an external or internal index is used for this function, e.g., SSE or entropy

  23. Internal Measures: Cohesion and Separation  Cluster Cohesion: how closely related are objects in a cluster  Cluster Separation: how distinct or well-separated a cluster is from other clusters Example: Squared Error  Cohesion: within cluster sum of squares (SSE)    2   ( ) WSS x m i  i x C i Separation: between cluster sum of squares  separation    Cohesion 2 ( ) BSS m m i j i j

  24. Cluster Validity: Clusters found in Random Data 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Random DBSCAN 0.6 0.6 Points y 0.5 0.5 y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x 1 1 0.9 0.9 K-means Complete 0.8 0.8 Link 0.7 0.7 0.6 0.6 0.5 0.5 y y 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x

  25. Internal Measures: Cluster Validity Statistics framework for cluster validity  More “atypical” -> likely valid structure in the data  Use values resulting from random data as baseline  Example  Clustering: SSE = 0.005  SSE of three clusters in 500 sets of random data points  1 50 0.9 45 0.8 40 0.7 35 0.6 30 Count 0.5 y 25 0.4 20 0.3 15 0.2 10 0.1 5 0 0 0 0.2 0.4 0.6 0.8 1 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 x SSE

  26. Internal Measures: number of clusters  Good for comparing two clusterings  Can also be used to estimate the number of clusters  Elbow method: use turning point in the curve of SSE wrt # of clusters 10 6 9 8 4 7 2 6 SSE 5 0 4 -2 3 -4 2 1 -6 0 5 10 15 2 5 10 15 20 25 30 K

  27. Internal Measures: Number of clusters  Another example of a more complicated data set with varying number of clusters 1 2 6 3 4 5 7 SSE of clusters found using K-means

  28. External Measures  Compare cluster results with “ground truth” or manually clustering  Still different from classification measures  Classification-oriented measures: entropy/purity based, precision and recall based  Similarity-oriented measures: Jaccard scores

  29. External Measures: Classification-Oriented Measures  Entropy based measures: the degree to which each cluster consists of objects of a single class  Purity: based on majority class in each cluster

  30. External Measures: Classification-Oriented Measures  BCubed Precision and recall: measures precision and recall associated with each object  Precision of an object: proportion of objects in the same cluster belong to the same category  Recall of an object: proportion of objects of the same category are assigned to the same cluster  Bcubed precision and recall are the average precision and recall of all objects

  31. BCubed precision and recall September 26, 2017 60

  32. External Measure: Similarity-Oriented Measures Given a reference clustering T and clustering S f 00 : number of pair of points belonging to different clusters in both T  and S f 01 : number of pair of points belonging to different cluster in T but  same cluster in S f 10 : number of pair of points belonging to same cluster in T but  different cluster in S f 11 : number of pair of points belonging to same clusters in both T and  S  f f  00 11 Rand    f f f f 00 01 10 11 f  11 Jaccard   f f f 01 10 11 T S September 26, 2017 Li Xiong 61

  33. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts  Similarity and distances  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Probabilistic Methods  Evaluation of Clustering  62 62

  34. Variations of the K-Means Method A few variants of the k-means which differ in   Selection of the initial k means  Dissimilarity calculations  Strategies to calculate cluster means Handling categorical data: k-modes (Huang’98)   Replacing means of clusters with modes  Using new dissimilarity measures to deal with categorical objects  Using a frequency-based method to update modes of clusters  A mixture of categorical and numerical data: k-prototype method September 26, 2017 Data Mining: Concepts and Techniques 63

  35. K-Medoids Method  The k-means algorithm is sensitive to outliers !  Since an object with an extremely large value may substantially distort the mean of the data. K-Medoids: Instead of using the mean as cluster  representative, use medoid , the most centrally located object in a cluster. Possible number of solutions?  10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 September 26, 2017 Data Mining: Concepts and Techniques 64

  36. The K - Medoids Clustering Method PAM (Partitioning Around Medoids) (Kaufman and Rousseeuw, 1987) Arbitrarily select k objects as medoid  Assign each data object in the given data set to most similar  medoid. For each nonmedoid object O’ and medoid object O   Compute total cost, S, of swapping the medoid object O to O’ (cost as total sum of absolute error) If min S<0, then swap O with O’  Repeat until there is no change in the medoids.  September 26, 2017 Data Mining: Concepts and Techniques 65

  37. A Typical K-Medoids Algorithm (PAM) Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary Assign 7 7 7 6 6 6 choose k each 5 5 5 object as remaining 4 4 4 initial object to 3 3 3 medoids nearest 2 2 2 medoids 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Select a nonmedoid object,O random Total Cost = 26 Do loop 10 10 Compute 9 9 Swapping O 8 8 total cost of Until no 7 7 and O ramdom swapping 6 6 change 5 5 If quality is 4 4 improved. 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 September 26, 2017 Data Mining: Concepts and Techniques 66

  38. What Is the Problem with PAM?  Pam is more robust than k-means in the presence of noise and outliers  Pam works efficiently for small data sets but does not scale well for large data sets.  Complexity? n is # of data, k is # of clusters September 26, 2017 Data Mining: Concepts and Techniques 67

  39. What Is the Problem with PAM?  Pam is more robust than k-means in the presence of noise and outliers  Pam works efficiently for small data sets but does not scale well for large data sets.  Complexity? O(k(n-k) 2 ) n is # of data, k is # of clusters September 26, 2017 Data Mining: Concepts and Techniques 68

  40. CLARA (Clustering Large Applications) (1990)  CLARA (Kaufmann and Rousseeuw in 1990)  Draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output September 26, 2017 Data Mining: Concepts and Techniques 69

  41. CLARANS (“Randomized” CLARA) (1994)  CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)  The clustering process can be represented as searching a graph where every node is a potential solution, that is, a set of k medoids September 26, 2017 Data Mining: Concepts and Techniques 70

  42. Search graph September 26, 2017 Data Mining: Concepts and Techniques 71

  43. CLARANS (“Randomized” CLARA) (1994)  CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)  The clustering process can be represented as searching a graph where every node is a potential solution, that is, a set of k medoids  PAM examines all neighbors for local minimum  CLARA works on subgraphs of samples  CLARANS examines neighbors dynamically  Limit the neighbors to explore (maxneighbor)  If local optimum is found, start with new randomly selected node in search for a new local optimum (numlocal) September 26, 2017 Data Mining: Concepts and Techniques 72

  44. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts  Similarity and distances  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Probabilistic Methods  Evaluation of Clustering  73 73

  45. Overcoming K-means Limitations Original Points K-means Clusters

  46. Overcoming K-means Limitations Original Points K-means Clusters

  47. Hierarchical Clustering  Produces a set of nested clusters  Can be visualized as a dendrogram, a tree like diagram  Y-axis measures closeness  Clustering obtained by cutting at desired level  Do not have to assume any particular number of clusters  May correspond to meaningful taxonomies 5 6 0.2 4 3 4 0.15 2 5 2 0.1 1 0.05 1 3 0 1 3 2 5 4 6

  48. September 26, 2017 Data Mining: Concepts and Techniques 77

  49. Hierarchical Clustering  Two main types of hierarchical clustering  Agglomerative (AGNES) Start with the points as individual clusters  At each step, merge the closest pair of clusters until only one  cluster (or k clusters) left  Divisive (DIANA) Start with one, all-inclusive cluster  At each step, split a cluster until each cluster contains a point  (or there are k clusters)

  50. Agglomerative Clustering Algorithm Compute the proximity matrix 1. Let each data point be a cluster 2. Repeat 3. Merge the two closest clusters 4. Update the proximity matrix 5. Until only a single cluster remains 6.

  51. Starting Situation  Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . Proximity Matrix . ... p1 p2 p3 p4 p9 p10 p11 p12

  52. Intermediate Situation C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 Proximity Matrix C1 C5 C2 ... p1 p2 p3 p4 p9 p10 p11 p12

  53. How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 . . . p1 p2 p3 Similarity? p4 p5 . . . Proximity Matrix

  54. Distance between Clusters X X  Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(K i , K j ) = min(t ip , t jq )  Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(K i , K j ) = max(t ip , t jq )  Average: avg distance between an element in one cluster and an element in the other, i.e., dist(K i , K j ) = avg(t ip , t jq )  Centroid: distance between the centroids of two clusters, i.e., dist(K i , K j ) = dist(C i , C j )  Medoid: distance between the medoids of two clusters, i.e., dist(K i , K j ) = dist(M i , M j )  Medoid: a chosen, centrally located object in the cluster 83

  55. Hierarchical Clustering: MIN 5 1 3 5 0.2 2 1 0.15 2 3 6 0.1 0.05 4 4 0 3 6 2 5 4 1 Nested Clusters Dendrogram

  56. View points/similarities as a graph  Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . Proximity Matrix . ... p1 p2 p3 p4 p9 p10 p11 p12

  57. Single link clustering and MST (Minimum Spanning Tree) An aggolomerative algorithm using minimum distance (single-link  clustering) essentially the same as Kruskal’s algorithm for minimal spanning tree (MST) MST: a subgraph which is a tree and connects all vertices  together that has the minimum weight Kruskal’s algorithm: Add edges in increasing weight, skipping  those whose addition would create a cycle Prim’s algorithm: Grow a tree with any root node, adding the  frontier edge with smallest weight

  58. Min vs. Max vs. Group Average 5 4 1 1 3 2 5 MIN 5 5 2 2 1 MAX 2 3 3 6 6 3 1 4 4 4 5 1 2 5 2 Group Average 3 6 3 1 4 4

  59. Strength of MIN Original Points Two Clusters • Can handle clusters with varying sizes • Can also handle non-elliptical shapes

  60. Limitations of MAX Original Points Two Clusters • Tends to break large clusters • Biased towards globular clusters

  61. Limitations of MIN Original Points Two Clusters • Chaining phenomenon • Sensitive to noise and outliers

  62. Strength of MAX Original Points Two Clusters • Less susceptible to noise and outliers

  63. Hierarchical Clustering: Group Average Compromise between Single and Complete  Link Strengths   Less susceptible to noise and outliers Limitations   Biased towards globular clusters

  64. Hierarchical Clustering: Major Weaknesses  Do not scale well (N: number of points)  Space complexity:  Time complexity:

  65. Hierarchical Clustering: Major Weaknesses  Do not scale well (N: number of points)  Space complexity: O(N 2 )  Time complexity: O(N 3 ) O(N 2 log(N)) for some cases/approaches  Cannot undo what was done previously  Quality varies in terms of distance measures MIN (single link): susceptible to noise/outliers  MAX/GROUP AVERAGE: may not work well with non-  globular clusters

  66. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts  Similarity and distances  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Probabilistic Methods  Evaluation of Clustering  95 95

  67. Density-Based Clustering Methods  Clustering based on density  Major features:  Clusters of arbitrary shape  Handle noise  One scan  Need density parameters as termination condition  Several interesting studies:  DBSCAN: Ester, et al. (KDD ’ 96)  OPTICS: Ankerst, et al (SIGMOD ’ 99).  DENCLUE: Hinneburg & D. Keim (KDD ’ 98)  CLIQUE: Agrawal, et al. (SIGMOD ’ 98) (more grid-based) September 26, 2017 Data Mining: Concepts and Techniques 96

  68. DBSCAN: Basic Concepts Density = number of points within a specified radius  core point: has high density  border point: has less density, but in the  neighborhood of a core point noise point: not a core point or a border point.  Core point border point noise point

  69. DBScan: Definitions  Two parameters :  Eps : radius of the neighbourhood  MinPts : Minimum number of points in an Eps- neighbourhood of that point  N Eps (p) : {q belongs to D | dist(p,q) <= Eps}  core point: | N Eps (q) | >= MinPts p MinPts = 5 q Eps = 1 cm September 26, 2017 Data Mining: Concepts and Techniques 98

  70. DBScan: Definitions  Directly density-reachable (p from q): p MinPts = 5 p belongs to N Eps (q) q Eps = 1 cm  Density-reachable (p from q): if there p is a chain of points p 1 , … , p n , p 1 = q , p 2 p n = p such that p i+1 is directly q density-reachable from p i p q  Density-connected (p and q): if there is a point o such that both, p and q o are density-reachable from o w.r.t. Eps and MinPts Data Mining: Concepts and Techniques 99

  71. DBSCAN: Cluster Definition  A cluster is defined as a maximal set of density-connected points Outlier Border Eps = 1cm Core MinPts = 5 September 26, 2017 Data Mining: Concepts and Techniques 100

  72. DBSCAN: The Algorithm  Arbitrary select an unvisited point p, retrieve all neighbor points density-reachable from p w.r.t. Eps and MinPts  If p is a core point, a cluster is formed, add all neighbors of p to the cluster, and recursively add their neighbors if they are a core point  Otherwise, mark p as a noise point  Continue the process until all of the points have been processed.  Complexity: O(n 2 ). If a spatial index is used, O(nlogn) September 26, 2017 Data Mining: Concepts and Techniques 101

  73. DBSCAN: Sensitive to Parameters September 26, 2017 Data Mining: Concepts and Techniques 102

  74. DBSCAN: Determining EPS and MinPts Basic idea (given MinPts = k, find eps):  For points in a cluster, their k th nearest neighbors  are at roughly the same distance Noise points have the k th nearest neighbor at  farther distance Plot sorted distance of every point to its k th nearest  neighbor

Recommend


More recommend