cisc 4631 data mining lecture 09 clustering theses slides
play

CISC 4631 Data Mining Lecture 09: Clustering Theses slides are - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 09: Clustering Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) Raymond Mooney (UT Austin) What is Clustering? Finding groups


  1. CISC 4631 Data Mining Lecture 09: • Clustering Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors) • Eamonn Koegh (UC Riverside) • Raymond Mooney (UT Austin)

  2. What is Clustering?  Finding groups of objects such that objects in a group will be similar to one another and different from the objects in other groups  Also called unsupervised learning , sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing Inter-cluster Intra-cluster distances are distances are maximized minimized

  3. What is a natural grouping among these objects? Clustering is subjective 3 Simpson's Family Females Males School Employees

  4. Similarity is Subjective 4

  5. Intuitions behind desirable distance measure properties D (A,B) = D (B,A) Symmetry Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like Alex.” D (A,A) = 0 Constancy of Self-Similarity Otherwise you could claim “Alex looks more like Bob, than Bob does.” D (A,B) = 0 IIf A=B Positivity (Separation) Otherwise there are objects in your world that are different, but you cannot tell apart. D (A,B)  D (A,C) + D (B,C) Triangular Inequality Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.” 5

  6. Applications of Cluster Analysis  Understanding Discovered Clusters Industry Group Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, – Group related documents DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Technology1-DOWN Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, for browsing, group genes Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, 2 and proteins that have ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Technology2-DOWN Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, similar functionality, group Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, stocks with similar price 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN fluctuations, or customers Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP that have similar buying Schlumberger-UP habits  Summarization – Reduce the size of large data sets Clustering precipitation in Australia

  7. Notion of a Cluster can be Ambiguous So tell me how many clusters do you see? How many clusters? Six Clusters Two Clusters Four Clusters

  8. Types of Clusterings  A clustering is a set of clusters  Important distinction between hierarchical and partitional sets of clusters  Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset  Hierarchical clustering – A set of nested clusters organized as a hierarchical tree

  9. Partitional Clustering Original Points A Partitional Clustering

  10. Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram Simpsonian Dendrogram

  11. Other Distinctions Between Sets of Clusters  Exclusive versus non-exclusive – In non-exclusive clusterings points may belong to multiple clusters – Can represent multiple classes or ‘border’ points  Fuzzy versus non-fuzzy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics  Partial versus complete – In some cases, we only want to cluster some of the data

  12. Types of Clusters  Well-separated clusters  Center-based clusters (our main emphasis)  Contiguous clusters  Density-based clusters  Described by an Objective Function

  13. Types of Clusters: Well-Separated  Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters

  14. Types of Clusters: Center-Based  Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster (assuming numerical attributes), or a medoid, the most “representative” point of a cluster (used if there are categorical features) 4 center-based clusters

  15. Types of Clusters: Contiguity- Based  Contiguous Cluster (Nearest neighbor or Transitive) – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters

  16. Types of Clusters: Density-Based  Density-based – A cluster is a dense region of points, which is separated by low- density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters

  17. Types of Clusters: Objective Function  Clusters Defined by an Objective Function – Finds clusters that minimize or maximize an objective function. – Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard) – Example: Sum of squares of distances to cluster center

  18. Clustering Algorithms  K-means and its variants  Hierarchical clustering  Density-based clustering

  19. K-means Clustering Partitional clustering approach  Each cluster is associated with a centroid (center point)  Each point is assigned to the cluster with the closest centroid  Number of clusters, K, must be specified  The basic algorithm is very simple  – K-means tutorial available from http://maya.cs.depaul.edu/~classes/ect584/WEKA/k-means.html

  20. K-means Clustering 1. Ask user how many clusters 5 they’d like. (e.g. k=3) 4 2. Randomly guess k cluster Center locations 3 3. Each datapoint finds out which 2 Center it’s closest to. 4. Each Center finds 1 the centroid of the points it owns… 0 5. …and jumps there 0 1 2 3 4 5 6. …Repeat until 20 terminated!

  21. K-means Clustering: Step 1 means Clustering: Step 1 5 4 k 1 3 k 2 2 1 k 3 0 0 1 2 3 4 5 21

  22. K-means Clustering 5 4 k 1 3 k 2 2 1 k 3 0 0 1 2 3 4 5 22

  23. K-means Clustering 5 4 k 1 3 2 k 3 k 2 1 0 0 1 2 3 4 5 23

  24. K-means Clustering 5 4 k 1 3 2 k 3 k 2 1 0 0 1 2 3 4 5 24

  25. K-means Clustering 5 expression in condition 2 4 k 1 3 2 k 2 k 3 1 0 0 1 2 3 4 5 expression in condition 1 25

  26. K-means Clustering – Details Initial centroids are often chosen randomly.  – Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the  cluster ‘Closeness’ is measured by Euclidean distance, correlation,  etc. K-means will converge for common similarity measures  mentioned above. Most of the convergence happens in the first few iterations.  – Often the stopping condition is changed to ‘Until relatively few points change clusters’

  27. Evaluating K-means Clusters  Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them. – We can show that to minimize SSE the best update strategy is to use the center of the cluster. – Given two clusters, we can choose the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters  A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

  28. Two different K-means Clusterings 3 2.5 Original Points 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Optimal Clustering Sub-optimal Clustering

  29. Importance of Choosing Initial Centroids Iteration 6 Iteration 3 Iteration 2 Iteration 4 Iteration 5 Iteration 1 3 3 3 3 3 3 2.5 2.5 2.5 2.5 2.5 2.5 2 2 2 2 2 2 1.5 1.5 1.5 1.5 1.5 1.5 y y y y y y 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 -2 -2 -2 -2 -2 -2 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1 -1 -1 -1 -1 -1 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 0 0 0 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 1 1.5 1.5 1.5 1.5 1.5 1.5 2 2 2 2 2 2 x x x x x x If you happen to choose good initial centroids, then you will get this after 6 iterations

  30. Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Good clustering Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

Recommend


More recommend