CISC 4631 Data Mining Lecture 09: • Clustering Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors) • Eamonn Koegh (UC Riverside) • Raymond Mooney (UT Austin)
What is Clustering? Finding groups of objects such that objects in a group will be similar to one another and different from the objects in other groups Also called unsupervised learning , sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing Inter-cluster Intra-cluster distances are distances are maximized minimized
What is a natural grouping among these objects? Clustering is subjective 3 Simpson's Family Females Males School Employees
Similarity is Subjective 4
Intuitions behind desirable distance measure properties D (A,B) = D (B,A) Symmetry Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like Alex.” D (A,A) = 0 Constancy of Self-Similarity Otherwise you could claim “Alex looks more like Bob, than Bob does.” D (A,B) = 0 IIf A=B Positivity (Separation) Otherwise there are objects in your world that are different, but you cannot tell apart. D (A,B) D (A,C) + D (B,C) Triangular Inequality Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.” 5
Applications of Cluster Analysis Understanding Discovered Clusters Industry Group Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, – Group related documents DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Technology1-DOWN Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, for browsing, group genes Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, 2 and proteins that have ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Technology2-DOWN Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, similar functionality, group Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, stocks with similar price 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN fluctuations, or customers Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP that have similar buying Schlumberger-UP habits Summarization – Reduce the size of large data sets Clustering precipitation in Australia
Notion of a Cluster can be Ambiguous So tell me how many clusters do you see? How many clusters? Six Clusters Two Clusters Four Clusters
Types of Clusterings A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering – A set of nested clusters organized as a hierarchical tree
Partitional Clustering Original Points A Partitional Clustering
Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram Simpsonian Dendrogram
Other Distinctions Between Sets of Clusters Exclusive versus non-exclusive – In non-exclusive clusterings points may belong to multiple clusters – Can represent multiple classes or ‘border’ points Fuzzy versus non-fuzzy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics Partial versus complete – In some cases, we only want to cluster some of the data
Types of Clusters Well-separated clusters Center-based clusters (our main emphasis) Contiguous clusters Density-based clusters Described by an Objective Function
Types of Clusters: Well-Separated Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters
Types of Clusters: Center-Based Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster (assuming numerical attributes), or a medoid, the most “representative” point of a cluster (used if there are categorical features) 4 center-based clusters
Types of Clusters: Contiguity- Based Contiguous Cluster (Nearest neighbor or Transitive) – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters
Types of Clusters: Density-Based Density-based – A cluster is a dense region of points, which is separated by low- density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters
Types of Clusters: Objective Function Clusters Defined by an Objective Function – Finds clusters that minimize or maximize an objective function. – Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard) – Example: Sum of squares of distances to cluster center
Clustering Algorithms K-means and its variants Hierarchical clustering Density-based clustering
K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple – K-means tutorial available from http://maya.cs.depaul.edu/~classes/ect584/WEKA/k-means.html
K-means Clustering 1. Ask user how many clusters 5 they’d like. (e.g. k=3) 4 2. Randomly guess k cluster Center locations 3 3. Each datapoint finds out which 2 Center it’s closest to. 4. Each Center finds 1 the centroid of the points it owns… 0 5. …and jumps there 0 1 2 3 4 5 6. …Repeat until 20 terminated!
K-means Clustering: Step 1 means Clustering: Step 1 5 4 k 1 3 k 2 2 1 k 3 0 0 1 2 3 4 5 21
K-means Clustering 5 4 k 1 3 k 2 2 1 k 3 0 0 1 2 3 4 5 22
K-means Clustering 5 4 k 1 3 2 k 3 k 2 1 0 0 1 2 3 4 5 23
K-means Clustering 5 4 k 1 3 2 k 3 k 2 1 0 0 1 2 3 4 5 24
K-means Clustering 5 expression in condition 2 4 k 1 3 2 k 2 k 3 1 0 0 1 2 3 4 5 expression in condition 1 25
K-means Clustering – Details Initial centroids are often chosen randomly. – Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster ‘Closeness’ is measured by Euclidean distance, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. – Often the stopping condition is changed to ‘Until relatively few points change clusters’
Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them. – We can show that to minimize SSE the best update strategy is to use the center of the cluster. – Given two clusters, we can choose the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Two different K-means Clusterings 3 2.5 Original Points 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Optimal Clustering Sub-optimal Clustering
Importance of Choosing Initial Centroids Iteration 6 Iteration 3 Iteration 2 Iteration 4 Iteration 5 Iteration 1 3 3 3 3 3 3 2.5 2.5 2.5 2.5 2.5 2.5 2 2 2 2 2 2 1.5 1.5 1.5 1.5 1.5 1.5 y y y y y y 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 -2 -2 -2 -2 -2 -2 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1 -1 -1 -1 -1 -1 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 0 0 0 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 1 1.5 1.5 1.5 1.5 1.5 1.5 2 2 2 2 2 2 x x x x x x If you happen to choose good initial centroids, then you will get this after 6 iterations
Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Good clustering Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x
Recommend
More recommend