CISC 4631 Data Mining Lecture 09: Clustering Theses slides are - PowerPoint PPT Presentation
CISC 4631 Data Mining Lecture 09: Clustering Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) Raymond Mooney (UT Austin) What is Clustering? Finding groups
CISC 4631 Data Mining Lecture 09: • Clustering Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors) • Eamonn Koegh (UC Riverside) • Raymond Mooney (UT Austin)
What is Clustering? Finding groups of objects such that objects in a group will be similar to one another and different from the objects in other groups Also called unsupervised learning , sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing Inter-cluster Intra-cluster distances are distances are maximized minimized
What is a natural grouping among these objects? Clustering is subjective 3 Simpson's Family Females Males School Employees
Similarity is Subjective 4
Intuitions behind desirable distance measure properties D (A,B) = D (B,A) Symmetry Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like Alex.” D (A,A) = 0 Constancy of Self-Similarity Otherwise you could claim “Alex looks more like Bob, than Bob does.” D (A,B) = 0 IIf A=B Positivity (Separation) Otherwise there are objects in your world that are different, but you cannot tell apart. D (A,B) D (A,C) + D (B,C) Triangular Inequality Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.” 5
Applications of Cluster Analysis Understanding Discovered Clusters Industry Group Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, – Group related documents DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Technology1-DOWN Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, for browsing, group genes Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, 2 and proteins that have ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Technology2-DOWN Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, similar functionality, group Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, stocks with similar price 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN fluctuations, or customers Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP that have similar buying Schlumberger-UP habits Summarization – Reduce the size of large data sets Clustering precipitation in Australia
Notion of a Cluster can be Ambiguous So tell me how many clusters do you see? How many clusters? Six Clusters Two Clusters Four Clusters
Types of Clusterings A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering – A set of nested clusters organized as a hierarchical tree
Partitional Clustering Original Points A Partitional Clustering
Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram Simpsonian Dendrogram
Other Distinctions Between Sets of Clusters Exclusive versus non-exclusive – In non-exclusive clusterings points may belong to multiple clusters – Can represent multiple classes or ‘border’ points Fuzzy versus non-fuzzy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics Partial versus complete – In some cases, we only want to cluster some of the data
Types of Clusters Well-separated clusters Center-based clusters (our main emphasis) Contiguous clusters Density-based clusters Described by an Objective Function
Types of Clusters: Well-Separated Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters
Types of Clusters: Center-Based Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster (assuming numerical attributes), or a medoid, the most “representative” point of a cluster (used if there are categorical features) 4 center-based clusters
Types of Clusters: Contiguity- Based Contiguous Cluster (Nearest neighbor or Transitive) – A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. 8 contiguous clusters
Types of Clusters: Density-Based Density-based – A cluster is a dense region of points, which is separated by low- density regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters
Types of Clusters: Objective Function Clusters Defined by an Objective Function – Finds clusters that minimize or maximize an objective function. – Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard) – Example: Sum of squares of distances to cluster center
Clustering Algorithms K-means and its variants Hierarchical clustering Density-based clustering
K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple – K-means tutorial available from http://maya.cs.depaul.edu/~classes/ect584/WEKA/k-means.html
K-means Clustering 1. Ask user how many clusters 5 they’d like. (e.g. k=3) 4 2. Randomly guess k cluster Center locations 3 3. Each datapoint finds out which 2 Center it’s closest to. 4. Each Center finds 1 the centroid of the points it owns… 0 5. …and jumps there 0 1 2 3 4 5 6. …Repeat until 20 terminated!
K-means Clustering: Step 1 means Clustering: Step 1 5 4 k 1 3 k 2 2 1 k 3 0 0 1 2 3 4 5 21
K-means Clustering 5 4 k 1 3 k 2 2 1 k 3 0 0 1 2 3 4 5 22
K-means Clustering 5 4 k 1 3 2 k 3 k 2 1 0 0 1 2 3 4 5 23
K-means Clustering 5 4 k 1 3 2 k 3 k 2 1 0 0 1 2 3 4 5 24
K-means Clustering 5 expression in condition 2 4 k 1 3 2 k 2 k 3 1 0 0 1 2 3 4 5 expression in condition 1 25
K-means Clustering – Details Initial centroids are often chosen randomly. – Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster ‘Closeness’ is measured by Euclidean distance, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. – Often the stopping condition is changed to ‘Until relatively few points change clusters’
Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them. – We can show that to minimize SSE the best update strategy is to use the center of the cluster. – Given two clusters, we can choose the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Two different K-means Clusterings 3 2.5 Original Points 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Optimal Clustering Sub-optimal Clustering
Importance of Choosing Initial Centroids Iteration 6 Iteration 3 Iteration 2 Iteration 4 Iteration 5 Iteration 1 3 3 3 3 3 3 2.5 2.5 2.5 2.5 2.5 2.5 2 2 2 2 2 2 1.5 1.5 1.5 1.5 1.5 1.5 y y y y y y 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 -2 -2 -2 -2 -2 -2 -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -1 -1 -1 -1 -1 -1 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 0 0 0 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 1 1.5 1.5 1.5 1.5 1.5 1.5 2 2 2 2 2 2 x x x x x x If you happen to choose good initial centroids, then you will get this after 6 iterations
Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Good clustering Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.