Preview ! Introduction Lecture 10 ! Partitioning methods Clustering ! Hierarchical methods ! Model-based methods ! Density-based methods 1 2 What is Clustering? Examples of Clustering Applications ! Cluster: a collection of data objects ! Marketing: Help marketers discover distinct groups in their ! Similar to one another within the same cluster customer bases, and then use this knowledge to develop ! Dissimilar to the objects in other clusters targeted marketing programs ! Cluster analysis ! Land use: Identification of areas of similar land use in an ! Grouping a set of data objects into clusters earth observation database ! Clustering is unsupervised classification: ! Insurance: Identifying groups of motor insurance policy no predefined classes holders with a high average claim cost ! Typical applications ! Urban planning: Identifying groups of houses according to ! As a stand-alone tool to get insight into data their house type, value, and geographical location distribution ! Seismology: Observed earth quake epicenters should be ! As a preprocessing step for other algorithms clustered along continent faults 4 Requirements for Clustering What Is a Good Clustering? in Data Mining ! Scalability ! A good clustering method will produce ! Ability to deal with different types of attributes clusters with ! Discovery of clusters with arbitrary shape ! High intra-class similarity ! Minimal domain knowledge required to determine input parameters ! Low inter-class similarity ! Ability to deal with noise and outliers ! Precise definition of clustering quality is difficult ! Insensitivity to order of input records ! Application-dependent ! Robustness wrt high dimensionality ! Ultimately subjective ! Incorporation of user-specified constraints ! Interpretability and usability 5 6 1
Similarity and Dissimilarity Major Clustering Approaches Between Objects ! Same we used for IBL (e.g, L p norm) ! Partitioning: Construct various partitions and then evaluate ! Euclidean distance (p = 2): them by some criterion ! Hierarchical: Create a hierarchical decomposition of the set 2 2 2 d ( i , j ) = (| x − x | + | x − x | + ... + | x − x | ) i j i j i j 1 1 2 2 p p of objects using some criterion ! Properties of a metric d(i,j) : ! Model-based: Hypothesize a model for each cluster and ! d(i,j) ≥ 0 find best fit of models to data ! d(i,i) = 0 ! Density-based: Guided by connectivity and density ! d(i,j) = d(j,i) functions ! d(i,j) ≤ d(i,k) + d(k,j) 7 8 Partitioning Algorithms K-Means Clustering ! Partitioning method: Construct a partition of a database D ! Given k , the k-means algorithm consists of of n objects into a set of k clusters four steps: ! Given a k , find a partition of k clusters that optimizes the ! Select initial centroids at random. chosen partitioning criterion ! Assign each object to the cluster with the ! Global optimal: exhaustively enumerate all partitions nearest centroid. ! Heuristic methods: k-means and k-medoids algorithms ! Compute each centroid as the mean of the ! k-means (MacQueen, 1967): Each cluster is represented by the center of the cluster objects assigned to it. ! k-medoids or PAM (Partition around medoids) ! Repeat previous 2 steps until no change. (Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster 9 10 Comments on the K-Means Method K-Means Clustering (contd.) ! Strengths ! Example ! Relatively efficient : O ( tkn ), where n is # objects, k is 10 10 # clusters, and t is # iterations. Normally, k , t << n . 9 9 8 8 ! Often terminates at a local optimum . The global optimum 7 7 6 6 5 5 may be found using techniques such as simulated 4 4 3 3 annealing and genetic algorithms 2 2 1 1 ! Weaknesses 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 ! Applicable only when mean is defined (what about categorical data?) 10 10 9 9 ! Need to specify k, the number of clusters, in advance 8 8 7 7 6 6 ! Trouble with noisy data and outliers 5 5 4 4 ! Not suitable to discover clusters with non-convex shapes 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 11 12 2
AGNES (Agglomerative Nesting) Hierarchical Clustering ! Produces tree of clusters (nodes) ! Use distance matrix as clustering criteria. This method ! Initially: each object is a cluster (leaf) does not require the number of clusters k as an input, but needs a termination condition ! Recursively merges nodes that have the least dissimilarity ! Criteria: min distance, max distance, avg distance, center Step 1 Step 2 Step 3 Step 4 Step 0 agglomerative distance (AGNES) ! Eventually all nodes belong to the same cluster (root) a a b b a b c d e 10 10 10 9 9 9 c 8 8 8 c d e 7 7 7 6 6 6 d 5 5 5 d e 4 4 4 3 3 3 e 2 2 2 divisive 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Step 3 Step 2 Step 1 Step 0 (DIANA) Step 4 13 14 A Dendrogram Shows How the DIANA (Divisive Analysis) Clusters are Merged Hierarchically ! Inverse order of AGNES Decompose data objects into several levels of nested partitioning (tree of clusters), called a dendrogram. ! Start with root cluster containing all objects A clustering of the data objects is obtained by cutting the ! Recursively divide into subclusters dendrogram at the desired level. Then each connected component forms a cluster. ! Eventually each cluster contains a single object 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 15 16 BIRCH Other Hierarchical Clustering Methods ! BIRCH: Balanced Iterative Reducing and Clustering using ! Major weakness of agglomerative clustering methods Hierarchies (Zhang, Ramakrishnan & Livny, 1996) ! Do not scale well: time complexity of at least O ( n 2 ), ! Incrementally construct a CF (Clustering Feature) tree where n is the number of total objects ! Can never undo what was done previously ! Parameters: max diameter, max children ! Integration of hierarchical with distance-based clustering ! Phase 1: scan DB to build an initial in-memory CF tree (each node: #points, sum, sum of squares) ! BIRCH: uses CF-tree and incrementally adjusts the quality of sub-clusters ! Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree ! CURE: selects well-scattered points from the cluster and then shrinks them towards the center of the cluster by a ! Scales linearly : finds a good clustering with a single scan specified fraction and improves the quality with a few additional scans ! Weaknesses: handles only numeric data, sensitive to order of data records. 17 18 3
Clustering Feature Vector CF Tree Root CF 1 CF 2 CF 3 CF 6 B = 7 Clustering Feature: CF = (N, LS, SS) child 1 child 2 child 3 child 6 L = 6 N : Number of data points LS: ∑ N Non-leaf node i=1 X i CF 1 CF 2 CF 3 CF 5 SS: ∑ N CF = (5, (16,30),(54,190)) i=1 X i 2 child 1 child 2 child 3 child 5 (3,4) 10 9 8 (2,6) 7 Leaf node Leaf node 6 (4,5) 5 4 prev CF 1 CF 2 CF 6 next prev CF 1 CF 2 CF 4 next 3 (4,7) 2 1 0 (3,8) 0 1 2 3 4 5 6 7 8 9 10 19 20 Drawbacks of Distance-Based Method CURE (Clustering Using REpresentatives) ! CURE: non-spherical clusters, robust wrt outliers ! Uses multiple representative points to evaluate ! Drawbacks of square-error-based clustering method the distance between clusters ! Consider only one point as representative of a cluster ! Stops the creation of a cluster hierarchy if a ! Good only for convex clusters, of similar size and level consists of k clusters density, and if k can be reasonably estimated 21 22 Cure: The Algorithm Data Partitioning and Clustering ! s = 50 ! s/pq = 5 ! p = 2 ! Draw random sample s ! s/p = 25 ! Partition sample to p partitions with size s/p y ! Partially cluster partitions into s/pq clusters y y ! Cluster partial clusters, shrinking x representatives towards centroid y y ! Label data on disk x x x x 23 24 4
Recommend
More recommend