Clustering
Community Detection http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811 Jian Pei: CMPT 741/459 Clustering (1) 2
Customer Relation Management • Partitioning customers into groups such that customers within a group are similar in some aspects • A manager can be assigned to a group • Customized products and services can be developed Jian Pei: CMPT 741/459 Clustering (1) 3
What Is Clustering? • Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Outliers Cluster 1 Cluster 2 Jian Pei: CMPT 741/459 Clustering (1) 4
Requirements of Clustering • Scalability • Ability to deal with various types of attributes • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge to determine input parameters Jian Pei: CMPT 741/459 Clustering (1) 5
Data Matrix • For memory-based clustering – Also called object-by-variable structure • Represents n objects with p variables (attributes, measures) � � x x x ⎡ ⎤ 11 1 f 1 p – A relational table ⎢ ⎥ � � � � � ⎢ ⎥ ⎢ ⎥ x � x � x i 1 if ip ⎢ ⎥ � � � � � ⎢ ⎥ ⎢ ⎥ x � x � x ⎢ ⎥ n 1 nf np ⎣ ⎦ Jian Pei: CMPT 741/459 Clustering (1) 6
Dissimilarity Matrix • For memory-based clustering – Also called object-by-object structure – Proximities of pairs of objects – d(i, j): dissimilarity between objects i and j – Nonnegative 0 ⎡ ⎤ – Close to 0: similar ⎢ ⎥ d (2,1) 0 ⎢ ⎥ d (3,1) d (3,2) 0 ⎢ ⎥ ⎢ ⎥ � � � ⎢ ⎥ ⎢ d ( n ,1) d ( n ,2) � � 0 ⎥ ⎣ ⎦ Jian Pei: CMPT 741/459 Clustering (1) 7
How Good Is Clustering? • Dissimilarity/similarity depends on distance function – Different applications have different functions • Judgment of clustering quality is typically highly subjective Jian Pei: CMPT 741/459 Clustering (1) 8
Types of Data in Clustering • Interval-scaled variables • Binary variables • Nominal, ordinal, and ratio variables • Variables of mixed types Jian Pei: CMPT 741/459 Clustering (1) 9
Interval-valued Variables • Continuous measurements of a roughly linear scale – Weight, height, latitude and longitude coordinates, temperature, etc. • Effect of measurement units in attributes – Smaller unit à larger variable range à larger effect to the result – Standardization + background knowledge Jian Pei: CMPT 741/459 Clustering (1) 10
Standardization • Calculate the mean absolute deviation 1 s (| x m | | x m | ... | x m |) 1 = − + − + + − m (x x x n = + ... ) n + + f 1 f f 2 f f nf f . f 1 f 2 f nf • Calculate the standardized measurement (z- x m − score) if f z = s if f • Mean absolute deviation is more robust – The effect of outliers is reduced but remains detectable Jian Pei: CMPT 741/459 Clustering (1) 11
Similarity and Dissimilarity • Distances are normally used measures • Minkowski distance: a generalization q q q d ( i , j ) | x x | | x x | ... | x x | ( q 0 ) = − + − + + − > q i j i j i j 1 1 2 2 p p • If q = 2, d is Euclidean distance • If q = 1, d is Manhattan distance • If q = ∞ , d is Chebyshev distance • Weighed distance q q q d ( i , j ) w | x x | w | x x | ... w | x x | ) ( q 0 ) = − + − + + − > q p i j i j i j 1 2 1 1 2 2 p p Jian Pei: CMPT 741/459 Clustering (1) 12
Manhattan and Chebyshev Distance Chebyshev Distance Manhattan Distance When n = 2, chess-distance Picture from Wekipedia http://brainking.com/images/rules/chess/02.gif Jian Pei: CMPT 741/459 Clustering (1) 13
Properties of Minkowski Distance • Nonnegative: d(i,j) ≥ 0 • The distance of an object to itself is 0 – d(i,i) = 0 • Symmetric: d(i,j) = d(j,i) • Triangular inequality i j – d(i,j) ≤ d(i,k) + d(k,j) k Jian Pei: CMPT 741/459 Clustering (1) 14
Object j 1 0 Sum Binary Variables 1 q r q+r Object i 0 s t s+t Sum q+s r+t p • A contingency table for binary data • Symmetric variable: each state carries the same weight r s + + d ( i , j ) = q r s t + + – Invariant similarity • Asymmetric variable: the positive value carries more weight r s ++ d ( i , j ) = q r s + – Noninvariant similarity (Jacard) Jian Pei: CMPT 741/459 Clustering (1) 15
Nominal Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., Red, yellow, blue, green p m − d ( i , j ) = • Method 1: simple matching p – M: # of matches, p: total # of variables • Method 2: use a large number of binary variables – Creating a new binary variable for each of the M nominal states Jian Pei: CMPT 741/459 Clustering (1) 16
Ordinal Variables • An ordinal variable can be discrete or continuous r ∈ { 1 ,..., M } if f • Order is important, e.g., rank • Can be treated like interval-scaled – Replace x if by their rank – Map the range of each variable onto [0, 1] by replacing the i-th object in the f-th variable by r 1 − z if = M 1 if − f – Compute the dissimilarity using methods for interval-scaled variables Jian Pei: CMPT 741/459 Clustering (1) 17
Ratio-scaled Variables • Ratio-scaled variable: a positive measurement on a nonlinear scale – E.g., approximately at exponential scale, such as Ae Bt • Treat them like interval-scaled variables? – Not a good choice: the scale can be distorted! • Apply logarithmic transformation, y if = log(x if ) • Treat them as continuous ordinal data, treat their rank as interval-scaled Jian Pei: CMPT 741/459 Clustering (1) 18
Variables of Mixed Types • A database may contain all the six types of variables – Symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio • One may use a weighted formula to combine their effects d p ( f ) ( f ) Σ δ f 1 ij ij d ( i , j ) = = p ( f ) Σ δ f 1 ij = Jian Pei: CMPT 741/459 Clustering (1) 19
Clustering Methods • K-means and partitioning methods • Hierarchical clustering • Density-based clustering • Grid-based clustering • Pattern-based clustering • Other clustering methods Jian Pei: CMPT 741/459 Clustering (1) 20
Partitioning Algorithms: Ideas • Partition n objects into k clusters – Optimize the chosen partitioning criterion • Global optimal: examine all possible partitions – (k n -(k-1) n - … -1) possible partitions, too expensive! • Heuristic methods: k-means and k-medoids – K-means: a cluster is represented by the center – K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster Jian Pei: CMPT 741/459 Clustering (1) 21
K-means • Arbitrarily choose k objects as the initial cluster centers • Until no change, do – (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster – Update the cluster means, i.e., calculate the mean value of the objects for each cluster Jian Pei: CMPT 741/459 Clustering (1) 22
K-Means: Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 Update Assi Assign 3 3 3 the e each 2 2 2 cluster 1 o 1 object 1 0 means 0 0 to to the 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 si most reassign reassign similar center 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center Update 4 4 the 3 3 2 2 cluster 1 1 means 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 741/459 Clustering (1) 23
Pros and Cons of K-means • Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t << n. • Often terminate at a local optimum • Applicable only when mean is defined – What about categorical data? • Need to specify the number of clusters • Unable to handle noisy data and outliers • Unsuitable to discover non-convex clusters Jian Pei: CMPT 741/459 Clustering (1) 24
Variations of the K-means • Aspects of variations – Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means • Handling categorical data: k-modes – Use mode instead of mean • Mode: the most frequent item(s) – A mixture of categorical and numerical data: k-prototype method • EM (expectation maximization): assign a probability of an object to a cluster (will be discussed later) Jian Pei: CMPT 741/459 Clustering (1) 25
A Problem of K-means + • Sensitive to outliers + – Outlier: objects with extremely large values • May substantially distort the distribution of the data • K-medoids: the most centrally located object in a cluster 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 741/459 Clustering (1) 26
Recommend
More recommend