clustering
play

Clustering A Categorization of Major Clustering Methods - PowerPoint PPT Presentation

Clustering What is Clustering? Types of Data in Cluster Analysis Clustering A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 1 2 What is Clustering? What is Clustering? Typical


  1. Clustering � What is Clustering? � Types of Data in Cluster Analysis Clustering � A Categorization of Major Clustering Methods � Partitioning Methods � Hierarchical Methods 1 2 What is Clustering? What is Clustering? � Typical applications � Clustering of data is a method by which large sets of data are grouped into clusters of smaller sets of similar � As a stand-alone tool to get insight into data data. distribution � As a preprocessing step for other algorithms � Cluster: a collection of data objects � Use cluster detection when you suspect that there are � Similar to one another within the same cluster natural groupings that may represent groups of customers � Dissimilar to the objects in other clusters or products that have lot in common. � When there are many competing patterns in the data, making it hard to spot a single pattern, creating clusters of � Clustering is unsupervised classification: no predefined similar records reduces the complexity within clusters so that other data mining techniques are more likely to classes succeed. 3 4

  2. Examples of Clustering Applications Clustering definition � Marketing: Help marketers discover distinct groups in � Given a set of data points, each having a set of attributes, their customer bases, and then use this knowledge to and a similarity measure among them, find clusters such that: develop targeted marketing programs � data points in one cluster are more similar to one another � Land use: Identification of areas of similar land use in an (high intra-class similarity) earth observation database � data points in separate clusters are less similar to one � Insurance: Identifying groups of motor insurance policy another (low inter-class similarity ) holders with a high average claim cost � City-planning: Identifying groups of houses according to � Similarity measures: e.g. Euclidean distance if attributes are their house type, value, and geographical location continuous. � Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults 5 6 Notion of a Cluster is Ambiguous Requirements of Clustering in Data Mining Scalability � Ability to deal with different types of attributes � Discovery of clusters with arbitrary shape � Minimal requirements for domain knowledge to determine � Initial points. Six Clusters input parameters Able to deal with noise and outliers � Insensitive to order of input records � High dimensionality � Incorporation of user-specified constraints Two Clusters Four Clusters � Interpretability and usability � 7 8

  3. Clustering Data Matrix � What is Cluster Analysis? � Represents n objects with p variables (attributes, � Types of Data in Cluster Analysis measures) � A relational table � A Categorization of Major Clustering Methods x x x ⎡ � � ⎤ � Partitioning Methods 11 1 p 1 f ⎢ ⎥ � � � � � ⎢ ⎥ � Hierarchical Methods ⎢ ⎥ x x x � � i 1 if ip ⎢ ⎥ ⎢ ⎥ � � � � � ⎢ ⎥ x x x � � ⎢ ⎥ np n 1 nf ⎣ ⎦ 9 10 Dissimilarity Matrix Type of data in clustering analysis � Proximities of pairs of objects � Continuous variables � d(i,j): dissimilarity between objects i and j � Binary variables � Nonnegative � Nominal and ordinal variables � Close to 0: similar 0 ⎡ ⎤ � Variables of mixed types ⎢ ⎥ d (2,1) 0 ⎢ ⎥ d (3,1) d (3,2) 0 ⎢ ⎥ ⎢ ⎥ � � � ⎢ ⎥ ⎢ ⎥ d ( n ,1) d ( n ,2) 0 � � ⎣ ⎦ 11 12

  4. Continuous variables Similarity/Dissimilarity Between Objects To avoid dependence on the choice of measurement units the data should � Distances are normally used to measure the similarity or � be standardized. dissimilarity between two data objects Standardize data � � Euclidean distance is probably the most commonly chosen Calculate the mean absolute deviation: type of distance. It is the geometric distance in the � 1 s (|x m | |x m | ... |x m |) multidimensional space: = − + − + + − f n 1 f f 2 f f nf f p 1 m (x x ... x ) = + + + 2 where f n 1 f 2 f nf d ( i , j ) ( x x ) � = − ∑ ki kj � Properties Calculate the standardized measurement (z-score) k 1 � = x m − � d(i,j) ≥ 0 if f z = s if � d(i,i) = 0 f Using mean absolute deviation is more robust than using standard � d(i,j) = d(j,i) � deviation. Since the deviations are not squared the effect of outliers is � d(i,j) ≤ d(i,k) + d(k,j) somewhat reduced but their z-scores do not become to small; therefore, the outliers remain detectable. 13 14 Similarity/Dissimilarity Between Objects Similarity/Dissimilarity Between Objects � Minkowski distance . Sometimes one may want to � City-block (Manhattan) distance . This distance is simply the sum of differences across dimensions. In most cases, this increase or decrease the progressive weight that is distance measure yields results similar to the Euclidean placed on dimensions on which the respective objects distance. However, note that in this measure, the effect of are very different. This measure enables to accomplish single large differences (outliers) is dampened (since they that and is computed as: are not squared). d ( i , j ) | x x | | x x | ... | x x | = − + − + + − i 1 j 1 i 2 j 2 ip jp 1 q q q q ⎛ ⎞ d ( i , j ) | x x | | x x | ... | x x | = − + − + + − ⎜ ⎟ i 1 j 1 i 2 j 2 ip jp ⎝ ⎠ � The properties stated for the Euclidean distance also hold for this measure. 15 16

  5. Binary Variables Similarity/Dissimilarity Between Objects � Binary variable has only two states: 0 or 1 � If we have some idea of the relative importance that � A binary variable is symmetric if both of its states are should be assigned to each variable, then we can weight equally valuable, that is, there is no preference on which them and obtain a weighted distance measure. outcome should be coded as 1. � A binary variable is asymmetric if the outcome of the states are not equally important, such as positive or 2 2 d ( i , j ) w ( x x ) w ( x x ) = − + + − � negative outcomes of a disease test. 1 i 1 j 1 p ip jp � Similarity that is based on symmetric binary variables is called invariant similarity. 17 18 Binary Variables Dissimilarity between Binary Variables � A contingency table for binary data � Example Object j Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 1 0 sum Jack M Y N P N N N 1 a b a b Mary F Y N P N P N + Jim M Y P N N N N Object i 0 c d c d + � gender is a symmetric attribute sum a c b d p + + � the remaining attributes are asymmetric binary � let the values Y and P be set to 1, and the value N be set to 0 � Simple matching coefficient (invariant, if the binary variable is symmetric): b c + 0 1 d ( i , j ) + = d ( jack , mary ) 0 . 33 = = a b c d + + + 2 0 1 + + 1 1 + d ( jack , jim ) 0 . 67 � Jaccard coefficient (noninvariant if the binary Jaccard coefficient = = 1 1 1 + + variable is asymmetric): b c + d ( i , j ) 1 + 2 = d ( jim , mary ) 0 . 75 = = a b c + + 1 1 2 + + 19 20

  6. Nominal Variables Ordinal Variables � On ordinal variables order is important � A generalization of the binary variable in that it can take � e.g. Gold, Silver, Bronze more than 2 states, e.g., red, yellow, blue, green � Method 1: simple matching � Can be treated like continuous � m: # of matches, p: total # of variables � the ordered states define the ranking 1,...,M f � replacing x if by their rank r { 1 ,..., M } ∈ p m − if f d ( i , j ) = p � map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by � Method 2: use a large number of binary variables r − 1 � creating a new binary variable for each of the M if z = if nominal states M 1 − f � compute the dissimilarity using methods for continuous variables 21 22 Variables of Mixed Types Clustering � A database may contain several/all types of variables � What is Cluster Analysis? � continuous, symmetric binary, asymmetric binary, nominal and ordinal. � Types of Data in Cluster Analysis � One may use a weighted formula to combine their effects . p � A Categorization of Major Clustering Methods (f) (f) = ∑ δ d ij ij d(i, j) f = 1 p (f) ∑ � Partitioning Methods δ ij f = 1 � Hierarchical Methods δ ij =0 if x if is missing or x if =x jf =0 and the variable f is asymmetric � binary δ ij =1 otherwise � continuous and ordinal variables dij: normalized absolute distance � binary and nominal variables dij=0 if x if =x jf ; otherwise dij=1 � 23 24

Recommend


More recommend