Clustering What is Clustering? Types of Data in Cluster Analysis Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Hierarchical Methods 1 2 What is Clustering? What is Clustering? What is Clustering? What is Clustering? Typical applications Typical applications Clustering of data is a method by which large sets of Clustering of data is a method by which large sets of data are grouped into clusters of smaller sets of similar As a stand-alone tool to get insight into data data data. distribution distribution As a preprocessing step for other algorithms Cluster: a collection of data objects Cluster: a collection of data objects Use cluster detection when you suspect that there are Similar to one another within the same cluster natural groupings that may represent groups of customers Dissimilar to the objects in other clusters Dissimilar to the objects in other clusters or products that have lot in common or products that have lot in common. When there are many competing patterns in the data, y p g p , making it hard to spot a single pattern, creating clusters of Clustering is unsupervised classification: no predefined similar records reduces the complexity within clusters so classes classes that other data mining techniques are more likely to that other data mining techniques are more likely to succeed. 3 4
Examples of Clustering Applications E amp s of ust r ng pp cat ons Clustering definition Clustering definition Marketing: Help marketers discover distinct groups in g p g p Given a set of data points, each having a set of attributes, Given a set of data points each having a set of attributes their customer bases, and then use this knowledge to and a similarity measure among them, find clusters such that: develop targeted marketing programs data points in one cluster are more similar to one another data points in one cluster are more similar to one another Land use: Identification of areas of similar land use in an (high intra-class similarity) earth observation database data points in separate clusters are less similar to one data points in separate clusters are less similar to one Insurance: Identifying groups of motor insurance policy another (low inter-class similarity ) holders with a high average claim cost City-planning: Identifying groups of houses according to Similarity measures: e.g. Euclidean distance if attributes are their house type, value, and geographical location continuous continuous. Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults 5 6 Requirements of Clustering in Data Mining R i t f Cl t i i D t Mi i Scalability Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Di f l t ith bit h Minimal requirements for domain knowledge to determine input parameters input parameters Able to deal with noise and outliers I Insensitive to order of input records i i d f i d High dimensionality Incorporation of user-specified constraints Interpretability and usability http://webscripts.softpedia.com/screenshots/Efficient-K-Means-Clustering-using-JIT_1.png 7 8
http://wiki.na-mic.org/Wiki/index.php/Progress_Report:DTI_Clustering Project aiming at developing tools in the 3D Slicer for automatic clustering of tractographic paths through diffusion tensor MRI (DTI) data. ‘characterize the strength of connectivity between selected regions in the brain’ http://api.ning.com/files/uI4*osegkS5tF-JjFYZai3mGuslDu*- http://api.ning.com/files/uI4 osegkS5tF JjFYZai3mGuslDu BQ1rFsozaAaDw9IBdc99OjNas3FPKIrdgPXAz34DU0KsbZwl7G8tM5-n4DXTk6Fab/clustering.gif 9 10 Clustering Clustering Notion of a Cluster is Ambiguous N ti f Cl t i A bi What is Cluster Analysis? What is Cluster Analysis? Types of Data in Cluster Analysis yp y A Categorization of Major Clustering Methods Initial points. Initial points. Six Clusters Six Clusters Partitioning Methods Hierarchical Methods Two Clusters Four Clusters 11 12
Data Matrix Data Matrix Dissimilarity Matrix Dissimilarity Matrix Proximities of pairs of objects P i iti f i f bj t Represents n objects with p variables (attributes, d(i,j): dissimilarity between objects i and j measures) Nonnegative A relational table Close to 0: similar Close to 0 similar 0 x x x 11 1 p 1 f d d (2,1) (2 1) 0 0 x x x d (3,1) d (3,2) 0 ip p i 1 if x x x d d ( ( n n ,1) 1) d d ( ( n n ,2) 2) 0 0 np np n n 1 1 nf nf 13 14 Continuous variables Continuous variables Type of data in clustering analysis T f d t i l t i l i To avoid dependence on the choice of measurement units the data should be standardized. be standardized Continuous variables Standardize data Calculate the mean absolute deviation: Calculate the mean absolute deviation Binary variables B bl 1 s (|x m | |x m | ... |x m |) f n 1 f f 2 f f nf f Nominal and ordinal variables Nominal and ordinal variables 1 m (x x ... x ) where f n 1 f 2 f nf Variables of mixed types Calculate the standardized measurement (z-score) x m if f z s if f f Using mean absolute deviation is more robust than using standard deviation. Since the deviations are not squared the effect of outliers is somewhat reduced but their z-scores do not become to small; therefore, somewhat reduced but their z scores do not become to small; therefore, the outliers remain detectable. 15 16
Similarity/Dissimilarity Between Objects y y j Distances are normally used to measure the similarity or dissimil dissimilarity between two data objects it b t t d t bj ts Euclidean distance is probably the most commonly chosen type of distance. It is the geometric distance in the type of distance It is the geometric distance in the multidimensional space: p p 2 2 d ( i , j ) ( x x ) ki kj k 1 Required properties for q p p a distance function d(i,j) 0 ( ,j) d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(j,i) http://uk.geocities.com/ahf_alternate/dist.htm#S2 d(i,j) d(i,k) + d(k,j) 17 18 Manhattan distance = distance if you had to travel Manhattan distance = distance if you had to travel Similarity/Dissimilarity Between Objects Si il it /Di i il it B t Obj t along coordinates only. City-block (Manhattan) distance . This distance is simply the sum of differences across dimensions. In most cases, this di t distance measure yields results similar to the Euclidean i ld lt i il t th E lid y = (9,8) (9 8) euc.: distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they g g ( ) p ( y dist(x,y) = (4 2 + 3 2 ) = 5 are not squared). d d ( ( i i , j j ) ) | | x x x x | | | | x x x x | | ... | | x x x x | | 5 5 3 3 i 1 j 1 i 2 j 2 ip jp 4 4 The properties stated for the Euclidean distance also hold Th d f h E l d d l h ld for this measure. x = (5,5) manh.: dist(x,y) = 4+ 3 = 7 19 20
Si il Similarity/Dissimilarity Between Objects it /Di i il it B t Obj t Si il Similarity/Dissimilarity Between Objects it /Di i il it B t Obj t Minkowski distance . Sometimes one may want to increase or decrease the progressive weight that is Weighted distances g placed on dimensions on which the respective objects If we have some idea of the relative importance that are very different. This measure enables to accomplish should be assigned to each variable, then we can weight g g that and is computed as: that and is computed as: them and obtain a weighted distance measure. 2 2 2 2 d ( i , j ) w ( x x ) w ( x x ) 1 q p 1 i 1 j 1 ip jp q q q d ( i , j ) | x x | | x x | ... | x x | ip p jp jp i 1 j j 1 i 2 j j 2 21 22 Binary Variables y Binary Variables Binary Variables A contingency table for binary data Binary variable has only two states: 0 or 1 Binary variable has only two states: 0 or 1 Object j 1 0 sum A binary variable is symmetric if both of its states are nary ar a ymm tr c f th f t tat ar equally valuable, that is, there is no preference on which 1 a b a b Object i outcome should be coded as 1. 0 c d c d sum a c b d p A binary variable is asymmetric if the outcome of the states are not equally important such as positive or states are not equally important, such as positive or negative outcomes of a disease test. Similarity that is based on symmetric binary variables is called invariant similarity. 23 24
Recommend
More recommend