Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 41
Table of contents Introduction 1 Data matrix and dissimilarity matrix 2 Proximity Measures 3 Clustering methods 4 Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering Cluster validation and assessment 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 41
Table of contents Introduction 1 Data matrix and dissimilarity matrix 2 Proximity Measures 3 Clustering methods 4 Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering Cluster validation and assessment 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 41
Introduction Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters. Dissimilarities and similarities are assessed based on the attribute values describing the objects and often involve distance measures. Clustering as a data mining tool has its roots in many application areas such as biology, security, business intelligence, and Web search. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 41
Requirements for cluster analysis Clustering is a challenging research field and the following are its typical requirements. Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to input order Capability of clustering high-dimensionality data Constraint-based clustering Interpretability and usability Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 41
Comparing clustering methods The clustering methods can be compared using the following aspects: The partitioning criteria : In some methods, all the objects are partitioned so that no hierarchy exists among the clusters. Separation of clusters : In some methods, data partitioned into mutually exclusive clusters while in some other methods, the clusters may not be exclusive, that is, a data object may belong to more than one cluster. Similarity measure : Some methods determine the similarity between two objects by the distance between them; while in other methods, the similarity may be defined by connectivity based on density or contiguity. Clustering space : Many clustering methods search for clusters within the entire data space. These methods are useful for low-dimensionality data sets. With high- dimensional data, however, there can be many irrelevant attributes, which can make similarity measurements unreliable. Consequently, clusters found in the full space are often meaningless. Its often better to instead search for clusters within different subspaces of the same data set. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 41
Table of contents Introduction 1 Data matrix and dissimilarity matrix 2 Proximity Measures 3 Clustering methods 4 Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering Cluster validation and assessment 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 41
Data matrix and dissimilarity matrix Suppose that we have n objects described by p attributes. The objects are x 1 = ( x 11 , x 12 , . . . , x 1 p ), x 2 = ( x 21 , x 22 , . . . , x 2 p ), and so on, where x ij is the value for object x i of the j th attribute. For brevity, we hereafter refer to object x i as object i . The objects may be tuples in a relational database, and are also referred to as data samples or feature vectors. Main memory-based clustering and nearest-neighbor algorithms typically operate on either of the following two data structures: Data matrix This structure stores the n objects in the form of a table or n × p matrix. x 11 . . . x 1 f . . . x 1 p . . . . . . . . . . . . . . . x i 1 . . . x if . . . x ip . . . . . . . . . . . . . . . . . . . . . x n 1 x nf x np Dissimilarity matrix : This structure stores a collection of proximities that are available for all pairs of objects. It is often represented by an n × n matrix or table: 0 d (1 , 2) d (1 , 3) . . . d (1 , n ) d (2 , 1) 0 d (2 , 3) . . . d (2 , n ) . . . . ... . . . . . . . . d ( n , 1) d ( n , 2) d ( n , 3) . . . 0 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 41
Table of contents Introduction 1 Data matrix and dissimilarity matrix 2 Proximity Measures 3 Clustering methods 4 Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering Cluster validation and assessment 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 41
Proximity Measures Proximity measures for nominal attributes : Let the number of states of a nominal attribute be M . The dissimilarity between two objects i and j can be computed based on the ratio of mismatches: d ( i , j ) = p − m p where m is the number of matches and p is the total number of attributes describing the objects. Proximity measures for binary attributes : Binary attributes are either symmetric or asymmetric. Object j 1 0 sum 1 q r q + r Object i 0 s t s + t sum q + s r + t p For symmetric binary attributes, similarity is calculated as r + s d ( i , j ) = q + r + s + t For asymmetric binary attributes when the number of negative matches, t , is unimportant and the number of positive matches, q , is important , similarity is calculated as r + s d ( i , j ) = q + r + s Hamid Beigy (Sharif University of Technology) Coefficient 1 − d ( i , j ) is called the Jaccard coefficient. Data Mining Fall 1396 7 / 41
Proximity Measures (cont.) Dissimilarity of numeric attributes : The most popular distance measure is Euclidean distance √ ( x i 1 − x j 2 ) 2 + ( x i 2 − x j 1 ) 2 + . . . + ( x ip − x jp ) 2 d ( i , j ) = Another well-known measure is Manhattan distance d ( i , j ) = | x i 1 − x j 2 | + | x i 2 − x j 1 | + . . . + | x ip − x jp | Minkowski distance is generalization of Euclidean and Manhattan distances √ | x i 1 − x j 2 | h + | x i 2 − x j 1 | h + . . . + | x ip − x jp | h d ( i , j ) = h Dissimilarity of ordinal attributes : We first replace each x if by its corresponding rank r if ∈ { 1 , . . . , M f } and then normalize it using z if = r if − 1 M f − 1 Then dissimilarity can be computed using distance measures for numeric attributes using z if . Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 41
Proximity Measures (cont.) Dissimilarity for attributes of mixed types : A more preferable approach is to process all attribute types together, performing a single analysis. ∑ p f =1 δ ( f ) ij d ( f ) ij d ( i , j ) = ∑ p f =1 δ ( f ) ij where the indicator δ ( f ) = 0 if either ij x if or x jf is missing x if = x jf = 0 and attribute f is asymmetric binary and otherwise δ ( f ) = 1. ij The distance d ( f ) is computed based on the type of attribute f . ij Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 41
Table of contents Introduction 1 Data matrix and dissimilarity matrix 2 Proximity Measures 3 Clustering methods 4 Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering Cluster validation and assessment 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 41
Clustering methods There are many clustering algorithms in the literature. It is difficult to provide a crisp categorization of clustering methods because these categories may overlap so that a method may have features from several categories. In general, the major fundamental clustering methods can be classified into the following categories. Method General Characteristics Partitioning – Find mutually exclusive clusters of spherical shape methods – Distance-based – May use mean or medoid (etc.) to represent cluster center – Effective for small- to medium-size data sets Hierarchical – Clustering is a hierarchical decomposition (i.e., multiple levels) methods – Cannot correct erroneous merges or splits – May incorporate other techniques like microclustering or consider object “linkages” Density-based – Can find arbitrarily shaped clusters methods – Clusters are dense regions of objects in space that are separated by low-density regions – Cluster density: Each point must have a minimum number of points within its “neighborhood” – May filter out outliers Grid-based – Use a multiresolution grid data structure methods – Fast processing time (typically independent of the number of data objects, yet dependent on grid size) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 41
Recommend
More recommend