data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 17: Clustering Validation Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 1 / 59

  2. Clustering Validation and Evaluation Cluster validation and assessment encompasses three main tasks: clustering evaluation seeks to assess the goodness or quality of the clustering, clustering stability seeks to understand the sensitivity of the clustering result to various algorithmic parameters, for example, the number of clusters, and clustering tendency assesses the suitability of applying clustering in the first place, that is, whether the data has any inherent grouping structure. Validity measures can be divided into three main types: External: External validation measures employ criteria that are not inherent to the dataset, e.g., class labels. Internal: Internal validation measures employ criteria that are derived from the data itself, e.g., intracluster and intercluster distances. Relative: Relative validation measures aim to directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 2 / 59

  3. External Measures External measures assume that the correct or ground-truth clustering is known a priori , which is used to evaluate a given clustering. Let D = { x i } n i = 1 be a dataset consisting of n points in a d -dimensional space, partitioned into k clusters. Let y i ∈ { 1 , 2 ,..., k } denote the ground-truth cluster membership or label information for each point. The ground-truth clustering is given as T = { T 1 , T 2 ,..., T k } , where the cluster T j consists of all the points with label j , i.e., T j = { x i ∈ D | y i = j } . We refer to T as the ground-truth partitioning , and to each T i as a partition . Let C = { C 1 ,..., C r } denote a clustering of the same dataset into r clusters, obtained via some clustering algorithm, and let ˆ y i ∈ { 1 , 2 ,..., r } denote the cluster label for x i . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 3 / 59

  4. External Measures External evaluation measures try capture the extent to which points from the same partition appear in the same cluster, and the extent to which points from different partitions are grouped in different clusters. All of the external measures rely on the r × k contingency table N that is induced by a clustering C and the ground-truth partitioning T , defined as follows N ( i , j ) = n ij = | C i ∩ T j | The count n ij denotes the number of points that are common to cluster C i and ground-truth partition T j . Let n i = | C i | denote the number of points in cluster C i , and let m j = | T j | denote the number of points in partition T j . The contingency table can be computed from T and C in O ( n ) time by examining the partition and cluster labels, y i and ˆ y i , for each point x i ∈ D and incrementing the corresponding count n y i ˆ y i . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 4 / 59

  5. Matching Based Measures: Purity Purity quantifies the extent to which a cluster C i contains entities from only one partition: purity i = 1 k max j = 1 { n ij } n i The purity of clustering C is defined as the weighted sum of the clusterwise purity values: r r n purity i = 1 n i k � � purity = max j = 1 { n ij } n i = 1 i = 1 where the ratio n i n denotes the fraction of points in cluster C i . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 5 / 59

  6. Matching Based Measures: Maximum Matching The maximum matching measure selects the mapping between clusters and partitions, such that the sum of the number of common points ( n ij ) is maximized, provided that only one cluster can match with a given partition. Let G be a bipartite graph over the vertex set V = C ∪ T , and let the edge set be E = { ( C i , T j ) } with edge weights w ( C i , T j ) = n ij . A matching M in G is a subset of E , such that the edges in M are pairwise nonadjacent, that is, they do not have a common vertex. The maximum weight matching in G is given as: � w ( M ) � match = argmax n M where w ( M ) is the sum of the sum of all the edge weights in matching M , given as w ( M ) = � e ∈ M w ( e ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 6 / 59

  7. Matching Based Measures: F-measure Given cluster C i , let j i denote the partition that contains the maximum number of points from C i , that is, j i = max k j = 1 { n ij } . The precision of a cluster C i is the same as its purity: prec i = 1 j = 1 { n ij } = n ij i k max n i n i The recall of cluster C i is defined as recall i = n ij i | T j i | = n ij i m j i where m j i = | T j i | . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 7 / 59

  8. Matching Based Measures: F-measure The F-measure is the harmonic mean of the precision and recall values for each C i 2 = 2 · prec i · recall i 2 n ij i F i = = 1 1 prec i + prec i + recall i n i + m j i recall i The F-measure for the clustering C is the mean of clusterwise F-meaure values: r F = 1 � F i r i = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 8 / 59

  9. bC uT bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC rS rS rS uT uT bC uT bC bC bC bC bC bC bC bC bC bC rS uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT bC bC bC bC bC bC bC bC bC uT bC bC bC bC bC uT uT rS rS uT rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS uT rS uT bC uT uT uT uT uT uT uT uT uT uT uT uT rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS uT K-means: Iris Principal Components Data Good Case u 2 1 . 0 bC bC 0 . 5 bC bC 0 bC bC − 0 . 5 − 1 . 0 u 1 − 1 . 5 − 4 − 3 − 2 − 1 0 1 2 3 Contingency table: iris-setosa iris-versicolor iris-virginica T 1 T 2 T 3 n i C 1 (squares) 0 47 14 61 C 2 (circles) 50 0 0 50 C 3 (triangles) 0 3 36 39 m j 50 50 50 n = 100 purity = 0 . 887, match = 0 . 887, F = 0 . 885. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 9 / 59

  10. uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT rS uT rS rS rS rS rS rS uT rS rS rS rS rS rS rS uT rS uT uT bC rS uT uT uT uT uT uT uT uT uT uT uT rS rS uT bC bC bC bC bC bC uT bC bC bC bC bC uT uT uT bC bC rS rS rS rS rS rS rS rS rS bC uT bC uT bC bC bC uT K-means: Iris Principal Components Data Bad Case u 2 1 . 0 rS rS 0 . 5 rS rS 0 bC bC bC bC bC bC − 0 . 5 − 1 . 0 u 1 − 1 . 5 − 4 − 3 − 2 − 1 0 1 2 3 Contingency table: iris-setosa iris-versicolor iris-virginica T 1 T 2 T 3 n i C 1 ( squares ) 30 0 0 30 C 2 ( circles ) 20 4 0 24 C 3 ( triangles ) 0 46 50 96 m j 50 50 50 n = 150 purity = 0 . 667, match = 0 . 560, F = 0 . 658 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 10 / 59

  11. Entropy-based Measures: Conditional Entropy The entropy of a clustering C and partitioning T is given as r k � � H ( C ) = − p C i log p C i H ( T ) = − p T j log p T j i = 1 j = 1 m j where p C i = n i n and p T j = n are the probabilities of cluster C i and partition T j . The cluster-specific entropy of T , that is, the conditional entropy of T with respect to cluster C i is defined as k � n ij � � n ij � � H ( T | C i ) = − log n i n i j = 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 11 / 59

  12. Entropy-based Measures: Conditional Entropy The conditional entropy of T given clustering C is defined as the weighted sum: � p ij r r k � n i � � � H ( T |C ) = n H ( T | C i ) = − p ij log p C i i = 1 i = 1 j = 1 = H ( C , T ) − H ( C ) n ij where p ij = n is the probability that a point in cluster i also belongs to partition and where H ( C , T ) = − � r � k j = 1 p ij log p ij is the joint entropy of C and T . i = 1 H ( T |C ) = 0 if and only if T is completely determined by C , corresponding to the ideal clustering. If C and T are independent of each other, then H ( T |C ) = H ( T ) . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 17: Clustering Validation 12 / 59

Recommend


More recommend