data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 14: Hierarchical Clustering Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 1 / 14

  2. Hierarchical Clustering The goal of hierarchical clustering is to create a sequence of nested partitions, which can be conveniently visualized via a tree or hierarchy of clusters, also called the cluster dendrogram . The clusters in the hierarchy range from the fine-grained to the coarse-grained – the lowest level of the tree (the leaves) consists of each point in its own cluster, whereas the highest level (the root) consists of all points in one cluster. Agglomerative hierarchical clustering methods work in a bottom-up manner. Starting with each of the n points in a separate cluster, they repeatedly merge the most similar pair of clusters until all points are members of the same cluster. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 2 / 14

  3. Hierarchical Clustering: Nested Partitions Given a dataset D = { x 1 ,..., x n } , where x i ∈ R d , a clustering C = { C 1 ,..., C k } is a partition of D . A clustering A = { A 1 ,..., A r } is said to be nested in another clustering B = { B 1 ,..., B s } if and only if r > s , and for each cluster A i ∈ A , there exists a cluster B j ∈ B , such that A i ⊆ B j . Hierarchical clustering yields a sequence of n nested partitions C 1 ,..., C n . The clustering C t − 1 is nested in the clustering C t . The cluster dendrogram is a rooted binary tree that captures this nesting structure, with edges between cluster C i ∈ C t − 1 and cluster C j ∈ C t if C i is nested in C j , that is, if C i ⊂ C j . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 3 / 14

  4. Hierarchical Clustering Dendrogram The dendrogram represents the following sequence of nested partitions: ABCDE Clustering Clusters C 1 { A } , { B } , { C } , { D } , { E } ABCD C 2 { AB } , { C } , { D } , { E } C 3 { AB } , { CD } , { E } C 4 { ABCD } , { E } C 5 { ABCDE } AB CD with C t − 1 ⊂ C t for t = 2 ,..., 5. We assume that A and B are merged before A B C D E C and D . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 4 / 14

  5. b b b b b b b b b b b b b b b b b b b Number of Hierarchical Clusterings The total number of different dendrograms with n leaves is given as: n − 1 � ( 2 m − 1 ) = 1 × 3 × 5 × 7 × ··· × ( 2 n − 3 ) = ( 2 n − 3 )!! m = 1 1 2 1 3 (a) n = 1 1 2 (b) n = 2 1 3 2 3 1 2 (c) n = 3 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 5 / 14

  6. Agglomerative Hierarchical Clustering In agglomerative hierarchical clustering, we begin with each of the n points in a separate cluster. We repeatedly merge the two closest clusters until all points are members of the same cluster. Given a set of clusters C = { C 1 , C 2 ,.., C m } , we find the closest pair of clusters C i and C j and merge them into a new cluster C ij = C i ∪ C j . Next, we update the set of clusters by removing C i and C j and adding C ij , as � � follows C = C \ { C i , C j } ∪ { C ij } . We repeat the process until C contains only one cluster. If specified, we can stop the merging process when there are exactly k clusters remaining. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 6 / 14

  7. Agglomerative Hierarchical Clustering Algorithm AgglomerativeClustering( D , k ): 1 C ← { C i = { x i } | x i ∈ D } // Each point in separate cluster 2 ∆ ← { δ ( x i , x j ): x i , x j ∈ D } // Compute distance matrix 3 repeat Find the closest pair of clusters C i , C j ∈ C 4 C ij ← C i ∪ C j // Merge the clusters 5 � � C ← C \ { C i , C j } ∪ { C ij } // Update the clustering 6 Update distance matrix ∆ to reflect new clustering 7 8 until |C| = k Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 7 / 14

  8. Distance between Clusters Single, Complete and Average A typical distance between two points is the Euclidean distance or L 2 - norm � d ( x i − y i ) 2 � 1 / 2 � δ ( x , y ) = � x − y � 2 = i = 1 Single Link: The minimum distance between a point in C i and a point in C j δ ( C i , C j ) = min { δ ( x , y ) | x ∈ C i , y ∈ C j } Complete Link: The maximum distance between points in the two clusters: δ ( C i , C j ) = max { δ ( x , y ) | x ∈ C i , y ∈ C j } Group Average: The average pairwise distance between points in C i and C j : � � y ∈ C j δ ( x , y ) x ∈ C i δ ( C i , C j ) = n i · n j Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 8 / 14

  9. Distance between Clusters: Mean and Ward’s Mean Distance: The distance between two clusters is defined as the distance between the means or centroids of the two clusters: δ ( C i , C j ) = δ ( µ i , µ j ) Minimum Variance or Ward’s Method: The distance between two clusters is defined as the increase in the sum of squared errors (SSE) when the two clusters are merged, where the SSE for a given cluster C i is given as δ ( C i , C j ) = ∆ SSE ij = SSE ij − SSE i − SSE j x ∈ C i � x − µ i � 2 . After simplification, we get: where SSE i = � � n i n j �� � 2 � δ ( C i , C j ) = � µ i − µ j n i + n j Ward’s measure is therefore a weighted version of the mean distance measure. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 9 / 14

  10. Single Link Agglomerative Clustering ABCDE 3 δ E ABCD ABCD 3 2 δ CD E AB 3 CD 2 2 CD 3 3 C D E δ AB 3 2 3 AB C 3 1 1 1 D 5 1 1 δ B C D E A 3 2 4 1 A B C D E B 3 2 3 1 3 C D 5 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 10 / 1

  11. Lance–Williams Formula Whenever two clusters C i and C j are merged into C ij , we need to update the distance matrix by recomputing the distances from the newly created cluster C ij to all other clusters C r ( r � = i and r � = j ). The Lance–Williams formula provides a general equation to recompute the distances for all of the cluster proximity measures δ ( C ij , C r ) = α i · δ ( C i , C r ) + α j · δ ( C j , C r ) + � � β · δ ( C i , C j ) + γ · � δ ( C i , C r ) − δ ( C j , C r ) � The coefficients α i ,α j ,β, and γ differ from one measure to another. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 11 / 1

  12. Lance–Williams Formulas for Cluster Proximity Measure α i α j β γ 1 1 − 1 Single link 0 2 2 2 1 1 1 Complete link 0 2 2 2 n j n i Group average 0 0 n i + n j n i + n j n j − n i · n j n i Mean distance 0 n i + n j n i + n j ( n i + n j ) 2 n j + n r n i + n r − n r Ward’s measure 0 n i + n j + n r n i + n j + n r n i + n j + n r Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 12 / 1

  13. rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS uT rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT rS uT rS rS rS uT uT uT uT uT uT uT uT uT uT uT uT rS rS uT bC bC uT bC bC bC bC bC uT bC bC bC bC bC bC bC bC bC rS rS rS rS rS rS rS rS bC rS rS rS rS rS uTrS uT bC bC rS bC bC bC bC bC bC bC rS bC rS rS rS rS rS rS bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC uT Iris Dataset: Complete Link Clustering u 2 1 . 0 bC bC 0 . 5 bC bC 0 − 0 . 5 − 1 . 0 u 1 − 1 . 5 − 4 − 3 − 2 − 1 0 1 2 3 Contingency Table: iris-setosa iris-virginica iris-versicolor C 1 (circle) 50 0 0 C 2 (triangle) 0 1 36 C 3 (square) 0 49 14 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 13 / 1

  14. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 14: Hierarchical Clustering Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 14 / 1

Recommend


More recommend