Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, November 20 th typoβs fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point π¦ as a member of its own π -neighborhood IRDM β15/16 12 Nov 2015
The Fir Th e First M Midt idterm T T es est Novem vember 19 th th 2015 2015 Where: Wh GΓΌnter-Hotz HΓΆrsaal (E2.2) Material: the first four lectures, the first two homeworks You are a allo llowed to br brin ing o one (1 (1) ) sheet o of A A4 p pape per wit with handwr writ itten or pr prin inted notes o on bo both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, t toothbrush, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o or pa passport. V-2: 2 IRDM β15/16
The Fin Th e Final Ex l Exam am Preliminary dates: Februar ary 15 15 th th an and 16 16 th th 2016 2016 Oral e l exam. Can o only ly be be t taken wh when you passed tw two o out o t of th three mid id-term t tests. More details l later. V-2: 3 IRDM β15/16
IRDM Chapter 5, overview Basic idea 1. Representative-based clustering 2. Probabilistic clustering 3. Validation 4. Hierarchical clustering 5. Density-based clustering 6. Clustering high-dimensional data 7. Youβll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13β15 V-2: 4 IRDM β15/16
IRDM Chapter 5, today Basic idea 1. Representative-based clustering 2. Probabilistic clustering 3. Validation 4. Hierarchical clustering 5. Density-based clustering 6. Clustering high-dimensional data 7. Youβll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13β15 V-2: 5 IRDM β15/16
Chapter 5.5: Hier ierarchi hical C Clust lustering ng Aggarwal Ch. 6.4 V-2: 6 IRDM β15/16
The basic idea Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical ο§ every cluster of π -clustering is a union of some clusters in an π - clustering for all π < π ο§ i.e. for all π , and for all π > π , every cluster in an π -clustering is a subset of some cluster in the π -clustering Example: k = 6 V-2: 7 IRDM β15/16
The basic idea Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical ο§ every cluster of π -clustering is a union of some clusters in an π - clustering for all π < π ο§ i.e. for all π , and for all π > π , every cluster in an π -clustering is a subset of some cluster in the π -clustering Example: k = 5 V-2: 8 IRDM β15/16
The basic idea Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical ο§ every cluster of π -clustering is a union of some clusters in an π - clustering for all π < π ο§ i.e. for all π , and for all π > π , every cluster in an π -clustering is a subset of some cluster in the π -clustering Example: k = 4 V-2: 9 IRDM β15/16
The basic idea Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical ο§ every cluster of π -clustering is a union of some clusters in an π - clustering for all π < π ο§ i.e. for all π , and for all π > π , every cluster in an π -clustering is a subset of some cluster in the π -clustering Example: k = 3 V-2: 10 IRDM β15/16
The basic idea Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical ο§ every cluster of π -clustering is a union of some clusters in an π - clustering for all π < π ο§ i.e. for all π , and for all π > π , every cluster in an π -clustering is a subset of some cluster in the π -clustering Example: k = 2 V-2: 11 IRDM β15/16
The basic idea Create clustering for each number of clusters π = 1,2, β¦ , π The clusterings must be hie ierarch chica ical ο§ every cluster of π -clustering is a union of some clusters in an π - clustering for all π < π ο§ i.e. for all π , and for all π > π , every cluster in an π -clustering is a subset of some cluster in the π -clustering Example: k = 1 V-2: 12 IRDM β15/16
Dendrograms Distance is β 0.7 The difference in height between the tree and its subtrees shows the distance between the two branches V-2: 13 IRDM β15/16
Dendrograms and clusters V-2: 14 IRDM β15/16
Dendrograms, revisited Dendrograms show the hierarchy of the clustering Number of clusters can be deduced from a dendrogram ο§ higher branches Outliers can be detected from a dendrogram ο§ single points that are far from others V-2: 15 IRDM β15/16
Agglomerative and Divisive Agglome omerative: bottom-up ο§ start with π clusters ο§ combine two closest clusters into a cluster of one bigger cluster Div ivis isiv ive: top-down ο§ start with 1 cluster ο§ divide the cluster into two ο§ divide the largest (per diameter) cluster into smaller clusters V-2: 16 IRDM β15/16
Cluster distances The distance between two points π¦ and π§ is π ( π¦ , π§ ) What is the distance between two clusters? Many intuitive definitions β no universal truth ο§ different cluster distances yield different clusterings ο§ the selection of cluster distance depends on application Some distances between clusters πΆ and π· : ο§ minimum distance { π ( π¦ , π§ ) βΆ π¦ β πΆ πππ π§ β π· } π ( πΆ , π· ) = min ο§ maximum distance { π ( π¦ , π§ ) βΆ π¦ β πΆ πππ π§ β π· } π ( πΆ , π· ) = max ο§ average distance π ( πΆ , π· ) = πππ { π ( π¦ , π§ ) βΆ π¦ β πΆ πππ π§ β π· } ο§ distance of centroids π ( πΆ , π· ) = π ( π πΆ , π π· ) , where π πΆ is the centroid of πΆ and π π· is the centroid of π· V-2: 17 IRDM β15/16
Single link The distance between two clusters is the distance between the closest points { π ( π¦ , π§ ) βΆ π¦ β πΆ πππ π§ β π· } ο§ π ( πΆ , π· ) = min V-2: 18 IRDM β15/16
Strength of single-link Can n ha hand ndle non non-spheric ical l clu clusters o of unequal s l size ize V-2: 19 IRDM β15/16
Weaknesses of single-link Se Sens nsitive to o noi noise a and nd out outliers Prod oduc uces e elong ongated clus usters V-2: 20 IRDM β15/16
Complete link The distance between two clusters is the distance between the furthest points { π ( π¦ , π§ ) βΆ π¦ β πΆ πππ π§ β π· } ο§ π ( πΆ , π· ) = max V-2: 21 IRDM β15/16
Strengths of complete link Le Less s sus usceptible t to o noi noise and nd out outliers V-2: 22 IRDM β15/16
Weaknesses of complete-link Break aks s largest c st clusters Bia iased t towards s spherica ical clu l clusters V-2: 23 IRDM β15/16
Group average and Mean distance Gr Group oup average is the average of pairwise distances π π¦ , π§ ο§ π πΆ , π· = avg π π¦ , π§ : π¦ β πΆ πππ π§ β π· = β π¦βπΆ , π§βπ· πΆ π· Mean an di dista stance is the distance of the cluster centroids ο§ π πΆ , π· = π ( π πΆ , π π· ) V-2: 24 IRDM β15/16
Properties of group average A compromise between single and complete link Le Less s sus usceptible t to o noi noise and nd out outliers ο§ similar to complete link Bia iased t towards s spherica ical clu l clusters ο§ similar to complete link V-2: 25 IRDM β15/16
Wardβs method Wardβs dis istanc nce between clusters π΅ and πΆ is the increase in sum of squared errors (SSE) when the two clusters are merged ο§ SSE for cluster π΅ is πππΉ π΅ = β π¦ β π π΅ 2 π¦βπ΅ ο§ difference for merging clusters π΅ and πΆ into cluster π· is then π ( π΅ , πΆ ) = ΞπππΉ π· = πππΉ π· β πππΉ π΅ β πππΉ πΆ ο§ or, equivalently, weighted mean distance π΅ πΆ π π΅ , πΆ = π΅ + | πΆ | π A β π πΆ 2 V-2: 26 IRDM β15/16
Discussion on Wardβs method Le Less s sus usceptible t to o noi noise and nd out outliers Biase ases t s towar ards sp s spherical al clust sters Hierarchical analogue of π -means ο§ hence many shared proβs and conβs ο§ can be used to initialise π -means V-2: 27 IRDM β15/16
Comparison Single Complete link link Group Wardβs average method V-2: 28 IRDM β15/16
Comparison Single Complete link link Group Wardβs average method V-2: 29 IRDM β15/16
Comparison Single Complete link link Group Wardβs average method V-2: 30 IRDM β15/16
Comparison Single Complete link link Group Wardβs average method V-2: 31 IRDM β15/16
Recommend
More recommend