chapter 5 2 clu lust ster erin ing
play

Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, - PowerPoint PPT Presentation

Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, November 20 th typos fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point as a member of its own -neighborhood IRDM 15/16 12 Nov 2015


  1. Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, November 20 th typo’s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point 𝑦 as a member of its own πœ— -neighborhood IRDM β€˜15/16 12 Nov 2015

  2. The Fir Th e First M Midt idterm T T es est Novem vember 19 th th 2015 2015 Where: Wh GΓΌnter-Hotz HΓΆrsaal (E2.2) Material: the first four lectures, the first two homeworks You are a allo llowed to br brin ing o one (1 (1) ) sheet o of A A4 p pape per wit with handwr writ itten or pr prin inted notes o on bo both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, t toothbrush, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o or pa passport. V-2: 2 IRDM β€˜15/16

  3. The Fin Th e Final Ex l Exam am Preliminary dates: Februar ary 15 15 th th an and 16 16 th th 2016 2016 Oral e l exam. Can o only ly be be t taken wh when you passed tw two o out o t of th three mid id-term t tests. More details l later. V-2: 3 IRDM β€˜15/16

  4. IRDM Chapter 5, overview Basic idea 1. Representative-based clustering 2. Probabilistic clustering 3. Validation 4. Hierarchical clustering 5. Density-based clustering 6. Clustering high-dimensional data 7. You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13β€”15 V-2: 4 IRDM β€˜15/16

  5. IRDM Chapter 5, today Basic idea 1. Representative-based clustering 2. Probabilistic clustering 3. Validation 4. Hierarchical clustering 5. Density-based clustering 6. Clustering high-dimensional data 7. You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13β€”15 V-2: 5 IRDM β€˜15/16

  6. Chapter 5.5: Hier ierarchi hical C Clust lustering ng Aggarwal Ch. 6.4 V-2: 6 IRDM β€˜15/16

  7. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 6 V-2: 7 IRDM β€˜15/16

  8. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 5 V-2: 8 IRDM β€˜15/16

  9. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 4 V-2: 9 IRDM β€˜15/16

  10. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 3 V-2: 10 IRDM β€˜15/16

  11. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 2 V-2: 11 IRDM β€˜15/16

  12. The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , π‘œ The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an π‘š - clustering for all 𝑙 < π‘š  i.e. for all π‘š , and for all 𝑙 > π‘š , every cluster in an π‘š -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 1 V-2: 12 IRDM β€˜15/16

  13. Dendrograms Distance is β‰ˆ 0.7 The difference in height between the tree and its subtrees shows the distance between the two branches V-2: 13 IRDM β€˜15/16

  14. Dendrograms and clusters V-2: 14 IRDM β€˜15/16

  15. Dendrograms, revisited Dendrograms show the hierarchy of the clustering Number of clusters can be deduced from a dendrogram  higher branches Outliers can be detected from a dendrogram  single points that are far from others V-2: 15 IRDM β€˜15/16

  16. Agglomerative and Divisive Agglome omerative: bottom-up  start with π‘œ clusters  combine two closest clusters into a cluster of one bigger cluster Div ivis isiv ive: top-down  start with 1 cluster  divide the cluster into two  divide the largest (per diameter) cluster into smaller clusters V-2: 16 IRDM β€˜15/16

  17. Cluster distances The distance between two points 𝑦 and 𝑧 is 𝑒 ( 𝑦 , 𝑧 ) What is the distance between two clusters? Many intuitive definitions – no universal truth  different cluster distances yield different clusterings  the selection of cluster distance depends on application Some distances between clusters 𝐢 and 𝐷 :  minimum distance { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 } 𝑒 ( 𝐢 , 𝐷 ) = min  maximum distance { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 } 𝑒 ( 𝐢 , 𝐷 ) = max  average distance 𝑒 ( 𝐢 , 𝐷 ) = 𝑏𝑏𝑏 { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 }  distance of centroids 𝑒 ( 𝐢 , 𝐷 ) = 𝑒 ( 𝜈 𝐢 , 𝜈 𝐷 ) , where 𝜈 𝐢 is the centroid of 𝐢 and 𝜈 𝐷 is the centroid of 𝐷 V-2: 17 IRDM β€˜15/16

  18. Single link The distance between two clusters is the distance between the closest points { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 }  𝑒 ( 𝐢 , 𝐷 ) = min V-2: 18 IRDM β€˜15/16

  19. Strength of single-link Can n ha hand ndle non non-spheric ical l clu clusters o of unequal s l size ize V-2: 19 IRDM β€˜15/16

  20. Weaknesses of single-link Se Sens nsitive to o noi noise a and nd out outliers Prod oduc uces e elong ongated clus usters V-2: 20 IRDM β€˜15/16

  21. Complete link The distance between two clusters is the distance between the furthest points { 𝑒 ( 𝑦 , 𝑧 ) ∢ 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 }  𝑒 ( 𝐢 , 𝐷 ) = max V-2: 21 IRDM β€˜15/16

  22. Strengths of complete link Le Less s sus usceptible t to o noi noise and nd out outliers V-2: 22 IRDM β€˜15/16

  23. Weaknesses of complete-link Break aks s largest c st clusters Bia iased t towards s spherica ical clu l clusters V-2: 23 IRDM β€˜15/16

  24. Group average and Mean distance Gr Group oup average is the average of pairwise distances 𝑒 𝑦 , 𝑧  𝑒 𝐢 , 𝐷 = avg 𝑒 𝑦 , 𝑧 : 𝑦 ∈ 𝐢 π‘π‘œπ‘’ 𝑧 ∈ 𝐷 = βˆ‘ π‘¦βˆˆπΆ , π‘§βˆˆπ· 𝐢 𝐷 Mean an di dista stance is the distance of the cluster centroids  𝑒 𝐢 , 𝐷 = 𝑒 ( 𝜈 𝐢 , 𝜈 𝐷 ) V-2: 24 IRDM β€˜15/16

  25. Properties of group average A compromise between single and complete link Le Less s sus usceptible t to o noi noise and nd out outliers  similar to complete link Bia iased t towards s spherica ical clu l clusters  similar to complete link V-2: 25 IRDM β€˜15/16

  26. Ward’s method Ward’s dis istanc nce between clusters 𝐡 and 𝐢 is the increase in sum of squared errors (SSE) when the two clusters are merged  SSE for cluster 𝐡 is 𝑇𝑇𝐹 𝐡 = βˆ‘ 𝑦 βˆ’ 𝜈 𝐡 2 π‘¦βˆˆπ΅  difference for merging clusters 𝐡 and 𝐢 into cluster 𝐷 is then 𝑒 ( 𝐡 , 𝐢 ) = Δ𝑇𝑇𝐹 𝐷 = 𝑇𝑇𝐹 𝐷 – 𝑇𝑇𝐹 𝐡 – 𝑇𝑇𝐹 𝐢  or, equivalently, weighted mean distance 𝐡 𝐢 𝑒 𝐡 , 𝐢 = 𝐡 + | 𝐢 | 𝜈 A βˆ’ 𝜈 𝐢 2 V-2: 26 IRDM β€˜15/16

  27. Discussion on Ward’s method Le Less s sus usceptible t to o noi noise and nd out outliers Biase ases t s towar ards sp s spherical al clust sters Hierarchical analogue of 𝑙 -means  hence many shared pro’s and con’s  can be used to initialise 𝑙 -means V-2: 27 IRDM β€˜15/16

  28. Comparison Single Complete link link Group Ward’s average method V-2: 28 IRDM β€˜15/16

  29. Comparison Single Complete link link Group Ward’s average method V-2: 29 IRDM β€˜15/16

  30. Comparison Single Complete link link Group Ward’s average method V-2: 30 IRDM β€˜15/16

  31. Comparison Single Complete link link Group Ward’s average method V-2: 31 IRDM β€˜15/16

Recommend


More recommend