Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, - PowerPoint PPT Presentation

Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, November 20 th typo’s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point 𝑦 as a member of its own 𝜗 -neighborhood IRDM ‘15/16 12 Nov 2015

The Fir Th e First M Midt idterm T T es est Novem vember 19 th th 2015 2015 Where: Wh Günter-Hotz Hörsaal (E2.2) Material: the first four lectures, the first two homeworks You are a allo llowed to br brin ing o one (1 (1) ) sheet o of A A4 p pape per wit with handwr writ itten or pr prin inted notes o on bo both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, t toothbrush, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o or pa passport. V-2: 2 IRDM ‘15/16

The Fin Th e Final Ex l Exam am Preliminary dates: Februar ary 15 15 th th an and 16 16 th th 2016 2016 Oral e l exam. Can o only ly be be t taken wh when you passed tw two o out o t of th three mid id-term t tests. More details l later. V-2: 3 IRDM ‘15/16

IRDM Chapter 5, overview Basic idea 1. Representative-based clustering 2. Probabilistic clustering 3. Validation 4. Hierarchical clustering 5. Density-based clustering 6. Clustering high-dimensional data 7. You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13—15 V-2: 4 IRDM ‘15/16

IRDM Chapter 5, today Basic idea 1. Representative-based clustering 2. Probabilistic clustering 3. Validation 4. Hierarchical clustering 5. Density-based clustering 6. Clustering high-dimensional data 7. You’ll find this covered in Aggarwal Ch. 6, 7 Zaki & Meira, Ch. 13—15 V-2: 5 IRDM ‘15/16

Chapter 5.5: Hier ierarchi hical C Clust lustering ng Aggarwal Ch. 6.4 V-2: 6 IRDM ‘15/16

The basic idea Create clustering for each number of clusters 𝑙 = 1,2, … , 𝑜 The clusterings must be hie ierarch chica ical  every cluster of 𝑙 -clustering is a union of some clusters in an 𝑚 - clustering for all 𝑙 < 𝑚  i.e. for all 𝑚 , and for all 𝑙 > 𝑚 , every cluster in an 𝑚 -clustering is a subset of some cluster in the 𝑙 -clustering Example: k = 6 V-2: 7 IRDM ‘15/16

Dendrograms Distance is ≈ 0.7 The difference in height between the tree and its subtrees shows the distance between the two branches V-2: 13 IRDM ‘15/16

Dendrograms and clusters V-2: 14 IRDM ‘15/16

Dendrograms, revisited Dendrograms show the hierarchy of the clustering Number of clusters can be deduced from a dendrogram  higher branches Outliers can be detected from a dendrogram  single points that are far from others V-2: 15 IRDM ‘15/16

Agglomerative and Divisive Agglome omerative: bottom-up  start with 𝑜 clusters  combine two closest clusters into a cluster of one bigger cluster Div ivis isiv ive: top-down  start with 1 cluster  divide the cluster into two  divide the largest (per diameter) cluster into smaller clusters V-2: 16 IRDM ‘15/16

Cluster distances The distance between two points 𝑦 and 𝑧 is 𝑒 ( 𝑦 , 𝑧 ) What is the distance between two clusters? Many intuitive definitions – no universal truth  different cluster distances yield different clusterings  the selection of cluster distance depends on application Some distances between clusters 𝐶 and 𝐷 :  minimum distance { 𝑒 ( 𝑦 , 𝑧 ) ∶ 𝑦 ∈ 𝐶 𝑏𝑜𝑒 𝑧 ∈ 𝐷 } 𝑒 ( 𝐶 , 𝐷 ) = min  maximum distance { 𝑒 ( 𝑦 , 𝑧 ) ∶ 𝑦 ∈ 𝐶 𝑏𝑜𝑒 𝑧 ∈ 𝐷 } 𝑒 ( 𝐶 , 𝐷 ) = max  average distance 𝑒 ( 𝐶 , 𝐷 ) = 𝑏𝑏𝑏 { 𝑒 ( 𝑦 , 𝑧 ) ∶ 𝑦 ∈ 𝐶 𝑏𝑜𝑒 𝑧 ∈ 𝐷 }  distance of centroids 𝑒 ( 𝐶 , 𝐷 ) = 𝑒 ( 𝜈 𝐶 , 𝜈 𝐷 ) , where 𝜈 𝐶 is the centroid of 𝐶 and 𝜈 𝐷 is the centroid of 𝐷 V-2: 17 IRDM ‘15/16

Single link The distance between two clusters is the distance between the closest points { 𝑒 ( 𝑦 , 𝑧 ) ∶ 𝑦 ∈ 𝐶 𝑏𝑜𝑒 𝑧 ∈ 𝐷 }  𝑒 ( 𝐶 , 𝐷 ) = min V-2: 18 IRDM ‘15/16

Strength of single-link Can n ha hand ndle non non-spheric ical l clu clusters o of unequal s l size ize V-2: 19 IRDM ‘15/16

Weaknesses of single-link Se Sens nsitive to o noi noise a and nd out outliers Prod oduc uces e elong ongated clus usters V-2: 20 IRDM ‘15/16

Complete link The distance between two clusters is the distance between the furthest points { 𝑒 ( 𝑦 , 𝑧 ) ∶ 𝑦 ∈ 𝐶 𝑏𝑜𝑒 𝑧 ∈ 𝐷 }  𝑒 ( 𝐶 , 𝐷 ) = max V-2: 21 IRDM ‘15/16

Strengths of complete link Le Less s sus usceptible t to o noi noise and nd out outliers V-2: 22 IRDM ‘15/16

Weaknesses of complete-link Break aks s largest c st clusters Bia iased t towards s spherica ical clu l clusters V-2: 23 IRDM ‘15/16

Group average and Mean distance Gr Group oup average is the average of pairwise distances 𝑒 𝑦 , 𝑧  𝑒 𝐶 , 𝐷 = avg 𝑒 𝑦 , 𝑧 : 𝑦 ∈ 𝐶 𝑏𝑜𝑒 𝑧 ∈ 𝐷 = ∑ 𝑦∈𝐶 , 𝑧∈𝐷 𝐶 𝐷 Mean an di dista stance is the distance of the cluster centroids  𝑒 𝐶 , 𝐷 = 𝑒 ( 𝜈 𝐶 , 𝜈 𝐷 ) V-2: 24 IRDM ‘15/16

Properties of group average A compromise between single and complete link Le Less s sus usceptible t to o noi noise and nd out outliers  similar to complete link Bia iased t towards s spherica ical clu l clusters  similar to complete link V-2: 25 IRDM ‘15/16

Ward’s method Ward’s dis istanc nce between clusters 𝐵 and 𝐶 is the increase in sum of squared errors (SSE) when the two clusters are merged  SSE for cluster 𝐵 is 𝑇𝑇𝐹 𝐵 = ∑ 𝑦 − 𝜈 𝐵 2 𝑦∈𝐵  difference for merging clusters 𝐵 and 𝐶 into cluster 𝐷 is then 𝑒 ( 𝐵 , 𝐶 ) = Δ𝑇𝑇𝐹 𝐷 = 𝑇𝑇𝐹 𝐷 – 𝑇𝑇𝐹 𝐵 – 𝑇𝑇𝐹 𝐶  or, equivalently, weighted mean distance 𝐵 𝐶 𝑒 𝐵 , 𝐶 = 𝐵 + | 𝐶 | 𝜈 A − 𝜈 𝐶 2 V-2: 26 IRDM ‘15/16

Discussion on Ward’s method Le Less s sus usceptible t to o noi noise and nd out outliers Biase ases t s towar ards sp s spherical al clust sters Hierarchical analogue of 𝑙 -means  hence many shared pro’s and con’s  can be used to initialise 𝑙 -means V-2: 27 IRDM ‘15/16

Comparison Single Complete link link Group Ward’s average method V-2: 28 IRDM ‘15/16

Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, - PowerPoint PPT Presentation

Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, November 20 th typos fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point as a member of its own -neighborhood IRDM 15/16 12 Nov 2015

Ca Canadian l lobst ster t tails a s and l lobst ster m meat 1 Ca Canadian lobst ster

MyS ySQL QL Clu lust ster er Tutor orial ial MySQL ySQL Con onference rence & Ex

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

WAMS AMSTER: TER: Field experiences 2010-2014 WAMSTER Ad-hoc synchrophasor measurement network

Senior Director, ACT Rochester Rochest ster er is is the 4 th th poorest st cit ity in in th

AVT 691 Erin Jacobs Wed. 4:30-7:10 Rm L004 Wednesday, August 28, 13 WISH TREES AVT 691 Erin

Melb lbourne Min inin ing Clu lub, 30 30 May 20 2019 19 Photo credit: ABC

Mult lti-Resource Packin ing for Clu luster Schedule lers Robert Grandl, Ganesh

Ca Canada S Soccer Clu r Club L Lic icensin ing Pr Program Balan Balancin cing Qu Qualit

Ca Canada S Soccer Clu r Club L Lic icensin ing Pr Program Prin Princip iple les in in

Toward a Theory of Social Institutions Ellen Lust, University of Gothenburg Goals Provide

Inns of Court Presentation 01.08.2018 LUST CREATIVE REMEDIES IN CASES OF DISSIPATION

Clarifying and fine-tuning passenger rights Sylviane Lust Director-General, IACA Air

Mindfulness Part I: Reducing Stress through Rejuvenation Mapping Brandi Lust Learning Goals We

Fr Frequently uently as aske ked d questi stions ons il illu lust strate rated uCu

1 A SMALL STILL VOICE 2 James 4:2-3 Ye lust, and have not: ye kill, and desire to have, and

INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal &

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Local Algorithms and Large Scale Graph Mining Silvio Lattanzi (Google Research NY) Charles River

Density-based Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Memory Hard Jol Alwen Binyi Chen IST Austria UCSB Krzysztof Pietrzak Leonid Reyzin Stefano

HMC-Sim 2.0: A Simulation Platform for Exploring Custom Memory Cube Operations John D. Leidel,

Harmonic Map Let f : T 2 S 3 = SU (2) be a harmonic map. A harmonic map is a critical

Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, - PowerPoint PPT Presentation

Chapter 5-2: Clu lust ster erin ing Jilles Vreeken Revision 1, November 20 th typos fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point as a member of its own -neighborhood IRDM 15/16 12 Nov 2015

Ca Canadian l lobst ster t tails a s and l lobst ster m meat 1 Ca Canadian lobst ster

MyS ySQL QL Clu lust ster er Tutor orial ial MySQL ySQL Con onference rence &amp; Ex

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

WAMS AMSTER: TER: Field experiences 2010-2014 WAMSTER Ad-hoc synchrophasor measurement network

Senior Director, ACT Rochester Rochest ster er is is the 4 th th poorest st cit ity in in th

AVT 691 Erin Jacobs Wed. 4:30-7:10 Rm L004 Wednesday, August 28, 13 WISH TREES AVT 691 Erin

Melb lbourne Min inin ing Clu lub, 30 30 May 20 2019 19 Photo credit: ABC

Mult lti-Resource Packin ing for Clu luster Schedule lers Robert Grandl, Ganesh

Ca Canada S Soccer Clu r Club L Lic icensin ing Pr Program Balan Balancin cing Qu Qualit

Ca Canada S Soccer Clu r Club L Lic icensin ing Pr Program Prin Princip iple les in in

Toward a Theory of Social Institutions Ellen Lust, University of Gothenburg Goals Provide

Inns of Court Presentation 01.08.2018 LUST CREATIVE REMEDIES IN CASES OF DISSIPATION

Clarifying and fine-tuning passenger rights Sylviane Lust Director-General, IACA Air

Mindfulness Part I: Reducing Stress through Rejuvenation Mapping Brandi Lust Learning Goals We

Fr Frequently uently as aske ked d questi stions ons il illu lust strate rated uCu

1 A SMALL STILL VOICE 2 James 4:2-3 Ye lust, and have not: ye kill, and desire to have, and

INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal &amp;

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Local Algorithms and Large Scale Graph Mining Silvio Lattanzi (Google Research NY) Charles River

Density-based Clustering MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Memory Hard Jol Alwen Binyi Chen IST Austria UCSB Krzysztof Pietrzak Leonid Reyzin Stefano

HMC-Sim 2.0: A Simulation Platform for Exploring Custom Memory Cube Operations John D. Leidel,

Harmonic Map Let f : T 2 S 3 = SU (2) be a harmonic map. A harmonic map is a critical

MyS ySQL QL Clu lust ster er Tutor orial ial MySQL ySQL Con onference rence & Ex

INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal &