12. Clustering Chlo-Agathe Azencot Centre for Computatjonal - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSupélec — Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

Learning objectjves ● Explain what clustering algorithms can be used for. ● Explain and implement three difgerent ways to evaluate clustering algorithms. ● Implement hierarchical clustering, discuss its various fmavors. ● Implement k-means clustering, discuss its advantages and drawbacks. ● Sketch out a density-based clustering algorithm. 2

Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 3

Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 4

Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. E.g. – group genes that are similarly afgected by a disease – group patjents whose genes respond similarly to a disease – group pixels in an image that belong to the same object (image segmentatjon). 5

Applicatjons of clustering ● Understand general characteristjcs of the data ● Visualize the data ● Infer some propertjes of a data point based on how it relates to other data points E.g. – fjnd subtypes of diseases – visualize protein families – fjnd categories among images – fjnd patuerns in fjnancial transactjons – detect communitjes in social networks 6

Distances and similaritjes 7

Distances & similaritjes ● Assess how close / far – data points are from each other – a data point is from a cluster – two clusters are from each other ● Distance metric 8

Distances & similaritjes ● Assess how close / far – data points are from each other – a data point is from a cluster – two clusters are from each other ● Distance metric symmetry triangle inequality ● E.g. Lq distances 9

Distance & similaritjes ● How do we get similaritjes? 10

Distance & similaritjes ● Transform distances into similaritjes? ● Kernels defjne similaritjes For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product of their images in the feature spaces. 11

Pearson's correlatjon ● Measure of the linear correlatjon between two variables ● If the features are centered: ? 12

Pearson's correlatjon ● Measure of the linear correlatjon between two variables ● If the features are centered: ● Normalized dot product = cosine 13

Pearson vs Euclide ● Pearson's coeffjcient Profjles of similar shapes will be close to each other, even if they difger in magnitude. ● Euclidean distance Magnitude is taken into account. 14

Pearson vs Euclide 15

Evaluatjng clusters 16

Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? 17

Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 18

Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 19

Centroids and medoids ● Centroid: mean of the points in the cluster. ● Medoid: point in the cluster that is closest to the centroid. 20

Cluster shape: Tightness vs 21

Cluster shape: Tightness T k 22

Cluster shape: Separability vs 23

Cluster shape: Separability S kl 24

Clusters shape: Davies-Bouldin ● Cluster tjghtness ( homogeneity ) T k ● Cluster separatjon S kl ● Davies-Bouldin index 25

Clusters shape: Silhouete coeffjcient ● how well x fjts in its cluster: ● how well x would fjt in another cluster: ● if x is very close to the other points of its cluster: s( x ) = 1 ● if x is very close to the points in another cluster: s( x ) = -1 26

Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 27

Cluster stability ● How many clusters? 28

Cluster stability ● K=2 ● K=3 29

Cluster stability ● K=2 ● K=3 30

Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● 3) Based on domain knowledge: The clusters should “make sense”. 31

Domain knowledge ● Do the cluster match natural categories? – Check with human expertjse 32

Ontology enrichment analysis ● Ontology: Entjtjes may be grouped, related within a hierarchy, and subdivided according to similaritjes and difgerences. Build by human experts ● E.g.: The Gene Ontology http://geneontology.org/ – Describe genes with a common vocabulary, organized in categories E.g. cellular process > cell death > programmed cell death > apoptotjc process > executjon phase of apoptosis 33

Ontology enrichment analysis ● Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? ● TANGO [Tanay et al., 2003] – Assume data points sampled from a hypergeometric distributjon – The probability for the intersectjon of G and C to contain more than t points is: 34

Ontology enrichment analysis ● Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? ● TANGO [Tanay et al., 2003] – Assume data points sampled from a hypergeometric distributjon – The probability for the intersectjon of G and C to contain more than t points is: Probability of gettjng i points from G when drawing |C| points from a total of n samples. 35

Hierarchical clustering 36

Hierachical clustering Group data over a variety of possible scales, in a multj-level hierarchy. 37

Constructjon ● Agglomeratjve approach ( botom-up ) Start with each element in its own cluster Iteratjvely join neighboring clusters. ● Divisive approach ( top-down ) Start with all elements in the same cluster Iteratjvely separate into smaller clusters. 38

Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● Branch length = cluster distance. 39

Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● U height = distance. How many clusters? ? 40

Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● U height = distance. 1 2 3 4 41

Linkage: connectjng two clusters ● Single linkage 42

Linkage: connectjng two clusters ● Complete linkage 43

Linkage: connectjng two clusters ● Average linkage 44

Linkage: connectjng two clusters ● Centroid linkage 45

Linkage: connectjng two clusters ● Ward Join clusters so as to minimize within-cluster variance 46

Example: Gene expression clustering Breast cancer survival signature [Bergamashi et al. 2011] 1 genes 2 patjents 2 1 47

Hierarchical clustering ● Advantages – No need to pre-defjne the number of clusters – Interpretability ● Drawbacks – Computatjonal complexity ? 48

Hierarchical clustering ● Advantages – No need to pre-defjne the number of clusters – Interpretability ● Drawbacks – Computatjonal complexity E.g. Single/complete linkage (naive): At least O(pn²) to compute all pairwise distances. – Must decide at which level of the hierarchy to split – Lack of robustness (unstable) 49

K-means 50

K-means clustering ● Minimize the intra-cluster variance ● What will this partjtjon of the space look like? 51

K-means clustering ● Minimize the intra-cluster variance ● For each cluster, the points in that cluster are those that are closest to its centroid than to any other centroid 52

12. Clustering Chlo-Agathe Azencot Centre for Computatjonal - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 12. Clustering Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Explain what clustering

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Overview 111111111111111111111333333333333333333333222222222222222222222 I Introduction II

Building Semantic Web tools for Bioinformatics Andrea Splendiani (andrea.splendiani@bbsrc.ac.uk)

Click to go to website: www.njctl.org Slide 2 / 43 Evolution Multiple Choice Review

Diurnal Cycle of Shallow Cumulus over Land Geert Lenderink, A. Pier Siebesma (siebesma@knmi.nl)

PROGRESSIVE DELIVERY PROGRESSIVE DELIVERY CONTINUOUS DELIVERY THE RIGHT WAY CONTINUOUS DELIVERY

Progressive ExpectationMaximization for Hierarchical Volumetric Photon Mapping Wenzel Jakob 1,2

Progressive Web Apps Mike Hartington, GDE & Developer Advocate at Ionic Roger Tipping, VP of

VUE.JS th e progressiv e javascrip t framewor k VUE.JS "Vue is a progressive framework for

12. Clustering Chlo-Agathe Azencot Centre for Computatjonal - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 12. Clustering Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Explain what clustering

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Overview 111111111111111111111333333333333333333333222222222222222222222 I Introduction II

Building Semantic Web tools for Bioinformatics Andrea Splendiani (andrea.splendiani@bbsrc.ac.uk)

Click to go to website: www.njctl.org Slide 2 / 43 Evolution Multiple Choice Review

Diurnal Cycle of Shallow Cumulus over Land Geert Lenderink, A. Pier Siebesma (siebesma@knmi.nl)

PROGRESSIVE DELIVERY PROGRESSIVE DELIVERY CONTINUOUS DELIVERY THE RIGHT WAY CONTINUOUS DELIVERY

Progressive ExpectationMaximization for Hierarchical Volumetric Photon Mapping Wenzel Jakob 1,2

Progressive Web Apps Mike Hartington, GDE &amp; Developer Advocate at Ionic Roger Tipping, VP of

VUE.JS th e progressiv e javascrip t framewor k VUE.JS &quot;Vue is a progressive framework for

Progressive Web Apps Mike Hartington, GDE & Developer Advocate at Ionic Roger Tipping, VP of

VUE.JS th e progressiv e javascrip t framewor k VUE.JS "Vue is a progressive framework for