Foundatjons of Machine Learning CentraleSupélec — Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
Learning objectjves ● Explain what clustering algorithms can be used for. ● Explain and implement three difgerent ways to evaluate clustering algorithms. ● Implement hierarchical clustering, discuss its various fmavors. ● Implement k-means clustering, discuss its advantages and drawbacks. ● Sketch out a density-based clustering algorithm. 2
Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 3
Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 4
Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. E.g. – group genes that are similarly afgected by a disease – group patjents whose genes respond similarly to a disease – group pixels in an image that belong to the same object (image segmentatjon). 5
Applicatjons of clustering ● Understand general characteristjcs of the data ● Visualize the data ● Infer some propertjes of a data point based on how it relates to other data points E.g. – fjnd subtypes of diseases – visualize protein families – fjnd categories among images – fjnd patuerns in fjnancial transactjons – detect communitjes in social networks 6
Distances and similaritjes 7
Distances & similaritjes ● Assess how close / far – data points are from each other – a data point is from a cluster – two clusters are from each other ● Distance metric 8
Distances & similaritjes ● Assess how close / far – data points are from each other – a data point is from a cluster – two clusters are from each other ● Distance metric symmetry triangle inequality ● E.g. Lq distances 9
Distance & similaritjes ● How do we get similaritjes? 10
Distance & similaritjes ● Transform distances into similaritjes? ● Kernels defjne similaritjes For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product of their images in the feature spaces. 11
Pearson's correlatjon ● Measure of the linear correlatjon between two variables ● If the features are centered: ? 12
Pearson's correlatjon ● Measure of the linear correlatjon between two variables ● If the features are centered: ● Normalized dot product = cosine 13
Pearson vs Euclide ● Pearson's coeffjcient Profjles of similar shapes will be close to each other, even if they difger in magnitude. ● Euclidean distance Magnitude is taken into account. 14
Pearson vs Euclide 15
Evaluatjng clusters 16
Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? 17
Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 18
Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 19
Centroids and medoids ● Centroid: mean of the points in the cluster. ● Medoid: point in the cluster that is closest to the centroid. 20
Cluster shape: Tightness vs 21
Cluster shape: Tightness T k 22
Cluster shape: Separability vs 23
Cluster shape: Separability S kl 24
Clusters shape: Davies-Bouldin ● Cluster tjghtness ( homogeneity ) T k ● Cluster separatjon S kl ● Davies-Bouldin index 25
Clusters shape: Silhouete coeffjcient ● how well x fjts in its cluster: ● how well x would fjt in another cluster: ● if x is very close to the other points of its cluster: s( x ) = 1 ● if x is very close to the points in another cluster: s( x ) = -1 26
Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 27
Cluster stability ● How many clusters? 28
Cluster stability ● K=2 ● K=3 29
Cluster stability ● K=2 ● K=3 30
Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● 3) Based on domain knowledge: The clusters should “make sense”. 31
Domain knowledge ● Do the cluster match natural categories? – Check with human expertjse 32
Ontology enrichment analysis ● Ontology: Entjtjes may be grouped, related within a hierarchy, and subdivided according to similaritjes and difgerences. Build by human experts ● E.g.: The Gene Ontology http://geneontology.org/ – Describe genes with a common vocabulary, organized in categories E.g. cellular process > cell death > programmed cell death > apoptotjc process > executjon phase of apoptosis 33
Ontology enrichment analysis ● Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? ● TANGO [Tanay et al., 2003] – Assume data points sampled from a hypergeometric distributjon – The probability for the intersectjon of G and C to contain more than t points is: 34
Ontology enrichment analysis ● Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? ● TANGO [Tanay et al., 2003] – Assume data points sampled from a hypergeometric distributjon – The probability for the intersectjon of G and C to contain more than t points is: Probability of gettjng i points from G when drawing |C| points from a total of n samples. 35
Hierarchical clustering 36
Hierachical clustering Group data over a variety of possible scales, in a multj-level hierarchy. 37
Constructjon ● Agglomeratjve approach ( botom-up ) Start with each element in its own cluster Iteratjvely join neighboring clusters. ● Divisive approach ( top-down ) Start with all elements in the same cluster Iteratjvely separate into smaller clusters. 38
Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● Branch length = cluster distance. 39
Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● U height = distance. How many clusters? ? 40
Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● U height = distance. 1 2 3 4 41
Linkage: connectjng two clusters ● Single linkage 42
Linkage: connectjng two clusters ● Complete linkage 43
Linkage: connectjng two clusters ● Average linkage 44
Linkage: connectjng two clusters ● Centroid linkage 45
Linkage: connectjng two clusters ● Ward Join clusters so as to minimize within-cluster variance 46
Example: Gene expression clustering Breast cancer survival signature [Bergamashi et al. 2011] 1 genes 2 patjents 2 1 47
Hierarchical clustering ● Advantages – No need to pre-defjne the number of clusters – Interpretability ● Drawbacks – Computatjonal complexity ? 48
Hierarchical clustering ● Advantages – No need to pre-defjne the number of clusters – Interpretability ● Drawbacks – Computatjonal complexity E.g. Single/complete linkage (naive): At least O(pn²) to compute all pairwise distances. – Must decide at which level of the hierarchy to split – Lack of robustness (unstable) 49
K-means 50
K-means clustering ● Minimize the intra-cluster variance ● What will this partjtjon of the space look like? 51
K-means clustering ● Minimize the intra-cluster variance ● For each cluster, the points in that cluster are those that are closest to its centroid than to any other centroid 52
Recommend
More recommend