12 clustering
play

12. Clustering Chlo-Agathe Azencot Centre for Computatjonal - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 12. Clustering Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Learning objectjves Explain what clustering


  1. Foundatjons of Machine Learning CentraleSupélec — Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

  2. Learning objectjves ● Explain what clustering algorithms can be used for. ● Explain and implement three difgerent ways to evaluate clustering algorithms. ● Implement hierarchical clustering, discuss its various fmavors. ● Implement k-means clustering, discuss its advantages and drawbacks. ● Sketch out a density-based clustering algorithm. 2

  3. Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 3

  4. Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 4

  5. Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. E.g. – group genes that are similarly afgected by a disease – group patjents whose genes respond similarly to a disease – group pixels in an image that belong to the same object (image segmentatjon). 5

  6. Applicatjons of clustering ● Understand general characteristjcs of the data ● Visualize the data ● Infer some propertjes of a data point based on how it relates to other data points E.g. – fjnd subtypes of diseases – visualize protein families – fjnd categories among images – fjnd patuerns in fjnancial transactjons – detect communitjes in social networks 6

  7. Distances and similaritjes 7

  8. Distances & similaritjes ● Assess how close / far – data points are from each other – a data point is from a cluster – two clusters are from each other ● Distance metric 8

  9. Distances & similaritjes ● Assess how close / far – data points are from each other – a data point is from a cluster – two clusters are from each other ● Distance metric symmetry triangle inequality ● E.g. Lq distances 9

  10. Distance & similaritjes ● How do we get similaritjes? 10

  11. Distance & similaritjes ● Transform distances into similaritjes? ● Kernels defjne similaritjes For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product of their images in the feature spaces. 11

  12. Pearson's correlatjon ● Measure of the linear correlatjon between two variables ● If the features are centered: ? 12

  13. Pearson's correlatjon ● Measure of the linear correlatjon between two variables ● If the features are centered: ● Normalized dot product = cosine 13

  14. Pearson vs Euclide ● Pearson's coeffjcient Profjles of similar shapes will be close to each other, even if they difger in magnitude. ● Euclidean distance Magnitude is taken into account. 14

  15. Pearson vs Euclide 15

  16. Evaluatjng clusters 16

  17. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? 17

  18. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 18

  19. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 19

  20. Centroids and medoids ● Centroid: mean of the points in the cluster. ● Medoid: point in the cluster that is closest to the centroid. 20

  21. Cluster shape: Tightness vs 21

  22. Cluster shape: Tightness T k 22

  23. Cluster shape: Separability vs 23

  24. Cluster shape: Separability S kl 24

  25. Clusters shape: Davies-Bouldin ● Cluster tjghtness ( homogeneity ) T k ● Cluster separatjon S kl ● Davies-Bouldin index 25

  26. Clusters shape: Silhouete coeffjcient ● how well x fjts in its cluster: ● how well x would fjt in another cluster: ● if x is very close to the other points of its cluster: s( x ) = 1 ● if x is very close to the points in another cluster: s( x ) = -1 26

  27. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● Based on domain knowledge: The clusters should “make sense”. 27

  28. Cluster stability ● How many clusters? 28

  29. Cluster stability ● K=2 ● K=3 29

  30. Cluster stability ● K=2 ● K=3 30

  31. Evaluatjng clusters ● Clustering is unsupervised. ● There is no ground truth. How do we evaluate the quality of a clustering algorithm? ● 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to difgerent clusters. ● 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. ● 3) Based on domain knowledge: The clusters should “make sense”. 31

  32. Domain knowledge ● Do the cluster match natural categories? – Check with human expertjse 32

  33. Ontology enrichment analysis ● Ontology: Entjtjes may be grouped, related within a hierarchy, and subdivided according to similaritjes and difgerences. Build by human experts ● E.g.: The Gene Ontology http://geneontology.org/ – Describe genes with a common vocabulary, organized in categories E.g. cellular process > cell death > programmed cell death > apoptotjc process > executjon phase of apoptosis 33

  34. Ontology enrichment analysis ● Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? ● TANGO [Tanay et al., 2003] – Assume data points sampled from a hypergeometric distributjon – The probability for the intersectjon of G and C to contain more than t points is: 34

  35. Ontology enrichment analysis ● Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? ● TANGO [Tanay et al., 2003] – Assume data points sampled from a hypergeometric distributjon – The probability for the intersectjon of G and C to contain more than t points is: Probability of gettjng i points from G when drawing |C| points from a total of n samples. 35

  36. Hierarchical clustering 36

  37. Hierachical clustering Group data over a variety of possible scales, in a multj-level hierarchy. 37

  38. Constructjon ● Agglomeratjve approach ( botom-up ) Start with each element in its own cluster Iteratjvely join neighboring clusters. ● Divisive approach ( top-down ) Start with all elements in the same cluster Iteratjvely separate into smaller clusters. 38

  39. Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● Branch length = cluster distance. 39

  40. Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● U height = distance. How many clusters? ? 40

  41. Dendogram ● The results of a hierarchical clustering algorithm are presented in a dendogram. ● U height = distance. 1 2 3 4 41

  42. Linkage: connectjng two clusters ● Single linkage 42

  43. Linkage: connectjng two clusters ● Complete linkage 43

  44. Linkage: connectjng two clusters ● Average linkage 44

  45. Linkage: connectjng two clusters ● Centroid linkage 45

  46. Linkage: connectjng two clusters ● Ward Join clusters so as to minimize within-cluster variance 46

  47. Example: Gene expression clustering Breast cancer survival signature [Bergamashi et al. 2011] 1 genes 2 patjents 2 1 47

  48. Hierarchical clustering ● Advantages – No need to pre-defjne the number of clusters – Interpretability ● Drawbacks – Computatjonal complexity ? 48

  49. Hierarchical clustering ● Advantages – No need to pre-defjne the number of clusters – Interpretability ● Drawbacks – Computatjonal complexity E.g. Single/complete linkage (naive): At least O(pn²) to compute all pairwise distances. – Must decide at which level of the hierarchy to split – Lack of robustness (unstable) 49

  50. K-means 50

  51. K-means clustering ● Minimize the intra-cluster variance ● What will this partjtjon of the space look like? 51

  52. K-means clustering ● Minimize the intra-cluster variance ● For each cluster, the points in that cluster are those that are closest to its centroid than to any other centroid 52

Recommend


More recommend