A quick review The clustering problem: Different representations - PowerPoint PPT Presentation

A quick review The clustering problem:  Different representations  homogeneity vs. separation  Many possible distance metrics  Many possible linkage approaches  Method matters; metric matters; definitions matter;

A quick review  Hierarchical clustering:  Takes as input a distance matrix  Progressively regroups the closest objects/groups  The result is a tree - intermediate nodes represent clusters  Branch lengths represent distances between clusters branch object 1 Distance matrix c1 node object 1 object 2 object 3 object 4 object 5 object 5 c3 object 4 c4 c2 object 2 object 1 0.00 4.00 6.00 3.50 1.00 object 2 4.00 0.00 6.00 2.00 4.50 object 3 root object 3 6.00 6.00 0.00 5.50 6.50 object 4 3.50 2.00 5.50 0.00 4.00 leaf object 5 1.00 4.50 6.50 4.00 0.00 nodes

Hierarchical clustering result Five clusters

The “philosophy” of clustering - Summary  “ Unsupervised learning ” problem  No single solution is necessarily the true/correct!  There is usually a tradeoff between homogeneity and separation:  More clusters  increased homogeneity but decreased separation  Less clusters  Increased separation but reduced homogeneity  Method matters; metric matters; definitions matter;  In most cases, heuristic methods or approximations are used.

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

K-mean clustering: A different approach  Clear definition of a ‘good’ clustering solution (in contrast to hierarchical clustering)  Divisive rather than agglomerative (in contrast to hierarchical clustering)  Obtained solution is non-hierarchical (in contrast to hierarchical clustering)  A new algorithmic approach (unlike any algorithm we learned so far)

What constitutes a good clustering solution? (What exactly are we trying to find?)

Defining a good clustering solution Expression in condition 2 Expression in condition 1

Defining a good clustering solution Expression in condition 2 Expression in condition 2 Expression in condition 1 Expression in condition 1

Defining a good clustering solution Red cluster Expression in condition 2 Expression in condition 2 center Green cluster center Expression in condition 1 Expression in condition 1 The K-mean approach Clustering of n observations/points into k clusters is ‘good’ if each observation is assigned to the cluster with the nearest mean/center

Defining a good clustering solution condition 2 condition 1

Defining a good clustering solution condition 2 condition 1 condition 2 condition 2 condition 2 condition 1 condition 1 condition 1 X  

K-mean clustering  An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center Cluster 2 center (mean) Cluster 1 center (mean)

But how do we find a clustering solution with this property?

K-mean clustering: Chicken and egg?  An algorithm for partitioning n observations/points into k clusters such that each observation belongs to the cluster with the nearest mean/center  Note the two components of this definition:  Partitioning of n points into clusters  Clusters’ means  A chicken and egg problem: I do not know the means before I determine the partitioning I do not know the partitioning before I determine the means

The K-mean clustering algorithm An iterative approach  Key principle - cluster around mobile centers: Start with some random locations of means/centers, partition into clusters according to these centers, then correct the centers according to the clusters, and repeat [similar to EM (expectation-maximization) algorithms]

K-mean clustering algorithm  The number of centers, k , has to be specified a-priori  Algorithm: 1. Arbitrarily select k initial centers 2. Assign each element to the closest center 3. Re-calculate centers (mean position of the assigned elements) 4. Repeat 2 and 3 until …

K-mean clustering algorithm  The number of centers, k , has to be specified a-priori  Algorithm: 1. Arbitrarily select k initial centers 2. Assign each element to the closest center 3. Re-calculate centers (mean position of the assigned elements) 4. Repeat 2 and 3 until one of the following termination conditions is reached: i. The clusters are the same as in the previous iteration (stable solution) ii. The clusters are as in some previous iteration (cycle) iii. The difference between two iterations is small?? iv. The maximum number of iterations has been reached

K-mean clustering algorithm  The number of centers, k , has to be specified a-priori  Algorithm: How can we do this efficiently? 1. Arbitrarily select k initial centers 2. Assign each element to the closest center 3. Re-calculate centers (mean position of the assigned elements) 4. Repeat 2 and 3 until one of the following termination conditions is reached: i. The clusters are the same as in the previous iteration (stable solution) ii. The clusters are as in some previous iteration (cycle) iii. The difference between two iterations is small?? iv. The maximum number of iterations has been reached

Assigning elements to the closest center  Could be computationally intensive …. B A

Assigning elements to the closest center  Could be computationally intensive ….  Preprocessing (by partitioning the space) can help B A

Assigning elements to the closest center  Could be computationally intensive ….  Preprocessing (by partitioning the space) can help closer to B than to A B closer to A than to B A

Assigning elements to the closest center  Could be computationally intensive ….  Preprocessing (by partitioning the space) can help closer to B than to A B closer to A closer to B than to B than to C A C

Assigning elements to the closest center  Could be computationally intensive ….  Preprocessing (by partitioning the space) can help closest to B B closest to A A C closest to C

Assigning elements to the closest center  Could be computationally intensive ….  Preprocessing (by partitioning the space) can help B A C

Voronoi diagram  Decomposition of a metric space determined by distances to a specified discrete set of “centers” in the space (each colored cell represents the collection of all points in this space that are closer to a specific center than to any other)  Several algorithms exist to find the Voronoi diagram  Numerous applications (e.g., the 1854 Broad Street cholera outbreak in Soho England, Aviation, and many others)

K-mean clustering algorithm  The number of centers, k , has to be specified a-priori  Algorithm: 1. Arbitrarily select k initial centers 2. Assign each element to the closest center (Voronoi) 3. Re-calculate centers (mean position of the assigned elements) 4. Repeat 2 and 3 until one of the following termination conditions is reached: i. The clusters are the same as in the previous iteration (stable solution) ii. The clusters are as in some previous iteration (cycle) iii. The difference between two iterations is small?? iv. The maximum number of iterations has been reached

K-mean clustering example  Two sets of points randomly generated  200 centered on (0,0)  50 centered on (1,1)

K-mean clustering example  Two points are randomly chosen as centers (stars)

K-mean clustering example  Each dot can now be assigned to the cluster with the closest center

K-mean clustering example  First partition into clusters

K-mean clustering example  Centers are re-calculated

K-mean clustering example  And are again used to partition the points

K-mean clustering example  Second partition into clusters

K-mean clustering example  Re-calculating centers again

K-mean clustering example  And we can again partition the points

K-mean clustering example  Third partition into clusters

K-mean clustering example  After 6 iterations:  The calculated centers remains stable

K-mean clustering: Summary  The convergence of k-mean is usually quite fast (sometimes 1 iteration results in a stable solution)  K-means is time- and memory-efficient  Strengths:  Simple to use  Fast  Can be used with very large data sets  Weaknesses:  The number of clusters has to be predetermined  The results may vary depending on the initial choice of centers

K-mean clustering: Variations  Expectation-maximization ( EM ): maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means.  k-means++: attempts to choose better starting points.  Some variations attempt to escape local optima by swapping points between clusters

An important take-home message Hierarchical K-mean clustering clustering ? D’haeseleer , 2005

What else are we missing?

What else are we missing?  What if the clusters are not “linearly separable”?

Defining a good clustering solution The K-mean approach condition 2 condition 1

A quick review The clustering problem: Different representations - PowerPoint PPT Presentation

A quick review The clustering problem: Different representations homogeneity vs. separation Many possible distance metrics Many possible linkage approaches Method matters; metric matters; definitions matter; A quick review

Printout Tuesday, October 29, 2019 7:38 PM Quick Notes Page 1 Quick Notes Page 2 Quick Notes

QUICK INTRODUCTION People call me GONZ QUICK INTRODUCTION 1. Never went to Art School

Sorting Chapter 7 1 Quick Sort One of the most popular fast sorting algorithms Quick sort

INAGEL QUICK KONJAC TR Natural thickener Contents 1. Inagel Quick Konjac TR properties 2.

FE Review-Transportation 1 FE Review-Transportation 2 FE Review-Transportation 3 FE

VOLTA / TURING OPTIMIZATION G. Thomas-Collignon, NVIDIA, GTC 2019 S9234 Quick review of basic

Math for Liberal Arts MAT 110 : Chapter 2 Notes David J. Gisch A Quick Fractions Review

Preparing for Change: The DOLs Final Rule and Exempt Classifications Agenda A Quick Review

Quick Review of Probability Geometric Distribution Coupon Collector Problem Anil Maheshwari

The Office of Research an d Spon s or ed Pr ogr am s Quick Guide to Grant Writing Quick Guide to

Quick Intro to RMS Quick Intro to RMS RMS is a Record Management System that :

Agenda Quick Poll Report Out Your Resources Quick Poll - Media Diet Menti Meter Open

Black Hills State University Quick Facts Quick Facts BHSU graduates more students helping to

Powerful Presentation Skills: A Quick and Handy Guide for Any Manager Powerful Presentation

Starting at 1pm Central A Few Quick things A video recording of this live webinar will be sent

Hollywood Science Hollywood Science Week 2: Science in Popular Culture A quick recap A quick

Computational aspects of ncRNA research Mihaela Zavolan Biozentrum, Basel Swiss Institute of

Rob Niche Dimensions Edwards What would you do with hundreds of genome sequences? Cholerae

Keyword-Based Search over Environmental Datasets Jos R.R. Viqueira Alberto Bugarn Joaqun

CSC2552 Topics in Computational Social Science: AI, Data, and Society Spring 2020 Lecture 2:

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr

Oswaldo Cruz Institute FIOCRUZ Antimicrobial resistance: where to go? Milton Ozrio Moraes

Privacy-Preserving Search of Similar Patients in Genomic Data Gilad Asharov Shai Halevi

Simulating Chromosome Segregation Qi Zheng Simulating Chromosome Segregation Qi Zheng