Clustering Aarti Singh Slides courtesy: Eric Xing Machine Learning - PowerPoint PPT Presentation

Clustering Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Oct 25, 2010

Unsupervised Learning “Learning from unlabeled/ unannotated data” (without supervision) Learning algorithm What can we predict from unlabeled data? o Density estimation 2

Unsupervised Learning “Learning from unlabeled/ unannotated data” (without supervision) Learning algorithm What can we predict from unlabeled data? o Density estimation o Groups or clusters in the data 3

Unsupervised Learning “Learning from unlabeled/ unannotated data” (without supervision) Learning algorithm What can we predict from unlabeled data? o Density estimation o Groups or clusters in the data o Low-dimensional structure - Principal Component Analysis (PCA) (linear) 4

Unsupervised Learning “Learning from unlabeled/ unannotated data” (without supervision) Learning algorithm What can we predict from unlabeled data? o Density estimation o Groups or clusters in the data o Low-dimensional structure - Principal Component Analysis (PCA) (linear) - Manifold learning (non-linear) 5

What is clustering? • Clustering: the process of grouping a set of objects into classes of similar objects – high intra-class similarity – low inter-class similarity – It is the commonest form of unsupervised learning 6

What is Similarity? Hard to define! But we know it when we see it • The real meaning of similarity is a philosophical question. We will take a more pragmatic approach - think in terms of a distance (rather than similarity) between vectors or correlations between random variables. 7

Distance metrics d = 2 x x = (x 1 , x 2 , …, x p ) y = (y 1 , y 2 , …, y p ) 3 y 4 p    2 ( , ) | | d x y x y Euclidean distance 2 5 i i  1 i p Manhattan distance  7   ( , ) | | d x y x y i i  1 i Sup-distance 4   ( , ) max | | d x y x y i i 8   1 i p

Correlation coefficient x = (x 1 , x 2 , …, x p ) Random vectors (e.g. expression levels y = (y 1 , y 2 , …, y p ) of two genes under various drugs) Pearson correlation coefficient   ve p    ( )( ) x x y y i i    1 i ( , ) x y p p      2 2 ( ) ( ) x x y y  + ve i i   1 1 i i p p     where 1 and 1 . x x y y 9 i i p p   1 1 i i

Clustering Algorithms • Partition algorithms • K means clustering • Mixture-Model based clustering • Hierarchical algorithms • Single-linkage • Average-linkage • Complete-linkage • Centroid-based 10

Hierarchical Clustering • Bottom-Up Agglomerative Clustering Starts with each object in a separate cluster, and repeat: – Joins the most similar pair of clusters, – Update the similarity of the new cluster to other clusters until there is only one cluster. Greedy – less accurate but simple, typically computationally expensive • Top-Down divisive Starts with all the data in a single cluster, and repeat: – Split each cluster into two using a partition based algorithm Until each object is a separate cluster. More accurate but complex, can be computationally cheaper 11

Bottom-up Agglomerative clustering Different algorithms differ in how the similarities are defined (and hence updated) between two clusters • Single-Link – Nearest Neighbor: similarity between their closest members. • Complete-Link – Furthest Neighbor: similarity between their furthest members. • Centroid – Similarity between the centers of gravity • Average-Link – Average similarity of all cross-cluster pairs. 12

Single-Link Method Euclidean Distance a a,b b a,b,c a,b,c,d c d c d d (1) (3) (2) c d b c d b c d d 2 5 6 2 5 6 , 3 5 , , 4 a a a b a b c 3 5 3 5 b b 4 c 4 4 c c Distance Matrix 13

Complete-Link Method Euclidean Distance a a,b a,b b a,b,c,d c,d c d c d (1) (3) (2) c d , b c d b c d c d 2 5 6 2 5 6 , 5 6 , 6 a a a b a b 3 5 3 5 b b 4 c 4 4 c c Distance Matrix 14

Dendrograms Single-Link Complete-Link a b c d a b c d 0 2 4 6 15

Another Example 16

Single vs. Complete Linkage Shape of clusters Outliers Single-linkage a llows anisotropic and sensitive to outliers non-convex shapes Complete-linkage assumes isotopic, convex robust to outliers shapes Outlier/noise 17

Computational Complexity • All hierarchical clustering methods need to compute similarity of all pairs of n individual instances which is O(n 2 ). • At each iteration, – Sort similarities to find largest one O(n 2 log n). – Update similarity between merged cluster and other clusters. • In order to maintain an overall O(n 2 ) performance, computing similarity to each other cluster must be done in constant time. (Homework) • So we get O(n 2 log n) or O(n 3 ) 18

Partitioning Algorithms • Partitioning method: Construct a partition of n objects into a set of K clusters • Given: a set of objects and the number K • Find: a partition of K clusters that optimizes the chosen partitioning criterion – Globally optimal: exhaustively enumerate all partitions – Effective heuristic method: K-means algorithm 19

K-Means Algorithm Input – Desired number of clusters, k Initialize – the k cluster centers (randomly if necessary) Iterate – 1. Decide the class memberships of the N objects by assigning them to the nearest cluster centers 2. Re-estimate the k cluster centers (aka the centroid or mean), by assuming the memberships found above are correct. Termination – If none of the N objects changed membership in the last iteration, exit. Otherwise go to 1. 20

K-means Clustering: Step 1 Voronoi diagram 21

K-means Clustering: Step 2 22

Computational Complexity • At each iteration, – Computing distance between each of the n objects and the K cluster centers is O( Kn ). – Computing cluster centers: Each object gets added once to some cluster: O( n ). • Assume these two steps are each done once for l iterations: O( l Kn ). • Is K-means guaranteed to converge? (Homework) 26

Seed Choice • Results are quite sensitive to seed selection. 27

Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering. – Select good seeds using a heuristic (e.g., object least similar to any existing mean) – Try out multiple starting points (very important!!!) – Initialize with the results of another method. – Further reading: k-means ++ algorithm of Arthur and Vassilvitskii 30

Other Issues • Shape of clusters – Assumes isotopic, convex clusters • Sensitive to Outliers – use K-medoids 31

Other Issues • Number of clusters K – Objective function – Look for “Knee” in objective function – Can you pick K by minimizing the objective over K? (Homework) 32

Clustering Aarti Singh Slides courtesy: Eric Xing Machine Learning - PowerPoint PPT Presentation

Clustering Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Oct 25, 2010 Unsupervised Learning Learning from unlabeled/ unannotated data (without supervision) Learning algorithm What can we predict from unlabeled

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

The Machinery of Parametric Linkage Analysis David Duffy Queensland Institute of Medical

Compositional Solution Space Quantification for Probabilistic Software Analysis Mateus Borges,

Chapter 4 Explanation Explanation Chapter 4 The logic of nomothetic explanation

Real Analysis a short presentation on what and why I. Fourier Analysis Fourier analysis is

Introduction to Cluster Analysis Keesha Erickson keeshae@lanl.gov qBio Summer School June 2018

Cluster Analysis Applied Multivariate Statistics Spring 2012 Overview Hierarchical

Comparing More than Two Observations Dmitriy Gorenshteyn Sr. Data Scientist, Memorial Sloan

Hierarchical clustering David M. Blei COS424 Princeton University February 28, 2008 D. Blei