Machine Learning - MT 2016 15. Clustering Varun Kanade University - PowerPoint PPT Presentation

Machine Learning - MT 2016 15. Clustering Varun Kanade University of Oxford November 28, 2016

Announcements ◮ No new practical this week ◮ All practicals must be signed off in sessions this week ◮ Firm Deadline: Reports handed in at CS reception by Friday noon ◮ Revision Class for M.Sc. + D.Phil. Thu Week 9 (2pm & 3pm) ◮ Work through ML HT2016 Exam (Problem 3 is optional) 1

Outline This week, we will study some approaches to clustering ◮ Defining an objective function for clustering ◮ k -Means formulation for clustering ◮ Multidimensional Scaling ◮ Hierarchical clustering ◮ Spectral clustering 2

England pushed towards Test defeat by India France election: Socialists scramble to avoid split after Fillon win Giants Add to the Winless Browns’ Misery Strictly Come Dancing: Ed Balls leaves programme Trump Claims, With No Evidence, That ‘Millions of People’ Voted Illegally Vive ‘La Binoche’, the reigning queen of French cinema 3

Sports England pushed towards Test defeat by India Politics France election: Socialists scramble to avoid split after Fillon win Sports Giants Add to the Winless Browns’ Misery Film&TV Strictly Come Dancing: Ed Balls leaves programme Politics Trump Claims, With No Evidence, That ‘Millions of People’ Voted Illegally Film&TV Vive ‘La Binoche’, the reigning queen of French cinema 3

England England pushed towards Test defeat by India France France election: Socialists scramble to avoid split after Fillon win USA Giants Add to the Winless Browns’ Misery England Strictly Come Dancing: Ed Balls leaves programme USA Trump Claims, With No Evidence, That ‘Millions of People’ Voted Illegally France Vive ‘La Binoche’, the reigning queen of French cinema 3

Clustering Often data can be grouped together into subsets that are coherent. However, this grouping may be subjective. It is hard to define a general framework. Two types of clustering algorithms 1. Feature-based - Points are represented as vectors in R D 2. (Dis)similarity-based - Only know pairwise (dis)similarities Two types of clustering methods 1. Flat - Partition the data into k clusters 2. Hierarchical - Organise data as clusters, clusters of clusters, and so on 4

Defining Dissimilarity ◮ Weighted dissimilarity between (real-valued) attributes   D d ( x , x ′ ) = f � w i d i ( x i , x ′ i )   i =1 i ) 2 and f ( z ) = z , ◮ In the simplest setting w i = 1 and d i ( x i , x ′ i ) = ( x i − x ′ which corresponds to the squared Euclidean distance ◮ Weights allow us to emphasise features differently ◮ If features are ordinal or categorical then define distance suitably ◮ Standardisation (mean 0 , variance 1 ) may or may not help 5

Helpful Standardisation 6

Unhelpful Standardisation 7

Partition Based Clustering Want to partition the data into subsets C 1 , . . . , C k , where k is fixed in advance Define quality of a partition by k W ( C ) = 1 1 � � d ( x i , x i ′ ) 2 | C j | j =1 i,i ′ ∈ C j If we use d ( x , x ′ ) = � x − x ′ � 2 , then k � � � x i − µ j � 2 W ( C ) = j =1 i ∈ C j where µ j = 1 � i ∈ C j x i | C j | The objective is minimising the sum of squares of distances to the mean within each cluster 8

Outline Clustering Objective k -Means Formulation of Clustering Multidimensional Scaling Hierarchical Clustering Spectral Clustering

Partition Based Clustering : k -Means Objective Minimise jointly over partitions C 1 , . . . , C k and µ 1 , . . . , µ k k � � � x i − µ j � 2 W ( C ) = j =1 i ∈ C j This problem is NP-hard even for k = 2 for points in R D If we fix µ 1 , . . . , µ j , finding a partition ( C j ) k j =1 that minimises W is easy C j = { i | � x i − µ j � = min j ′ � x i − µ j ′ �} If we fix the clusters C 1 , . . . , C k minimising W with respect to ( µ j ) k j =1 is easy 1 � µ j = x i | C j | i ∈ C j Iteratively run these two steps - assignment and update 9

k -Means Clusters ( k = 3) Ground Truth Clusters 11

The k -Means Algorithm 1. Intialise means µ 1 , . . . , µ k ‘‘randomly’’ 2. Repeat until convergence: a. Find assignments of data to clusters represented by the mean that is closest to obtain, C 1 , . . . , C k : � x i − µ j ′ � 2 } C j = { i | j = argmin j ′ b. Update means using the current cluster assignments: 1 � µ j = x i | C j | i ∈ C j Note 1: Ties can be broken arbitrarily Note 2: Choosing k random datapoints to be the initial k -means is a good idea 12

The k -Means Algorithm Does the algorithm always converge? Yes, because the W function decreases every time a new partition is used; there are only finitely many partitions k � � � x i − µ j � 2 W ( C ) = j =1 i ∈ C j Convergence may be very slow in the worst-case, but typically fast on real-world instances Convergence is probably to a local minimum. Run multiple times with random initialisation. Can use other criteria: k -medoids, k -centres, etc. Selecting the right k is not easy: plot W against k and identify a "kink" 13

k -Means Clusters ( k = 4 ) Ground Truth Clusters 14

Choosing the number of clusters k MSE on test vs K for K−means 0.25 0.2 0.15 0.1 0.05 0 2 4 6 8 10 12 14 16 ◮ As in the case of PCA, larger k will give better value of the objective ◮ Choose suitable k by identifying a ‘‘kink’’ or ‘‘elbow’’ in the curve (Source: Kevin Murphy, Chap 11) 15

Multidimensional Scaling (MDS) In certain cases, it may be easier to define (dis)similarity between objects than embed them in Euclidean space Algorithms such as k -means require points to be in Euclidean space Ideal Setting: Suppose for some N points in R D we are given all pairwise Euclidean distances in a matrix D Can we reconstruct x 1 , . . . , x N , i.e., all of X ? 16

Multidimensional Scaling Distances are preserved under translation, rotation, reflection, etc. We cannot recover X exactly; we can aim to determine X up to these transformations If D ij is the distance between points x i and x j , then D 2 ij = � x i − x j � 2 = x T i x i − 2 x T i x j + x T j x j = M ii − 2 M ij + M jj Here M = XX T is the N × N matrix of dot products Exercise: Show that assuming � i x i = 0 , M can be recovered from D 17

Multidimensional Scaling Consider the (full) SVD: X = UΣV T We can write M as M = XX T = UΣΣ T U T Starting from M , we can reconstruct ˜ X using the eigendecomposition of M M = UΛU T Because, M is symmetric and positive semi-definite, U T = U − 1 and all entries of (diagonal matrix) Λ are non-negative Let ˜ X = UΛ 1 / 2 If we are satisfied with approximate reconstruction, we can use truncated eigendecomposition 18

Multidimensional Scaling: Additional Comments In general if you define (dis)similarities on objects such as text documents, genetic sequences, etc. , we cannot be sure that the generated similarity matrix M will be positive semi-definite or that the dissimilarity matrix D is a valid squared Euclidean distance If such cases, we cannot always find a Euclidean embeddding that recovers the (dis)similarities exactly Minimize stress function: Find z 1 , . . . , z N that minimizes � ( D ij − � z i − z j � ) 2 S ( Z ) = i � = j Several other types of stress functions can be used 19

Multidimensional Scaling: Summary ◮ In certain applications, it may be easier to define pairwise similarities or distances, rather than construct a Euclidean embedding of discrete objects, e.g., genetic data, text data, etc. ◮ Many machine learning algorithms require (or are more naturally expressed with) data in some Euclidean space ◮ Multidimensional Scaling gives a way to find an embedding of the data in Euclidean space that (approximately) respects the original distance/similarity values 20

Hierarchical Clustering Hierarchical structured data exists all around us ◮ Measurements of different species and individuals within species ◮ Top-level and low-level categories in news articles ◮ Country, county, town level data Two Algorithmic Strategies for Clustering ◮ Agglomerative: Bottom-up, clusters formed by merging smaller clusters ◮ Divisive: Top-down, clusters formed by splitting larger clusters Visualise this as a dendogram or tree 21

Measuring Dissimilarity at Cluster Level To find hierarchical clusters we need to define dissimilarity at cluster level, not just at datapoints Suppose we have dissimilarity at datapoint level, e.g., d ( x , x ′ ) = � x − x ′ � Different ways to define dissimilarity at cluster level, say C and C ′ ◮ Single Linkage D ( C, C ′ ) = x ∈ C, x ′ ∈ C ′ d ( x , x ′ ) min ◮ Complete Linkage D ( C, C ′ ) = x ∈ C, x ′ ∈ C ′ d ( x , x ′ ) max ◮ Average Linkage 1 D ( C, C ′ ) = � d ( x , x ′ ) | C | · | C ′ | x ∈ C, x ′ ∈ C ′ 22

Machine Learning - MT 2016 15. Clustering Varun Kanade University - PowerPoint PPT Presentation

Machine Learning - MT 2016 15. Clustering Varun Kanade University of Oxford November 28, 2016 Announcements No new practical this week All practicals must be signed off in sessions this week Firm Deadline: Reports handed in at CS

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Machine Learning 2 DS 4420 - Spring 2018 From clustering to EM Byron C. Wallace Clustering

Clustering with k-means Introduction to Machine Learning Clustering, what? Cluster :

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Clustering: Hierarchical Clustering and K- Means Clustering Machine

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Applied Machine Learning Clustering Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives

MACHINE LEARNING Spectral Clustering 1 ADVANCED MACHINE LEARNING Outline of Todays Lecture

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality

A Consistent Density-Based Clustering Algorithm and its Application to Microstructure Image

Lecture 22: Clustering Distance measures K-Means Aykut Erdem December 2016 Hacettepe

Cluster Analysis Grouping the data items into a number of sets such that the members of each

Week 7 Video 3 Advanced Clustering Algorithms Today Multiple advanced algorithms for

A perspec(ve on CMIP5: from the simple beginning of

Workflow for routine evaluation of CMIP6 models with the ESMValTool Bjrn Brtz, Veronika

Long-term archiving workflow for CMIP6 ISENES2 Workshop on ESM Workflows 28.09.2016 Martina