Pattern Recognition 2019 Clustering, Mixture Models and EM Ad Feelders Universiteit Utrecht December 13, 2019 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 1 / 66
Objective of Clustering Put objects (persons, images, web-pages, ...) into a number of groups in such a way that the objects within the same group are similar, but the groups are dissimilar. Variable 2 Variable 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 2 / 66
Similarity between objects Each object is described by a number of variables (also called features or attributes). The similarity between objects is determined on the basis of these variables. The measurement of similarity is central to many clustering methods. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 3 / 66
Clustering � = Classification In classification the group to which an object belongs is given, and the task is to discriminate between groups on the basis of the variables used to describe the objects. In clustering the groups are not given, but the objective is to discover them. Clustering is sometimes called unsupervised learning, and classification supervised learning. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 4 / 66
Clustering Techniques Many techniques have been developed to cluster objects into groups: Hierarchical clustering (not discussed). Partitioning methods (e.g. K-means, K-medoids). Model-based clustering (mixture models). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 5 / 66
Data Matrix We have observations on N objects, that we want to cluster into a number of groups. For each object we observe D variables, numbered 1 , 2 , . . . , D . Data matrix: x 11 x 1 j x 1 D . . . . . . . . . . . . . . . X = x n 1 x nj x nD . . . . . . . . . . . . . . . x N 1 x Nj x ND . . . . . . where x nj denotes the value of object n for variable j . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 6 / 66
Distance Measures: numeric variables Object 2 Variable 2 (x 21 ,x 22 ) x 22 � x 12 Object 1 (x 11 ,x 12 ) x 21 � x 11 Variable 1 Dashed line: Euclidian distance Solid line: Manhattan distance Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 7 / 66
Distance Measures: numeric variables Manhattan distance between x i and x j : D � | x id − x jd | . d =1 Squared Euclidian distance between x i and x j : D ( x id − x jd ) 2 = � x i − x j � 2 . � d =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 8 / 66
Standardization Units of measurement should not be important for cluster structure. Therefore variables are often standardized. For example: � N � 1 � � x j ) 2 s j = ( x nj − ¯ � N − 1 n =1 Standardized measurement: nj = x nj − ¯ x j x ∗ s j x ∗ j has mean zero and standard deviation 1. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 9 / 66
Partitioning methods Search directly for a division of the N objects into K groups that maximizes the quality of the clustering. The number of distinct partitions P ( N , K ) of N objects into K non-empty groups is O ( K N ). For example: P (100 , 5) = 10 68 . Exhaustive search is not feasible. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 10 / 66
K -means Clustering There are many possibilities to measure the quality of a partition. In case of numeric data, one can use for example N K � � r nk � x n − µ k � 2 J = (9.1) n =1 k =1 Sum of the squares of Euclidian distances of each data point to the center of the cluster to which it has been assigned. r nk = 1 if x n has been assigned to cluster k , and r nk = 0 otherwise (1-of- K coding). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 11 / 66
Minimize J with respect to r nk (E-step) Optimize for each point n separately by choosing r nk to be 1 for the value of k that gives the minimum distance � x n − µ k � 2 . More formally � 1 if k = arg min j � x n − µ j � 2 r nk = (9.2) 0 otherwise. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 12 / 66
Minimize J with respect to µ j (M-step) Take derivative of J with respect to µ j and equate to zero N � − 2 r nj ( x n − µ j ) = 0 (9.3) n =1 which gives � n r nj x n µ j = (9.4) � n r nj i.e. the mean of the points that are assigned to cluster j . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 13 / 66
K -means algorithm 1 Partition the observations into K initial clusters. 2 Calculate the mean of each cluster (M-step). 3 Assign each observation to the cluster whose mean is nearest (E-step). 4 If reassignments have taken place, return to step 2; otherwise stop. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 14 / 66
Old Faithful data set Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 15 / 66
Old Faithful data set (a) (b) (c) 2 2 2 0 0 0 −2 −2 −2 −2 0 2 −2 0 2 −2 0 2 (d) (e) (f) 2 2 2 0 0 0 −2 −2 −2 −2 0 2 −2 0 2 −2 0 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 16 / 66
Old Faithful data set (g) (h) (i) 2 2 2 0 0 0 −2 −2 −2 −2 0 2 −2 0 2 −2 0 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 17 / 66
� Convergence of algorithm 1000 500 0 1 2 3 4 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 18 / 66
How to do this in R # load library/package MASS > library(MASS) # scale data > faith.sc <- scale(faithful) # K-means with K=2 applied to faithful data > faithful.k2 <- kmeans(faith.sc,2) # plot resulting clusters > plot(faith.sc[,1],faith.sc[,2],xlim=c(-2,2),type="n") > points(faith.sc[,1],faith.sc[,2], col=faithful.k2$cluster*2,pch=19) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 19 / 66
Final clustering obtained 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● waiting ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● −2 −1 0 1 2 eruptions Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 20 / 66
How many clusters? Required number of groups usually not known in advance. Determine the appropriate number of groups from the data. Informal: plot the quality criterion against the number of groups. Look for large jumps to determine the appropriate number of groups. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 21 / 66
Example: Ruspini data 150 Variable 2 100 50 0 0 20 40 60 80 100 120 Variable 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 22 / 66
Determining the number of groups 80000 60000 within sum of squares 40000 20000 2 3 4 5 6 number of groups Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 23 / 66
K -medoids Can be used with other dissimilarity measures than Euclidian distance. Use a number of representative objects (called medoids) instead of means. Advantage: less sensitive to outliers than K -means (cf. the mean and median of a sample). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 24 / 66
K -medoids: cluster quality Each object is assigned to the cluster corresponding to the nearest medoid. The K representative objects should minimize the sum of the dissimilarities of all objects to their nearest medoid, i.e. N K ˜ � � J = r nk V ( x n , µ k ) (9.6) n =1 k =1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 13, 2019 25 / 66
Recommend
More recommend