CLUSTER ANALYSIS WITH K-MEANS What about the details ? Maurice ROUX Ex-Professor Paul Cezanne University Marseille, France mrhroux@yahoo.fr
K-means: what about the details ? Introduction • k-means type algorithms are very popular • they are fast • they allow for the treatment of huge data sets • they use a very simple scheme easily comprehensible
K-means: what about the details ? Introduction: practical problems • the quality of the results depends heavily on the initialization • k-means requires the number of clusters to be chosen beforehand How to deal with these issues ?
K-means: what about the details ? The classical solutions 1. Initialization : Repeat many random initializations and retain the solution which maximizes the « Between sum of squared distances » (BSS). 2. Number K of clusters Try several values of K and retain the one which leads to the best value of some given criterion.
K-means: what about the details ? The details to take care about 1. Initialization : How many random initializations ? 2. Number K of clusters : Which criterion to evaluate the results ?
K-means: what about the details ? The present study : methods • use real data sets to put in practice the usual methods mostly tested on artificial data sets • try to solve both the selection of K and the number of random initializations in the classical batch K-means algorithm • can the processing of a set of partitions (« cluster ensemble ») bring more information on the data set ?
K-means: what about the details ? Plan of the presentation 1. Some quality indexes of a partition illustrated with an artificial data set 2. Real life data sets 3. Discussion
K-means: what about the details ? PART 1 : quality indexes and an artificial data set Quality indexes : 12 classical formulas for evaluating the fit of a partition to a given distance or dissimilarity. Artificial data set : a 20-points sample in 2-D by J.P. Nakache and J. Confais (2010) «Approche pragmatique de la classification», Technip, Paris (p. 197)
K-means: what about the details ? Quality indexes : parametric Indices Type Variation BSS / TSS isolation/compactness [ 0; 1] [ 0; ∞ ] Theta (Guénoche, 2003) isolation/compactness [ 0; ∞ ] Davies-Bouldin (1979) compactness/isolation [ 0; ∞ ] Dunn (1974) isolation/compactness Hubert & Levin (1976) compactness [ 0; 1] Silhouette (Rousseuw, 1987)isolation/compactness [-1; +1] 9
K-means: what about the details ? Quality indexes: non-parametric Indices Type Variation Yule (1900) correlation [-1; +1] Adjusted Rand (1985) correlation [ 0; 1] Fowlkes & Mallows (1983) correlation [ 0; 1] Goodman & Kruskal (1954) correlation [-1; +1] Kendall's tau (1938) correlation [-1; +1] [ 0; ∞ ] contingency Khi-2 correlation 10
K-means: what about the details ? A small 2-D example by J.P. Nakache and J. Confais (2010) 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 11
K-means: what about the details ? Small example: optimal criteria values for 50 random restarts in K-means algorithm K-m2c K-m3c K-m4c K-m5c K-m6c K-m7c K-m8c K-m9c K-m10c K-m11c BSS/TSS 0,4673 0,7354 0,8603 0,9001 0,9345 0,9486 0,9575 0,9669 0,9679 0,9798 Theta 1,6828 2,288 2,7846 2,993 3,6454 3,9283 4,1094 4,3701 4,2197 5,0056 DB 1,1781 0,949 0,7766 0,8269 0,9678 0,9063 0,829 0,7653 0,9622 0,6984 Dunn 1,6287 1,7433 2,0693 1,3138 1,4258 1,2447 1,0815 1,2909 0,6573 0,8751 HL 0,8402 0,9378 0,9747 0,9682 0,9793 0,9852 0,9869 0,9903 0,9756 0,9923 Silh 0,3889 0,4765 0,5523 0,5385 0,4866 0,4992 0,5144 0,5169 0,4537 0,5848 GK 0,6471 0,8741 0,9595 0,9535 0,9684 0,9758 0,9772 0,9816 0,9586 0,9897 Tau 0,3245 0,369 0,3487 0,2944 0,2149 0,1929 0,1768 0,1608 0,1315 0,1177 Yule 0,6754 0,9485 0,9815 0,9692 0,9838 0,9887 0,9933 0,9969 0,9816 0,9955 AdRand 0,3884 0,6992 0,7962 0,7258 0,7615 0,7859 0,8246 0,8708 0,6916 0,8221 Fowlkes 0,6813 0,7895 0,8444 0,7778 0,7917 0,8095 0,8421 0,8824 0,7143 0,8333 Khi-2 3808,5 5827,9 6062,7 5120,6 3789,8 3434,8 3154,3 2883,8 2306,4 2097,5 12
K-means: what about the details ? Small example: optimal criteria values for 50 random restarts Theta = Mean D b / Mean D w Dunn = Min D b / Max D w 6 5 4 Theta 3 Dunn 2 1 0 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c 13
K-means: what about the details ? Small example: optimal criteria values for 50 random restarts Davies-Bouldin index 1,2 1,1 1 0,9 0,8 DB 0,7 0,6 0,5 0,4 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters 14
K-means: what about the details ? Davies-Bouldin index (1979) D w (k) + D w (j) Max { | j ≠ k } DB(k) = D b (j, k) D w (k) = Mean { d ii’ | i ∈ k ; i’ ∈ k ; i ≠ i’ } D b (j, k) = Mean { d ii’ | i ∈ j ; i’ ∈ k } DB = Mean k ∈ K DB(k) Type : compactness / isolation
K-means: what about the details ? Small example : best partition in 4 clusters by k-means out of 50 random restarts, DB = 0.7766 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 16
K-means: what about the details ? Small example : partition in 4 clusters, DB = 0.7539 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 17
K-means: what about the details ? Small example : partition in 4 clusters, DB = 0.7514 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 18
K-means: what about the details ? Small example: optimal criteria values for 50 random restarts Yule, adjusted Rand and Fowlkes-Mallows indexes 1,1 1 0,9 0,8 Yule 0,7 AdRand Fowlkes 0,6 0,5 0,4 0,3 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters 19
K-means: what about the details ? Goodman-Kruskal's index Kendall's tau 1,2 1 0,8 GK 0,6 Tau 0,4 0,2 0 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters
K-means: what about the details ? Non-parametric indexes based on quadruples of objects Partition distances u ii’ < u jj’ u ii’ > u jj’ d ii’ < d jj’ concordant discordant Initial distances d ii’ > d jj’ discordant concordant
K-means: what about the details ? Kendall ’s Tau (1938) Goodman and Kruskal index (1954) S + = number of concordant quadruples S - = number of discordant quadruples N = number of object pairs GK = S + - S - S + - S - Tau = S + + S - (N*(N-1))/2 Type : correlation coefficient
K-means: what about the details ? Contingency Khi-2 over quadruples 7000 6000 5000 4000 Khi-2 3000 2000 1000 0 2c 3c 4c 5c 6c 7c 8c 9c 10c 11c Number of clusters
K-means: what about the details ? Some indexes should be discarded A. Uniform trend for increasing or decreasing values • SSB/SST • Theta • Goodman-Kruskal • Hubert-Levin B. Preference for unbalanced partitions • Davies-Bouldin
K-means: what about the details ? Analyzing the results by correspondence analysis A special data table for analysing the results of K- means: the confusion or co-association table . From a set P of partitions (with the same number of clusters) count the number of times two objects, i and i’, fall in the same cluster. c ii’ = Card { p ∈ P | k p (i) = k p (i’) } k p (i) = cluster in which i belongs to in partition p Submit this table C to Correspondence analysis
K-means: what about the details ? Small example: 15 distinct partitions in 4 clusters after 50 restarts Correspondence analysis of the co-association matrix 1,5 F, J, R F R J 1 P M K 3 or 4 0,5 A, C, D, L, O, Q, S K, M, P clusters ? O F1: C D Q L A S 0 -2 -1,5 -1 -0,5 0 0,5 1 57.7 % Intermediate or I H -0,5 H, I anomalous E objects ? B -1 T G N G, N, T -1,5 F2: 35 %
K-means: what about the details ? Correspondence analysis suggests the validity of the 3-clusters partition. It makes appear the border position of points B and H-I 50 G 45 T N 40 B H P I 35 E 30 M K 25 L C 20 S A 15 Q O 10 D F R J 5 0 0 10 20 30 40 50 60 Cluster { K, M, P, R, F, J } contains 2 sub-clusters
K-means: what about the details ? PART 2 : real life examples • Leukemia38 • Alpes55 • Yeast237
K-means: what about the details ? Real life examples : Leukemia38 Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek, M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S. (1999) Molecular classification of cancer : class discovery and class prediction by gene expression monitoring. Science, vol. 286, pp 531-537. //www.sciencemag.org Handl J., Knowles J. and Kell D.B. (2005) Computational cluster validation in post-genomic data analysis, BIOINFORMATICS, 21(15): 3201-3212. Data table : 38 tissues x 100 genes, quantitative levels of gene expressions. There are 3 groups of tissues, known a priori .
K-means: what about the details ? Leukemia38: correspondence analysis of raw data 0,8 T T T T T T 0,4 T T B F1: 0 27.5 % -0,8 -0,4 0 0,4 0,8 M M M B B M M B B B M B B B B B M B B B B B M M B B M M B -0,4 F2: 23.4 %
Recommend
More recommend