evaluating clustering introduction to cluster analysis
play

Evaluating clustering Introduction to cluster analysis and - PDF document

HAL Id: hal-01810377 scientifjques de niveau recherche, publis ou non, mer School on Clustering, Data Analysis and Visualization of Complex Data, May 2018, Catania, Christophe Biernacki. Introduction to cluster analysis and classifjcation:


  1. HAL Id: hal-01810377 scientifjques de niveau recherche, publiés ou non, mer School on Clustering, Data Analysis and Visualization of Complex Data, May 2018, Catania, Christophe Biernacki. Introduction to cluster analysis and classifjcation: Evaluating clustering. Sum- To cite this version: Christophe Biernacki Evaluating clustering Introduction to cluster analysis and classifjcation: publics ou privés. recherche français ou étrangers, des laboratoires émanant des établissements d’enseignement et de destinée au dépôt et à la difgusion de documents https://hal.inria.fr/hal-01810377 L’archive ouverte pluridisciplinaire HAL , est abroad, or from public or private research centers. teaching and research institutions in France or The documents may come from lished or not. entifjc research documents, whether they are pub- archive for the deposit and dissemination of sci- HAL is a multi-disciplinary open access Submitted on 7 Jun 2018 Italy. ฀hal-01810377฀

  2. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Introduction to cluster analysis and classification: Evaluating clustering C. Biernacki Summer School on Clustering, Data Analysis and Visualization of Complex Data May 21-25 2018, University of Catania, Italy 1/66

  3. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Evaluating clustering “‘Technical” evaluation x , δ [ , ∆ , kernel , . . . ] , K , algo ) z = f ( ˆ “User” evaluation A good clustering result is an end-user useful clustering result Need always to combine both evaluation points of view 2/66

  4. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Outline 1 Data factor 2 Dissimilarity factor (and co) 3 Algorithm factor 4 Number of clusters factor 5 User factor 6 To go further 3/66

  5. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further The variable effect Medicine 1 : diseases may be classified by etiology (cause), pathogenesis (mechanism by which the disease is caused), or by symptom(s). Alternatively, diseases may be classified according to the organ system involved, though this is often complicated since many diseases affect more than one organ. And so on. . . 5 3 4 3 2 2 4 1 1 Variable 3 Variable 3 2 Variable 3 0 0 0 −1 −2 −1 −2 −4 12 8 −2 −3 10 6 8 4 6 −4 −3 4 2 2 0 −5 0 −2 −2 −2 0 2 4 6 8 10 −2 −1 0 1 2 3 4 5 6 7 Variable 2 −4 Variable 1 Variable 1 Variable 2 −4 1 Nosologie m´ ethodique, dans laquelle les maladies sont rang´ ees par classes, suivant le syst` eme de Sydenham, & l’ordre des botanistes, par Fran¸ cois Boissier de Sauvages de Lacroix. Paris, H´ erissant le fils, 1771 4/66

  6. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Need to compare partitions: empirical error rate Two partitions z and ˆ z τ : all permutations on { 1 , . . . , K } Empirical error rate n z ) = 1 � 0 , K − 1 � � err( z , ˆ n min z i ) } ∈ I { z i = τ (ˆ K τ i =1 Partitions are closer when err is small Restricted to compare partition with the same number of clusters Example ˆ err( z , ˆ z ) z z ˆ 1 6 min { 5 , 1 } = 1 G 1 = { a , b , c } G 1 = { e , f } 6 ˆ G 2 = { d , e , f } G 2 = { a , b , c , d } 5/66

  7. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Need to compare partitions: rand index Two partitions z and ˆ z A measure on basis of agreement vs. disagreement between object pairs Not limited to the same number of clusters between partitions Rand index [Rand 1971] A : #pairs of elements in x that are in the same subset in z and in the same subset in ˆ z B : #pairs of elements in x that are in different subsets in z and in different subsets in ˆ z C : #pairs of elements in x that are in the same subset in z and in different subsets in ˆ z D : #pairs of elements in x that are in different subsets in z and in the same subset in ˆ z A + B nb. agree rand( z , ˆ z ) = A + B + C + D = nb. agree + nb. disagree ∈ { 0 , 1 } Partitions are closer when rand is high Example ˆ intermediate rand( z , ˆ z ) z z ˆ G 1 = { a , b , c } G 1 = { a , b } A = 2, B = 7 0.6 ˆ G 2 = { d , e , f } G 2 = { c , d , e } C = 4, D = 2 ˆ G 3 = { f } Caution: use the adjusted rand index [Hubert and Arabie 1985] to compare z ) when ˆ K � = ˜ rand( z , ˆ z ) and rand( z , ˜ K 6/66

  8. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: description 2 475 patients from 506 (missing values have been discarded) 8 quantitative variables, 4 categorical (some are ordinal) variables Two “evident” clusters for medical users: Stage 3 and Stage 4 of cancer Continuous data Categorical data 40 5 30 4 20 3 10 2nd axis PCA 2nd axis MCA 2 0 −10 1 −20 0 −30 −1 −40 −50 −2 −80 −60 −40 −20 0 20 40 60 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1st axis PCA 1st axis MCA 2 Byar and Green (1980) 7/66

  9. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: variable detail 8/66

  10. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: partition according to retained variables quantitative categorical (raw) mixing quali/quanti err=9.46% err=47.16% err=8.63% 1 2 1 2 1 2 Stage 3 247 26 142 131 252 21 Stage 4 19 183 120 82 20 182 Partition varies with retained variables as expected A general principle: categorical variables less informative than quantitative ones However, categorical variables here improve quantitative ones 9/66

  11. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: partition according to recoded variables categorical (raw) categorical (MCA) err=47.16% err=38.95% 1 2 1 2 Stage 3 142 131 175 98 Stage 4 120 82 87 115 MCA is equivalent to recoding categorical variables Raw data and MCA data are in a one-to-one mapping (no info. loss) It can however drastically impact clustering result It open the question of data units/coding to use Currently: let the user to choose the unit (prior or posterior choice) Next lesson: need formalizing to go further 10/66

  12. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Prostate cancer data: partition according to missing data Use the reduced data set without individuals having missing data ( n = 475) Use the completed data set where missing data are imputed 3 ( n = 506) In both cases, use all mixed variables (not all details at this step, see next lesson) Data set completed data reduced data err 12.8 8.1 It is current to have a data “pretreatment” like missing data imputation Be careful: it can impact the clustering Imputation gives only an estimate data set ˆ x which is a “deteriorated” data set As a consequence it can lead to a “deteriorated” clustering result See next lesson to formalize this problem 3 We use the mice package:http://cran.r-project.org/web/packages/mice/mice.pdf 11/66

  13. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Stability of a clustering result Do not forget that ˆ z is just an estimate of (a hypothetical true) z Statistical properties of this estimate should be addressed, as it stability (variance) A simple (but computational demanding) attempt: x ( b ) ( b = 1 , . . . , B ) Use bootstrap samples z ( b ) Obtain bootstrap partitions Deduce for instance confidence regions on centers µ through related centers µ ( b ) Be careful to the permutation of labelling! See the next lesson for more on the statistical properties (need formalizing). . . 12/66

  14. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Outline 1 Data factor 2 Dissimilarity factor (and co) 3 Algorithm factor 4 Number of clusters factor 5 User factor 6 To go further 13/66

  15. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Effect of the metric M (1/5) � a � 0 � 3 � 0 0 � � � � X = R 2 , M = , x 1 = , x 2 = , x 3 = 0 1 0 0 1 � a � 0 x 2 ) 2 = ( x 2 ) = a ( x 21 − x 11 ) 2 = 9 a x 2 ) ′ δ M ( x 1 , x 1 − ( x 1 − 0 1 � a � 0 x 3 ) 2 = ( x 3 ) = ( x 32 − x 12 ) 2 = 1 x 3 ) ′ δ M ( x 1 , x 1 − ( x 1 − 0 1 14/66

  16. Data factor Dissimilarity factor (and co) Algorithm factor Number of clusters factor User factor To go further Effect of the metric M (2/5) x 3 ) 2 ⇔ a ≤ 1 x 2 ) 2 ≤ δ M ( δ M ( x 1 , x 1 , 9 The distance is impacted by the metric, thus the clustering could be also Somewhere the metric is also related to variable selection (try a = 0. . . ) 15/66

Recommend


More recommend