clustering with mixed type variables and determination of
play

Clustering with Mixed Type Variables and Determination of Cluster - PowerPoint PPT Presentation

Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana ezankov, Duan Hsek Tom Lster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic COMPSTAT 2010 1 Outline Motivation


  1. Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana Ř ezanková, Dušan Húsek Tomáš Löster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic COMPSTAT 2010 1

  2. Outline  Motivation  Methods for clustering with mixed type variables  Implementation in software packages  Proposal of new criteria for cluster evaluation  Application  Conclusion COMPSTAT 2010 2

  3. Motivation  Task: We are looking for groups of similar Task: We are looking for groups of similar  objects (e.g. respondents), , i.e. we will i.e. we will objects (e.g. respondents) concentrate on the the problem of object clustering problem of object clustering concentrate on  The objects are characterized by both quantitative and qualitative (nominal) variables (e.g. respondent opinions, numbers of actions)  The number of clusters is unknown in advance The number of clusters is unknown in advance – –  i.e. we should cope with appropriate number of i.e. we should cope with appropriate number of clusters determination (assignment) clusters determination (assignment) COMPSTAT 2010 3

  4. Methods for clustering with mixed type variables  Using a specialized dissimilarity measure Using a specialized dissimilarity measure  (Gower’ ’s coefficient, cluster variability based) s coefficient, cluster variability based) (Gower and application of agglomerative hierarchical and application of agglomerative hierarchical cluster analysis (AHCA) (AHCA) cluster analysis  Clustering objects separately with quantitative and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA)  Latent class models COMPSTAT 2010 4

  5. Implementation in software packages  Specialized dissimilarity measures Specialized dissimilarity measures  - are not implemented are not implemented for for AHCA AHCA -  Clustering objects with qualitative variables - is implemented only rarely (disagreement coef.)  Cluster-based similarity partitioning algorithm - is not implemented not implemented but it could be realized  LC Cluster models (Latent GOLD)  Log Log- -likelihood distance measure likelihood distance measure between clusters  - implemented in two-step cluster analysis (SPSS) COMPSTAT 2010 5

  6. Implementation in software packages  Log Log- -likelihood distance measure likelihood distance measure between clusters between clusters  - implemented in two-step cluster analysis (SPSS)       D ( )    h h h h h , h   ( 1 ) ( 2 ) m m 1         2 2 n ln( s s ) H   g g l gl gl  2    l 1 l 1 K n n  l   glu glu … entropy entropy … H ln gl n n  u 1 g g COMPSTAT 2010 6

  7. Implementation in software packages  Log Log- -likelihood distance measure likelihood distance measure between objects between objects  - implemented in two-step cluster analysis (SPSS)       D ( )    h h h h h , h   ( 1 ) ( 2 ) m m 1         2 2 n ln( s s ) H   g g l gl gl  2    l 1 l 1   D ( x , x ) i j x , x i j COMPSTAT 2010 7

  8. Evaluation criteria implemented in software packages  BIC ( BIC ( Bayesian Information Criterion)  Bayesian Information Criterion) AIC ( Akaike Information Criterion ) AIC - implemented in two-step cluster analysis (SPSS) k     I 2 w ln( n ) … minimum BIC g k    ( 2 ) g 1 m       ( 1 ) w k 2 m ( K 1 )   k l   k   1 l    I 2 2 w only for initial estimation AIC g k  of number of clusters g 1 COMPSTAT 2010 8

  9. Proposed evaluation criteria  Within-cluster variability for k clusters:   ( 1 ) ( 2 ) k k m m 1             2 2 ( k ) n ln( s s ) H   g g l gl gl 2       g 1 g 1 l 1 l 1  Variability of the whole data set: ( 1 ) ( 2 ) m m 1      2 ) ( 1 ) n ln( 2 s H l l 2   l 1 l 1 COMPSTAT 2010 9

  10. Proposed evaluation criteria  Within-cluster variability for k clusters:   ( 1 ) ( 2 ) k k m m 1             2 2 ( k ) n ln( s s ) H   g g l gl gl 2       g 1 g 1 l 1 l 1      difference diff ( k ) ( k 1 ) ( k ) it should be maximal for the suitable number of clusters COMPSTAT 2010 10

  11. Evaluation criteria modified for qualitative variables 1. Uncertainty index (R-square (RSQ) index)     ( 1 ) ( ) V V V k    B T W I ( k )  U V V ( 1 ) T T 2. Semipartial uncertainty index (optimal number of clusters - minimum)    I ( k ) I ( k 1 ) I ( k ) SPU U U COMPSTAT 2010 11

  12. Evaluation criteria modified for qualitative variables 3. Pseudo (Calinski and Habarasz) F index – PSF (SAS) , CHF ( SYSTAT) V B       ( n k ) ( ( 1 ) ( k )) k 1   I ( k )    CHFU V ( k 1 ) ( k ) W  n k 4. Pseudo T-squared statistic – PST2 (SAS)      PTS (SYSTAT) ( )    h h , h h I ( k )    PTSU  h h   n n 2  h h COMPSTAT 2010 12

  13. Evaluation criteria modified for qualitative variables SYSTAT COMPSTAT 2010 13

  14. Evaluation criteria modified for qualitative variables 5. Modified Davies and Bouldin (DB) index    k s s    D , h D , h  max      D h , h h    h 1 h h I ( k ) DB k        k     h h max           ( ) h , h h      h 1 h h h , h  ( ) I k DBU k COMPSTAT 2010 14

  15. Evaluation criteria modified for qualitative variables 6. Dunn’s index     D    h h  I ( k ) min min D      max diam   1 h k 1 h k   g   1 g k   D min D ( x , x ) h h i j   x C , x C  i h j h  diam max D ( x , x ) g i j  x , x C i j g COMPSTAT 2010 15

  16. Modified evaluation criteria  C Cluster luster variability variability based on the variance and  Gini’ ’s s coefficient of mutability coefficient of mutability Gini   ( 1 ) ( 2 ) m m 1        2 2 ln( ) G n s s G   g g l gl gl  2    l 1 l 1 2   K n  l     glu G 1 Gini’ ’s s coefficient of coefficient of mutability mutability Gini   gl n    u 1 g k k      I 2 G w ln( n ) G ( k ) G BGC g k g   g 1 g 1 COMPSTAT 2010 16

  17. Evaluation criteria modified for qualitative variables 1. Tau index (RSQ index)   V V V G ( 1 ) G ( k )    B T W I ( k )  V V G ( 1 ) T T 2. Semipartial tau index (optimal number of clusters - minimum)    I ( k ) I ( k 1 ) I ( k )    SP COMPSTAT 2010 17

  18. Application to a real data file  Data from a questionnaire survey Data from a questionnaire survey  (for the participants of the chemistry seminar for the participants of the chemistry seminar) ) (  7 qualitative and 1 quantitative (count) variables  Two-step cluster analysis for clustering of respondents (experiments for the numbers of clusters from 2 to 4)  LC Cluster model (experiments for the numbers of clusters from 2 to 6) – the quantitative variable was recoded to 5 categories COMPSTAT 2010 18

  19. Application to a real data file Criteria based on the entropy (TSCA in SPSS) Number of clusters Measure 1 2 3 4 Within-cluster 273.92 241.17 206.39 186.51 variability Variability - 32.75 34.78 19.88 difference I U 0 0.12 0.25 0.32 I SPU 0.12 0.13 0.07 - I CHFU 0 6.52 7.69 7.19 I BIC 590.85 568.41 541.88 545.15 COMPSTAT 2010 19

  20. Application to a real data file Criteria based on the Gini’s coefficient (TSCA in SPSS) Number of clusters Measure 1 2 3 4 Within-cluster 185.41 162.57 137.83 127.86 variability Variability - 22.84 24.74 9.97 difference 0 0.12 0.26 0.31 I  0.12 0.13 0.05 - I SP  0 6.74 8.11 6.90 I CHF  I BGC 413.85 411.20 404.75 427.84 COMPSTAT 2010 20

  21. Application to a real data file Comparison of BIC Number of clusters Method 1 2 3 4 Two-step CA 590.85 568.41 541.88 545.15 LC Cluster Model 1397.01 1059.24 1036.90 1019.18 COMPSTAT 2010 21

Recommend


More recommend