Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana Ř ezanková, Dušan Húsek Tomáš Löster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic COMPSTAT 2010 1
Outline Motivation Methods for clustering with mixed type variables Implementation in software packages Proposal of new criteria for cluster evaluation Application Conclusion COMPSTAT 2010 2
Motivation Task: We are looking for groups of similar Task: We are looking for groups of similar objects (e.g. respondents), , i.e. we will i.e. we will objects (e.g. respondents) concentrate on the the problem of object clustering problem of object clustering concentrate on The objects are characterized by both quantitative and qualitative (nominal) variables (e.g. respondent opinions, numbers of actions) The number of clusters is unknown in advance The number of clusters is unknown in advance – – i.e. we should cope with appropriate number of i.e. we should cope with appropriate number of clusters determination (assignment) clusters determination (assignment) COMPSTAT 2010 3
Methods for clustering with mixed type variables Using a specialized dissimilarity measure Using a specialized dissimilarity measure (Gower’ ’s coefficient, cluster variability based) s coefficient, cluster variability based) (Gower and application of agglomerative hierarchical and application of agglomerative hierarchical cluster analysis (AHCA) (AHCA) cluster analysis Clustering objects separately with quantitative and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA) Latent class models COMPSTAT 2010 4
Implementation in software packages Specialized dissimilarity measures Specialized dissimilarity measures - are not implemented are not implemented for for AHCA AHCA - Clustering objects with qualitative variables - is implemented only rarely (disagreement coef.) Cluster-based similarity partitioning algorithm - is not implemented not implemented but it could be realized LC Cluster models (Latent GOLD) Log Log- -likelihood distance measure likelihood distance measure between clusters - implemented in two-step cluster analysis (SPSS) COMPSTAT 2010 5
Implementation in software packages Log Log- -likelihood distance measure likelihood distance measure between clusters between clusters - implemented in two-step cluster analysis (SPSS) D ( ) h h h h h , h ( 1 ) ( 2 ) m m 1 2 2 n ln( s s ) H g g l gl gl 2 l 1 l 1 K n n l glu glu … entropy entropy … H ln gl n n u 1 g g COMPSTAT 2010 6
Implementation in software packages Log Log- -likelihood distance measure likelihood distance measure between objects between objects - implemented in two-step cluster analysis (SPSS) D ( ) h h h h h , h ( 1 ) ( 2 ) m m 1 2 2 n ln( s s ) H g g l gl gl 2 l 1 l 1 D ( x , x ) i j x , x i j COMPSTAT 2010 7
Evaluation criteria implemented in software packages BIC ( BIC ( Bayesian Information Criterion) Bayesian Information Criterion) AIC ( Akaike Information Criterion ) AIC - implemented in two-step cluster analysis (SPSS) k I 2 w ln( n ) … minimum BIC g k ( 2 ) g 1 m ( 1 ) w k 2 m ( K 1 ) k l k 1 l I 2 2 w only for initial estimation AIC g k of number of clusters g 1 COMPSTAT 2010 8
Proposed evaluation criteria Within-cluster variability for k clusters: ( 1 ) ( 2 ) k k m m 1 2 2 ( k ) n ln( s s ) H g g l gl gl 2 g 1 g 1 l 1 l 1 Variability of the whole data set: ( 1 ) ( 2 ) m m 1 2 ) ( 1 ) n ln( 2 s H l l 2 l 1 l 1 COMPSTAT 2010 9
Proposed evaluation criteria Within-cluster variability for k clusters: ( 1 ) ( 2 ) k k m m 1 2 2 ( k ) n ln( s s ) H g g l gl gl 2 g 1 g 1 l 1 l 1 difference diff ( k ) ( k 1 ) ( k ) it should be maximal for the suitable number of clusters COMPSTAT 2010 10
Evaluation criteria modified for qualitative variables 1. Uncertainty index (R-square (RSQ) index) ( 1 ) ( ) V V V k B T W I ( k ) U V V ( 1 ) T T 2. Semipartial uncertainty index (optimal number of clusters - minimum) I ( k ) I ( k 1 ) I ( k ) SPU U U COMPSTAT 2010 11
Evaluation criteria modified for qualitative variables 3. Pseudo (Calinski and Habarasz) F index – PSF (SAS) , CHF ( SYSTAT) V B ( n k ) ( ( 1 ) ( k )) k 1 I ( k ) CHFU V ( k 1 ) ( k ) W n k 4. Pseudo T-squared statistic – PST2 (SAS) PTS (SYSTAT) ( ) h h , h h I ( k ) PTSU h h n n 2 h h COMPSTAT 2010 12
Evaluation criteria modified for qualitative variables SYSTAT COMPSTAT 2010 13
Evaluation criteria modified for qualitative variables 5. Modified Davies and Bouldin (DB) index k s s D , h D , h max D h , h h h 1 h h I ( k ) DB k k h h max ( ) h , h h h 1 h h h , h ( ) I k DBU k COMPSTAT 2010 14
Evaluation criteria modified for qualitative variables 6. Dunn’s index D h h I ( k ) min min D max diam 1 h k 1 h k g 1 g k D min D ( x , x ) h h i j x C , x C i h j h diam max D ( x , x ) g i j x , x C i j g COMPSTAT 2010 15
Modified evaluation criteria C Cluster luster variability variability based on the variance and Gini’ ’s s coefficient of mutability coefficient of mutability Gini ( 1 ) ( 2 ) m m 1 2 2 ln( ) G n s s G g g l gl gl 2 l 1 l 1 2 K n l glu G 1 Gini’ ’s s coefficient of coefficient of mutability mutability Gini gl n u 1 g k k I 2 G w ln( n ) G ( k ) G BGC g k g g 1 g 1 COMPSTAT 2010 16
Evaluation criteria modified for qualitative variables 1. Tau index (RSQ index) V V V G ( 1 ) G ( k ) B T W I ( k ) V V G ( 1 ) T T 2. Semipartial tau index (optimal number of clusters - minimum) I ( k ) I ( k 1 ) I ( k ) SP COMPSTAT 2010 17
Application to a real data file Data from a questionnaire survey Data from a questionnaire survey (for the participants of the chemistry seminar for the participants of the chemistry seminar) ) ( 7 qualitative and 1 quantitative (count) variables Two-step cluster analysis for clustering of respondents (experiments for the numbers of clusters from 2 to 4) LC Cluster model (experiments for the numbers of clusters from 2 to 6) – the quantitative variable was recoded to 5 categories COMPSTAT 2010 18
Application to a real data file Criteria based on the entropy (TSCA in SPSS) Number of clusters Measure 1 2 3 4 Within-cluster 273.92 241.17 206.39 186.51 variability Variability - 32.75 34.78 19.88 difference I U 0 0.12 0.25 0.32 I SPU 0.12 0.13 0.07 - I CHFU 0 6.52 7.69 7.19 I BIC 590.85 568.41 541.88 545.15 COMPSTAT 2010 19
Application to a real data file Criteria based on the Gini’s coefficient (TSCA in SPSS) Number of clusters Measure 1 2 3 4 Within-cluster 185.41 162.57 137.83 127.86 variability Variability - 22.84 24.74 9.97 difference 0 0.12 0.26 0.31 I 0.12 0.13 0.05 - I SP 0 6.74 8.11 6.90 I CHF I BGC 413.85 411.20 404.75 427.84 COMPSTAT 2010 20
Application to a real data file Comparison of BIC Number of clusters Method 1 2 3 4 Two-step CA 590.85 568.41 541.88 545.15 LC Cluster Model 1397.01 1059.24 1036.90 1019.18 COMPSTAT 2010 21
Recommend
More recommend