Clustering with Mixed Type Variables and Determination of Cluster - PowerPoint PPT Presentation

Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana Ř ezanková, Dušan Húsek Tomáš Löster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic COMPSTAT 2010 1

Outline  Motivation  Methods for clustering with mixed type variables  Implementation in software packages  Proposal of new criteria for cluster evaluation  Application  Conclusion COMPSTAT 2010 2

Motivation  Task: We are looking for groups of similar Task: We are looking for groups of similar  objects (e.g. respondents), , i.e. we will i.e. we will objects (e.g. respondents) concentrate on the the problem of object clustering problem of object clustering concentrate on  The objects are characterized by both quantitative and qualitative (nominal) variables (e.g. respondent opinions, numbers of actions)  The number of clusters is unknown in advance The number of clusters is unknown in advance – –  i.e. we should cope with appropriate number of i.e. we should cope with appropriate number of clusters determination (assignment) clusters determination (assignment) COMPSTAT 2010 3

Methods for clustering with mixed type variables  Using a specialized dissimilarity measure Using a specialized dissimilarity measure  (Gower’ ’s coefficient, cluster variability based) s coefficient, cluster variability based) (Gower and application of agglomerative hierarchical and application of agglomerative hierarchical cluster analysis (AHCA) (AHCA) cluster analysis  Clustering objects separately with quantitative and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA)  Latent class models COMPSTAT 2010 4

Implementation in software packages  Specialized dissimilarity measures Specialized dissimilarity measures  - are not implemented are not implemented for for AHCA AHCA -  Clustering objects with qualitative variables - is implemented only rarely (disagreement coef.)  Cluster-based similarity partitioning algorithm - is not implemented not implemented but it could be realized  LC Cluster models (Latent GOLD)  Log Log- -likelihood distance measure likelihood distance measure between clusters  - implemented in two-step cluster analysis (SPSS) COMPSTAT 2010 5

Implementation in software packages  Log Log- -likelihood distance measure likelihood distance measure between clusters between clusters  - implemented in two-step cluster analysis (SPSS)       D ( )    h h h h h , h   ( 1 ) ( 2 ) m m 1         2 2 n ln( s s ) H   g g l gl gl  2    l 1 l 1 K n n  l   glu glu … entropy entropy … H ln gl n n  u 1 g g COMPSTAT 2010 6

Implementation in software packages  Log Log- -likelihood distance measure likelihood distance measure between objects between objects  - implemented in two-step cluster analysis (SPSS)       D ( )    h h h h h , h   ( 1 ) ( 2 ) m m 1         2 2 n ln( s s ) H   g g l gl gl  2    l 1 l 1   D ( x , x ) i j x , x i j COMPSTAT 2010 7

Evaluation criteria implemented in software packages  BIC ( BIC ( Bayesian Information Criterion)  Bayesian Information Criterion) AIC ( Akaike Information Criterion ) AIC - implemented in two-step cluster analysis (SPSS) k     I 2 w ln( n ) … minimum BIC g k    ( 2 ) g 1 m       ( 1 ) w k 2 m ( K 1 )   k l   k   1 l    I 2 2 w only for initial estimation AIC g k  of number of clusters g 1 COMPSTAT 2010 8

Proposed evaluation criteria  Within-cluster variability for k clusters:   ( 1 ) ( 2 ) k k m m 1             2 2 ( k ) n ln( s s ) H   g g l gl gl 2       g 1 g 1 l 1 l 1  Variability of the whole data set: ( 1 ) ( 2 ) m m 1      2 ) ( 1 ) n ln( 2 s H l l 2   l 1 l 1 COMPSTAT 2010 9

Proposed evaluation criteria  Within-cluster variability for k clusters:   ( 1 ) ( 2 ) k k m m 1             2 2 ( k ) n ln( s s ) H   g g l gl gl 2       g 1 g 1 l 1 l 1      difference diff ( k ) ( k 1 ) ( k ) it should be maximal for the suitable number of clusters COMPSTAT 2010 10

Evaluation criteria modified for qualitative variables 1. Uncertainty index (R-square (RSQ) index)     ( 1 ) ( ) V V V k    B T W I ( k )  U V V ( 1 ) T T 2. Semipartial uncertainty index (optimal number of clusters - minimum)    I ( k ) I ( k 1 ) I ( k ) SPU U U COMPSTAT 2010 11

Evaluation criteria modified for qualitative variables 3. Pseudo (Calinski and Habarasz) F index – PSF (SAS) , CHF ( SYSTAT) V B       ( n k ) ( ( 1 ) ( k )) k 1   I ( k )    CHFU V ( k 1 ) ( k ) W  n k 4. Pseudo T-squared statistic – PST2 (SAS)      PTS (SYSTAT) ( )    h h , h h I ( k )    PTSU  h h   n n 2  h h COMPSTAT 2010 12

Evaluation criteria modified for qualitative variables SYSTAT COMPSTAT 2010 13

Evaluation criteria modified for qualitative variables 5. Modified Davies and Bouldin (DB) index    k s s    D , h D , h  max      D h , h h    h 1 h h I ( k ) DB k        k     h h max           ( ) h , h h      h 1 h h h , h  ( ) I k DBU k COMPSTAT 2010 14

Evaluation criteria modified for qualitative variables 6. Dunn’s index     D    h h  I ( k ) min min D      max diam   1 h k 1 h k   g   1 g k   D min D ( x , x ) h h i j   x C , x C  i h j h  diam max D ( x , x ) g i j  x , x C i j g COMPSTAT 2010 15

Modified evaluation criteria  C Cluster luster variability variability based on the variance and  Gini’ ’s s coefficient of mutability coefficient of mutability Gini   ( 1 ) ( 2 ) m m 1        2 2 ln( ) G n s s G   g g l gl gl  2    l 1 l 1 2   K n  l     glu G 1 Gini’ ’s s coefficient of coefficient of mutability mutability Gini   gl n    u 1 g k k      I 2 G w ln( n ) G ( k ) G BGC g k g   g 1 g 1 COMPSTAT 2010 16

Evaluation criteria modified for qualitative variables 1. Tau index (RSQ index)   V V V G ( 1 ) G ( k )    B T W I ( k )  V V G ( 1 ) T T 2. Semipartial tau index (optimal number of clusters - minimum)    I ( k ) I ( k 1 ) I ( k )    SP COMPSTAT 2010 17

Application to a real data file  Data from a questionnaire survey Data from a questionnaire survey  (for the participants of the chemistry seminar for the participants of the chemistry seminar) ) (  7 qualitative and 1 quantitative (count) variables  Two-step cluster analysis for clustering of respondents (experiments for the numbers of clusters from 2 to 4)  LC Cluster model (experiments for the numbers of clusters from 2 to 6) – the quantitative variable was recoded to 5 categories COMPSTAT 2010 18

Application to a real data file Criteria based on the entropy (TSCA in SPSS) Number of clusters Measure 1 2 3 4 Within-cluster 273.92 241.17 206.39 186.51 variability Variability - 32.75 34.78 19.88 difference I U 0 0.12 0.25 0.32 I SPU 0.12 0.13 0.07 - I CHFU 0 6.52 7.69 7.19 I BIC 590.85 568.41 541.88 545.15 COMPSTAT 2010 19

Application to a real data file Criteria based on the Gini’s coefficient (TSCA in SPSS) Number of clusters Measure 1 2 3 4 Within-cluster 185.41 162.57 137.83 127.86 variability Variability - 22.84 24.74 9.97 difference 0 0.12 0.26 0.31 I  0.12 0.13 0.05 - I SP  0 6.74 8.11 6.90 I CHF  I BGC 413.85 411.20 404.75 427.84 COMPSTAT 2010 20

Application to a real data file Comparison of BIC Number of clusters Method 1 2 3 4 Two-step CA 590.85 568.41 541.88 545.15 LC Cluster Model 1397.01 1059.24 1036.90 1019.18 COMPSTAT 2010 21

Clustering with Mixed Type Variables and Determination of Cluster - PowerPoint PPT Presentation

Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana ezankov, Duan Hsek Tom Lster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic COMPSTAT 2010 1 Outline Motivation

Type Checking Grammar Rule Semantic Rule var-decl id : type-exp Insert (id.name, type-exp .

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

YCL Week 3 Lets talk about variables! Variables Variables are containers for data. Variables

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Precision Training PAI Overview What is mixed-precision

Week 7 Video 1 Clustering Clustering A type of Structure Discovery algorithm This type of

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

1 Display List Initialization (in init) Outline Display List Initialization (in init) Outline

MGARCH An R Package for Fitting Multivariate GARCH Models Harald Schmidbauer Bilgi University,

Introduction to OpenGL Introduction to OpenGL Graphics API Window system independent

Today's Agenda > Check your github repos > Overview

Introduction to OpenGL and GLUT to configure your systems for compiling OpenGL programs in C

GLUT&ps+pi+alls alexandrizavodny GLUTbarebones

Implementing Non-Strict Functional Languages with the Generalized Intensional Transformation

1 L Jan-22-04 SMD159, 3D Graphics in OpenGL Overview 3D applications in OpenGL - A 3D

Sambuz

Useful Links

Newsletter

Mail Us

Clustering with Mixed Type Variables and Determination of Cluster - PowerPoint PPT Presentation

Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana ezankov, Duan Hsek Tom Lster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic COMPSTAT 2010 1 Outline Motivation

Type Checking Grammar Rule Semantic Rule var-decl id : type-exp Insert (id.name, type-exp .

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

YCL Week 3 Lets talk about variables! Variables Variables are containers for data. Variables

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Precision Training PAI Overview What is mixed-precision

Week 7 Video 1 Clustering Clustering A type of Structure Discovery algorithm This type of

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

1 Display List Initialization (in init) Outline Display List Initialization (in init) Outline

MGARCH An R Package for Fitting Multivariate GARCH Models Harald Schmidbauer Bilgi University,

Introduction to OpenGL Introduction to OpenGL Graphics API Window system independent

Today's Agenda &gt; Check your github repos &gt; Overview

Introduction to OpenGL and GLUT to configure your systems for compiling OpenGL programs in C

GLUT&amp;ps+pi+alls alexandrizavodny GLUTbarebones

Implementing Non-Strict Functional Languages with the Generalized Intensional Transformation

1 L Jan-22-04 SMD159, 3D Graphics in OpenGL Overview 3D applications in OpenGL - A 3D

Sambuz

Useful Links

Newsletter

Mail Us

Today's Agenda > Check your github repos > Overview

GLUT&ps+pi+alls alexandrizavodny GLUTbarebones