The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Model-based clustering with mixed/missing data using the new software MixtComp https://modal-research.lille.inria.fr/BigStat/ Christophe Biernacki (with Thibault Deregnaucourt and Vincent Kubicki) CMStatistics 2015 (ERCIM 2015) London (UK), 12-14 December 2015 1/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Outline 1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion 2/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Clustering of complex data Data: n individuals: x = ( x 1 , . . . , x n ) = ( x O , x M ) belonging to a space X x O Observed individuals x M Missing individuals Aim: estimation of the partition z and the number of clusters K Partition in K clusters G 1 , . . . , G K : z = ( z 1 , . . . , z n ), z i = ( z i 1 , . . . , z iK ) ′ x i ∈ G k ⇔ z ih = I { h = k } Mixed, missing, uncertain Individuals x O Partition z O ⇔ Group ? 0.5 red 5 ? ? ? ⇔ ??? 0.3 0.1 green 3 ? ? ? ⇔ ??? 0.3 0.6 { red,green } 3 ? ? ? ⇔ ??? 0.9 [0.25 0.45] red ? ? ? ? ⇔ ??? ↓ ↓ ↓ ↓ continuous continuous categorical integer 3/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Model-based clustering i . i . d . Cluster k is modelled by a parametric distribution: X i | Z ik =1 ∼ p( · ; α k ) i . i . d . Cluster k has probability π k with � K k =1 π k = 1 : Z i ∼ Mult K (1 , π 1 , . . . , π K ) Missing data x are obtained by a missing completely at random process (MCAR) 1 Observed mixture pdf: with parameter θ = ( π 1 , . . . , π K , α 1 , . . . , α K ), it is written K K � p( x O � π k p( x O � p( x O i , x M i ; α k ) d x M i ; θ ) = i ; α k ) = π k i x M k =1 k =1 i i ; θ ) = π k p ( x O i ; α k ) Maximum a posteriori (MAP): with t k ( x O i ; θ ) = p( Z ik = 1 | x O p ( x O i ; θ ) k = { 1 ,..., K } t k ( x O ˆ z i = arg max i ; θ ) Seems to be suitable for missing/uncertain data but which p( · ; α k ) for mixed data? 1 Could be relaxed to missing at random (MAR) 4/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Outline 1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion 5/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion High-dimensional today’s data 2 2 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications , 29 6/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion HD clustering: blessing (1/2) A two-component d -variate Gaussian mixture with intra-dependency: π 1 = π 2 = 1 2 , X 1 | z 11 = 1 ∼ N d ( 0 , Σ ) , X 1 | z 12 = 1 ∼ N d ( 1 , Σ ) Each variable provides equal and own separation information Theoretical error decreases when d grows: err theo = Φ( −� µ 2 − µ 1 � Σ − 1 / 2) Empirical error rate with the (true) intra-correlated model worse with d Empirical error rate with the (false) intra-independent model better with d ! 0.38 4 0.36 3 0.34 2 Empirical corr. Empirical indep. 0.32 1 Theoretical err x2 0.3 0 0.28 −1 0.26 −2 0.24 −3 1 2 3 4 5 6 7 8 9 10 −4 −3 −2 −1 0 1 2 3 4 5 d x1 7/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion HD clustering: blessing (2/2) FDA d=2 d=20 2.5 4 2 3 1.5 2 1 2nd axis FDA 2nd axis FDA 0.5 1 0 0 −0.5 −1 −1 −1.5 −2 −2 −2.5 −3 −4 −3 −2 −1 0 1 2 3 4 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 1st axis FDA 1st axis FDA d=200 d=400 5 3 4 2 3 2 1 2nd axis FDA 2nd axis FDA 1 0 0 −1 −1 −2 −2 −3 −4 −3 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 1st axis FDA 1st axis FDA Neglect intra-dependency in HD clustering for better bias/variance trade-off a a When variables convey no redundant cluster information; see conlusion 8/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Mixed data: conditional independence everywhere The aim is to combine continuous, categorical and integer data x cont x cat x int x 1 = ( , , 1 ) 1 1 The proposed solution is to mixed all types by inter-type conditional independence p( x 1 ; α k ) = p( x cont ; α cont ) × p( x cat 1 ; α cat k ) × p( x int 1 ; α int k ) 1 k In addition, for symmetry between types, intra-type conditional independence Only need to define the univariate pdf for each variable type! Continuous: Gaussian Categorical: multinomial Integer: Poisson 9/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Outline 1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion 10/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion SEM algorithm A SEM algorithm to estimate θ by maximizing the observed-data log-likelihood x O ) = ln p( x O ; θ ) ℓ ( θ ; Initialisation: θ (0) Iteration nb q : E-step: compute conditional probabilities p( x M , z |D ; θ ( q ) ) S-step: draw ( x M ( q ) , z ( q ) ) from p( x M , z | x 0 ; θ ( q ) ) M-step: maximize θ ( q +1) = arg max θ ln p( x O , x M ( q ) , z ( q ) ; θ ) Stopping rule: iteration number Properties simplicity because of conditional independence classical M steps avoids local maxima the mean of the sequence ( θ ( q ) ) approximates ˆ θ the variance of the sequence ( θ ( q ) ) gives confidence intervals 11/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion SE algorithm x M , A SE algorithm estimates then ( z ) Iteration nb q : E-step: compute conditional probabilities p( x M , z | x O ; ˆ θ ) z | x O ; ˆ S-step: draw ( x M ( q ) , z ( q ) ) from p( x M , θ ) Stopping rule: iteration number Properties simplicity because of conditional independence x M ( q ) , z ( q ) ) estimates ( x M , the mean/mode of the sequence ( z ) confidence intervals are also derived 12/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Outline 1 The problem 2 Conditional independent clustering 3 Estimation 4 Clustering with MixtComp 5 Imputation with MixtComp 6 Conclusion 13/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Prostate cancer data 3 Individuals: 506 patients with prostatic cancer grouped on clinical criteria into two Stages 3 and 4 of the disease Variables: d = 12 pre-trial variates were measured on each patient, composed by eight continuous variables (age, weight, systolic blood pressure, diastolic blood pressure, serum haemoglobin, size of primary tumour, index of tumour stage and histolic grade, serum prostatic acid phosphatase) and four categorical variables with various numbers of levels (performance rating, cardiovascular disease history, electrocardiogram code, bone metastases) Some missing data: 62 missing values ( ≈ 1%) We forget the classes (Stages of the desease) for performing clustering Questions How many clusters? Which partition? 3 Byar DP, Green SB (1980): Bulletin Cancer, Paris 67:477-488 14/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Create a free account in MixtComp 4 https://modal-research.lille.inria.fr/BigStat/ It implements the mixed/missing data clustering in a software as a service (SaaS) 4 See documentation at https://modal.lille.inria.fr/wikimodal/doku.php?id=mixtcomp 15/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Two files to merge into a unique zip file Variable descriptor file: descriptor.csv Data file: data.csv 16/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Learn! Step 1: input zip file and K Step 2: it is running! 17/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Output Option 1: output zip file Option 2: instant viewing clusters (variable-wise normalized entropy) 18/29
The problem Conditional independent clustering Estimation Clustering with MixtComp Imputation with MixtComp Conclusion Output R format 19/29
Recommend
More recommend