Introduction Notation Criterion Procedure Example An evolutionary analysis of association patterns Alfonso Iodice D’Enza 1 Francesco Palumbo 2 Correspondence Analysis and Related MEthods 2011 Rennes, 8 - 11 February 2011 1Universit` a di Cassino 2Universit` a degli Studi di Napoli An evolutionary analysis of association patterns 1 / 25
Introduction Notation Criterion Procedure Example Introduction 1 Notation 2 Criterion 3 Procedure 4 Example 5 An evolutionary analysis of association patterns 2 / 25
Introduction Notation Criterion Procedure Example Background A common approach in finding patterns of association in high dimensional and sparse data is to combine dimension reduction and clustering techniques. Qualitative data Quantitative data multiple correspondence analysis and Tandem-analysis clustering [Hwang et al. (2006)] [Arabie and Hubert(1994)] non-symmetric correspondence analysis and Factor K-means clustering [Vichi and Kiers(2001)] [Palumbo and Iodice D’Enza(2010)] An evolutionary analysis of association patterns 3 / 25
Introduction Notation Criterion Procedure Example Aim and scope This contribution consists of a dynamic clustering procedure for high dimensional binary data that are arranged into subsequent batches; the first data batch is used to determine a ‘starting’ solution that is updated as further data batches are processed. two-fold problem clustering very large data sets or data produced at a high rate (data flows); perform a comparative analysis of data stratified according to time or space. An evolutionary analysis of association patterns 4 / 25
Introduction Notation Criterion Procedure Example Notation and data structures n number of statistical units; p number of binary attributes; K number of groups of statistical units. Z j , 1 . . . , p , Bernoulli distributed attribute (with z indicating success and ¯ z failure) with parameter π j . X = ( X 1 , X 2 , . . . , X K ) random vector multinomial distributed with parameters ( n ; π 1 , π 2 , . . . , π K ), where π k ( k = 1 , . . . , K ) are unknown. An evolutionary analysis of association patterns 5 / 25
Introduction Notation Criterion Procedure Example Criterion Cross-classification table F of X and a single binary attribute Z Z ¯ z z 1 f 11 f 12 f 1+ . . . . X . . . . . . . . K f K 1 f K 2 f K + f +1 f +2 n An evolutionary analysis of association patterns 6 / 25
Introduction Notation Criterion Procedure Example Criterion The qualitative variance, or heterogeneity, of X can be defined by the Gini index K K � 2 f 2 � f k + � � k + G ( X ) = 1 − = 1 − n 2 . n k =1 k =1 The variation of X within the categories of the variable Z is obtained by averaging G ( X | z ) and G ( X | ¯ z ) 2 � � 2 K K f 2 f 2 = 1 − 1 f + h � � � � kh kh G ( X | Z ) = 1 − f 2 n n f + h + h h =1 k =1 k =1 h =1 An evolutionary analysis of association patterns 7 / 25
Introduction Notation Criterion Procedure Example Criterion The variation of X explained by the categories of Z is 2 K � K � f 2 f 2 1 − 1 � k + � � kh G ( X ) − G ( X | Z ) = 1 − n 2 − = n f + h k =1 k =1 h =1 K 2 K f 2 f 2 = 1 − 1 � � � k + kh n f + h n n k =1 h =1 k =1 An evolutionary analysis of association patterns 8 / 25
Introduction Notation Criterion Procedure Example Criterion In the case of p binary attributes the criterion being maximized is p � ( G ( X ) − G ( X | Z j )) j =1 that is the sum of variances of X explained by each of the attributes Z j . An evolutionary analysis of association patterns 9 / 25
Introduction Notation Criterion Procedure Example Algebraic formalization Quantity to maximize � 1 n F (∆) − 1 F T − 1 � X T 11 T X �� ≡ tr n 2 � 1 n X T Z (∆) − 1 Z T X − 1 � X T 11 T X �� ≡ tr n 2 where X is a ( n × K ) matrix with x ik = 1 is the unit i is assigned to group k , F = [ F 1 . . . F p ] = X T Z , ∆ = diag ( Z T Z ) and 1 is a n -dimensional vector of ones. Eigenvalue decomposition 1 � X T Z (∆) − 1 Z T X − 1 �� X T 11 T X � U = ΛU . n n An evolutionary analysis of association patterns 10 / 25
Introduction Notation Criterion Procedure Example Back to the aim This contribution consists of a dynamic clustering procedure for high dimensional binary data that are arranged into subsequent batches; the first data batch is used to determine a ‘starting’ solution that is updated as further data batches are processed. two-fold problem clustering very large data sets or data produced at a high rate (data flows); perform a comparative analysis of data stratified according to time or space. An evolutionary analysis of association patterns 11 / 25
Introduction Notation Criterion Procedure Example Back to the aim This contribution consists of a dynamic clustering procedure for high dimensional binary data that are arranged into subsequent batches; the first data batch is used to determine a ‘starting’ solution that is updated as further data batches are processed. two-fold problem clustering very large data sets or data produced at a high rate (data flows); perform a comparative analysis of data stratified according to time or space. An evolutionary analysis of association patterns 11 / 25
Introduction Notation Criterion Procedure Example The overall procedure The proposed procedure consists of three phases. phase 1 Analysis of the starting batch : the i -FCB 3 procedure is applied to obtain the starting solution[Palumbo and Iodice D’Enza(2010)]; phase 2 new batch processing : incoming statistical units are assigned to the K groups; phase 3 updating process : all the quantities are updated according to new data. Phases 2 and 3 are repeated for each new data batch. 3 iterative factorial clustering of binary data An evolutionary analysis of association patterns 12 / 25
Introduction Notation Criterion Procedure Example phase 1: starting batch The i -FCB iterative algorithm runs over the following steps: step 0 : pseudo-random generation of matrix X ; step 1 : an eigenvalue decomposition is performed on the matrix resulting from expression 1, obtaining the matrix Ψ, such that � Z (∆) − 1 Z T − 1 � 1 n 11 T Ψ = XU Λ 2 ; (1) step 2 : matrix X is updated according to a Euclidean squared distance-based non-hierarchical clustering algorithm ( k -means) on the projected statistical units (Ψ matrix). Steps 1 and 2 are iterated until the stopping rule is verified: the quantity in 1 does not significantly increase from one iteration to the next. An evolutionary analysis of association patterns 13 / 25
Introduction Notation Criterion Procedure Example convergence of the criterion number of iterations versus value of the criterion: 1000 repetitions. Unstructured data Structured data An evolutionary analysis of association patterns 14 / 25
Introduction Notation Criterion Procedure Example phase 3: updating process update of the number of units: n ∗ = n + n + ; update of cross-tabulation block matrix: F ∗ = F + F + , with F + = Z +T X + ; update of the diagonal matrix of margins: ∆ ∗ = ∆ + ∆ + , with ∆ + = diag � Z +T Z + � update of eigenvalue decomposition: � f ∗ f ∗ T �� 1 F ∗ (∆ ∗ ) − 1 F ∗ T − 1 U ∗ = Λ ∗ U ∗ � n ∗ n ∗ where f ∗ is the row-margin vector of the F ∗ matrix. An evolutionary analysis of association patterns 15 / 25
Introduction Notation Criterion Procedure Example Application: synthetic data The number of binary attributes is p = 12, V 1 , V 2 , . . . , V 12; starting block: 200 statistical units described by uncorrelated items first block: 100 statistical units with V 1 , V 2 , V 3 highly correlated, 100 statistical units with V 10 , V 11 , V 12 highly correlated; second block: 400 statistical units described by uncorrelated items third block: 100 statistical units with V 4 , V 5 , V 6 highly correlated, 100 statistical units with V 7 , V 8 , V 9 highly correlated; notes The number of clusters is K = 3. Synthetic data are obtained using the R-package bindata , by Leisch. An evolutionary analysis of association patterns 16 / 25
Introduction Notation Criterion Procedure Example Visualization of the results A common visualization support The procedure produces a different factorial plan for each update. In order to visualize the evolving association structure of the considered attributes as new data comes in, a three-way multidimensional scaling (MDS) is used. MDS visualization For the starting matrix F and for its updates F ∗ a matrix of chi-square distances among attributes is computed. A three-way MDS on the resulting three-way distance matrix is performed, using the package smacof by de Leeuw and Mair. An evolutionary analysis of association patterns 17 / 25
Introduction Notation Criterion Procedure Example Application An evolutionary analysis of association patterns 18 / 25
Introduction Notation Criterion Procedure Example Application: real-world data The ‘retail’ data set The retail market basket data set is supplied by a anonymous Belgian retail supermarket store. The data are collected over three non-consecutive periods, for a time range of approximately 5 months of data. The total amount of receipts (statistical units) being collected equals n = 88163; the number of products (binary attributes) p = 28549. An evolutionary analysis of association patterns 19 / 25
Recommend
More recommend