symbolic pca of compositional data
play

Symbolic PCA of compositional data. Sun Makosso Kallyth & Edwin - PowerPoint PPT Presentation

Introduction. Presentation of the first methodology Second Approach : Usage of angular transformation Applications of two approaches. Conclusion Bibliographie. Symbolic PCA of compositional data. Sun Makosso Kallyth & Edwin Diday


  1. Introduction. Presentation of the first methodology Second Approach : Usage of angular transformation Applications of two approaches. Conclusion Bibliographie. Symbolic PCA of compositional data. Sun Makosso Kallyth & Edwin Diday Universit´ e Paris Dauphine COMPSTAT 2010-The 19th International Conference on Computational Statistics. Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  2. Introduction. Presentation of the first methodology Second Approach : Usage of angular transformation Applications of two approaches. Conclusion Bibliographie. 1 Introduction. Context and contribution of symbolic data analysis. Compositional data and example. 2 Presentation of the first methodology Coding of bins. PCA of means of variables. Representation of dispersion of individual. 3 Second Approach : Usage of angular transformation Problem of unit constraint. Resolution of problem of unit constraint by angular transformation. 4 Applications of two approaches. 5 Conclusion Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  3. Introduction. Presentation of the first methodology Second Approach : Usage of angular transformation Applications of two approaches. Conclusion Bibliographie. Context We have more and more complex data : sequential, textual, data structured in blocs, . . . Problem to analyze this data with usual tool of data analysis. Necessity to extend classical methods of data analysis to complex data. Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  4. Introduction. Presentation of the first methodology Second Approach : Usage of angular transformation Applications of two approaches. Conclusion Bibliographie. Contribution of symbolic data analysis Study efficiently complex data via a superior level of generality (town − > regions, country − > continent, players − > team) Variables can be symbolic interval-valued, symbolic multi valued variable, histogram,. . . . Output of methodology proposed must have symbolic nature Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  5. Introduction. Presentation of the first methodology Second Approach : Usage of angular transformation Applications of two approaches. Conclusion Bibliographie. Compositional Data and histogram data. x 1 , . . . , x m m classical variables are compositional if x 1 , . . . , x m are non negative and x 1 + . . . + x m = 1 . Symbolic histogram variables are an example of compositional variable. if : n : number of observations ; p : number of variables ; m j : number of bins of variables ; Y j = ( Y ij ) i =1 ,...,n, j =1 ,...,p is symbolic histogram variable if � ξ (1) ( m j ) � Y ij = { ξ j , H ij } ; ξ j = , . . . , ξ are bins of variables. j j H ij are relatives frequency : ( m j ) H (1) + . . . + H = 1 . ij ij Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  6. Introduction. Presentation of the first methodology Second Approach : Usage of angular transformation Applications of two approaches. Conclusion Bibliographie. Example of Symbolic histogram variable Table : Example of Symbolic histogram variable Region GDP in k$ by hab. Rate of mortality Bin ≤ 1 k$ ]1 , 20] k$ > 20 k$ ≤ 0 . 10 > 0 . 10 Afrique 0.340 0.660 0.000 0.245 0.755 Alena 0.000 0.333 0.667 1.000 0.000 AsieOrientale 0.067 0.801 0.133 1.000 0.000 Europe 0.000 0.322 0.677 0.742 0.258 Y 11 = { ξ 1 , H 11 } with ξ 1 = { ] − ∞ , 1] , ]1 , 20] , ]20 , + ∞ [ } ; H 11 = (0 . 340; 0 . 660; 0 . 000) Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  7. Introduction. Presentation of the first methodology Coding of bins. Second Approach : Usage of angular transformation PCA of means of variables. Applications of two approaches. Representation of dispersion of individual. Conclusion Bibliographie. Parametric coding. Let be D j = ( α j , β j ) domain of all possibles values of bins. For the first variable (GDP), we have α 1 = 0 , β 1 = + ∞ ; For the second variable (rate of mortality), we have : α 2 = 0 , β 2 = 100 ; ( k j ) δ j = inf k j =1 ,...,m j L k j , where L k j is the length of interval ξ . j ( k j ) ( k j ) ( k j ) If ξ =] − ∞ , a j ] then ξ − → ξ =] e, a j ] where j j j � α j if a j − δ j < α j e = . a j − δ j else ( k j ) ( k j ) ( k j ) If ξ =] b j , + ∞ [ , then ξ − → ξ =] b j , f j ] with j j j � β j si b j + δ j > β j f j = . b j + δ j else Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  8. Introduction. Presentation of the first methodology Coding of bins. Second Approach : Usage of angular transformation PCA of means of variables. Applications of two approaches. Representation of dispersion of individual. Conclusion Bibliographie. Parametric coding. In the example, ξ (1) =] − ∞ , 1] , ξ (2) =]1 , 20] , L 2 = 20 − 1 = 19 , we replace 1 1 ξ (1) ′ (1) − → ξ =] max(1 − 19 , 0) , 1] =]0 , 1] and 1 1 ′ (3) ξ (3) − → ξ =]20 , min(20 + 19 , + ∞ )] =]20 , min(39 , + ∞ )] =]20 , 39] . 1 1 If bins of variables don’t have the same unit, we replace each interval ] a ′ , b ′ ] by an adjusted interval ] a ′ / ( b ′ − a ′ ); b ′ / ( b ′ − a ′ )] . ( m j ) Parametric coding assign to one bin a vector of scores s j = ( s (1) , . . . , s ) , j j where s ( kj ) is the center of adjusted interval for k j = 1 , . . . , m j . j Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  9. Introduction. Presentation of the first methodology Coding of bins. Second Approach : Usage of angular transformation PCA of means of variables. Applications of two approaches. Representation of dispersion of individual. Conclusion Bibliographie. Non parametric coding. Non parametric Coding use as score of bins the rank associated to their bins. In the table of example of histogram data, scores of bins of classes will be s (1) = 1 , s (2) = 2 , . . . , s ( mj ) = m j . j j j s (1) = 1 , s (2) = 2; s (3) = 3; s (1) = 1 , s (2) = 2 . 1 1 1 2 2 Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  10. Introduction. Presentation of the first methodology Coding of bins. Second Approach : Usage of angular transformation PCA of means of variables. Applications of two approaches. Representation of dispersion of individual. Conclusion Bibliographie. PCA of means of variables. Work out means of histogram g ij : g ij = � m j ( k j ) ( k j ) k j =1 s H : j ij Table : Table of means of histogram variable. Variable Y 1 . . . Y p ω 1 g 11 . . . g 1 p ω 2 g 21 . . . g 2 p . . . . . . . . . . . . ω n g n 1 . . . g np Ordinary PCA of the n × p table of ( g ij ) i =1 ,...,n ; j =1 ,...,p. . Let be u α principal axes of means of variables. Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  11. Introduction. Presentation of the first methodology Coding of bins. Second Approach : Usage of angular transformation PCA of means of variables. Applications of two approaches. Representation of dispersion of individual. Conclusion Bibliographie. � s ( k ) ; H ( k ) � Transformation of { s j ; H ij } = in interval [ x ij , x ij ] via Tchebychev’s j ij rule : if X is random variable, for t > 0 P ( X ∈ [ g ij − tσ ij , g ij + tσ ij ]) ≥ 1 − 1 t 2 ∀ t > 0 (2.1) g ij = � m j ( k j ) ( k j ) k j =1 s H , σ ij is the standard derivation. j ij Table : Histogram transformed into interval via Tchebychev’s rule. Variable − > Y 1 Y 2 . . . Y p � � � � � � ω 1 x 11 , x 11 x 12 , x 12 . . . x 1 p , x 1 p � � � � � � ω 2 x 21 , x 21 x 22 , x 22 . . . x 2 p , x 2 p . . . . . . . . . . . . . . . � � � � � � ω n x n 1 , x n 1 x n 2 , x n 2 . . . x np , x np Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  12. Introduction. Presentation of the first methodology Coding of bins. Second Approach : Usage of angular transformation PCA of means of variables. Applications of two approaches. Representation of dispersion of individual. Conclusion Bibliographie. Representation of dispersion of individual. Construction of hypercubes. A hypercube is assimilate by a 2 p × p matrix. For p = 2 , we have :  x i 1 x i 2  x i 1 x i 2   M i =   x i 1 x i 2   x i 1 x i 2 We project the hypercube on principal axes u α of PCA of means of variable. Der termination of min and max of 2 p points projected. Then we represent rectangle. Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

  13. Introduction. Presentation of the first methodology Second Approach : Usage of angular transformation Problem of unit constraint. Applications of two approaches. Resolution of problem of unit constraint by angular transformation. Conclusion Bibliographie. Problem of unit constraint. equency H ( kj ) Relative fr´ are compositional data because of unit constraint. Unit ij constraint (cf. Aitchison (1986) ) cause : Spurious correlation 1 Negative biais 2 Lack of normality 3 Instability of variance 4 Sun Makosso Kallyth & Edwin Diday Symbolic PCA of compositional data.

Recommend


More recommend