Correspondence Analysis Outliers Confidence regions Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar 9, 2011 TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 1 / 23
Correspondence Analysis Outliers Confidence regions Overview 1 Correspondence Analysis Statistical model Correspondence Analysis 2 Moderate outliers in contingency tables Idea of moderate outliers Simulation study design Results 3 Spatial confidence regions One outlier in the table Several outliers in the table Outlook TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 2 / 23
Correspondence Analysis Model Outliers CA Confidence regions Motivation Behaviour of Correspondence Analysis (CA) with outliers in multidimensional contingency tables Consider ’outliers’ that break independence in the table, but are not immediately conspicuous. Question: How do outliers affect the CA-coordinates? TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 3 / 23
Correspondence Analysis Model Outliers CA Confidence regions Notation X 1 , X 2 , X 3 - random variables, X = { 1 , ..., I } × { 1 , ..., J } × { 1 , ..., K } n ijk , i = 1 , ..., I , j = 1 , ..., J , k = 1 , ..., K observed frequency of ( X 1 = i , X 2 = j , X 3 = k ) N ijk random variables, n - sample size ( N 111 , . . . , N IJK ) ∼ Multinomial ( n , ( π 111 , . . . , π IJK )) Under the null hypothesis of total independence: π ijk = π i ·· π · j · π ·· k π i ·· = � � k π ijk , π · j · = � � k π ijk , π ·· k = � � j π ijk j i i TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 4 / 23
Correspondence Analysis Model Outliers CA Confidence regions Example of a 3-dimensional contingency table X 3 Sum X 1 X 2 1 2 ... K 1 n 111 n 112 n 11 k n 11 K n 11 · . ... . 1 . n 1 j 1 n 1 j 2 n 1 jK J n 1 J 1 n 1 J 2 n 1 Jk n 1 JK 1 n i 11 . . ... . . . . n ijk n ij · J n iJK 1 n I 11 . ... . I . J n IJK n IJ · Sum n ·· 1 · · · n ·· k n ·· K n TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 5 / 23
Correspondence Analysis Model Outliers CA Confidence regions Correspondence analysis S matrix of standardized residuals dimension ( I · J ) × K ( n ijk / n ) − r ( ij ) c k √ r ( ij ) ck elements of S : s ( ij ) k = with c k = n ·· k / n and r ( ij ) = n ij · / n D r = diag ( r 11 , ..., r IJ ) D c = diag ( c 1 , ..., c K ) Singular value decomposition of S : S = U Σ V T Correspondence analysis representation by F = D r − 1 2 U Σ G = D c − 1 2 V Σ TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 6 / 23
Correspondence Analysis Idea Outliers Design Confidence regions Results Outliers in contingency tables Idea Outliers are defined as specific cell frequencies of the underlying contingency table Outlier: deviation from the null model Null model: independence model TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 7 / 23
Correspondence Analysis Idea Outliers Design Confidence regions Results Simulation study design Procedure 1: Independence 1 Randomly generate marginal probabilities π i .. , π . j . , π .. k 2 Define probabilities π ijk = π i .. · π . j . · π .. k 3 Simulate n observations from Multinomial ( n , ( π l ) l = 1 ,..., IJK ) , Matrix of observations X ( I , J , K ) 4 Apply correspondence analysis (R-package: ca) TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 8 / 23
Correspondence Analysis Idea Outliers Design Confidence regions Results Simulation study design Procedure 2: with an outlier 1 Randomly generate marginal probabilities π i .. , π . j . , π .. k 2 Define probabilities π ijk = π i .. · π . j . · π .. k 3 Outlier generation: replace chosen π ijk by ( 1 . 2 ) max ( π ijk ) 4 Rescale probabilities to � ijk π ijk = 1 5 Simulate n observations from Multinomial ( n , ( π l ) l = 1 ,..., IJK ) , Matrix of observations X ( I , J , K ) 6 Apply correspondence analysis (R-package: ca) TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 9 / 23
Correspondence Analysis Idea Outliers Design Confidence regions Results Tables with outlier in first cell [[A]] [[B]] [[C]] [[D]] X1 X2 X3 X1 X2 X3 X1 X2 X3 X1 X2 X3 a b c d a b c d a b c d a b c d 1, , 1 62 0 0 1 1, , 1 61 20 15 4 1, , 1 54 5 3 4 1, , 1 90 13 37 35 2 0 0 0 2 2 1 64 47 39 2 0 4 2 11 2 10 21 53 63 3 0 0 1 2 3 0 21 20 14 3 3 2 6 6 3 0 2 13 9 4 0 0 0 1 4 3 53 51 41 4 2 1 8 5 4 4 7 28 25 2, , 2, , 2, , 2, , 1 4 22 0 16 1 0 11 10 6 1 8 15 23 24 1 0 1 3 3 2 19 45 7 32 2 0 46 34 31 2 13 29 22 52 2 1 3 2 6 3 5 42 7 39 3 0 20 10 11 3 10 40 24 43 3 0 0 2 0 4 15 50 6 23 4 0 27 34 37 4 7 18 17 33 4 1 1 3 2 3, , 3, , 3, , 3, , 1 4 25 1 13 1 0 3 3 2 1 13 22 18 28 1 5 18 65 71 2 12 73 8 41 2 0 11 3 9 2 13 45 33 50 2 7 19 84 84 3 12 60 4 29 3 0 4 2 1 3 10 42 39 60 3 1 5 15 13 4 12 58 11 43 4 0 5 6 5 4 7 33 13 38 4 5 17 39 34 4, , 4, , 4, , 4, , 1 1 11 0 9 1 1 11 6 7 1 2 2 3 4 1 1 3 9 10 2 10 29 4 21 2 0 32 28 16 2 1 5 2 1 2 2 1 12 16 3 9 31 7 22 3 0 7 7 7 3 2 2 6 5 3 0 0 7 1 4 8 19 1 11 4 0 31 40 22 4 1 0 3 3 4 1 3 6 8 TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 10 / 23
Correspondence Analysis Idea Outliers Design Confidence regions Results CA-Plots of 4 simulations with outlier in the first cell A B 2 2 3 ● 1 4 2 1 ● ● c 8 15 12 d 7 cd 12 16 6 14 4 15 9 1 a ● ● a 1 ● ● ● 3 5 14 ● 8 9 10 13 ● 2 6 ● ● ● 16 11 5 b ● ● 13 ● ● b 0 ● 0 ●● ● ● 7 ● ●● ● ● ● 10 ● 11 ● ● ● −1 −1 −2 −2 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 C D 2 2 1 1 4 16 ● 6 ● 3 15 ● ● c ● 13 5 b 16 ● 11 12 d 2 13 a ● 9 10 8 d a 1 ● 4 5 11 9 1 6 8 ● 14 10 ● ● 7 2 ● ● ● ● ● c 0 ● ● ● ● b 0 ● ● ● ● ● 3 ● 12 14 ● ● ● ● 15 −1 −1 ● 7 ● −2 −2 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 TU Dortmund A.Langovaya, S.Kuhnt CA and Outliers 11 / 23
Frequency of the CA-coordinates of the cell 1 Independece_Rows: Coordinates of the cell 1 Independence_Columns: coordinates of the cell 1 80 100 F 60 r F e r q e u 40 q e u 50 n e c n 20 y c y 0 0 10 −10 10 −10 5 5 −5 −5 Dimension 2 Dimension 2 Dimension 1 Dimension 1 0 0 0 0 −5 5 −5 5 −10 10 10 −10 Outlier_Rows: Coordinates of the cell 1 Outlier_Columns: coordinates of the cell 1 150 200 F F r 150 100 r e e q q u u 100 e e n 50 n c 50 c y y 0 0 −10 −10 5 5 −5 −5 Dimension 1 Dimension 1 0 0 0 0 Dimension 2 Dimension 2 −5 −5 5 5 −10 −10
Recommend
More recommend