The Aggregate Prediction Index and Non-Symmetric Correspondence Analysis of Aggregate Data: The 2 x 2 Table Eric J. Beh School of Mathematical and Physical Sciences University of Newcastle, Australia Rosaria Lombardo Economics Faculty, Second University of Naples, Italy CARME 2011, Rennes, France – February 8-11
The 2 x 2 Contingency Table Cross-classify a sample of size n according to two dichotomous variables “Let us blot out the Column 1 Column 2 Total contents of the table, p leaving only the marginal p p ? ? Row 1 1 12 11 frequencies . . . [they] by p p ? ? p themselves supply no Row 2 22 2 21 information on . . . the p p Total 1 proportionality of the 1 2 frequencies in the body of the table . . . ” Symmetric association – Pearson chi-squared statistic Define – Fisher (1935) Aggregate Association Index (Beh; 2010 CS&DA) 2 P p p p p p 2 1 1 1 2 11 21 X P | p , p n P P 1 1 1 1 2 p p p p p 2 1 2 1 2
Bounds & Accounting Identity Duncan & Davis (1953) Bounds n n n 1 2 1 L max 0 , P min , 1 U 1 1 1 n n 1 1 n n n 1 1 1 L max 0 , P min , 1 U 2 2 2 n n 2 2 The Accounting Identity (King, 1997; and others) n P n P n 1 1 1 2 2 3
Non-Symmetric Correspondence Analysis p ij Define p ij j p i as the difference between the unconditional marginal prediction p •j (column marginal proportion) and the conditional prediction p ij /p i• (row profiles) . Rows → Predictor Variable Columns → Response Variable Goodman-Kruskal tau index (1954) 2 2 For a 2x2 contingency table . . . 2 p i ij i 1 j 1 num 2 2 Light & Margolin (1971) 2 2 1 p 1 p j j j 1 j 1 2 C n 1 ~ NSCA ( D’Ambra & Lauro1989) , 1
Non-Symmetric Correspondence Analysis Decomposition of ij x y ij i j Akin to the SVD and BMD of a general two-way contingency table Lancaster (1969) Orthonormality 2 i 3 p 1 , 2 i 1 1 x ( 1 ) p x p x i p 1 1 2 2 0 , 1 2 1 j 1 y ( 1 ) 1 , 2 j y y 2 1 2 0 , 1
Bounds Under the hypothesis of independence, ρ is an asymptotic standard normal random variable and can be expressed as a function of P 1 and of the marginal information: P p p 1 1 1 P p 2 1 1 x y p i j 2 Duncan & Davis (1953) showed that p p p p p p p p 1 1 2 2 2 1 1 2 min , min , p p p p p p p p 2 2 1 1 1 2 2 1 Which only requires the marginal information 6
NSCA and Classical Coordinates Some insight into the asymmetric association may be made using NSCA, by constructing a classical plot or biplot graphical display. For a classical plot 2 i 3 p i 1 1 j 1 f x ( 1 ) g y ( 1 ) i i j j p 2 2 These coordinates may be expressed in terms of P 1 and the marginal proportions p 1 g y P p 1 1 1 1 p f x 2 P p 2 1 1 1 1 p p 1 f x 2 ( P p ) 1 g y P p 2 2 1 1 p 2 2 1 1 p 2 7 2
NSCA and the Biplot To depict the asymmetric relationship between row and column categories, consider a row metric preserving biplot (Kroonenberg, Lombardo, 1999). The biplot coordinates for the ith row 2 i 3 p i 1 1 f x ( 1 ) i i p 2 and for the j.th column 1 j 1 g y ( 1 ) j j 2 The row isometric biplot it is used to project the column coordinates on the line defined by the row coordinates, the shorter is the distance the stronger is the predictability! Bounds can be computed for coordinates. For example, 2 2 p p p p p p 1 1 2 1 1 2 min , f min , 1 2 2 p p p p p p 8 2 2 1 2 2 1
Bounds of P 1 100(1 – )% Confidence Bounds under the null hypothesis of independence 2 2 1 p 1 p p p j j j j * * 2 2 L p Z P p Z U 1 / 2 1 1 / 2 n 1 p n 1 p 1 1 * * L max 0 , L P min 1 , U U 1 Given and the aggregate data, there is a significant asymmetric association between the two dichotomous variables if L P L or U P U 1 1 1 1
Aggregate Prediction Index (API) 30 30 Chi-squared Statistic Chi-squared Statistic C – Statistic Statistically significant association 25 25 Statistically significant association 20 20 15 15 2 2 10 10 5 5 0 0 p 1* p 1* L 1 0.0 0.2 L 0.4 0.6 U 0.8 1.0 L 1 U 1 0.0 0.2 L 0.4 0.6 0.8 1.0 U U 1 P 1 P 1 Consider a plot of the chi-squared statistic versus P 1 If the area under C but above χ 2 is large than there is evidence that the row categories are good predictors of the column categories
Aggregate Prediction Index (API) 30 30 Chi-squared Statistic Chi-squared Statistic C – Statistic Statistically significant association 25 25 Statistically significant association 20 20 15 15 2 2 10 10 5 5 0 0 p 1* p 1* L 1 0.0 0.2 L 0.4 0.6 U 0.8 1.0 L 1 U 1 0.0 0.2 L 0.4 0.6 0.8 1.0 U U 1 P 1 P 1 This area may be calculated by U 2 L L U U C P | p , p dP 1 1 1 1 1 1 L API 100 1 U 1 C P | p , p dP 1 1 1 1 L 1
Example – Fisher’s Twin Data Fisher's data studies 30 criminal twins and classifies them according to whether they are monozygotic twins or dizygotic twins. The table also classifies whether the twins have been convicted of a criminal offence. The Goodman – Kruskal tau index = 0.434 The C – statistic is 12.597. m p-value = 0.0004 → the type of twin is a good predictor of the conviction status of a criminal.
Example – Fisher’s Twin Data But, as Fisher (1935) did, suppose we “blot out” the cells of the table. Question: What information do the margins provide in understanding the extent to which the variables are associated. We shall • consider the non-symmetric correspondence analysis using only the aggregate data, and • calculate the aggregate prediction index
Example – Fisher’s Twin Data 51.7 API 0.05 = 56,85 If we consider the 5% C – Statistic level of significance, 23.0 the margins provide strong evidence that there may exist a significant prediction of conviction status 0.0 given twin type 0.92 1.00 0.00 0.20 0.40 0.60 0.80 12 p 1 0 . 4 30 2 26 30 P 12 1 C P No prediction when 0,19 ≤ P 1 ≤ 0.60 1 34 17
Example – Fisher’s Twin Data 1 Row=monoz. 0,8 Classical plot 0,6 Column=conv. proposed in 0,4 CA of 0,2 aggregate data (Beh, 2008) 0 0 0,2 0,4 0,6 0,8 1 1,2 -0,2 -0,4 Column=non conv. -0,6 Row=dizy. -0,8 No prediction if 0,19 ≤ P 1 ≤ 0.60
Example – Fisher’s Twin Data 1 Row=monoz. Column=conv. 0,8 Row 0,6 isometric 0,4 Biplot 0,2 0 0 0,2 0,4 0,6 0,8 1 1,2 -0,2 -0,4 Row=dizy. -0,6 -0,8 Column=non conv. No prediction if 0,19 ≤ P 1 ≤ 0.60 Inverse prediction if 0.0 ≤ P 1 ≤ 0.19 Direct prediction when 0.60 ≤ P 1 ≤ 0.92
Recommend
More recommend