Categorical data Modelling and Independence R.W. Oldford
Eikosograms - Dependence/independence As with continuous data, interest lies in assessing whether the values of Categorical variates might have been generated independently. Independence occurs between X and Y if, and only if, Pr ( Y = y X = x ) = Pr ( Y = y ) OR, equivalently, Pr ( X = x Y = y ) = Pr ( X = x ) for all possible values x and y . This condition shows up in an eikosogram when the eikosogram is flat , e.g. x_4 y_3 x_3 Y y_2 X x_2 y_1 x_1 x_1 x_2 x_3 x_4 X y_1 y_2 y_3 Y
Eikosograms - Dependence/independence This is simplest to see when each variate is only binary. e.g. y_2 x_2 Y X y_1 x_1 x_1 x_2 X y_1 y_2 Y These two variates appear to be independent. Why?
Eikosograms - Dependence/independence We could use our lineup test to get a visual test of independence. We set this up mathematically as follows. Let n ij denote the number of values when X = x j and Y = y i . X = x 1 X = x 2 X = x J − 1 X = x J Row totals · · · Y = y I n I 1 n I 2 n I ( J − 1) n IJ n I + · · · Y = y ( I − 1) n ( I − 1)1 n ( I − 1)2 n ( I − 1)( J − 1) n ( I − 1) J n ( I − 1)+ · · · . . . . . . . . . . . . . . . . . . · · · Y = y 2 n 21 n 22 n 2( J − 1) n 2 J n 2+ · · · Y = y 1 n 11 n 12 n 1( J − 1) n 1 J n 1+ · · · Column totals n +1 n +2 n +( J − 1) n + J n ++ · · · p + j = � We could estimate the marginal probabilities for X as � Pr ( X = x j ) = n + j / n ++ and the j = � X as � conditional probabilities for Y p i Pr ( Y = y i X = x j ) = n ij / n + j . Similarly we could estimate the marginal probabilities for Y and the conditional probabilities for X Y .
Eikosograms - Dependence/independence Under independence � j = � X = x j ) = � Pr ( Y = y i ) and � p ij = � Pr ( Y = y i ) × � p i Pr ( Y = y i Pr ( X = x j ). We now use these estimates (under the hypothesis of independence) to generate values of the categorical variates when the hypothesis holds. For example, suppose we are looking only at the eikosogram for Y X - i.e. X on the horizontal, Y on the vertical. We could choose to condition on the known values of n + j and then for each j sample a vector of n ij values which are drawn from a multinomial distribution Mult ( n + j , p j ) where � � n 1+ n 2+ n I + p j = = p , , . . . , n ++ n ++ n ++ Notes This p j does not depend on j ; it is our marginal estimate of ( Pr ( Y = y 1 ) , Pr ( Y = y 2 ) , . . . , Pr ( Y = y I ) . ) Doing this for every j = 1 , . . . , J will produce a new table, one that has the same column totals as the original, but whose column entries have been generated according to the hypothesis of independence. We note also that we could have first generated column totals using a Mult ( n ++ , q ) where � � n +1 n +2 n + J q = , , . . . , n ++ n ++ n ++ and use these � n + j values in place of n + j to generate the columns of the new table. Following this methodology, the column widths of the corresponding eikosogram would change with each table. The second approach is an unconditional approach, the first a conditional one (here being conditioned on holding the column totals fixed).
Eikosograms - Dependence/independence In R , the apply() allows us to get the sums we want. For example, consider the table x x ## lower ## upper w x y z ## A 3 3 3 3 ## B 4 3 2 1
Eikosograms - Dependence/independence In R , the apply() allows us to get the sums we want. For example, consider the table x x ## lower ## upper w x y z ## A 3 3 3 3 ## B 4 3 2 1 We now get the sums we want using apply(X, MARGIN, FUN, ...) # row sums (sum across other dimensions for each value of the first) apply (x, 1, sum) ## A B ## 12 10
Eikosograms - Dependence/independence In R , the apply() allows us to get the sums we want. For example, consider the table x x ## lower ## upper w x y z ## A 3 3 3 3 ## B 4 3 2 1 We now get the sums we want using apply(X, MARGIN, FUN, ...) # row sums (sum across other dimensions for each value of the first) apply (x, 1, sum) ## A B ## 12 10 # col sums (sum across other dimensions for each value of dimension 2) apply (x, 2, sum) ## w x y z ## 7 6 5 4 This works for multiway tables ( MARGIN can be a vector of dimensions).
Eikosograms - Dependence/independence An R function which would generate a new table following the hypothesis of independence according to the first (conditional test) way would be generateTable <- function (TwoWayTable, y, x, conditionalTest=TRUE){ varnames <- names ( dimnames (TwoWayTable)) nvars <- length (varnames) respID <- (1 : nvars)[varnames == y] condID <- (1 : nvars)[varnames == x] new_table <- TwoWayTable n <- sum (TwoWayTable) # Get the marginal probabilities for the response variate p <- apply (TwoWayTable, respID, sum) / n # Get the marginal probabilities for the conditioning variate q <- apply (TwoWayTable, condID, sum) / n n_resp <- length (p) n_cond <- length (q) if (conditionalTest) { # Preserve the conditional totals m <- apply (TwoWayTable, condID, sum) } else { m <- rmultinom (1,n, q)} for (c in 1 : n_cond){ # Generate new counts from a multinomial newRespVals <- rmultinom (1,m[c],p) if (respID < condID){ for (i in 1 :length (newRespVals)){ new_table[i,c] <- newRespVals[i] } } else { for (i in 1 :length (newRespVals)){ new_table[c,i] <- newRespVals[i] } } } new_table }
Eikosograms - Dependence/independence The graphic of choice to assess independence is an eikosogram. Because the eikos function is implemented using grid , we need to update the lineup function: lineup <- function (data, showSuspect=NULL, generateSuspect=NULL, trueLoc=NULL, layout = c (5,4), # We add the "pkg" argument pkg= c ("graphics","grid", "ggplot2")) { # # Get the number of suspects in total nSuspects <- layout[1] * layout[2] if ( is.null (trueLoc)) {trueLoc <- sample (1 : nSuspects, 1)} if ( is.null (showSuspect)) { stop ("need a plot function for the suspect")} if ( is.null (generateSuspect)) { stop ("need a function to generate suspect")} # Need to decide which subject to present presentSuspect <- function (suspectNo) { if (suspectNo != trueLoc) {data <- generateSuspect (data)} showSuspect (data, suspectNo) } # Up to here, there is no change beyond the additional "pkg" argument. # CONTINUED ON NEXT SLIDE
Eikosograms - Dependence/independence # CONTINUED FROM PREVIOUS SLIDE # Plotting depends on the plotting package pkg <- match.arg (pkg) switch (pkg, "graphics" = { savePar <- par (mfrow=layout, mar= c (2.5, 0.1, 3, 0.1), oma= rep (0,4)) sapply (1 : nSuspects, FUN = presentSuspect) # The plotLayout here is of no value for graphics plotLayout <- layout par (savePar) }, "grid" = { grobs <- lapply (1 : nSuspects, FUN = presentSuspect) plotLayout <- marrangeGrob (grobs=grobs, nrow=layout[1], ncol=layout[2]) }, "ggplot2" = { ##ggplot2 plots <- lapply (1 : nSuspects, FUN = presentSuspect) plotLayout <- marrangeGrob (grobs=plots, nrow=layout[1], ncol=layout[2]) }, stop ("Wrong 'pkg'") ) # CONTINUED ON NEXT SLIDE
Eikosograms - Dependence/independence # CONTINUED FROM PREVIOUS SLIDE # Obfuscate location to keep us honest possibleBaseVals <- 3 :min (2 * nSuspects, 50) # remove easy base values possibleBaseVals <- possibleBaseVals[possibleBaseVals != 10 & possibleBaseVals != 5] base <- sample (possibleBaseVals, 1) offset <- sample (5 :min (5 * nSuspects, 125),1) # return obfuscated location # return obfuscated location and plot (if not "graphics") list (trueLoc = paste0 ("log(",base ^ (trueLoc + offset), ", base=",base,") - ", offset), plotLayout = plotLayout ) } We’ll work with the SAheart data from the package ElemStatLearn and assess whether coronary heard disease is independent of family history. library (ElemStatLearn) # First, get the chd events, ordered so that chd =1 qppears as the bottom bar of an eikosogram chd <- c ("None", "CHD")[1 + SAheart $ chd] # Create the two way table heart <- table (SAheart $ famhist, chd, dnn = c ("famhist", "coronary"))
Eikosograms - Dependence/independence To test independence for two way tables, the data structure passed to line-up needs to be a little richer than that we had before when, say, comparing distributions. # Here is the data structure that we will use for two way tables data <- list (table = heart, y = "coronary", x = "famhist") We need to adapt the function for generating a new table to this data structure. We’ll introduce a function for each type of test (conditional or unconditional) generateTableDataCond <- function (data){ newtable <- generateTable (data $ table, data $ y, data $ x) list (table=newtable, y = data $ y, x = data $ x) } generateTableDataUncond <- function (data){ newtable <- generateTable (data $ table, data $ y, data $ x, FALSE) list (table=newtable, y = data $ y, x = data $ x) }
Recommend
More recommend