Categorical data Reasoning by diagrams R.W. Oldford
Crossed data - tables The main data structure for crossed categorical data is a table . Each variate has a finite number of values (categories) city <- c ("Kitchener", "Waterloo") housing <- c ("House", "Apartment", "Residence") All combinations of one value from each variate are possible (crossed) and we have the number of times each combination occurs # fake data counts <- rpois (6, lambda = 50) Arranged in a rectangular array: vacancy <- matrix (counts, nrow = length (city), ncol = length (housing), byrow = TRUE, dimnames = list (city = city, housing = housing)) And now coerced to be an object of class table vacancy <- as.table (vacancy) vacancy ## housing ## city House Apartment Residence ## Kitchener 52 53 46 ## Waterloo 47 64 43
Crossed data - tables The table can be a many-way array from crossing many categorical variates term <- c ("Fall", "Winter", "Spring") # more fake counts counts <- seq (from = 10, to = 180, by = 10) vacancy <- array (counts, dim= c ( length (city), length (housing), length (term)), dimnames = list (city = city, housing = housing, term = term)) vacancy <- as.table (vacancy) vacancy ## , , term = Fall ## ## housing ## city House Apartment Residence ## Kitchener 10 30 50 ## Waterloo 20 40 60 ## ## , , term = Winter ## ## housing ## city House Apartment Residence ## Kitchener 70 90 110 ## Waterloo 80 100 120 ## ## , , term = Spring ## ## housing ## city House Apartment Residence ## Kitchener 130 150 170 ## Waterloo 140 160 180 Note when filling the array, the earlier indices change more quickly than do the later indices.
Crossed data - tables The order of dimensions can be rearranged - the R function aperm(...) aperm (vacancy, perm= c (3,2,1)) ## , , city = Kitchener ## ## housing ## term House Apartment Residence ## Fall 10 30 50 ## Winter 70 90 110 ## Spring 130 150 170 ## ## , , city = Waterloo ## ## housing ## term House Apartment Residence ## Fall 20 40 60 ## Winter 80 100 120 ## Spring 140 160 180
Crossed data - constructing tables from data Have an existing dataframe with categorical variates SAheart[1 : 3,] ## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd ## 1 160 12.00 5.73 23.11 Present 49 25.30 97.20 52 1 ## 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1 ## 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 0 Create the table directly from individual factors (like famhist ) or unique values (like chd ): table (SAheart $ chd, SAheart $ famhist, dnn = c ("chd", "famhist")) ## famhist ## chd Absent Present ## 0 206 96 ## 1 64 96 Or, by cross-tabulation (“cross tabs” or xtabs ) xtabs ( ~ chd + famhist, data = SAheart) # Note formula ## famhist ## chd Absent Present ## 0 206 96 ## 1 64 96
Crossed data - working with tables Consider the three-way table (a 4 x 4 x 2 array) HairEyeColor : ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 32 11 10 3 ## Brown 53 50 25 15 ## Red 10 10 7 7 ## Blond 3 30 5 8 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 36 9 5 2 ## Brown 66 34 29 14 ## Red 16 7 7 7 ## Blond 4 64 5 8 The names of its variates ( dimnames ) in order are: names ( dimnames (HairEyeColor)) ## [1] "Hair" "Eye" "Sex" are used to create interesting sub-tables or alternative tables.
Crossed data - working with tables Selecting slices (conditioning) HairEyeColor["Black",,] ## Sex ## Eye Male Female ## Brown 32 36 ## Blue 11 9 ## Hazel 10 5 ## Green 3 2 HairEyeColor[,"Green",] ## Sex ## Hair Male Female ## Black 3 2 ## Brown 15 14 ## Red 7 7 ## Blond 8 8 HairEyeColor["Black","Blue",] ## Male Female ## 11 9 HairEyeColor["Black","Green","Male"] ## [1] 3
Crossed data - working with tables Collapsing dimensions (marginalizing, projecting) # Zero dimensional margin.table (HairEyeColor) ## [1] 592 # 1 dimensional -- here margin 1 ("Hair") is preserved margin.table (HairEyeColor, margin=1) ## Hair ## Black Brown Red Blond ## 108 286 71 127 # 2 dimensional -- here margins 1 and 2 ("Hair", "Eye") are preserved margin.table (HairEyeColor, margin= c (1,2)) ## Eye ## Hair Brown Blue Hazel Green ## Black 68 20 15 5 ## Brown 119 84 54 29 ## Red 26 17 14 14 ## Blond 7 94 10 16 # Note: except for 0 dimensional. these are the same as using "apply" with "sum" apply (HairEyeColor, MARGIN=1, FUN=sum) ## Black Brown Red Blond ## 108 286 71 127
Crossed data - working with tables Summing along every margin (new variate value Sum for each variate) # Every margin is summed addmargins (HairEyeColor) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 32 11 10 3 56 ## Brown 53 50 25 15 143 ## Red 10 10 7 7 34 ## Blond 3 30 5 8 46 ## Sum 98 101 47 33 279 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 36 9 5 2 52 ## Brown 66 34 29 14 143 ## Red 16 7 7 7 37 ## Blond 4 64 5 8 81 ## Sum 122 114 46 31 313 ## ## , , Sex = Sum ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 68 20 15 5 108 ## Brown 119 84 54 29 286 ## Red 26 17 14 14 71 ## Blond 7 94 10 16 127 ## Sum 220 215 93 64 592
Crossed data - working with tables Summing along a single margin # Just produce marginal sums over dimension 2 ("Eyes") values # for each pair (i, k) of remaining variates "Hair" and "Sex" addmargins (HairEyeColor, margin=2) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 32 11 10 3 56 ## Brown 53 50 25 15 143 ## Red 10 10 7 7 34 ## Blond 3 30 5 8 46 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 36 9 5 2 52 ## Brown 66 34 29 14 143 ## Red 16 7 7 7 37 ## Blond 4 64 5 8 81
Crossed data - working with tables Summing along two margins # Produce marginal sums over both dimensions 1 and 2 ("Hair" and "Eyes") # for each value for "Eye" addmargins (HairEyeColor, margin= c (1,2)) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 32 11 10 3 56 ## Brown 53 50 25 15 143 ## Red 10 10 7 7 34 ## Blond 3 30 5 8 46 ## Sum 98 101 47 33 279 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green Sum ## Black 36 9 5 2 52 ## Brown 66 34 29 14 143 ## Red 16 7 7 7 37 ## Blond 4 64 5 8 81 ## Sum 122 114 46 31 313
Crossed data - working with tables Proportions (depends on which margin is fixed) # No margins fixed, just total ... single multinomial round ( prop.table (HairEyeColor), 3) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.054 0.019 0.017 0.005 ## Brown 0.090 0.084 0.042 0.025 ## Red 0.017 0.017 0.012 0.012 ## Blond 0.005 0.051 0.008 0.014 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.061 0.015 0.008 0.003 ## Brown 0.111 0.057 0.049 0.024 ## Red 0.027 0.012 0.012 0.012 ## Blond 0.007 0.108 0.008 0.014 Possible generative model:
Crossed data - working with tables Proportions (depends on which margin is fixed) # No margins fixed, just total ... single multinomial round ( prop.table (HairEyeColor), 3) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.054 0.019 0.017 0.005 ## Brown 0.090 0.084 0.042 0.025 ## Red 0.017 0.017 0.012 0.012 ## Blond 0.005 0.051 0.008 0.014 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.061 0.015 0.008 0.003 ## Brown 0.111 0.057 0.049 0.024 ## Red 0.027 0.012 0.012 0.012 ## Blond 0.007 0.108 0.008 0.014 Possible generative model: multinomial . Here counts n ijk have fixed total n = n +++ = � ijk n ijk = 592. n Pr ( Data ) = � � p n 111 · · · p n 442 111 442 n 111 n 211 · · · n 442 with p +++ = � 4 � 4 � 2 k =1 p ijk = 1. i =1 j =1
Crossed data - working with tables Proportions (depends on which margin is fixed) # One margin (the third here, i.e. Sex) is fixed ... as many multinomials as in round ( prop.table (HairEyeColor, margin=3), 2) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.11 0.04 0.04 0.01 ## Brown 0.19 0.18 0.09 0.05 ## Red 0.04 0.04 0.03 0.03 ## Blond 0.01 0.11 0.02 0.03 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 0.12 0.03 0.02 0.01 ## Brown 0.21 0.11 0.09 0.04 ## Red 0.05 0.02 0.02 0.02 ## Blond 0.01 0.20 0.02 0.03 Possible generative model:
Recommend
More recommend