ACMS 20340 Statistics for Life Sciences Chapter 22: The Chi-Square Test for Two-Way Tables
Survey High School students were given the following survey. 1. Do you smoke? ◮ Yes ◮ No 2. Do your parents smoke? ◮ Both parents smoke ◮ Only one parent smokes ◮ No The results were then tabulated.
The Results of the Survey These were the results. Do you smoke? Yes 1004 No 4371 Do your parents smoke? Both 1780 One 2239 Neither 1356
Placing the Results on a Table But we can also tabulate the data by counting the number of each possible pair of responses. Student Smokes Student Doesn’t Smoke Both Parents Smoke 400 1380 One Parent Smokes 416 1823 Neither Parent Smokes 188 1168 Each row corresponds to a possible outcome of the first factor. Each column corresponds to a possible outcomes of the second factor. Each cell contains the count of responses having that pair of answers corresponding to its row and column.
Two-Way Tables But we can also tabulate the data by counting the number of each possible pair of responses. Student Smokes Student Doesn’t Smoke Both Parents Smoke 400 1380 One Parent Smokes 416 1823 Neither Parent Smokes 188 1168 This is called a two-way table, since the individuals are simultaneously divided up in two different ways. We can construct a two-way table for any pair of categorical variables.
Two-Way Tables, continued Usually, row and column totals are included in two-way tables: Smokes Doesn’t Smoke Total Both Parents Smoke 400 1380 1780 One Parent Smokes 416 1823 2239 Neither Parent Smokes 188 1168 1356 1004 4371 5375 The last column gives the same summary of the parents smoking that we initially found. Further, the last row give exactly the results of our first question. Because the totals are “in the margins”, they are sometimes called marginal distributions (of the table).
Representing the Data We have many tools to represent this data. For instance, here is a description using a bar chart. Both One None 1500 1000 500 0 Smokes Doesn't Smoke But is there a significant relationship between the two variables?
Adjusting for the Difference between Groups Since the number of smokers and the number of non-smokers are different, let’s adjust by dividing each row entry by the size of corresponding group, which is the total number of responses in the given column. Smokes Doesn’t Smoke Total Both Parents Smoke 400/1004 1380/4371 1780 One Parent Smokes 416/1004 1823/4371 2239 Neither Parent Smokes 188/1004 1168/4371 1356 1004 4371 5375
Adjusting for the Difference between Groups Since the number of smokers and the number of non-smokers are different, let’s adjust by dividing each row entry by the size of corresponding group, which is the total number of responses in the given column. Smokes Doesn’t Smoke Total Both Parents Smoke 0.398 0.316 One Parent Smokes 0.414 0.417 Neither Parent Smokes 0.188 0.267 1.0 1.0
A More Informative Pair of Bar Charts 0.4 Both One None 0.3 0.2 0.1 0.0 Smokes Doesn't Smoke There are still differences, but are they significant?
Formulating the Null Hypothesis To answer this question, we formulate the hypothesis H 0 : There is no relationship between the two variables . One approach is to test each variable separately using the two sample proportion tests. For example: ◮ n 1 = 1004 smokers; ◮ n 2 = 4371 non-smokers; ◮ X 1 = 400 smokers have two parents who smoke; ◮ X 2 = 1380 non-smokers have two parents who smoke. Performing a two sample test of proportion yields a P -value of 6 . 2 × 10 − 7 . Problem: We end up with three different P -values, but we want a single P -value that tells use how the two groups compare.
Another Approach Let’s return to the two-way table. Smokes Doesn’t Smoke Total Both 400 1380 1780 One 416 1823 2239 Neither 188 1168 1356 1004 4371 5375 If the null hypothesis were true, then we would expect 1780 / 5375 = 0 . 33 of the respondents to report that both parents are smokers. Why? Because we would expect the same percentage of smokers as non-smokers to have two smoking parents. Thus it wouldn’t matter which group our respondent were in to determine the percentage of having two parents who smoke (i.e. we can pool the smokers with non-smokers to determine the % having both parents).
Expected Proportions Smokes Doesn’t Smoke Total Both 400 1380 1780 ← p 1 = 1780 / 5375 One 416 1823 2239 ← p 2 = 2239 / 5375 Neither 188 1168 1356 ← p 3 = 1356 / 5375 1004 4371 5375
Expected Proportions Smokes Doesn’t Smoke Total Both 400 1380 1780 ← p 1 = 0 . 33 One 416 1823 2239 ← p 2 = 0 . 42 Neither 188 1168 1356 ← p 3 = 0 . 25 1004 4371 5375 So then how many would counts we expect to be in each cell? There are 1004 smokers so we would expect 1004 × p 1 = 331 . 32 of them to have both parents smoking. We perform similar calculations for all the cells.
Expected Counts The expected counts are highlighted. Smokes Doesn’t Smoke Total Both 400 (331.32) 1380 (1442.43) 1780 ← p 1 = 0 . 33 One 416 (421.68) 1823 (1832.82) 2239 ← p 2 = 0 . 42 Neither 188 (251) 1168 (1092.75) 1356 ← p 3 = 0 . 25 1004 4371 5375 Now we can calculate a chi-square value for this data in the usual way: For each box we calculate (observed − expected) 2 . expected Adding all of these terms together yields χ 2 ≈ 38 . 06.
Degrees of Freedom In this example there are 3 rows and 2 columns, giving df = (3 − 1)(2 − 1) = 2 degrees of freedom. If we look up χ 2 ≈ 38 . 06 on the chi-square table gives a P -value of 2 . 7186 × 10 − 9 (a software program was used to get this exact number). This value is extremely small, so the null hypothesis should be rejected. Hence, there is strong evidence that there is a relationship between the two variables. We can further investigate the situation by looking at the residuals just as we did before.
Examining the Residuals As before, larger residuals indicate where there is more deviation from what was expected. Smokes Doesn’t Smoke Both 14.24 2.70 One 0.077 0.053 Neither 15.81 5.18 The biggest residuals are in the smokers column. There seems to be a relationship between a high schooler smoking and having parents who smoke. But the test doesn’t tell us what this relationship is; further investigation is needed.
Chi-square Test For Two-way Tables (1) Given a two-way table between two factors, we test the hypothesis H 0 : There is no relationship between the two factors by performing a χ 2 test for two-way tables. For a table with r rows and c columns (so factor 1 has r possibilities and factor 2 has c ). For each cell, we compute an expected count as expected = (row total)(column total) . table total Then we compute the χ 2 statistic value by summing over each possible square � (observed − expected) 2 χ 2 = . expected
Chi-square Test For Two-way Tables (2) The computed χ 2 statistic is then compared to the chi-square distribution having ( r − 1)( c − 1) degrees of freedom. The P -value is the area under the curve to the right of the test statistic.
Conditions for the χ 2 Test for Two-Way Tables The conditions for the χ 2 test for two-way tables are the same as the conditions for a chi-square test for goodness of fit. ◮ The sample is random, and each individual is independent of the others. ◮ The expected number of observations for each square is ≥ 1. ◮ At least 80% of the squares have an expected count ≥ 5.
Another Way to Construct a Two-Way Table In the previous example we took a single sample and broke it up into subsets determined by categorical variables. We can also compare a single categorical variable between different samples.
Pizza! Subjects are divided into four groups. Each group is put in a room painted a different color (blue, green, white, and yellow) and asked what their pizza preference is (out of pepperoni, cheese, green pepper, or other). The responses were tabulated as follows. Blue Green White Yellow Total Pepperoni 5 15 15 12 47 Cheese 15 10 6 7 38 Green Pepper 7 12 6 16 41 Other 12 23 29 24 88 39 60 56 59 214 Is there a significant difference between the responses of the groups? We use a chi-square test for a two-way table to find out.
Expected Counts and Residuals Expected Counts Blue Green White Yellow Total Pepperoni 8.56 13.18 12.30 12.96 47 Cheese 6.95 10.65 9.94 10.48 38 Green Pepper 7.47 11.50 10.73 11.30 41 Other 16.03 24.67 23.03 24.26 88 39 60 56 59 214 Residuals Blue Green White Yellow Pepperoni 1.48 0.25 0.59 0.07 Cheese 9.42 0.04 1.56 1.15 Green Pepper 0.03 0.02 2.08 1.95 Other 1.02 0.11 1.55 0.00
The χ 2 Statistic and the P -value The sum of the residuals is χ 2 = 21 . 3422. The table is 4 × 4, so the number of degrees of freedom is df = (4 − 1)(4 − 1) = 9 Consulting the χ 2 table, we find a P-value between 0.02 and 0.01. This indicates there is probably a relationship between the color of the wall and the type of pizza preferred.
Recommend
More recommend