lecture 5 anova and correlation
play

Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu - PowerPoint PPT Presentation

Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62 Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions Pearsons Chi-square tests for r


  1. Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62

  2. Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions Pearson’s Chi-square tests for r × 2 tables Independence Goodness of Fit Homogeneity Categorical data: r × c tables Pearson chi-square tests Odds ratio and relative risk 2 / 62

  3. ANOVA: Definition Statistical technique for comparing means for multiple populations Partitioning the total variation in a data set into components defined by specific sources ANOVA = AN alysis O f VA riance 3 / 62

  4. ANOVA: Concepts Estimate group means Assess magnitude of variation attributable to specific sources Extension of 2-sample t-test to multiple groups Population model Sample model: estimates, standard errors Partition of variability 4 / 62

  5. Types of ANOVA One-way ANOVA One factor — e.g. smoking status Two-way ANOVA Two factors — e.g. gender and smoking status Three-way ANOVA Three factors — e.g. gender, smoking and beer 5 / 62

  6. Emphasis One-way ANOVA is an extension of the t-test to 3 or more samples focus analysis on group differences Two-way ANOVA (and higher) focuses on the interaction of factors Does the effect due to one factor change as the level of another factor changes? 6 / 62

  7. ANOVA Rationale I Variation Variation Variation between each between each in all = observation + group mean observations and its group and the overall mean mean In other words, Total = Within group + Between groups sum of squares sum of squares sum of squares 7 / 62

  8. ANOVA Rationale II In shorthand: SST = SSW + SSB If the group means are not very different, the variation between them and the overall mean (SSB) will not be much more than the variation between the observations within a group (SSW) 8 / 62

  9. ANOVA: One-Way 9 / 62

  10. MSW We can pool the estimates of σ 2 across groups and use an overall estimate for the population variance: σ 2 Variation within a group = ˆ W SSW = N − k = MSW MSW is called the “within groups mean square” 10 / 62

  11. MSB We can also look at systematic variation among groups σ 2 Variation between groups = ˆ B SSB = k − 1 = MSB 11 / 62

  12. An ANOVA table Suppose there are k groups (e.g. if smoking status has categories current, former or never, then k=3) We calculate our test statistic using the sum of square values as follows: 12 / 62

  13. Hypothesis testing with ANOVA In performing ANOVA, we may want to ask: is there truly a difference in means across groups? Formally, we can specify the hypotheses: : µ 1 = µ 2 = · · · = µ k H 0 H a : at least one of the µ i ’s is different The null hypothesis specifies a global relationship If the result of the test is significant, then perform individual comparisons 13 / 62

  14. Goal of the comparisons Compare the two variability estimates, MSW and MSB σ 2 If F obs = MSB MSW = ˆ W is small, B σ 2 ˆ then variability between groups is negligible compared to variation within groups ⇒ The grouping does not explain much variation in the data 14 / 62

  15. The F-statistic For our observations, we assume X ∼ N ( µ gp , σ 2 ), where µ gp = E ( X | gp) = β 0 + β 1 · I (group=2) + β 1 · I (group=3) + · · · ) and I (group=i) is an indicator to denote whether or not each individual is in the i th group Note: we have assumed the same variance σ 2 for all groups — important to check this assumption Under these assumptions, we know the null distribution of the statistic F= MSB MSW The distribution is called an F-distribution 15 / 62

  16. The F-distribution Remember that a χ 2 distribution is always specified by its degrees of freedom An F-distribution is any distribution obtained by taking the quotient of two χ 2 distributions divided by their respective degrees of freedom When we specify an F-distribution, we must state two parameters, which correspond to the degrees of freedom for the two χ 2 distributions If X 1 ∼ χ 2 df 1 and X 2 ∼ χ 2 df 2 we write: X 1 / df 1 ∼ F df 1 , df 2 X 2 / df 2 16 / 62

  17. Back to the hypothesis test . . . Knowing the null distribution of MSB MSW, we can define a decision rule to test the hypothesis for ANOVA: Reject H 0 if F ≥ F α ; k − 1 , N − k Fail to reject H 0 if F < F α ; k − 1 , N − k 17 / 62

  18. ANOVA: F-tests I 18 / 62

  19. ANOVA: F-tests II 19 / 62

  20. Example: ANOVA for HDL Study design: Randomize control trial 132 men randomized to one of Diet + exericse Diet Control Follow-up one year later: 119 men remaining in study Outcome: mean change in plasma levels of HDL cholesterol from baseline to one-year follow-up in the three groups 20 / 62

  21. Model for HDL outcomes We model the means for each group as follows: µ c = E ( HDL | gp = c ) = mean change in control group µ d = E ( HDL | gp = d ) = mean change in diet group µ de = E ( HDL | gp = de ) = mean change in diet and exercise group We could also write the model as E ( HDL | gp ) = β 0 + β 1 I ( gp = d ) + β 2 I ( gp = de ) Recall that I(gp=D), I(gp=DE) are 0/1 group indicators 21 / 62

  22. HDL ANOVA Table We obtain the following results from the HDL experiment: 22 / 62

  23. HDL ANOVA results F-test H 0 : µ c = µ d = µ de (or H 0 : β 1 = β 2 = 0) H a : at least one mean is different from the others Test statistic F obs = 13 df 1 = k − 1 = 3 − 1 = 2 df 2 = N − k = 116 23 / 62

  24. HDL ANOVA Conclusions Rejection region: F > F 0 . 05;2 , 116 = 3 . 07 Since F obs = 13 . 0 > 3 . 07, we reject H 0 We conclude that at least one of the group means is different from the others 24 / 62

  25. Which groups are different? We might proceed to make individual comparisons Conduct two-sample t-tests for each pair of groups: X i − ¯ ¯ ˆ X j − 0 θ − θ 0 t = = SE (ˆ � θ ) s 2 s 2 p p n i + n j 25 / 62

  26. Multiple Comparisons Performing individual comparisons require multiple hypothesis tests If α = 0 . 05 for each comparison, there is a 5% chance that each comparison will falsely be called significant Overall, the probability of Type I error is elevated above 5% Question How can we address this multiple comparisons issue? 26 / 62

  27. Bonferroni adjustment A possible correction for multiple comparisons Test each hypothesis at level α ∗ = ( α/ 3) = 0 . 0167 Adjustment ensures overall Type I error rate does not exceed α = 0 . 05 However, this adjustment may be too conservative 27 / 62

  28. Multiple comparisons α α ∗ = α/ 3 Hypothesis H 0 : µ c = µ d (or β 1 = 0) 0.0167 H 0 : µ c = µ de (or β 2 = 0) 0.0167 H 0 : µ d = µ de (or β 1 − β 2 = 0) 0.0167 Overall α = 0 . 05 28 / 62

  29. HDL: Pairwise comparisons I Control and Diet groups H 0 : µ c = µ d (or β 1 = 0) − 0 . 05 − 0 . 02 t = = − 1 . 87 q 0 . 028 40 + 0 . 028 40 p-value = 0.06 29 / 62

  30. HDL: Pairwise comparisons II Control and Diet + exercise groups H 0 : µ c = µ de (or β 2 = 0) − 0 . 05 − 0 . 14 t = = 5 . 05 q 0 . 028 40 + 0 . 028 39 p-value = 4 . 4 × 10 − 7 30 / 62

  31. HDL: Pairwise comparisons III Diet and Diet + exercise groups H 0 : µ d = µ de (or β 1 − β 2 = 0) − 0 . 02 − 0 . 14 t = = − 3 . 19 q 0 . 028 40 + 0 . 028 39 p-value = 0.0014 31 / 62

  32. Bonferroni corrected p-values Hypothesis p-value adjusted p-value H 0 : µ c = µ d 0.06 0.18 4 . 4 × 10 − 7 1 . 3 × 10 − 6 H 0 : µ c = µ de H 0 : µ d = µ de 0.0014 0.0042 Overall α = 0 . 05 Conclusion: Significant difference in HDL change for DE group compared to other groups 32 / 62

  33. Two-way ANOVA Uses the same idea as one-way ANOVA by partitioning variability Allows us to look at interaction of factors Does the effect due to one factor change as the level of another factor changes? 33 / 62

  34. Example: Public health students’ medical expenditures Study design: In an observation study, total medical expenditures and various demographic characteristics were recorded for 200 public health students Goal: determine how gender and smoking status affect total medical expenditures in this population 34 / 62

  35. Example: Set-up Y = Total medical expenditures F = Indicator of Female = 1 if Gender=Female, 0 otherwise S = Indicator of Smoking = 1 if smoked 100 cigarettes or more, 0 otherwise 35 / 62

  36. Interaction model We assume the model Y ∼ N ( µ, σ 2 ) where µ = E ( Y ) = β 0 + β 1 F + β 2 S + β 3 F · S What are the interpretations of β 0 , β 1 , β 2 , and β 3 36 / 62

  37. Two-way ANOVA: Interactions Mean Model µ = E ( Y ) = β 0 + β 1 F + β 2 S + β 3 F · S Smoker No Yes Male β 0 β 0 + β 2 Gender Female β 0 + β 1 β 0 + β 1 + β 2 + β 3 37 / 62

  38. Mean Model E (Expenditure | Male, non-smoker) = β 0 + β 1 · 0 + β 2 · 0 + β 3 · 0 = β 0 E (Expenditure | Female, non-smoker) = β 0 + β 1 · 1 + β 2 · 0 + β 3 · 0 = β 0 + β 1 E (Expenditure | Male, Smoker) = β 0 + β 1 · 0 + β 2 · 1 + β 3 · 0 = β 0 + β 2 E (Expenditure | Female, Smoker) = β 0 + β 1 · 1 + β 2 · 1 + β 3 · 1 = β 0 + β 1 + β 2 + β 3 38 / 62

  39. Medical Expenditures: ANOVA table Source of Sum of Mean Variation Square df Square F p-value Model 1 . 7 × 10 9 5 . 6 × 10 8 (between groups) 3 28.11 < 0 . 001 Error 3 . 9 × 10 9 2 . 0 × 10 7 (within groups) 196 5 . 6 × 10 9 Total 199 39 / 62

Recommend


More recommend