ANOVA: Analysis of Variance Marc H. Mehlman marcmehlman@yahoo.com University of New Haven “The analysis of variance is (not a mathematical theorem but) a simple method of arranging arithmetical facts so as to isolate and display the essential features of a body of data with the utmost simplicity.” – Sir Ronald A. Fisher Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 1 / 31
Table of Contents ANOVA: One Way Layout 1 Comparing Means 2 ANOVA: Two Way Layout 3 Chapter #11 R Assignment 4 Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 2 / 31
ANOVA (analysis of variance) is for testing if the means of k different populations are equal when all the populations are independent, normal and have the same unknown variance. An ANOVA test compares the randomness (variance) within groups (populations) to the randomness between groups. To test if the means of all the populations are equal, one considers the ratio variance between groups variance within groups as a test statistic. A large ratio would indicate a difference between in means between the groups. Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 3 / 31
ANOVA: One Way Layout ANOVA: One Way Layout ANOVA: One Way Layout Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 4 / 31
ANOVA: One Way Layout The Idea of ANOVA The sample means for the three samples are the same for each set. The variation among sample means for (a) is identical to (b). The variation among the individuals within the three samples is much less for (b). CONCLUSION: the samples in (b) contain a larger amount of variation among the sample means relative to the amount of variation within the samples, so ANOVA will find more significant differences among the means in (b) − assuming equal sample sizes here for (a) and (b). − Note: larger samples will find more significant differences. 7 Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 5 / 31
ANOVA: One Way Layout Note: When k = 2, one usually uses the two–sample t test. However, ANOVA will give the same result. When k > 2, hypothesis testing two populations at a time does not work well. For instance, if one has four populations and each test is a � 4 � significance level 0.05, then the significance level of all = 6 tests 2 would be 1 − (1 − 0 . 05) 6 = 0 . 265. The ANOVA procedure is computationally intense - one usually uses a computer program. Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 6 / 31
ANOVA: One Way Layout Assumptions for doing ANOVA 1 the populations are normal. 2 the populations have same (unknown) variance. The above conditions are robust in the sense one can use ANOVA if the populations are approximately normal (otherwise the Kruskal–Wallis Test – a nonparametric test) and the population variances are approximately equal. Convention: Rule for establishing equal variance If the largest sample standard deviation is less than twice the smallest sample standard deviation, one can use ANOVA techniques under the assumption the variances are all the same. Some textbooks use four times the smallest sample variance instead of just twice. Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 7 / 31
ANOVA: One Way Layout The Treatment or Factor is what differs between populations. Example A Blood pressure drug is administered to k populations in k different doses. One samples from each of the the k populations. dosage #1 X 11 , · · · , X 1 n 1 . . . . . . dosage #k X k 1 , · · · , X kn k Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 8 / 31
ANOVA: One Way Layout Definition Let k def = # of levels (populations) def = sample size of random sample from j th population n j N def = n 1 + n 2 + · · · + n k = total number of random varibles = sample mean from j th population def x j ¯ = sample variance from j th population def s 2 j n i k = the grand mean = 1 x def � � ¯ x ij N i =1 j =1 Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 9 / 31
ANOVA: One Way Layout Definition k n i x ) 2 = Sum of Squares Total � � SS TOT = ( x ij − ¯ i =1 j =1 def = Sum of Squares between levels SS A x ) 2 + n 2 (¯ x ) 2 + · · · + n k (¯ x ) 2 = n 1 (¯ x 1 − ¯ x 2 − ¯ x k − ¯ def SS E = Sum of Squares within the levels = ( n 1 − 1) s 2 1 + ( n 2 − 1) s 2 2 + · · · + ( n k − 1) s 2 k Theorem SS TOT = SS A + SS E . Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 10 / 31
ANOVA: One Way Layout Definition def MS A = Mean Squares between levels (groups) x ) 2 + n 2 (¯ x ) 2 + · · · + n k (¯ x ) 2 k − 1 = n 1 (¯ x 1 − ¯ x 2 − ¯ x k − ¯ SS A def = . k − 1 def MS E = Mean Squares within the levels = pooled sample variance = Mean Squared Error N − k = ( n 1 − 1) s 2 1 + ( n 2 − 1) s 2 2 + · · · + ( n k − 1) s 2 SS E def k = . N − k Theorem The Mean Square Error, MS E , is an unbiased estimator of σ 2 . Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 11 / 31
ANOVA: One Way Layout Theorem (ANOVA F Test) To test H 0 : µ 1 = · · · = µ k vs H A : not H 0 use test statistic F = MS A ∼ F ( k − 1 , N − k ) under H 0 . MS E Not H 0 ⇒ F large, so use right tail test. One creates an ANOVA table : Source df SS MS F p MS A Between k − 1 SS A MS A P ( F ( k − 1 , N − I ) ≥ f ) MS E Within N − k SS E MS E Total N − 1 SS TOT Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 12 / 31
ANOVA: One Way Layout Example Judges at the Parisian photography contest, FotoGras, numerically scored photographs submitted by a number of photographers on a scale 0–10. A One–Way Anova Test was performed to see which type of camera the photograph was taken with had anything to do with the judges numerical scores. A summary of the data is given below: Brand Sample Size Sample Mean Sample Variance Canon 11 7.6 2.1 Nikon 9 8.0 3.3 Pentax 5 8.7 2.9 Samsung 3 8.3 2.0 Sony 8 8.0 1.9 The scores awarded from each brand was verified as being (mostly) normally distributed and independent from the scores awarded from other brands. Create an ANOVA Table from the scores and decide whether there was no “brand effect” at a 0.05 significance level. Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 13 / 31
ANOVA: One Way Layout Example (cont.) Solution: √ √ Since the largest sample standard deviation, 3 . 3, is less than twice the size of the smallest sample variance, 1 . 9, we can assume the population variances are all the same. k = 5 N = 11 + 9 + 5 + 3 + 8 = 36 11(7 . 6) + 9(8 . 0) + 5(8 . 7) + 3(8 . 3) + 8(8 . 0) ¯ x = = 8 . 0 36 SS A = 11(7 . 6 − 8 . 0) 2 + 9(8 . 0 − 8 . 0) 2 + 5(8 . 7 − 8 . 0) 2 + 3(8 . 3 − 8 . 0) 2 + 8(8 . 0 − 8 . 0) 2 = 4 . 48 SS E = (11 − 1)2 . 1 + (9 − 1)3 . 3 + (5 − 1)2 . 9 + (3 − 1)2 . 0 + (8 − 1)1 . 9 = 76 . 3 SS TOT = SSG + SSE = 4 . 48 + 76 . 3 = 80 . 78 4 . 48 SS A MS A = = = 1 . 12 k − 1 5 − 1 76 . 3 SS E MS E = = = 2 . 46129 36 − 5 N − k MS A 1 . 12 f = = = 0 . 4550459 MS E 2 . 46129 p –value = P ( F (4 , 31) ≥ f ) = 0 . 7679706 Source df SS MS F p Between 4 4.48 1.12 0.45505 0.76797 Within 31 76.3 2.46129 Total 35 80.78 One accepts the hypothesis that there is no “brand” effect. Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 14 / 31
ANOVA: One Way Layout Example Given data on carpet durability > cdat=read.table("carpet.dat",h=TRUE) > cdat Durability Carpet 18.95 1 12.62 1 11.94 1 14.42 1 10.06 2 7.19 2 7.03 2 14.66 2 10.92 3 13.28 3 14.52 3 12.51 3 10.46 4 21.40 4 18.10 4 22.50 4 Test if durability depends on which carpet type one choses. Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 15 / 31
ANOVA: One Way Layout Example (continued) > cdat=read.table("carpet.dat",h=TRUE) > Carpet.F = as.factor(cdat$Carpet) # change to a categorical variable > g.lm=lm(cdat$Durability~Carpet.F) > anova(g.lm) Analysis of Variance Table Response: cdat$Durability Df Sum Sq Mean Sq F value Pr(>F) Carpet.F 3 146.374 48.791 3.5815 0.04674 * Residuals 12 163.477 13.623 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > kruskal.test(cdat$Durability~Carpet.F) # Kruskal--Wallis Test Kruskal-Wallis rank sum test data: cdat$Durability by Carpet.F Kruskal-Wallis chi-squared = 5.2059, df = 3, p-value = 0.1573 Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 16 / 31
Comparing Means Comparing Means Comparing Means Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 17 / 31
Comparing Means If H 0 is rejected , ie all means are not equal, how do you find how the population means differ from each other? Answer: boxplots (all in one graph). multiple comparison methods such as the Bonferroni Multiple Comparison Test. Marc Mehlman Marc Mehlman (University of New Haven) ANOVA: Analysis of Variance 18 / 31
Recommend
More recommend