SEVERAL ๐ S: COMPARISON Business Statistics
CONTENTS Comparing two ๐ s Comparing more than two ๐ s Analysis of variance Testing significance of ANOVA Performing ANOVA The statistical model Equal variances Old exam question Further study
COMPARING TWO ๐ S Recall the comparison of the means of two independent samples ( ๐ 1 and ๐ 2 ) with the ๐ข -test: We earlier wrote ๐ 1 and ๐ 2 or โช ๐ผ 0 : ๐ 1 = ๐ 2 ๐ and ๐ , so why not ๐ 1 and ๐ 2 ? ๐ 1 โ๐ โช under ๐ผ 0 : ~๐ข df , where df is the number of degrees 2 ๐ก ๐ 1 โ๐ 2 of freedom, depending on the assumption of the variances ๐ง 1 โ๐ง 2 โช reject when ๐ข calc = > ๐ข crit ๐ก ๐1โ๐2 Can we do this for three samples as well? โช ๐ผ 0 : ๐ 1 = ๐ 2 = ๐ 3
COMPARING MORE THAN TWO ๐ S A first attempt: ๐ 1 โ๐ โช think about how ๐ผ 0 : ๐ 1 = ๐ 2 leads to 2 ๐ diff ~๐ข df ๐ 1 โ๐ 2 โ๐ โช try if ๐ผ 0 : ๐ 1 = ๐ 2 = ๐ 3 leads to 3 ~๐ข df ๐ diff Very wrong! โช every year a few students try to do this at their exam, but this is a dead end road to take!
COMPARING MORE THAN TWO ๐ S A second attempt: โช pairwise comparisons: ๐ 1 vs. ๐ 2 and ๐ 2 vs ๐ 3 1 vs. ๐ 3 not needed โช ๐ โช that implies doing two tests โช not one For each test the probability of incorrectly rejecting ๐ผ 0 is ๐ฝ โช so with 2 tests, it becomes 1 โ 1 โ ๐ฝ 2 โช example: ๐ฝ = 0.05 gives 0.0975 , so almost double So that will not work โช think about testing 10 means (45 comparisons), the probability of a wrong decision becomes 90%
COMPARING MORE THAN TWO ๐ S A third attempt: โช We can partition the sum of squares into two parts: โช between the three groups โช within each group โช Try to identify sources of variation in a numerical dependent variable ๐ (the response variable) โช Variation in ๐ about its mean is partly explained by a categorical independent variable (the factor, with different levels) โช and partly unexplained (random error)
COMPARING MORE THAN TWO ๐ S Example โช Chips defect rates are different in every batch, but are there systematic differences between manufacturers? โช numerical dependent variable: chip defect rate ( ๐ ) โช categorical independent variable (one factor with four levels): manufacturer (1-4)
COMPARING MORE THAN TWO ๐ S Statistical model (formulation 1): โช manufaturer 1: ๐ ๐1 = ๐ 1 + ๐ ๐1 โช ... โช manufaturer 4: ๐ ๐4 = ๐ 4 + ๐ ๐4 Where: โช ๐ is the defect rate โช ๐ 1 is the mean for manufacturer 1 โช ... โช ๐ 4 is the mean for manufacturer 4 โช ๐ is the random, unexplained, part Null hypothesis: ๐ 1 = ๐ 2 = ๐ 3 = ๐ 4 โช ๐ 1 = ๐ 2 = ๐ 3 = ๐ 4
COMPARING MORE THAN TWO ๐ S Statistical model (formulation 2): This โgroup effectโ ๐ฝ has nothing to do with the โช manufaturer 1: ๐ ๐1 = ๐ + ๐ฝ 1 + ๐ ๐1 significance level ๐ฝ ! โช ... โช manufaturer 4: ๐ ๐4 = ๐ + ๐ฝ 4 + ๐ ๐4 Where โช ๐ is the defect rate โช ๐ is the overall mean defect rate โช ๐ฝ 1 is manufacturer 1โs mean deviation from ๐ โช ... โช ๐ฝ 4 is manufacturer 4โs mean deviation from ๐ โช ๐ is the random, unexplained, part ๐ + ๐ฝ 1 = โฏ = ๐ + ๐ฝ 4 Null hypothesis: โช ๐ฝ 1 = ๐ฝ 2 = ๐ฝ 3 = ๐ฝ 4 = 0
COMPARING MORE THAN TWO ๐ S โช What is the alternative hypothesis ๐ผ 1 ? โช when ๐ผ 0 : ๐ 1 = ๐ 2 = ๐ 3 = ๐ 4 โช Formulation 1: โช wrong: ๐ผ 1 : ๐ 1 โ ๐ 2 โ ๐ 3 โ ๐ 4 โช correct: ๐ผ 1 : ๐๐๐ข ๐ 1 = ๐ 2 = ๐ 3 = ๐ 4 โช or: at least one of the ๐ s differs from the other ๐ s ๐ 1 = ๐ 2 = ๐ 3 โ ๐ 4 ๐ 1 โ ๐ 2 โ ๐ 3 โ ๐ 4
COMPARING MORE THAN TWO ๐ S โช What is the alternative hypothesis ๐ผ 1 ? โช when ๐ผ 0 : ๐ฝ 1 = ๐ฝ 2 = ๐ฝ 3 = ๐ฝ 4 = 0 โช Formulation 2: โช wrong: ๐ผ 1 : ๐ฝ 1 โ ๐ฝ 2 โ ๐ฝ 3 โ ๐ฝ 4 โ 0 โช correct: ๐ผ 1 : ๐๐๐ข ๐ฝ 1 = ๐ฝ 2 = ๐ฝ 3 = ๐ฝ 4 = 0 โช or: at least one of the ๐ฝ s differs from 0 ๐ฝ 1 = ๐ฝ 2 = ๐ฝ 3 โ ๐ฝ 4 ๐ฝ 1 โ ๐ฝ 2 โ ๐ฝ 3 โ ๐ฝ 4
EXERCISE 1 We want to investigate possible differences in mean income in Atlanta, Boston, Chicago and Detroit. a. What is the null hypothesis? b. Suppose the null hypothesis is rejected. What can you conclude?
ANALYSIS OF VARIANCE โช Define notation: โช ๐ง is the numerical value (e.g., chip defect rate) โช ๐ง ๐๐ is the value for observation # ๐ within treatment # ๐ (e.g., machine # ๐ ) ๐ง โ๐ is the average over all observations ( ๐ = 1, โฆ , ๐ ๐ ) within โช เดค treatment # ๐ โช ๐ง โโ is the average over all observations within all treatments ( ๐ = เดค เดค 1, โฆ , ๐ ) Observe the position of the โช Analysis of variance (ANOVA) model dots. A dot tells that index has been averaged over or ๐ ๐ ๐๐ = ๐ ๐ + ๐ ๐๐ ๐๐ = ๐ + ๐ฝ ๐ + ๐ ๐๐
ANALYSIS OF VARIANCE Compare variation within groups to variation between groups โช variation within group # ๐ : ๐ ๐ 2 โช ๐๐๐ ๐ = ฯ ๐=1 ๐ง ๐๐ โ เดค ๐ง โ๐ โช variation within all groups ๐ = 1, โฆ , ๐ : ๐ โช ๐๐๐ = ฯ ๐=1 ๐๐๐ ๐ = ๐ ๐ 2 ๐ โช ฯ ๐=1 ฯ ๐=1 ๐ง ๐๐ โ เดค ๐ง โ๐ โช variation between the ๐ groups โช so due to the ๐ฝ s: 2 ๐ ๐ง โ๐ โ เดค โช ๐๐๐ต = ฯ ๐=1 ๐ ๐ เดค ๐ง โโ เดค So ๐๐๐ต is the variation around the mean เดค ๐ง โโ that is explained by the model, by factor โAโ
ANALYSIS OF VARIANCE Together, ๐๐๐ต and ๐๐๐ make up the total variation โช variation in entire data set: So ๐๐๐ is the total variation around the grand mean เดฅ ๐ง โโ ๐ ๐ 2 ๐ ๐ง ๐๐ โ เดค โช ๐๐๐ = ฯ ๐=1 ฯ ๐=1 ๐ง โโ เดค โช so โช ๐๐๐ = ๐๐๐ต + ๐๐๐ โช Think about the logic: we are comparing several means by comparing two variances โช analysis of variance is used to compare ๐ 1 , ๐ 2 , โฆ , ๐ ๐ ๐๐๐ ๐๐๐ต ๐๐๐น
ANALYSIS OF VARIANCE
ANALYSIS OF VARIANCE Source of variation ๐ป๐ป ๐๐ ๐ต๐ป ๐ฎโratio ๐๐๐ต = ๐๐๐ต ๐บ = ๐๐๐ต ๐๐๐ต ๐ โ 1 between groups (due to factor โAโ) ๐ โ 1 ๐๐๐ ๐๐๐ = ๐๐๐ ๐๐๐ ๐ โ ๐ within groups ๐ โ ๐ ๐๐๐ ๐ โ 1 total
EXERCISE 2 We sample from the four cities incomes from 100 persons ( 30 from Atlanta, 20 from Boston, 25 from Chicago and Detroit). a. What is ๐ and ๐ in the previous scheme? ๐๐ = ๐ ๐ + ๐ ๐๐ for the case of the 8 th respondent b. Specify ๐ from Chicago.
TESTING SIGNIFICANCE OF ANOVA What do we test? โช ๐ผ 0 : ๐ 1 = ๐ 2 = โฏ = ๐ ๐ โช or equivalently ๐ผ 0 : ๐ฝ 1 = ๐ฝ 2 = โฏ = ๐ฝ ๐ = 0 How do we test? โช by comparing ๐๐๐ต and ๐๐๐ โช or equivalently ๐๐๐ต and ๐๐๐ (which are variances!) โช if ๐ผ 0 is true, ๐๐๐ต and ๐๐๐ are expected to be equal โช their ratio is the test statistic: ๐บ = ๐๐๐ต if this ratio is large, the group ๐๐๐ averages are likely to differ
TESTING SIGNIFICANCE OF ANOVA The test statistic ๐บ โช is likely to be around 1 if ๐ผ 0 is true โช is likely to be much larger than 1 if ๐ผ 1 is true โช has a sampling distribution ๐บ ๐โ1,๐โ๐ under ๐ผ 0 Here ๐บ df 1 ,df 2 is the ๐บ -distribution with df 1 and df 2 degrees of freedom ๐๐๐ต Reject for large values of ๐บ = ๐๐๐ only because we only reject ๐ผ 0 if variations between groups are larger than expected under ๐ผ 0
TESTING SIGNIFICANCE OF ANOVA So step 3 becomes: ๐๐๐ต โช under ๐ผ 0 , ๐บ = ๐๐๐ โผ ๐บ ๐โ1,๐โ๐ โช under the assumption: ๐ ๐๐ โผ ๐ 0, ๐ 2 In other words, the assumptions of ANOVA are: โช the observations ๐ง ๐๐ should be independent โช the sub-populations should be normally distributed โช the sub-populations should have equal variances Fortunately, ANOVA is somewhat robust to โช departures from normality and โช the equal variance assumptions
TESTING SIGNIFICANCE OF ANOVA Five step procedure for ANOVA โช Step 1: โช ๐ผ 0 : ๐ฝ 1 = ๐ฝ 2 = โฏ = ๐ฝ ๐ = 0 ; ๐ผ 1 : not ๐ผ 0 ; ๐ฝ = 0.05 โช Step 2: ๐๐๐ต sample statistic: ๐บ = ๐๐๐ ; reject for large values โช โช Step 3: ๐๐๐ต under ๐ผ 0 : ๐บ = โช ๐๐๐ โผ ๐บ ๐โ1,๐โ๐ requirement: normal populations with equal variance โช โช Step 4: calculate ๐บ crit = ๐บ upper;df 1 ,df 2 ,๐ฝ โช or calculate ๐ โvalue = ๐ ๐บ โฅ ๐บ calc โช โช Step 5 โช reject ๐ผ 0 if ๐บ calc > ๐บ crit โช or reject ๐ผ 0 if ๐ โvalue < ๐ฝ
Recommend
More recommend