business statistics
play

Business Statistics CONTENTS Comparing two s Comparing more than - PowerPoint PPT Presentation

SEVERAL S: COMPARISON Business Statistics CONTENTS Comparing two s Comparing more than two s Analysis of variance Testing significance of ANOVA Performing ANOVA The statistical model Equal variances Old exam question Further


  1. SEVERAL ๐œˆ S: COMPARISON Business Statistics

  2. CONTENTS Comparing two ๐œˆ s Comparing more than two ๐œˆ s Analysis of variance Testing significance of ANOVA Performing ANOVA The statistical model Equal variances Old exam question Further study

  3. COMPARING TWO ๐œˆ S Recall the comparison of the means of two independent samples ( ๐‘ 1 and ๐‘ 2 ) with the ๐‘ข -test: We earlier wrote ๐‘Œ 1 and ๐‘Œ 2 or โ–ช ๐ผ 0 : ๐œˆ 1 = ๐œˆ 2 ๐‘Œ and ๐‘ , so why not ๐‘ 1 and ๐‘ 2 ? ๐‘ 1 โˆ’๐‘ โ–ช under ๐ผ 0 : ~๐‘ข df , where df is the number of degrees 2 ๐‘ก ๐‘ 1 โˆ’๐‘ 2 of freedom, depending on the assumption of the variances ๐‘ง 1 โˆ’๐‘ง 2 โ–ช reject when ๐‘ข calc = > ๐‘ข crit ๐‘ก ๐‘1โˆ’๐‘2 Can we do this for three samples as well? โ–ช ๐ผ 0 : ๐œˆ 1 = ๐œˆ 2 = ๐œˆ 3

  4. COMPARING MORE THAN TWO ๐œˆ S A first attempt: ๐‘ 1 โˆ’๐‘ โ–ช think about how ๐ผ 0 : ๐œˆ 1 = ๐œˆ 2 leads to 2 ๐œ diff ~๐‘ข df ๐‘ 1 โˆ’๐‘ 2 โˆ’๐‘ โ–ช try if ๐ผ 0 : ๐œˆ 1 = ๐œˆ 2 = ๐œˆ 3 leads to 3 ~๐‘ข df ๐œ diff Very wrong! โ–ช every year a few students try to do this at their exam, but this is a dead end road to take!

  5. COMPARING MORE THAN TWO ๐œˆ S A second attempt: โ–ช pairwise comparisons: ๐‘ 1 vs. ๐‘ 2 and ๐‘ 2 vs ๐‘ 3 1 vs. ๐‘ 3 not needed โ–ช ๐‘ โ–ช that implies doing two tests โ–ช not one For each test the probability of incorrectly rejecting ๐ผ 0 is ๐›ฝ โ–ช so with 2 tests, it becomes 1 โˆ’ 1 โˆ’ ๐›ฝ 2 โ–ช example: ๐›ฝ = 0.05 gives 0.0975 , so almost double So that will not work โ–ช think about testing 10 means (45 comparisons), the probability of a wrong decision becomes 90%

  6. COMPARING MORE THAN TWO ๐œˆ S A third attempt: โ–ช We can partition the sum of squares into two parts: โ–ช between the three groups โ–ช within each group โ–ช Try to identify sources of variation in a numerical dependent variable ๐‘ (the response variable) โ–ช Variation in ๐‘ about its mean is partly explained by a categorical independent variable (the factor, with different levels) โ–ช and partly unexplained (random error)

  7. COMPARING MORE THAN TWO ๐œˆ S Example โ–ช Chips defect rates are different in every batch, but are there systematic differences between manufacturers? โ–ช numerical dependent variable: chip defect rate ( ๐‘ ) โ–ช categorical independent variable (one factor with four levels): manufacturer (1-4)

  8. COMPARING MORE THAN TWO ๐œˆ S Statistical model (formulation 1): โ–ช manufaturer 1: ๐‘ ๐‘—1 = ๐œˆ 1 + ๐œ ๐‘—1 โ–ช ... โ–ช manufaturer 4: ๐‘ ๐‘—4 = ๐œˆ 4 + ๐œ ๐‘—4 Where: โ–ช ๐‘ is the defect rate โ–ช ๐œˆ 1 is the mean for manufacturer 1 โ–ช ... โ–ช ๐œˆ 4 is the mean for manufacturer 4 โ–ช ๐œ is the random, unexplained, part Null hypothesis: ๐œˆ 1 = ๐œˆ 2 = ๐œˆ 3 = ๐œˆ 4 โ–ช ๐œˆ 1 = ๐œˆ 2 = ๐œˆ 3 = ๐œˆ 4

  9. COMPARING MORE THAN TWO ๐œˆ S Statistical model (formulation 2): This โ€œgroup effectโ€ ๐›ฝ has nothing to do with the โ–ช manufaturer 1: ๐‘ ๐‘—1 = ๐œˆ + ๐›ฝ 1 + ๐œ ๐‘—1 significance level ๐›ฝ ! โ–ช ... โ–ช manufaturer 4: ๐‘ ๐‘—4 = ๐œˆ + ๐›ฝ 4 + ๐œ ๐‘—4 Where โ–ช ๐‘ is the defect rate โ–ช ๐œˆ is the overall mean defect rate โ–ช ๐›ฝ 1 is manufacturer 1โ€™s mean deviation from ๐œˆ โ–ช ... โ–ช ๐›ฝ 4 is manufacturer 4โ€™s mean deviation from ๐œˆ โ–ช ๐œ is the random, unexplained, part ๐œˆ + ๐›ฝ 1 = โ‹ฏ = ๐œˆ + ๐›ฝ 4 Null hypothesis: โ–ช ๐›ฝ 1 = ๐›ฝ 2 = ๐›ฝ 3 = ๐›ฝ 4 = 0

  10. COMPARING MORE THAN TWO ๐œˆ S โ–ช What is the alternative hypothesis ๐ผ 1 ? โ–ช when ๐ผ 0 : ๐œˆ 1 = ๐œˆ 2 = ๐œˆ 3 = ๐œˆ 4 โ–ช Formulation 1: โ–ช wrong: ๐ผ 1 : ๐œˆ 1 โ‰  ๐œˆ 2 โ‰  ๐œˆ 3 โ‰  ๐œˆ 4 โ–ช correct: ๐ผ 1 : ๐‘œ๐‘๐‘ข ๐œˆ 1 = ๐œˆ 2 = ๐œˆ 3 = ๐œˆ 4 โ–ช or: at least one of the ๐œˆ s differs from the other ๐œˆ s ๐œˆ 1 = ๐œˆ 2 = ๐œˆ 3 โ‰  ๐œˆ 4 ๐œˆ 1 โ‰  ๐œˆ 2 โ‰  ๐œˆ 3 โ‰  ๐œˆ 4

  11. COMPARING MORE THAN TWO ๐œˆ S โ–ช What is the alternative hypothesis ๐ผ 1 ? โ–ช when ๐ผ 0 : ๐›ฝ 1 = ๐›ฝ 2 = ๐›ฝ 3 = ๐›ฝ 4 = 0 โ–ช Formulation 2: โ–ช wrong: ๐ผ 1 : ๐›ฝ 1 โ‰  ๐›ฝ 2 โ‰  ๐›ฝ 3 โ‰  ๐›ฝ 4 โ‰  0 โ–ช correct: ๐ผ 1 : ๐‘œ๐‘๐‘ข ๐›ฝ 1 = ๐›ฝ 2 = ๐›ฝ 3 = ๐›ฝ 4 = 0 โ–ช or: at least one of the ๐›ฝ s differs from 0 ๐›ฝ 1 = ๐›ฝ 2 = ๐›ฝ 3 โ‰  ๐›ฝ 4 ๐›ฝ 1 โ‰  ๐›ฝ 2 โ‰  ๐›ฝ 3 โ‰  ๐›ฝ 4

  12. EXERCISE 1 We want to investigate possible differences in mean income in Atlanta, Boston, Chicago and Detroit. a. What is the null hypothesis? b. Suppose the null hypothesis is rejected. What can you conclude?

  13. ANALYSIS OF VARIANCE โ–ช Define notation: โ–ช ๐‘ง is the numerical value (e.g., chip defect rate) โ–ช ๐‘ง ๐‘—๐‘˜ is the value for observation # ๐‘— within treatment # ๐‘˜ (e.g., machine # ๐‘˜ ) ๐‘ง โˆ™๐‘˜ is the average over all observations ( ๐‘— = 1, โ€ฆ , ๐‘œ ๐‘˜ ) within โ–ช เดค treatment # ๐‘˜ โ–ช ๐‘ง โˆ™โˆ™ is the average over all observations within all treatments ( ๐‘˜ = เดค เดค 1, โ€ฆ , ๐‘‘ ) Observe the position of the โ–ช Analysis of variance (ANOVA) model dots. A dot tells that index has been averaged over or ๐‘ ๐‘ ๐‘—๐‘˜ = ๐œˆ ๐‘˜ + ๐œ ๐‘—๐‘˜ ๐‘—๐‘˜ = ๐œˆ + ๐›ฝ ๐‘˜ + ๐œ ๐‘—๐‘˜

  14. ANALYSIS OF VARIANCE Compare variation within groups to variation between groups โ–ช variation within group # ๐‘˜ : ๐‘œ ๐‘˜ 2 โ–ช ๐‘‡๐‘‡๐‘‹ ๐‘˜ = ฯƒ ๐‘—=1 ๐‘ง ๐‘—๐‘˜ โˆ’ เดค ๐‘ง โˆ™๐‘˜ โ–ช variation within all groups ๐‘˜ = 1, โ€ฆ , ๐‘‘ : ๐‘‘ โ–ช ๐‘‡๐‘‡๐‘‹ = ฯƒ ๐‘˜=1 ๐‘‡๐‘‡๐‘‹ ๐‘˜ = ๐‘œ ๐‘˜ 2 ๐‘‘ โ–ช ฯƒ ๐‘˜=1 ฯƒ ๐‘—=1 ๐‘ง ๐‘—๐‘˜ โˆ’ เดค ๐‘ง โˆ™๐‘˜ โ–ช variation between the ๐‘‘ groups โ–ช so due to the ๐›ฝ s: 2 ๐‘‘ ๐‘ง โˆ™๐‘˜ โˆ’ เดค โ–ช ๐‘‡๐‘‡๐ต = ฯƒ ๐‘˜=1 ๐‘œ ๐‘˜ เดค ๐‘ง โˆ™โˆ™ เดค So ๐‘‡๐‘‡๐ต is the variation around the mean เดค ๐‘ง โˆ™โˆ™ that is explained by the model, by factor โ€œAโ€

  15. ANALYSIS OF VARIANCE Together, ๐‘‡๐‘‡๐ต and ๐‘‡๐‘‡๐‘‹ make up the total variation โ–ช variation in entire data set: So ๐‘‡๐‘‡๐‘ˆ is the total variation around the grand mean เดฅ ๐‘ง โˆ™โˆ™ ๐‘œ ๐‘˜ 2 ๐‘‘ ๐‘ง ๐‘—๐‘˜ โˆ’ เดค โ–ช ๐‘‡๐‘‡๐‘ˆ = ฯƒ ๐‘˜=1 ฯƒ ๐‘—=1 ๐‘ง โˆ™โˆ™ เดค โ–ช so โ–ช ๐‘‡๐‘‡๐‘ˆ = ๐‘‡๐‘‡๐ต + ๐‘‡๐‘‡๐‘‹ โ–ช Think about the logic: we are comparing several means by comparing two variances โ–ช analysis of variance is used to compare ๐œˆ 1 , ๐œˆ 2 , โ€ฆ , ๐œˆ ๐‘‘ ๐‘‡๐‘‡๐‘ˆ ๐‘‡๐‘‡๐ต ๐‘‡๐‘‡๐น

  16. ANALYSIS OF VARIANCE

  17. ANALYSIS OF VARIANCE Source of variation ๐‘ป๐‘ป ๐ž๐  ๐‘ต๐‘ป ๐‘ฎโˆ’ratio ๐‘๐‘‡๐ต = ๐‘‡๐‘‡๐ต ๐บ = ๐‘๐‘‡๐ต ๐‘‡๐‘‡๐ต ๐‘‘ โˆ’ 1 between groups (due to factor โ€œAโ€) ๐‘‘ โˆ’ 1 ๐‘๐‘‡๐‘‹ ๐‘๐‘‡๐‘‹ = ๐‘‡๐‘‡๐‘‹ ๐‘‡๐‘‡๐‘‹ ๐‘œ โˆ’ ๐‘‘ within groups ๐‘œ โˆ’ ๐‘‘ ๐‘‡๐‘‡๐‘ˆ ๐‘œ โˆ’ 1 total

  18. EXERCISE 2 We sample from the four cities incomes from 100 persons ( 30 from Atlanta, 20 from Boston, 25 from Chicago and Detroit). a. What is ๐‘œ and ๐‘‘ in the previous scheme? ๐‘—๐‘˜ = ๐œˆ ๐‘˜ + ๐œ ๐‘—๐‘˜ for the case of the 8 th respondent b. Specify ๐‘ from Chicago.

  19. TESTING SIGNIFICANCE OF ANOVA What do we test? โ–ช ๐ผ 0 : ๐œˆ 1 = ๐œˆ 2 = โ‹ฏ = ๐œˆ ๐‘‘ โ–ช or equivalently ๐ผ 0 : ๐›ฝ 1 = ๐›ฝ 2 = โ‹ฏ = ๐›ฝ ๐‘‘ = 0 How do we test? โ–ช by comparing ๐‘‡๐‘‡๐ต and ๐‘‡๐‘‡๐‘‹ โ–ช or equivalently ๐‘๐‘‡๐ต and ๐‘๐‘‡๐‘‹ (which are variances!) โ–ช if ๐ผ 0 is true, ๐‘๐‘‡๐ต and ๐‘๐‘‡๐‘‹ are expected to be equal โ–ช their ratio is the test statistic: ๐บ = ๐‘๐‘‡๐ต if this ratio is large, the group ๐‘๐‘‡๐‘‹ averages are likely to differ

  20. TESTING SIGNIFICANCE OF ANOVA The test statistic ๐บ โ–ช is likely to be around 1 if ๐ผ 0 is true โ–ช is likely to be much larger than 1 if ๐ผ 1 is true โ–ช has a sampling distribution ๐บ ๐‘‘โˆ’1,๐‘œโˆ’๐‘‘ under ๐ผ 0 Here ๐บ df 1 ,df 2 is the ๐บ -distribution with df 1 and df 2 degrees of freedom ๐‘๐‘‡๐ต Reject for large values of ๐บ = ๐‘๐‘‡๐‘‹ only because we only reject ๐ผ 0 if variations between groups are larger than expected under ๐ผ 0

  21. TESTING SIGNIFICANCE OF ANOVA So step 3 becomes: ๐‘๐‘‡๐ต โ–ช under ๐ผ 0 , ๐บ = ๐‘๐‘‡๐‘‹ โˆผ ๐บ ๐‘‘โˆ’1,๐‘œโˆ’๐‘‘ โ–ช under the assumption: ๐œ ๐‘—๐‘˜ โˆผ ๐‘‚ 0, ๐œ 2 In other words, the assumptions of ANOVA are: โ–ช the observations ๐‘ง ๐‘—๐‘˜ should be independent โ–ช the sub-populations should be normally distributed โ–ช the sub-populations should have equal variances Fortunately, ANOVA is somewhat robust to โ–ช departures from normality and โ–ช the equal variance assumptions

  22. TESTING SIGNIFICANCE OF ANOVA Five step procedure for ANOVA โ–ช Step 1: โ–ช ๐ผ 0 : ๐›ฝ 1 = ๐›ฝ 2 = โ‹ฏ = ๐›ฝ ๐‘‘ = 0 ; ๐ผ 1 : not ๐ผ 0 ; ๐›ฝ = 0.05 โ–ช Step 2: ๐‘๐‘‡๐ต sample statistic: ๐บ = ๐‘๐‘‡๐‘‹ ; reject for large values โ–ช โ–ช Step 3: ๐‘๐‘‡๐ต under ๐ผ 0 : ๐บ = โ–ช ๐‘๐‘‡๐‘‹ โˆผ ๐บ ๐‘‘โˆ’1,๐‘œโˆ’๐‘‘ requirement: normal populations with equal variance โ–ช โ–ช Step 4: calculate ๐บ crit = ๐บ upper;df 1 ,df 2 ,๐›ฝ โ–ช or calculate ๐‘ž โˆ’value = ๐‘„ ๐บ โ‰ฅ ๐บ calc โ–ช โ–ช Step 5 โ–ช reject ๐ผ 0 if ๐บ calc > ๐บ crit โ–ช or reject ๐ผ 0 if ๐‘ž โˆ’value < ๐›ฝ

Recommend


More recommend