BUS41100 Applied Regression Analysis Week 7: Regression Issues Standardized and Studentized residuals, outliers and leverage, nonconstant variance, non-normality, nonlinearity, transformations, multicollinearity Max H. Farrell The University of Chicago Booth School of Business
Model assumptions Y | X ∼ N ( β 0 + β 1 X, σ 2 ) Key assumptions of our linear regression model: (i) The conditional mean of Y is linear in X . (ii) The additive errors (deviations from line) ◮ are Normally distributed ◮ independent from each other ◮ identically distributed (i.e., they have constant variance) 1
Inference and prediction relies on this model being true! If the model assumptions do not hold, then all bets are off: ◮ prediction can be systematically biased ◮ standard errors and confidence intervals are wrong (but how wrong?) We will focus on using graphical methods (plots!) to detect violations of the model assumptions. You’ll see that ◮ It is more of an art than a science, ◮ but it is grounded in mathematics. 2
Example model violations Anscombe’s quartet comprises four datasets that have similar statistical properties . . . > attach(anscombe <- read.csv("anscombe.csv")) > c(x.m1=mean(x1), x.m2=mean(x2), x.m3=mean(x3), x.m4=mean(x4)) x.m1 x.m2 x.m3 x.m4 9 9 9 9 > c(y.m1=mean(y1), y.m2=mean(y2), y.m3=mean(y3), y.m4=mean(y4)) y.m1 y.m2 y.m3 y.m4 7.500909 7.500909 7.500000 7.500909 > c(x.sd1=sd(x1), x.sd2=sd(x2), x.sd3=sd(x3), x.sd3=sd(x4)) x.sd1 x.sd2 x.sd3 x.sd4 3.316625 3.316625 3.316625 3.316625 > c(y.sd1=sd(y1), y.sd2=sd(y2), y.sd4=sd(y3), y.sd4=sd(y4)) y.sd1 y.sd2 y.sd3 y.sd4 2.031568 2.031657 2.030424 2.030579 > c(cor1=cor(x1,y1), cor2=cor(x2,y2), cor3=cor(x3,y3), cor4=cor(x4,y4)) cor1 cor2 cor3 cor4 0.8164205 0.8162365 0.8162867 0.8165214 3
. . . but vary considerably when graphed. ● 10 10 ● ● ● ● ● ● ● ● y1 y2 ● ● ● 8 8 ● ● ● ● ● 6 6 ● ● ● ● 4 4 ● 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x1 x2 ● ● 10 10 ● ● ● y3 y4 ● 8 8 ● ● ● ● ● ● ● ● ● ● ● 6 6 ● ● ● ● ● 4 4 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x3 x4 4
Similarly, let’s consider linear regression for each dataset. ● 10 10 ● ● ● ● ● ● ● ● y1 y2 ● ● ● 8 8 ● ● ● ● ● 6 6 ● ● ● ● 4 4 ● 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x1 x2 ● ● 10 10 ● ● ● y3 y4 ● 8 8 ● ● ● ● ● ● ● ● ● ● ● 6 6 ● ● ● ● ● 4 4 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x3 x4 5
The regression lines and even R 2 values are the same... > ansreg <- list(reg1=lm(y1~x1), reg2=lm(y2~x2), + reg3=lm(y3~x3), reg4=lm(y4~x4)) > attach(ansreg) > cbind(reg1$coef, reg2$coef, reg3$coef, reg4$coef) [,1] [,2] [,3] [,4] (Intercept) 3.0000909 3.000909 3.0024545 3.0017273 x1 0.5000909 0.500000 0.4997273 0.4999091 > smry <- lapply(ansreg, summary) > c(smry$reg1$r.sq, smry$reg2$r.sq, + smry$reg3$r.sq, smry$reg4$r.sq) [1] 0.6665425 0.6662420 0.6663240 0.6667073 6
...but the residuals (plotted against ˆ Y ) look totally different. ● ● ● ● 1.0 ● ● ● ● reg1$residuals reg2$residuals 1 0.0 ● ● ● ● 0 ● ● ● −1.0 ● ● ● −1 ● −2.0 −2 ● ● ● 5 6 7 8 9 10 5 6 7 8 9 10 reg1$fitted reg2$fitted ● ● 3 ● reg3$residuals reg4$residuals 1.0 ● 2 ● 0.0 1 ● ● ● ● ● ● ● 0 ● ● ● ● −1.5 ● ● ● −1 ● ● ● 5 6 7 8 9 10 7 8 9 10 11 12 reg3$fitted reg4$fitted 7
Plotting e vs ˆ Y is your #1 tool for finding fit problems. Why? ◮ Because it gives a quick visual indicator of whether or not the model assumptions are true. What should we expect to see if they are true? 1. Each ε i has the same variance ( σ 2 ). 2. Each ε i has the same mean (0). 3. The ε i collectively have the same Normal distribution. Remember: ˆ Y is made from X in SLR and MLR, so one plot summarizes across the X . 8
How do we check these? Well, the true ε i residuals are unknown, so must look instead at the least squares estimated residuals. ◮ We estimate Y i = b 0 + b 1 X i + e i , such that the sample least squares regression residuals are e i = Y i − ˆ Y i What should the e i look like if the SLR model is true? 9
If the SLR model is true, it turns out that: ( X i − ¯ X ) 2 h i = 1 e i ∼ N (0 , σ 2 [1 − h i ]) , n + X ) 2 . j =1 ( X j − ¯ � n The h i term is referred to as the i th observation’s leverage: ◮ It is that point’s share of the data ( 1 /n ) plus its proportional contribution to variability in X . Notice that as n → ∞ , h i → 0 and residuals e i “obtain” the same distribution as the unknown errors ε i , i.e., e i ∼ N (0 , σ 2 ) . ————————————— See handout on course page for derivations. 10
Understanding Leverage The h i leverage term measures sensitivity of the estimated least squares regression line to changes in Y i . The term “leverage” provides a mechanical intuition: ◮ The farther you are from a pivot joint, the more torque you have pulling on a lever. Online illustration of leverage: https://rstudio-class.chicagobooth.edu Outliers do more damage if they have high leverage! 11
Standardized residuals Since e i ∼ N (0 , σ 2 [1 − h i ]) , we know that e i σ √ 1 − h i ∼ N (0 , 1) . These transformed e i ’s are called the standardized residuals. ◮ They all have the same distribution if the SLR model assumptions are true. iid ◮ They are almost (close enough) independent ( ∼ N (0 , 1) ). ◮ Estimate σ 2 using ˆ σ 2 or s 2 12
About estimating s under sketchy SLR assumptions ... We want to see whether any particular e i is “too big”, but we don’t want a single outlier to make s artificially large. > plot(x3,y3, col=3, pch=20, cex=1.5) > abline(reg3, col=3) ● 12 ◮ One big outlier 10 y3 ● can make s ● 8 ● ● overestimate σ . ● ● ● ● 6 ● ● 4 6 8 10 12 14 x3 13
Studentized residuals We thus define a Studentized residual as e i √ 1 − h i r i = s − i where s 2 1 j � = i e 2 � − i = j is ˆ σ calculated without e i . n − p − 1 These are easy to get in R with the rstudent() function. > rstudent(reg3) [1] -0.4390554 -0.1855022 1203.5394638 -0.3138441 [5] -0.5742948 -1.1559818 0.0664074 0.3618514 [9] -0.7356770 -0.0657680 0.2002633 14
Outliers and Studentized residuals Since the studentized residuals should be ≈ N (0 , 1) , we should be concerned about any r i outside of about [ − 3 , 3] . ● ● 3 1000 reg3$residuals rstudent(reg3) 2 600 1 ● ● ● 0 ● ● 200 ● ● ● ● −1 ● 0 ● ● ● ● ● ● ● ● ● ● 5 6 7 8 9 10 5 6 7 8 9 10 reg3$fitted reg3$fitted These aren’t hard and fast cutoffs. As n gets bigger, we will expect to see some very rare events (big ε i ) and not get worried unless | r i | > 4 . 15
How to deal with outliers When should you delete outliers? ◮ Only when you have a really good reason! There is nothing wrong with running a regression with and without potential outliers to see whether results are significantly impacted. Any time outliers are dropped, the reasons for doing so should be clearly noted. ◮ I maintain that both a statistical and a non-statistical reason are required. (What?) 16
Recommend
More recommend