Week 7: Regression Issues Standardized and Studentized residuals, - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 7: Regression Issues Standardized and Studentized residuals, outliers and leverage, nonconstant variance, non-normality, nonlinearity, transformations, multicollinearity Max H. Farrell The University of Chicago Booth School of Business

Model assumptions Y | X ∼ N ( β 0 + β 1 X, σ 2 ) Key assumptions of our linear regression model: (i) The conditional mean of Y is linear in X . (ii) The additive errors (deviations from line) ◮ are Normally distributed ◮ independent from each other ◮ identically distributed (i.e., they have constant variance) 1

Inference and prediction relies on this model being true! If the model assumptions do not hold, then all bets are off: ◮ prediction can be systematically biased ◮ standard errors and confidence intervals are wrong (but how wrong?) We will focus on using graphical methods (plots!) to detect violations of the model assumptions. You’ll see that ◮ It is more of an art than a science, ◮ but it is grounded in mathematics. 2

Example model violations Anscombe’s quartet comprises four datasets that have similar statistical properties . . . > attach(anscombe <- read.csv("anscombe.csv")) > c(x.m1=mean(x1), x.m2=mean(x2), x.m3=mean(x3), x.m4=mean(x4)) x.m1 x.m2 x.m3 x.m4 9 9 9 9 > c(y.m1=mean(y1), y.m2=mean(y2), y.m3=mean(y3), y.m4=mean(y4)) y.m1 y.m2 y.m3 y.m4 7.500909 7.500909 7.500000 7.500909 > c(x.sd1=sd(x1), x.sd2=sd(x2), x.sd3=sd(x3), x.sd3=sd(x4)) x.sd1 x.sd2 x.sd3 x.sd4 3.316625 3.316625 3.316625 3.316625 > c(y.sd1=sd(y1), y.sd2=sd(y2), y.sd4=sd(y3), y.sd4=sd(y4)) y.sd1 y.sd2 y.sd3 y.sd4 2.031568 2.031657 2.030424 2.030579 > c(cor1=cor(x1,y1), cor2=cor(x2,y2), cor3=cor(x3,y3), cor4=cor(x4,y4)) cor1 cor2 cor3 cor4 0.8164205 0.8162365 0.8162867 0.8165214 3

. . . but vary considerably when graphed. ● 10 10 ● ● ● ● ● ● ● ● y1 y2 ● ● ● 8 8 ● ● ● ● ● 6 6 ● ● ● ● 4 4 ● 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x1 x2 ● ● 10 10 ● ● ● y3 y4 ● 8 8 ● ● ● ● ● ● ● ● ● ● ● 6 6 ● ● ● ● ● 4 4 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x3 x4 4

Similarly, let’s consider linear regression for each dataset. ● 10 10 ● ● ● ● ● ● ● ● y1 y2 ● ● ● 8 8 ● ● ● ● ● 6 6 ● ● ● ● 4 4 ● 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x1 x2 ● ● 10 10 ● ● ● y3 y4 ● 8 8 ● ● ● ● ● ● ● ● ● ● ● 6 6 ● ● ● ● ● 4 4 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x3 x4 5

The regression lines and even R 2 values are the same... > ansreg <- list(reg1=lm(y1~x1), reg2=lm(y2~x2), + reg3=lm(y3~x3), reg4=lm(y4~x4)) > attach(ansreg) > cbind(reg1$coef, reg2$coef, reg3$coef, reg4$coef) [,1] [,2] [,3] [,4] (Intercept) 3.0000909 3.000909 3.0024545 3.0017273 x1 0.5000909 0.500000 0.4997273 0.4999091 > smry <- lapply(ansreg, summary) > c(smry$reg1$r.sq, smry$reg2$r.sq, + smry$reg3$r.sq, smry$reg4$r.sq) [1] 0.6665425 0.6662420 0.6663240 0.6667073 6

...but the residuals (plotted against ˆ Y ) look totally different. ● ● ● ● 1.0 ● ● ● ● reg1$residuals reg2$residuals 1 0.0 ● ● ● ● 0 ● ● ● −1.0 ● ● ● −1 ● −2.0 −2 ● ● ● 5 6 7 8 9 10 5 6 7 8 9 10 reg1$fitted reg2$fitted ● ● 3 ● reg3$residuals reg4$residuals 1.0 ● 2 ● 0.0 1 ● ● ● ● ● ● ● 0 ● ● ● ● −1.5 ● ● ● −1 ● ● ● 5 6 7 8 9 10 7 8 9 10 11 12 reg3$fitted reg4$fitted 7

Plotting e vs ˆ Y is your #1 tool for finding fit problems. Why? ◮ Because it gives a quick visual indicator of whether or not the model assumptions are true. What should we expect to see if they are true? 1. Each ε i has the same variance ( σ 2 ). 2. Each ε i has the same mean (0). 3. The ε i collectively have the same Normal distribution. Remember: ˆ Y is made from X in SLR and MLR, so one plot summarizes across the X . 8

How do we check these? Well, the true ε i residuals are unknown, so must look instead at the least squares estimated residuals. ◮ We estimate Y i = b 0 + b 1 X i + e i , such that the sample least squares regression residuals are e i = Y i − ˆ Y i What should the e i look like if the SLR model is true? 9

If the SLR model is true, it turns out that: ( X i − ¯ X ) 2 h i = 1 e i ∼ N (0 , σ 2 [1 − h i ]) , n + X ) 2 . j =1 ( X j − ¯ � n The h i term is referred to as the i th observation’s leverage: ◮ It is that point’s share of the data ( 1 /n ) plus its proportional contribution to variability in X . Notice that as n → ∞ , h i → 0 and residuals e i “obtain” the same distribution as the unknown errors ε i , i.e., e i ∼ N (0 , σ 2 ) . ————————————— See handout on course page for derivations. 10

Understanding Leverage The h i leverage term measures sensitivity of the estimated least squares regression line to changes in Y i . The term “leverage” provides a mechanical intuition: ◮ The farther you are from a pivot joint, the more torque you have pulling on a lever. Online illustration of leverage: https://rstudio-class.chicagobooth.edu Outliers do more damage if they have high leverage! 11

Standardized residuals Since e i ∼ N (0 , σ 2 [1 − h i ]) , we know that e i σ √ 1 − h i ∼ N (0 , 1) . These transformed e i ’s are called the standardized residuals. ◮ They all have the same distribution if the SLR model assumptions are true. iid ◮ They are almost (close enough) independent ( ∼ N (0 , 1) ). ◮ Estimate σ 2 using ˆ σ 2 or s 2 12

About estimating s under sketchy SLR assumptions ... We want to see whether any particular e i is “too big”, but we don’t want a single outlier to make s artificially large. > plot(x3,y3, col=3, pch=20, cex=1.5) > abline(reg3, col=3) ● 12 ◮ One big outlier 10 y3 ● can make s ● 8 ● ● overestimate σ . ● ● ● ● 6 ● ● 4 6 8 10 12 14 x3 13

Studentized residuals We thus define a Studentized residual as e i √ 1 − h i r i = s − i where s 2 1 j � = i e 2 � − i = j is ˆ σ calculated without e i . n − p − 1 These are easy to get in R with the rstudent() function. > rstudent(reg3) [1] -0.4390554 -0.1855022 1203.5394638 -0.3138441 [5] -0.5742948 -1.1559818 0.0664074 0.3618514 [9] -0.7356770 -0.0657680 0.2002633 14

Outliers and Studentized residuals Since the studentized residuals should be ≈ N (0 , 1) , we should be concerned about any r i outside of about [ − 3 , 3] . ● ● 3 1000 reg3$residuals rstudent(reg3) 2 600 1 ● ● ● 0 ● ● 200 ● ● ● ● −1 ● 0 ● ● ● ● ● ● ● ● ● ● 5 6 7 8 9 10 5 6 7 8 9 10 reg3$fitted reg3$fitted These aren’t hard and fast cutoffs. As n gets bigger, we will expect to see some very rare events (big ε i ) and not get worried unless | r i | > 4 . 15

How to deal with outliers When should you delete outliers? ◮ Only when you have a really good reason! There is nothing wrong with running a regression with and without potential outliers to see whether results are significantly impacted. Any time outliers are dropped, the reasons for doing so should be clearly noted. ◮ I maintain that both a statistical and a non-statistical reason are required. (What?) 16

Week 7: Regression Issues Standardized and Studentized residuals, - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 7: Regression Issues Standardized and Studentized residuals, outliers and leverage, nonconstant variance, non-normality, nonlinearity, transformations, multicollinearity Max H. Farrell The University

MATH2130-F17 Week 13 Week 14 Week 15, Inner Farid Aliniaeifard Product Space CU BOULDER

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Time Matters Week 7 Week 6 Prototyping + Needfinding Week 7 Week 8 Implementation Week 9

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Galatians: week 3 Galatians 3:1-29 Week 1: Galatians 1:1-2:14 Week 2: Galatians 2:15-21 Week 3:

Vermont M nt Marble: A e: Americas s nt Stone Monument Sto Class S s Schedule e Week

Week 1: Christ: The Source of True Happiness Week 2: Happiness, the Gospel and Living Well Week

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Business Statistics CONTENTS Ordinary least squares (recap for some) Statistical formulation of

8 Approximate inference in switching linear dynamical systems using Gaussian mixtures David

Existence of the free boundary in a diffusive ow in porous media Gabriela Marinoschi

Design and analysis of follow-up studies with genetic component Juha Karvanen Department of

Introduction to Regression and Correlation James H. Steiger Department of Psychology and Human

Machine Learning for Computational Linguistics Regression ar ltekin University of

r rrss ttst

Linear Regression 23.11.2016 General information Lecture website stat.ethz.ch/~muellepa

Week 7: Regression Issues Standardized and Studentized residuals, - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 7: Regression Issues Standardized and Studentized residuals, outliers and leverage, nonconstant variance, non-normality, nonlinearity, transformations, multicollinearity Max H. Farrell The University

MATH2130-F17 Week 13 Week 14 Week 15, Inner Farid Aliniaeifard Product Space CU BOULDER

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Time Matters Week 7 Week 6 Prototyping + Needfinding Week 7 Week 8 Implementation Week 9

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

Galatians: week 3 Galatians 3:1-29 Week 1: Galatians 1:1-2:14 Week 2: Galatians 2:15-21 Week 3:

Vermont M nt Marble: A e: Americas s nt Stone Monument Sto Class S s Schedule e Week

Week 1: Christ: The Source of True Happiness Week 2: Happiness, the Gospel and Living Well Week

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Business Statistics CONTENTS Ordinary least squares (recap for some) Statistical formulation of

8 Approximate inference in switching linear dynamical systems using Gaussian mixtures David

Existence of the free boundary in a diffusive ow in porous media Gabriela Marinoschi

Design and analysis of follow-up studies with genetic component Juha Karvanen Department of

Introduction to Regression and Correlation James H. Steiger Department of Psychology and Human

Machine Learning for Computational Linguistics Regression ar ltekin University of

r rrss ttst

Linear Regression 23.11.2016 General information Lecture website stat.ethz.ch/~muellepa

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and