residual analysis
play

Residual Analysis Inferences about a regression model are valid only - PowerPoint PPT Presentation

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Residual Analysis Inferences about a regression model are valid only under assumptions about the random errors in the observations. Objectives:


  1. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Residual Analysis Inferences about a regression model are valid only under assumptions about the random errors in the observations. Objectives: Show how residuals reveal departures from assumptions; Suggest procedures for coping with such departures. 1 / 17 Residual Analysis Introduction

  2. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Regression Residuals The random errors ǫ satisfy Y = E ( Y ) + ǫ, or ǫ = Y − E ( Y ) . We observe Y , but we do not know E ( Y ), so we cannot calculate ǫ . We estimate E ( Y ) by ˆ y , the predicted (or fitted) value. We approximate the random errors by regression residuals : ˆ ǫ i = y i − ˆ y i , i = 1 , 2 , . . . , n . 2 / 17 Residual Analysis Regression Residuals

  3. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Properties of residuals If the model contains an intercept, the sum of the residuals, and also their mean, is zero: n � ǫ i = 0 , and so ¯ ˆ ǫ = 0 . ˆ i =1 The covariance of the residuals and any term in the regression model is zero: n � ˆ ǫ i x i , j = 0 , j = 1 , 2 , . . . , k . i =1 3 / 17 Residual Analysis Properties of residuals

  4. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Detecting Lack of Fit A misspecified model is one that leaves out a relevant predictor. The residuals from a misspecified model do not have mean zero. Example: serum cholesterol ( y ) and dietary fat ( x ) in Olympic athletes. ath <- read.table("Text/Exercises&Examples/OLYMPIC.txt", header = TRUE) pairs(ath) 4 / 17 Residual Analysis Detecting Lack of Fit

  5. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Suppose we ignore the graph, and fit a first-order model: l1 <- lm(CHOLESTEROL ~ FAT, ath) summary(l1) plot(ath$FAT, residuals(l1)) The summary of the fitted model looks reasonable. But the graph of the residuals against x show that the assumption E ( ǫ ) = 0 is violated. Because this is a straight-line model, this graph is effectively the same as the “residuals versus fitted value” graph from plot(l1) . 5 / 17 Residual Analysis Detecting Lack of Fit

  6. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Because of the curvature, we could fit the second-order (quadratic) model: l2 <- lm(CHOLESTEROL ~ FAT + I(FAT^2), ath) summary(l2) plot(ath$FAT, residuals(l2)) The residual plot suggests that the model is satisfactory. The quadratic term is highly significant. 6 / 17 Residual Analysis Detecting Lack of Fit

  7. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Partial residuals Sometimes the effect of an independent variable is better described by a transformed version: log( x ) , 1 / x , etc. The partial residual plot can help identify the transformation: The partial residuals for independent variable x j are ǫ ∗ = ˆ ǫ + ˆ ˆ β j x j ǫ ∗ against x j . Plot ˆ Also known as a “Component + Residual” plot. 7 / 17 Residual Analysis Partial residuals

  8. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Example Effect of price ( p ) and advertising ( x 2 ) on demand ( y ) for coffee. coffee <- read.table("Text/Exercises&Examples/COFFEE2.txt", header = TRUE) pairs(coffee) Try a first-order model: l1 <- lm(DEMAND ~ PRICE + AD, coffee) summary(l1) plot(coffee$PRICE, residuals(l1)) The residual plot shows misspecification. 8 / 17 Residual Analysis Partial residuals

  9. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II The Component + Residual plot: library(car) crPlot(l1, variable = "PRICE") Curve suggests either adding PRICE^2 , or transforming to log( PRICE ) or 1 / PRICE . R 2 and R 2 a are highest for 1 / PRICE . Note: the partial regression plot is different. 9 / 17 Residual Analysis Partial residuals

  10. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Detecting Unequal Variances Homoscedasticity versus heteroscedasticity . That is, constant variance versus varying variance. When the variance is not constant, it is most often related to the mean. For Poisson-distributed data (counts), var( Y ) = E ( Y ). When errors are multiplicative, Y = E ( Y ) × (1 + ǫ ), and var( Y ) ∝ E ( Y ) 2 . 10 / 17 Residual Analysis Detecting Unequal Variances

  11. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Sometimes the variance can be made constant by transforming Y . For example, with multiplicative errors, log( Y ) = log[ E ( Y ) × (1 + ǫ )] = log[ E ( Y )] + log[1 + ǫ ] ≈ log[ E ( Y )] + ǫ. So var[log Y ] is (approximately) constant. Sometimes variance can be made constant by transformation, but a different method may be better than using a transformation. √ For example, with Poisson-distributed counts, Y has approximately constant variance, but a generalized linear model may be more satisfactory. 11 / 17 Residual Analysis Detecting Unequal Variances

  12. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Example Salary and experience for social workers. workers <- read.table("Text/Exercises&Examples/SOCWORK.txt", header = TRUE) pairs(workers) Try a second-order model: l2 <- lm(SALARY ~ EXP + I(EXP^2), workers) summary(l2) plot(l2) 12 / 17 Residual Analysis Detecting Unequal Variances

  13. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II The “Residuals vs Fitted” plot shows a fan-shaped scatter, and the “Scale-Location” plot shows an upward trend. It suggests std dev( Y ) ∝ E ( Y ) , hence var( Y ) ∝ E ( Y ) 2 , so try logarithms. 13 / 17 Residual Analysis Detecting Unequal Variances

  14. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Second-order model for log(SALARY) : lLog2 <- lm(log(SALARY) ~ EXP + I(EXP^2), workers) summary(lLog2) The quadratic term is not significant, so try a first-order model: lLog1 <- lm(log(SALARY) ~ EXP, workers) summary(lLog1) plot(lLog1) The residual plots are more satisfactory. 14 / 17 Residual Analysis Detecting Unequal Variances

  15. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Simple Test for Heteroscedasticity Divide the data set in two, for instance low fitted values versus high fitted values. Fit the model separately to each part, and compare the MSEs (Mean Square for Errors). Under H 0 : variance is constant, F ∗ = MSE 1 MSE 2 has the F -distribution with ν 1 = n 1 − ( k + 1) and ν 2 = n 2 − ( k + 1) degrees of freedom. 15 / 17 Residual Analysis Simple Test for Heteroscedasticity

  16. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II This is usually a two-sided test; H a : variance is not constant. Reject H 0 at level α if F ∗ differs too far from 1 in either direction; that is, if F ∗ < F 1 − α/ 2 ( ν 1 , ν 2 ), the lower α/ 2-point of the distribution, or F ∗ > F α/ 2 ( ν 1 , ν 2 ), the upper α/ 2-point of the distribution. 16 / 17 Residual Analysis Simple Test for Heteroscedasticity

  17. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Note: F 1 − α/ 2 ( ν 1 , ν 2 ) = 1 / F α/ 2 ( ν 2 , ν 1 ), so an equivalent method is based on F = Larger MSE � F ∗ , 1 � Smaller MSE = max . F ∗ Then we reject H 0 if F > F α/ 2 ( ν Larger , ν Smaller ) 17 / 17 Residual Analysis Simple Test for Heteroscedasticity

Recommend


More recommend