Regression Diagnostics Introduction to Regression 1
Why do we need to do all this? • Theory based on assumptions • Focuses on the residuals and fitted values • Validate the model • Gives us clues how to change the model • Is it appropriate? • Lots of statistical tests based on certain assumptions 2
What shall we look at? • Calculate residuals for each case • Observed value – Predicted value • Standardise them by dividing by their SD (approx.) • Different types of standardised residuals • We need to do a series of plots • Remember the model • Constant variance 3
4
Standardised Residuals • Should be – Size??? – Independent – Normally distributed – Constant – Unrelated to the fitted values – Unrelated to the independent variables 5
Plots to do compute • Normal probability plot of residuals • Look for large standardised residuals • Check values • Plot residuals vs fitted/ predicted values • Plot residuals vs each independent variable • Plot residuals against time (if that is appropriate) 6
7
8
9
10
The regression equation is sqrtrooms = 0.200 + 1.90 sqrtcrews 11
Leverage • Measure the distance from the x-values to the mean of the x-values • May influence results • p-predictors • High values for leverage > 2 ∗ (𝑞+1) 𝑜 • Be careful here 12
Outliers and Bad leverage points • Examine them and see if they are different • Flag a problem with model • Consider fitting another model 13
Cooks distance • Measures the influence of an observation on the set of regression coefficients . Influential observations can be leverage points, outliers, or both. • Look for gaps • Function of leverage and standardised residuals • Suggested cutoffs are 4/(n-2). • What happens when you omit points 14
Makey up data SRESID Leverage Cooks 1.54 0.26 0.42 -4.35 0.26 3.35 15
16
Results Including all points 17
18
And more Without x=20, and y=10 point 19
20
Without point x=20, y=95 21
DFITS • Measures the influence of each observation on the fitted values • Roughly the number of standard deviations that the fitted value changes when each observation is removed from the data set and the model is refit. 22
23
What model to fit? • Suppose we start with • Salaries = α+β 1 Experience+ε – linear model • And look at residuals vs fitted values 24
25
So what model should we fit? • Salaries = α+β 1 Experience+ε – linear model • Should create a new variable • Experience*Experience and added it to model • Salaries = α+β 1 Experience+β 2 Exper*Exper + ε • Polynomial model 26
27
28
Oregon Housing • Description • 76 single-family homes in Eugene, Oregon during 2005 • Estate agents have their methods of determining price • Seller wanted a method of determining asking price
Variables • Price (thousands of $) • Floor size (thousands of sq ft) • Age of house • Number of bedrooms • Number of bathrooms • Garage size • School area • Lot size (1:11)- interesting variable too.
Coding of Lot size Category Lot Size 1 0-3k 2 3-5k 3 5-7k 4 7-10k 5 10-15k 6 15-20k 7 20K-1acre 8 1-3ac 9 3-5ac 10 5-10ac 11 10-20ac 0-3k = 0-3,000 sq ft 1 acre = 43,560 sq ft 31
Model • Going to focus on three variables Price, Size and Age • Age is coded as (Year – 70)/10 • Going to fit two models • Price = α + β 1 *Size+ β 2 *Age+ ε • Price = α + β 1 *Size+ β 2 *Age+ β 3 *Age*Age+ ε • First we draw some graphs 32
33
First model 34
Residuals vs Age 35
Second model 36
Article Modeling Home Prices Using Realtor Data Iain Pardoe Lundquist College of Business, University of Oregon Journal of Statistics Education Volume 16, Number 2 (2008), www.amstat.org/publications/jse/v16n2/datasets.pardoe.html 37
Conclusion • Be sure to run diagnostics • Examine the plots • Check funny points • Try out some changes 38
Added variable plots • Added-variable plots enable us to visually assess the effect of each predictor, having adjusted for the effects of the other predictors. • Y and two predictor variables X and Z • Regress Y on X – calculate residuals – Set 1 • Regress Z on X – calculate residuals – Set 2 • Plot Set 1 residuals vs Set 2 residuals 39
And more… • Residuals from Y and X = part of Y not predicted by X • Residuals from Z and X = part of Z not predicted by X • Added-variable plot for predictor variable Z shows that part of Y that is not predicted by X against that part of Z that is not predicted by X 40
Another dataset • Price = the price (in $US) of dinner (including one drink and a tip) • Food = customer rating of the food (out of 30) • Décor = customer rating of the decor (out of 30) • Service = customer rating of the service (out of 30) • East = 1 (0) if the restaurant is east (west) of Fifth Avenue 41
Added Variable plots • For Food variable • Price vs Décor, service and East – calculate residuals • Food vs Décor, service and East- calculate residuals • Plot residuals against each other 42
43
44
Recommend
More recommend