Gov 2000: 12. Troubleshooting the Linear Model Matthew Blackwell Fall 2016 1 / 67
1. Outliers, leverage points, and infmuential observations 2. Heteroskedasticity 3. Nonlinearity of the regression function 2 / 67
Where are we? Where are we going? under Gauss-Markov assumptions (and sometimes conditional Normality) tell? Can we fjx it? 3 / 67 • Last few weeks: estimation and inference for the linear model • This week: what happens when the assumptions fail? Can we • Next weeks: dealing with panel data.
Review of the OLS assumptions 𝑗 𝜸 + 𝑣 𝑗 2. Random sample: (𝑧 𝑗 , 𝐲 ′ 𝑗 ) are a iid sample from the population. 3. Full rank: 𝐘 is an 𝑜 × (𝑙 + 1) matrix with rank 𝑙 + 1 4. Zero conditional mean: 𝔽[𝑣 𝑗 |𝐲 𝑗 ] = 0 5. Homoskedasticity: 𝕎[𝑣 𝑗 |𝐲 𝑗 ] = 𝜏 2 𝑣 𝑣 ) 4 / 67 1. Linearity: 𝑧 𝑗 = 𝐲 ′ 6. Normality: 𝑣 𝑗 |𝐲 𝑗 ∼ 𝑂(0, 𝜏 2 • 1-4 give us unbiasedness/consistency • 1-5 are the Gauss-Markov, allow for large-sample inference • 1-6 allow for small-sample inference
Violations of the assumptions Three issues today: 1. Infmuential observations that skew regression estimates 2. Violations of homoskedaticity 3. Incorrect functional form/nonlinearity 5 / 67 ▶ ⇝ SEs are biased (usually downward) ▶ ⇝ biased/inconsistent estimates
1/ Outliers, leverage points, and influential observations 6 / 67
Example: Buchanan votes in Florida, 2000 7 / 67 • 2000 Presidential election in FL (Wand et al., 2001, APSR)
Example: Buchanan votes in Florida, 2000 8 / 67 3500 3000 2500 Buchanan Votes 2000 1500 1000 500 0 0 100000 200000 300000 400000 500000 600000 Total Votes
Example: Buchanan votes in Florida, 2000 9 / 67 3500 Palm Beach 3000 2500 Buchanan Votes 2000 1500 Pinellas 1000 Hillsborough Broward Duval Marion Pasco Brevard Miami-Dade Polk Escambia Volusia 500 Orange Santa Rosa Sarasota Lee Lake Leon Okaloosa Citrus Alachua Manatee Hernando Bay St. Johns Holmes Clay Charlotte Seminole Putnam Osceola Highlands Walton Sumter St. Lucie Collier Suwannee Indian River Jackson Martin Washington Calhoun Columbia Baker Nassau Flagler Bradford Gulf Levy Okeechobee Franklin Gadsden Wakulla Liberty Desoto Union Monroe Lafayette Jefferson Hamilton Madison Gilchrist Hardee Hendry Taylor Dixie Glades 0 0 100000 200000 300000 400000 500000 600000 Total Votes
Example: Buchanan votes 2.4e-10 *** p-value: 2.42e-10 56 on 1 and 65 DF, ## F-statistic: 0.455 Adjusted R-squared: 0.463, ## Multiple R-squared: ## Residual standard error: 333 on 65 degrees of freedom ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## Signif. codes: ## --- 7.48 mod <- lm(edaybuchanan ~ edaytotal, data = flvote) 0.00031 0.00232 ## edaytotal 0.27 1.10 49.14146 ## (Intercept) 54.22945 Estimate Std. Error t value Pr(>|t|) ## ## Coefficients: ## summary(mod) 10 / 67
Three types of extreme values 1. Leverage point: extreme in one 𝑦 direction 2. Outlier: extreme in the 𝑧 direction 3. Infmuence point: extreme in both directions distribution), can cause ineffjciency and possibly bias 11 / 67 • Not all of these are problematic • If the data are truly “contaminated” (come from a difgerent • Can be a violation of iid (not identically distributed) • Diagnostics are loose
Leverage point definition 12 / 67 2 Leverage Point 1 Full sample 0 Without leverage point y -1 -2 -3 -4 -2 0 2 4 6 8 • Values that are extreme in the 𝑦 direction • That is, values far from the center of the covariate distribution • Decrease SEs (more 𝑦 variation) • No bias if typical in 𝑧 dimension
Hat matrix ̂ 𝜸 = (𝐉 − 𝐈)𝐳 ̂ 𝐳 = 𝐈𝐳 13 / 67 • First we need to defjne an important matrix 𝐈 = 𝐘 (𝐘 ′ 𝐘) −1 𝐘 ′ 𝐯 = 𝐳 − 𝐘 ̂ = 𝐳 − 𝐘 (𝐘 ′ 𝐘) −1 𝐘 ′ 𝐳 ≡ 𝐳 − 𝐈𝐳 • 𝐈 is the hat matrix because it puts the “hat” on 𝐳 : ▶ 𝐈 is an 𝑜 × 𝑜 symmetric matrix ▶ 𝐈 is idempotent: 𝐈𝐈 = 𝐈
Hat values 𝑘=1 ∑ 𝑜 matrix 𝑧 𝑗 ̂ ℎ 𝑗𝑘 𝑧 𝑘 ̂ ∑ 𝜸 = 𝐘(𝐘 ′ 𝐘) −1 𝐘 ′ 𝐳 = 𝐈𝐳 𝑜 14 / 67 ̂ 𝐳 = 𝐘 ̂ • For a particular observation 𝑗 , we can show this means: 𝑧 𝑗 = • ℎ 𝑗𝑘 = importance of observation 𝑘 is for the fjtted value • Leverage/hat values: ℎ 𝑗 = ℎ 𝑗𝑗 diagonal entries of the hat • With a simple linear regression, we have ℎ 𝑗 = 1 (𝑦 𝑗 − 𝑦) 2 𝑜 + 𝑘=1 (𝑦 𝑘 − 𝑦) 2 ▶ ⇝ how far 𝑗 is from the center of the 𝐘 distribution • Rule of thumb: examine hat values greater than 2(𝑙 + 1)/𝑜
Buchanan hats head(hatvalues(mod), 5) ## 1 2 3 4 5 ## 0.04179 0.02285 0.22066 0.01556 0.01493 15 / 67
Buchanan hats 16 / 67 Liberty Hardee Hamilton Palm Beach Levy Santa Rosa Escambia Okaloosa Gadsden Baker Bay Nassau Calhoun Bradford Hernando Hillsborough Lake Union Taylor Clay Holmes Citrus Flagler Jackson Marion Washington Gilchrist Charlotte Monroe Franklin Seminole St. Johns Putnam Pasco Sumter Walton Columbia Gulf Suwannee Jefferson Sarasota Manatee Brevard Wakulla Indian River Desoto Lafayette Okeechobee Orange Madison Polk St. Lucie Hendry Highlands Volusia Leon Glades Alachua Miami-Dade Osceola Pinellas Dixie Collier Martin Broward Lee Duval 0.05 0.10 0.15 0.20 0.25 Hat Values
Outlier definition 𝜏 2 ) 17 / 67 Outlier 6 4 2 Full sample 0 Without outlier -2 -4 -2 0 2 4 • An outlier is a data point with very large regression errors, 𝑣 𝑗 • Very distant from the rest of the data in the 𝑧 -dimension • Increases standard errors (by increasing ̂ • No bias if typical in the 𝑦 ’s
Detecting outliers 𝑣 ′ 𝑗 | > 4 − 5 should defjnitely be checked. 𝑣 ′ 𝑗 | > 2 will be relatively rare. 𝑣 ′ 𝜏√1 − ℎ 𝑗𝑗 ̂ 𝑣 𝑗 ̂ 18 / 67 ̂ 𝑣 (1 − ℎ 𝑗𝑗 ) 𝑣 𝑗 |𝐘] = 𝜏 2 𝕎[ ̂ ̂ • Look for big residuals, right? ▶ Problem: 𝑣 𝑗 are not identically distributed. ▶ Variance of the 𝑗 th residual: • Rescale to get standardized residuals with constant variance: 𝑗 = • Rule of thumb: ▶ | ̂ ▶ | ̂
Buchanan outliers std.resids <- rstandard(mod) 19 / 67 Palm Beach 6 Standardized Residuals 4 2 0 -2 0 10 20 30 40 50 60 Index
Detecting outliers ̃ 1 − ℎ 𝑗 𝑣 𝑗 ̂ ̃ 𝑧 𝑗 ̃ 3. Calculate prediction error: 𝜸 (−𝑗) 20 / 67 (−𝑗) 𝐳 (−𝑗) (−𝑗) 𝐘 (−𝑗) ) ̂ them. outliers because they might pull the regression line close to • Standardized or regular residuals are not good for detecting • Better: leave-one-out prediction errors, 1. Regress 𝐘 (−𝑗) on 𝐳 (−𝑗) , where these omit unit 𝑗 : −1 𝐘 ′ 𝜸 (−𝑗) = (𝐘 ′ 2. Calculate predicted value of 𝑧 𝑗 using that regression: 𝑗 ̂ 𝑧 𝑗 = 𝐲 ′ 𝑣 𝑗 = 𝑧 𝑗 − ̃ • Possible relate prediction errors to residuals: 𝑣 𝑗 =
Influence points leverage point. 21 / 67 Influence Point 6 4 Full sample 2 y 0 Without influence point -2 -4 -2 0 2 4 6 8 • An infmuence point is one that is both an outlier and a • Extreme in both the 𝑦 and 𝑧 dimensions • Causes the regression line to move toward it (bias?)
Overall measures of influence leverage” 𝑗 𝑣 ′ ̂ (𝑙+1)̂ 𝑗 ̃ 𝑣 2 𝑣 𝑗 ℎ 𝑗 , which is just the “outlier-ness × ̃ between the fjtted value and the predicted leave-one-out value: ̂ 22 / 67 𝑧 𝑗 • A rough measure of infmuence is to look at how the difgerence 𝑧 𝑗 − ̃ ▶ This is equivalent to • Cook’s distance ( cooks.distance() ): 𝐸 𝑗 = 𝜏 2 × ℎ 𝑗 ▶ Basically: “normalized outlier-ness × leverage” ▶ 𝐸 𝑗 > 4/(𝑜 − 𝑙 − 1) considered “large”, but cutofgs are arbitrary • Infmuence plot: ▶ x-axis: hat values, ℎ 𝑗 ▶ y-axis: standardized residuals,
Influence plot from lm output plot(mod, which = 5, labels.id = flvote$county) 23 / 67 8 Palm Beach Standardized residuals 6 4 2 0 -2 Broward Miami-Dade Cook's distance -4 0.00 0.05 0.10 0.15 0.20 0.25 Leverage lm(edaybuchanan ~ edaytotal)
Limitations of the standard tools 24 / 67 1.5 1.0 0.5 y 0.0 -0.5 -1.0 0 2 4 6 8 • What happens when there are two infmuence points? • Red line drops the red infmuence point • Blue line drops the blue infmuence point • “Leave-one-out” approaches helps recover the line
What to do about outliers and influential units? least absolute deviations) 25 / 67 • Is the data corrupted? ▶ Fix the observation (obvious data entry errors) ▶ Remove the observation ▶ Be transparent either way • Is the outlier part of the data generating process? ▶ Transform the dependent variable ( log(𝑧) ) ▶ Use a method that is robust to outliers (robust regression,
2/ Heteroskedasticity 26 / 67
Recommend
More recommend