201ab quantitative methods linear model diagnostics model
play

201ab Quantitative methods Linear model diagnostics. Model - PowerPoint PPT Presentation

201ab Quantitative methods Linear model diagnostics. Model assumptions, in order of importance (1) Validity (2) Additivity and linearity (3) Independent errors (4) Normal errors (5) Homoscedastic errors (6) Error in y, not in x Validity


  1. 201ab Quantitative methods Linear model diagnostics.

  2. Model assumptions, in order of importance (1) Validity (2) Additivity and linearity (3) Independent errors (4) Normal errors (5) Homoscedastic errors (6) Error in y, not in x

  3. Validity & Generalization • What conclusions are drawn from a data analysis, and how do they relate to the data and the analysis? – How do the measured / Linking assumptions? Their justifiability? manipulated variables correspond to the concepts in the conclusions? – Which aspects of the desired Subjects? Stimuli? Manipulations? generalization are represented in the measured variability? – Are the premises and logic of your “Availability” of k* words vs **k* words? analysis sound?

  4. Additivity and Linearity • The linear model assumes linearity + additivity: y = B0 + B 1 x 1 + B 2 x 2 … • Important violations to beware of: – Lots of measures are not fundamentally not linear (need for linearizing transforms, etc.) – Lots of effects are fundamentally not linear (e.g., dose-response curve cannot be linear)

  5. Independent errors. • Standard linear model assumes i.i.d. errors: y = … + e e ~ N(0, s e ) • Critical violations: – Measuring the same person many times (repeated measures) – Measuring a fixed set of stimuli (item random effects) – Measuring over time/space (smoothness/autocorrelation) – Error correlates with explanatory variable (endogeneity) In these cases you need to use models that can handle it. • Less critical violations: – Weak correlations orthogonal to explanatory variables

  6. Normal, homoscedastic errors • Small deviations from normality / homoscedasticity are often not a big deal. • Large deviations from normality, in particular extreme outliers, may yield large errors in estimated coefficients that are not captured by our measures of uncertainty. This undermines generalization.

  7. Error in y not x • Error in x will cause us to underestimate coefficients. • Not really a big deal. • Errors-in-variables models deal with this if need be.

  8. Model assumptions, in order of importance (1) Validity (2) Additivity and linearity (3) Independent errors (4) Normal errors (5) Homoscedastic errors (6) Error in y, not in x Caveat: “Importance” here determined by my estimate of the expected magnitude of the problems caused by violations of these assumptions in the kinds of analyses people in this class will typically undertake in their research.

  9. Diagnostics you should undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails

  10. Diagnostics you should undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. – Check for generalized weirdness.

  11. Diagnostics you should undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. – Check for generalized weirdness.

  12. Diagnostics you should undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities.

  13. Checking for non-linearity Residual ~ x Residual ~ y.hat Residual plots highlight the non-linearity For high dimensional data, only Residual ~ y.hat is really possible to look at.

  14. Diagnostics you could undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities. – Absolute residuals ~ y_hat to check for homoscedasticity.

  15. Checking for homoscedasticity Homoscedasticity: variance of residuals is constant |residual| ~ y.hat spreadLevelPlot(lm(y~x)) plot(lm, 3) Test for non-constant variance (heteroscedasticity) based on regression of error^2 as a function of fitted y values (for regression): “Breusch-Pagan test” (different, and somewhat more powerful procedure for categorical predictors) ncvTest(lm(y~x)) Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 10.68375 Df = 1 p = 0.00108081

  16. Diagnostics you could undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities. – Absolute residuals ~ y_hat to check for homoscedasticity. – Standardized residual QQ plots check for Normality

  17. Studentized / Standardized residuals Residuals (estimated error) Standardized residuals Deviation of real y value from line Residual divided by sd of residuals ( S ) = ˆ ˆ ε i = y i − ˆ ( ) y i ˆ ε i / s r ε i These should be t distributed, so we can compare to t distribution to look for abnormalities / outliers. qqPlot(lm(y~x)) Large deviations from theoretical t distribution can be tested for (via t-test!) and extreme outliers will be evident this way.

  18. Checking for normal residuals Look at qq plot, test with Kolmogorov-Smirnov test qqPlot(lm(y~x)) ks.test(rstudent(lm(y~x)), "pt", length(y)-2) One-sample Kolmogorov-Smirnov test data: rstudent(lm(y ~ x)) D = 0.1398, p-value = 0.04002 alternative hypothesis: two-sided Generally though, it’s fine to ignore slight but significant deviations

  19. Diagnostics you could undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities. – Absolute residuals ~ y_hat to check for homoscedasticity. – Standardized residual QQ plots check for Normality – Standardized residual ~ leverage for outlier effects, Cook’s dist.

  20. Testing for outliers outlierTest(lm(y~x)) student uncorrectedBonferonni # error p-value p-value 6 4.31 0.0004 0.0088 16 -4.31 0.0004 0.0088 These tests for outliers tend to be less sensitive than the eye: if there is a significant outlier, we will be able to see it, but if we can see it, it may still not be significant (usually due to low df tails).

  21. Leverage Leverage in statistics is like leverage in physics: with a long enough lever (a predictor far enough away from the mean) you can make a regression line do whatever you want. Leverage is potential influence. With many predictors what matters is ~Mahalanobis distance: distance from the center of mass scaled by the covariance matrix. This is hard to visualize, so it’s useful to just look at the leverage numbers, and particularly, whether there are large residuals at large leverage – that is bad.

  22. Cook’s distance plot(lm(y~x), which=5) A data point with a lot of leverage and large residuals is exerting undue influence on the regression. Cook’s distance measures this. Several, equally correct, ways to think about Cook’s distance: (1) How much will my regression coefficients change without this data point? (2) How much will the predicted Y values change without this data point? (3) A combination of leverage and residual to ascertain point’s influence.

  23. Outliers and extreme influence Data points with large residuals, and/or high leverage How do we measure this apparent extreme influence? Outlier detection qqPlot outlierTest Look at residuals as a function of leverage plot(lm(y~x), which=5) Compute Cook’s distance plot(lm(y~x), which=4)

  24. Outliers and extreme influence Data points with large residuals, and/or high leverage

  25. Cook’s distance Several, equally correct, ways to think about Cook’s distance: (1) How much will my regression coefficients change without this data point? (2) How much will the predicted Y values change without this data point? (3) A combination of leverage and residual to ascertain point’s influence. plot(lm(y~x), which=4) We can just look at the Cook’s distance for different data How much influence is points, to see if some are too much? extremely influential. (a) D > 1 ? (b) D > (4/n) ? (c) D > (4/(n-k-1)) ? Different folks have different standards…

  26. Diagnostics you could undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities. – Absolute residuals ~ y_hat to check for homoscedasticity. – Standardized residual QQ plots check for Normality – Standardized residual ~ leverage for outlier effects, Cook’s dist. – Residual as a function of observation to look for autocorrelation

  27. Autocorrelated errors. • Something fishy… plot(x,y) • Residuals as a function of observation number. plot(residuals(lm(y~x)) • Autocorrelation function. acf(residuals(lm(y~x))

Recommend


More recommend