marcel dettling
play

Marcel Dettling Institute for Data Analysis and Process Design - PowerPoint PPT Presentation

Applied Statistical Regression HS 2011 Week 04 Marcel Dettling Institute for Data Analysis and Process Design Zurich University of Applied Sciences marcel.dettling@zhaw.ch http://stat.ethz.ch/~dettling ETH Zrich, October 17, 2011


  1. Applied Statistical Regression HS 2011 – Week 04 Marcel Dettling Institute for Data Analysis and Process Design Zurich University of Applied Sciences marcel.dettling@zhaw.ch http://stat.ethz.ch/~dettling ETH Zürich, October 17, 2011 Marcel Dettling, Zurich University of Applied Sciences 1

  2. Applied Statistical Regression HS 2011 – Week 04 Curvilinear Fitting All models such as:       • Y 1 ln( ) x E i 0 i i       Y x E • i 0 1 i        1 Y x E • i 0 1 i Are simple linear regression models. There is only one single predictor, and the relation is linear in the parameters. None of these models fits a straight line in the scatterplot, these are all curvilinear relations – linear regression is very versatile! Marcel Dettling, Zurich University of Applied Sciences 2

  3. Applied Statistical Regression HS 2011 – Week 04 Logged Predictor and Response Regression models of the form           Y x E i 0 1 i i     where and are very important and Y log( ) Y x log( ) x i i i i often encountered in practice. Backtransformation shows that the initial relation is:      Y x E 1 i 0 i i i.e. a non-linear relation with multiplicative error. Through the transformation, the parameter estimation problem is linearized, and can be solved with the least squares method. Marcel Dettling, Zurich University of Applied Sciences 3

  4. Applied Statistical Regression HS 2011 – Week 04 Example: Daily Cost in Rehabilitation Daily Cost in Rehab vs. ADL Residuals vs. Fitted Values Residuals vs Fitted 1427 379 2500 1500 823 Daily Cost Residuals 1500 500 0 500 -500 0 0 10 20 30 40 50 400 600 800 1000 ADL Fitted values Marcel Dettling, Zurich University of Applied Sciences 4

  5. Applied Statistical Regression HS 2011 – Week 04 Logged Response Model We transform the response variable and try to explain it using a linear model with our previous predictors:        log( ) Y Y x E 0 1 In the original scale, we can write the logged response model using the same predictors:      Y exp( x ) exp( ) E 0 1  Multiplicative model  2  E ~ N (0, ) , and thus, has a lognormal distribution exp( ) E E Marcel Dettling, Zurich University of Applied Sciences 5

  6. Applied Statistical Regression HS 2011 – Week 04 Also This Transformation Works! Residuals vs Fitted Normal Q-Q 4 1 Standardized residuals 2 Residuals 0 0 -1 -2 936 682 49 936 -4 682 49 -2 5.8 6.0 6.2 6.4 6.6 6.8 7.0 -3 -2 -1 0 1 2 3 Fitted values Theoretical Quantiles Marcel Dettling, Zurich University of Applied Sciences 6

  7. Applied Statistical Regression HS 2011 – Week 04 Dealing with Zero Response • Logged response model is only applicable when the response is strictly positive… Y  • What if there are some cases with 0 ? - never omit these - additive shifting is possible • How to additively shift? - usual choice: c=1 - not good, because effect is scale-dependent  Shift with the value of the smallest positive observation! Marcel Dettling, Zurich University of Applied Sciences 7

  8. Applied Statistical Regression HS 2011 – Week 04 Back Transforming the Fitted Values • In principle, we can „simply back transform“ y   ˆ ˆ y exp( ) • This is an estimate for the median, but not the mean! • If unbiased estimation is required, then use:    2 ˆ     E  ˆ ˆ y exp y   2 • Confidence/prediction intervals are not problematic [ , ] l u  [exp( ),exp( )] l u Marcel Dettling, Zurich University of Applied Sciences 8

  9. Applied Statistical Regression HS 2011 – Week 04 Back Transforming: Example Daily Cost in Rehabilitation vs. ADL-Score 3000 2000 Daily Cost 1000 500 0 0 10 20 30 40 50 ADL Marcel Dettling, Zurich University of Applied Sciences 9

  10. Applied Statistical Regression HS 2011 – Week 04 Interpretation of the Coefficients Important : there is no back transformation for the coefficients to the original scale, but still a good interpretation        ˆ ˆ ˆ ˆ log( ) y x ... x 0 1 1 p p     ˆ ˆ ˆ ˆ y exp( )exp( x )...exp( x ) 0 1 1 p p x An increase by one unit in would multiply the fitted value in the 1  ˆ original scale with exp( ) . 1  Coefficients are interpreted multiplicatively! Marcel Dettling, Zurich University of Applied Sciences 10

  11. Applied Statistical Regression HS 2011 – Week 04 First-Aid Transformations These are intendend to stabilize the variance First-Aid Transformations:  do always apply these (if no practical reasons against it)  to both response and predictors Absolute values and concentrations:   y log( ) y log-transformation: Count data :   square-root transformation: y y Proportions :      1 y sin y arcsine transformation: Marcel Dettling, Zurich University of Applied Sciences 11

  12. Applied Statistical Regression HS 2011 – Week 04 Multiple Linear Regression The model is:           Y x x ... x E i 0 1 i 1 2 i 2 p ip i • we have p predictors now • visualization is no longer possible n • we are still given data points, and still: • the goal is to estimate the regression coefficients Marcel Dettling, Zurich University of Applied Sciences 12

  13. Applied Statistical Regression HS 2011 – Week 04 Assumptions on the Error Term We assumptions are identical to simple linear regression. E E  - , i.e. the hyper plane is the correct fit [ ] 0 i   2 - Var E ( ) , constant scatter for the error term i E  ( , ) 0 Cov E E - , uncorrelated errors i j As in simple linear regression, we do not require any specific distribution for parameter estimation and certain optimality results of the least squares approach. The distributional assumption only comes into play when we do inference on the parameters. Marcel Dettling, Zurich University of Applied Sciences 13

  14. Applied Statistical Regression HS 2011 – Week 04 Don‘t Do Many Simple Regressions Doing many simple linear regressions is not equivalent to multiple linear regression. Check the example x1 0 1 2 3 0 1 2 3 x2 -1 0 1 2 1 2 3 4 yy 1 2 3 4 -1 0 1 2    ˆ 2 Y y x x We have , a perfect fit. i i i 1 i 2   2 ˆ 0 Thus, all residuals are 0 and . E  But what is the result from simple linear regressions? Marcel Dettling, Zurich University of Applied Sciences 14

  15. Applied Statistical Regression HS 2011 – Week 04 Don‘t Do Many Simple Regressions yy ~ x1 yy ~ x2 4 4 3 3 2 2 yy yy 1 1 0 0 -1 -1 0.0 1.0 2.0 3.0 -1 0 1 2 3 4 x1 x2 Marcel Dettling, Zurich University of Applied Sciences 15

  16. Applied Statistical Regression HS 2011 – Week 04 An Example Researchers at General Motors collected data on 60 US Standard Metropolitan Statistical Areas (SMSAs) in a study of whether air pollution contributes to mortality. City Mortality JanTemp JulyTemp RelHum Rain Educ Dens NonWhite WhiteCollar Pop House Income HC NOx SO2 Akron, OH 921.87 27 71 59 36 11.4 3243 8.8 42.6 660328 3.34 29560 21 15 59 Albany, NY 997.87 23 72 57 35 11 4281 3.5 50.7 835880 3.14 31458 8 10 39 Allentown, PA 962.35 29 74 54 44 9.8 4260 0.8 39.4 635481 3.21 31856 6 6 33 Atlanta, GA 982.29 45 79 56 47 11.1 3125 27.1 50.2 2138231 3.41 32452 18 8 24 Baltimore, MD 1071.29 35 77 55 43 9.6 6441 24.4 43.7 2199531 3.44 32368 43 38 206 Birmingham, AL 1030.38 45 80 54 53 10.2 3325 38.5 43.1 883946 3.45 27835 30 32 72 http://lib.stat.cmu.edu/DASL/Stories/AirPollutionandMortality.html Marcel Dettling, Zurich University of Applied Sciences 16

  17. Applied Statistical Regression HS 2011 – Week 04 Some Simple Linear Regressions SO2 log(SO2) 1100 1100 1000 1000 Mortality Mortality 900 900 800 800 0 50 100 150 200 250 0 1 2 3 4 5 SO2 log(SO2) %NonWhite Rain 1100 1100 1000 1000 Mortality Mortality 900 900 800 800 0 10 20 30 40 10 20 30 40 50 60 17 %NonWhite Rain

  18. Applied Statistical Regression HS 2011 – Week 04 Coefficient Estimates    ˆ 886.34 16.86 log( ) y SO log(SO2): 2    ˆ y 887.90 4.49 NonWhite NonWhite:    ˆ y 851.22 2.34 Rain Rain: > lm(Mortality ~ log(SO2) + NonWhite + Rain, data=mortality) > Coefficients: > (Intercept) log(SO2) NonWhite Rain 773.020 17.502 3.649 1.763 The regression coefficient is the increase in the response, if the predictor increases by 1 unit, but all other predictors remain unchanged. 18

Recommend


More recommend