regression choosing variables
play

Regression: Choosing Variables LIR 832 November 14, 2006 Topics of - PDF document

Regression: Choosing Variables LIR 832 November 14, 2006 Topics of the Day Choosing Independent Variables What variables should be in a model? What is the effect of leaving out important variables? What is the effect of adding


  1. Regression: Choosing Variables LIR 832 November 14, 2006 Topics of the Day… � Choosing Independent Variables � What variables should be in a model? � What is the effect of leaving out important variables? � What is the effect of adding in irrelevant variables? � How do we decide about this? Why not just toss everything in and let our t-stats or r-square solve this for us? 1

  2. Example: Effect of Unions (x) on Weekly Earnings (y) reg lnwage cbc2 Source | SS df MS Number of obs = 156130 -------------+------------------------------ F( 1,156128) = 3897.11 Model | 1234.14281 1 1234.14281 Prob > F = 0.0000 Residual | 49442.8436156128 .316681464 R-squared = 0.0244 -------------+------------------------------ Adj R-squared = 0.0243 Total | 50676.9864156129 .324584071 Root MSE = .56274 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cbc2 | .2488057 .0039856 62.43 0.000 .2409941 .2566173 _cons | 2.469369 .001545 1598.30 0.000 2.466341 2.472397 ------------------------------------------------------------------------------ Example: Effect of Unions (x) on Weekly Earnings (y) reg lnwage cbc2 age Source | SS df MS Number of obs = 156130 -------------+------------------------------ F( 2,156127) = 7530.01 Model | 4458.26229 2 2229.13115 Prob > F = 0.0000 Residual | 46218.7241156127 .296032871 R-squared = 0.0880 -------------+------------------------------ Adj R-squared = 0.0880 Total | 50676.9864156129 .324584071 Root MSE = .54409 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cbc2 | .2014921 .00388 51.93 0.000 .1938874 .2090969 age | .0111539 .0001069 104.36 0.000 .0109444 .0113634 _cons | 2.043437 .0043461 470.17 0.000 2.034918 2.051955 ------------------------------------------------------------------------------ 2

  3. reg lnwage cbc2 age female married black other NE Midwest South city1mil ed3 ed4 aa ed6 ed7 Source | SS df MS Number of obs = 156130 -------------+------------------------------ F( 15,156114) = 5888.11 Model | 18311.0587 15 1220.73725 Prob > F = 0.0000 Residual | 32365.9277156114 .20732239 R-squared = 0.3613 -------------+------------------------------ Adj R-squared = 0.3613 Total | 50676.9864156129 .324584071 Root MSE = .45533 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cbc2 | .1360972 .0032913 41.35 0.000 .1296462 .1425481 age | .0067085 .000096 69.85 0.000 .0065203 .0068968 female | -.2151269 .002322 -92.65 0.000 -.2196779 -.2105759 married | .127496 .0025106 50.78 0.000 .1225752 .1324168 black | -.0645881 .0039931 -16.17 0.000 -.0724145 -.0567617 other | -.0454844 .0052715 -8.63 0.000 -.0558164 -.0351524 NE | .0089504 .0034877 2.57 0.010 .0021146 .0157862 Midwest | -.0148798 .0033238 -4.48 0.000 -.0213944 -.0083653 South | -.0260961 .0032539 -8.02 0.000 -.0324736 -.0197186 city1mil | .1118365 .0023835 46.92 0.000 .1071648 .1165081 ed3 | .2875855 .0038465 74.77 0.000 .2800464 .2951246 ed4 | .3676268 .0041132 89.38 0.000 .359565 .3756885 aa | .4949227 .0050869 97.29 0.000 .4849525 .5048929 ed6 | .7416187 .0042642 173.92 0.000 .7332609 .7499764 ed7 | .896922 .005259 170.55 0.000 .8866146 .9072295 _cons | 1.813933 .0050728 357.58 0.000 1.803991 1.823876 ------------------------------------------------------------------------------ reg lnwage cbc2 age female married black other NE Midwest South city1mil ed3 ed4 aa ed6 ed7 manager prof tech sales privhh protect servocc servocc farmer craft oper transop laborer Source | SS df MS Number of obs = 156130 -------------+------------------------------ F( 27,156102) = 4558.99 Model | 22342.7173 27 827.508049 Prob > F = 0.0000 Residual | 28334.2691156102 .181511249 R-squared = 0.4409 -------------+------------------------------ Adj R-squared = 0.4408 Total | 50676.9864156129 .324584071 Root MSE = .42604 ------------------------------------------------------------------------------ lnwage3 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cbc2 | .1348609 .0031501 42.81 0.000 .1286866 .1410351 age | .0056959 .0000906 62.84 0.000 .0055183 .0058736 female | -.1960792 .0023927 -81.95 0.000 -.2007688 -.1913895 married | .0945142 .0023617 40.02 0.000 .0898854 .0991431 black | -.0497951 .0037475 -13.29 0.000 -.05714 -.0424501 other | -.0287192 .0049378 -5.82 0.000 -.0383971 -.0190413 NE | .0106994 .0032661 3.28 0.001 .0042979 .0171009 Midwest | -.0160232 .0031147 -5.14 0.000 -.0221278 -.0099185 South | -.0345 .003048 -11.32 0.000 -.040474 -.028526 city1mil | .1006931 .0022359 45.04 0.000 .0963108 .1050754 ed3 | .2163545 .0036596 59.12 0.000 .2091817 .2235273 ed4 | .2570192 .0039814 64.55 0.000 .2492157 .2648228 aa | .3307331 .0049498 66.82 0.000 .3210316 .3404345 ed6 | .5085537 .004477 113.59 0.000 .4997789 .5173285 ed7 | .6125842 .0056601 108.23 0.000 .6014905 .6236779 manager | .3553568 .0039626 89.68 0.000 .3475901 .3631235 prof | .2786787 .0041472 67.20 0.000 .2705503 .2868071 tech | .2750721 .0062083 44.31 0.000 .262904 .2872401 sales | .0288982 .0040054 7.21 0.000 .0210478 .0367487 privhh | -.3069562 .0139645 -21.98 0.000 -.3343264 -.2795861 protect | .0610202 .0081706 7.47 0.000 .045006 .0770344 servocc | -.3478074 .0052614 -66.11 0.000 -.3581196 -.3374952 farmer | -.1941755 .0089707 -21.65 0.000 -.2117578 -.1765931 craft | .1923506 .0043155 44.57 0.000 .1838922 .2008089 oper | .0161818 .0051605 3.14 0.002 .0060673 .0262963 transop | -.0171413 .0066874 -2.56 0.010 -.0302485 -.004034 laborer | -.1110402 .0058008 -19.14 0.000 -.1224096 -.0996708 _cons | 1.896043 .0055862 339.42 0.000 1.885094 1.906992 ------------------------------------------------------------------------------ 3

  4. Example: Effect of Unions (x) on Weekly Earnings (y) � Some observations…: � The returns to union membership are sensitive to age and educational attainment. Union members tend to be older and have higher educational attainment than other members of the labor force. Once we control for those factors, estimated returns to union membership are lower. � Similarly, union members tend to be male. Absent a control for gender, part of the male wage advantage is attributed to union membership. � In contrast with the first two points, after all the other controls, further control for occupation doesn’t really do very much. Example: Effect of Unions (x) on Weekly Earnings (y) � Conclusions: � What you have in the model may affect your estimates. � This is not always the case. � Linguistics: � We call the variables we place in models to remove the effects of correlates of the variables we are interested in “CONTROLS”. They are there to control for other factors that influence our dependent variable. 4

  5. Choosing Model Specification (“What variables do I use?”) � Q: How do we decide what should be in the model? � A: It depends on the question we are trying to answer. � Example: If we just want to know how much more a union member earns than a non-member overall, then our first estimate is fine. � Example: If we want to measure how much union membership increases the earnings all else equal ( ceteris paribus ), then we need to build a regression model that controls for the other influences on earnings… � Education � Occupation � Experience � Gender � And on and on… What is Misspecification? � “Misspecification” is: � 1. Omitting variables that should be included. � 2. Adding variables that should not be included. 5

  6. Omitted Variables � Let’s define the “true” model as the correct model for explaining the issue. We are going to work with population models so we don’t have the added problem of sampling variability. Let’s write this out in our typical form: i = β + β + β + ε 1 Y X X Equation 0 1 1 2 2 var whereY dependent iable ' exp var X s are the lanatory iable ε is the error term Omitted Variables � Now, suppose we estimate a model leaving out X 2 : i = α + α + ε * 2 Y X Equation 0 1 1 var whereY dependent iable ' exp var X s are the lanatory iable ε * is the error term 6

Recommend


More recommend