variable screening
play

Variable Screening You will often have many candidate variables to - PowerPoint PPT Presentation

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Variable Screening You will often have many candidate variables to use as independent variables in a regression model. Using all of them may be


  1. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Variable Screening You will often have many candidate variables to use as independent variables in a regression model. Using all of them may be infeasible (more parameters than observations). Even if feasible, a prediction equation with many parameters may not perform well: in validation; in application. 1 / 22 Variable Screening Methods Introduction

  2. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Stepwise regression How to choose the subset to use? One approach: stepwise regression. Example Executive salary, with 10 candidate variables: execSal <- read.table("Text/Exercises&Examples/EXECSAL2.txt", header = TRUE) execSal[1:5,] pairs(execSal[,-1]) 2 / 22 Variable Screening Methods Stepwise Regression

  3. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Variables X1 Experience (years) X2 Education (years) X3 Gender (1 if male, 0 if female) X4 Number of employees supervised X5 Corporate assets ($ millions) X6 Board member (1 if yes, 0 if no) X7 Age (years) X8 Company profits (past 12 months, $ millions) X9 Has international responsibility (1 if yes, 0 if no) X10 Company’s total sales (past 12 months, $ millions) 3 / 22 Variable Screening Methods Stepwise Regression

  4. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Note that X3 , X6 , and X9 are indicator variables. The complete second-order model is quadratic in the other 7 variables, with interactions with all combinations of the indicator variables. A quadratic function of 7 variables has 36 coefficients. The complete second-order model has 36 × 8 = 288 parameters. Infeasible: the data set has only 100 observations. 4 / 22 Variable Screening Methods Stepwise Regression

  5. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Forward stepwise selection First, consider all the one-variable models E ( Y ) = β 0 + β j x j , j = 1 , 2 , . . . , k . For each, test the hypothesis H 0 : β j = 0 at some level α . If none is significant, the model is E ( Y ) = β 0 . Otherwise, choose the best (in terms of R 2 , R 2 a , | t | , | r | ; it doesn’t matter); call the variable x j 1 . 5 / 22 Variable Screening Methods Stepwise Regression

  6. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Now consider all two-variable models that include x j 1 : E ( Y ) = β 0 + β j 1 x j 1 + β j x j , j � = j 1 . For each, test the significance of the new coefficient β j . If none is significant, the model is E ( Y ) = β 0 + β j 1 x j 1 . Otherwise, choose the best new variable; call it x j 2 . Continue adding variables until no remaining variable is significant at level α . 6 / 22 Variable Screening Methods Stepwise Regression

  7. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Backward stepwise elimination Alternatively, begin with the model containing all the variables, the full first-order model (assuming you can fit it). Test the significance of each coefficient at some level α . If all are significant, use that model. Otherwise, eliminate the least significant variable (smallest | t | , smallest reduction in R 2 , . . . ; again, it doesn’t matter which). Continue eliminating variables until all remaining variables are significant at level α . 7 / 22 Variable Screening Methods Stepwise Regression

  8. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Either forward selection or backward elimination could be used to select a subset of variables for further study. Problem: forward selection and backward elimination may identify different subsets. 8 / 22 Variable Screening Methods Stepwise Regression

  9. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Bidirectional stepwise regression A combination of forward selection and backward elimination. Choose a starting model; it could be: no independent variables; all independent variables; some other subset of independent variables suggested a priori. Look for a variable to add to the model, by adding each candidate, one at a time, and testing the significance of the coefficient. Then look for a variable to eliminate, by testing all coefficients. You could use a different α -to-enter and α -to-remove, with α enter < α remove . 9 / 22 Variable Screening Methods Stepwise Regression

  10. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Repeat both steps until no variable can be added or eliminated. The final model is one at which both forward selection and backward elimination would terminate. But it is still possible that you get different final models depending on the choice of initial model. 10 / 22 Variable Screening Methods Stepwise Regression

  11. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Criterion-based stepwise regression In hypothesis-test-based subset selection, many tests are used. Each test, in isolation, has a specified error rate α . The per-test error rate α controls the choice of final subset, but in an indirect way. 11 / 22 Variable Screening Methods Stepwise Regression

  12. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Modern methods are instead based on improving a criterion such as: Adjusted coefficient of determination, R 2 a ; MSE, s 2 (equivalent to R 2 a ); Mallows’s C p criterion; PRESS criterion; Akaike’s information criterion, AIC. PRESS and AIC are equivalent when n is large. 12 / 22 Variable Screening Methods Stepwise Regression

  13. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II In R, using AIC, starting with the empty model: start <- lm(Y ~ 1, execSal) all <- Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 summary(step(start, scope = all)) Starting with the full model: # no scope, so direction defaults to "backward": summary(step(lm(all, execSal))) summary(step(lm(all, execSal), direction = "both")) 13 / 22 Variable Screening Methods Stepwise Regression

  14. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Note: σ 2 ) + 2( k + 1) AIC = n log(ˆ � SSE � = n log + 2( k + 1) n = n log(SSE) + 2( k + 1) [ − n log n ] . This works well when choosing from nested models. � 10 � But in the example, the 5-variable model is the best of = 252 5 possible models. Some statisticians prefer the Bayesian Information Criterion σ 2 ) + (log n )( k + 1) BIC = n log(ˆ 14 / 22 Variable Screening Methods Stepwise Regression

  15. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II BIC imposes a higher penalty on the number of parameters in the model. In R: summary(step(start, scope = all, k = log(nrow(execSal)))) Final model is the same in this case, and will never be larger, but may be smaller. 15 / 22 Variable Screening Methods Stepwise Regression

  16. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Best Subset Regression When used with a criterion, stepwise regression terminates with a subset of variables that cannot be improved by adding or dropping a single variable. That it, it is locally optimal. But some other subset may have a better value of the criterion. In R, the bestglm package implements best subset regression for various criteria. 16 / 22 Variable Screening Methods Best Subset Regression

  17. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Concerning best subset methods, the text asserts that these techniques lack the objectivity of a stepwise regression procedure. I disagree. Finding the subset of variables that optimizes some criterion is completely objective. In fact, because of the opaque way that choosing α controls the procedure, I argue that stepwise regression lacks the transparency of best subset regression 17 / 22 Variable Screening Methods Best Subset Regression

  18. ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Why not use stepwise methods to build a complete model? We need to try second-order terms like products of independent variables (interactions) and squared terms (curvatures). Some software tools do not know that an interaction should be included only if both main effects are also included–but step() does. Try the full second-order model: all <- Y ~ ((X1 + X2 + X4 + X5 + X7 + X8 + X10)^2 + I(X1^2) + I(X2^2) + I(X4^2) + I(X5^2) + I(X7^2) + I(X8^2) + I(X10^2)) * X3 * X6 * X9 summary(step(start, scope = all, k = log(nrow(execSal)))) 18 / 22 Variable Screening Methods Caveats

Recommend


More recommend