Subsampling versus bootstrap in resampling-based model selection for multivariable regression Subsampling versus bootstrap in resampling-based model selection for multivariable regression Riccardo De Bin 1 , Silke Janitza 1 , Willi Sauerbrei 2 & Anne-Laure Boulesteix 1 unzburg, July 23 rd 2014 G¨ 1 Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Germany 2 Department of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Germany Statistical Computing 2014, July 23 rd 2014 1/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Outline Introduction Methods Data Results Prediction accuracy Conclusions Statistical Computing 2014, July 23 rd 2014 2/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Introduction • model selection to identify the “best” model to describe an outcome; • if the study is replicated, the procedure should ideally produce the same result → stability; • model selection for multivariable regression based on inclusion frequencies (Gong, 1982; Sauerbrei & Schumacher, 1992); • this approach is based on a resampling technique: ◮ the classical (mostly used) choice is the bootstrap; ◮ the bootstrap has some pitfalls (Janitza et al., 2014); ◮ alternatives such as subsampling should be considered. • Aim: compare bootstrap and subsampling in a model selection process for multivariable regression based on inclusion frequencies. Statistical Computing 2014, July 23 rd 2014 3/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Methods: Inclusion frequencies • we generate, through a resampling technique, several pseudo-samples containing small perturbation of the original data. • in each pseudo-sample, we apply a model selection procedure; • we define the proportion of times in which a variable is selected in the models as “inclusion frequency” (IF); • ideally, we can distinguish between: ◮ relevant variables, related to the outcome → high IF; ◮ noise variables, significant only in specific samples → low IF; • possible issues: ◮ variables with weak effect (their IF may depend on chance); ◮ co-selection (e.g., two highly correlated variables may be alternatively selected, leading to an IF around 0.5 for both of them). Statistical Computing 2014, July 23 rd 2014 4/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Methods: Model selection • we would like to select a model which: ◮ contains all the relevant variables, to correctly explain the outcome and to avoid underfitting; ◮ contains as few variables as possible, to favor interpretability and to avoid overfitting; • several approaches are available in literature: ◮ backward elimination, forward selection, all subset approach, . . . ◮ here we use backward elimination with no re-inclusion (for arguments in favor of this choice, see Mantel, 1970); • the inclusion criterion is a key aspect: ◮ significance level, information criterion, total number of variables, . . . ◮ we base our analysis on the significance level (here 0.05, 0.10, 0.157); Statistical Computing 2014, July 23 rd 2014 5/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Methods: Resampling strategies (1/2) In our study, we consider the following resampling strategies: • bootstrap(n) ◮ it is the classical bootstrap technique (Efron, 1979) ◮ n observations drawn from the original data with replacement; (hereafter n denotes the sample size); ◮ its asymptotic properties have been extensively studied in the last decades, starting from Bickel & Freedman (1981); ◮ there are counterexamples where the consistency is not achieved (see, e.g., Mammen, 1992; Bickel et al., 1997); ◮ bootstrap(n) shows pitfalls in several cases (for a recent review, see Janitza et al., 2014); Statistical Computing 2014, July 23 rd 2014 6/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Methods: Resampling strategies (2/2) • subsample(m); ◮ intensively investigated in literature (Shao & Wu, 1989; Politis & Romano, 1994; Politis et al., 1999); ◮ m < n observations drawn from the original data without replacement; ◮ also known as delete-d jackknife (see Wu, 1986); ◮ shows asymptotic consistency also in cases where the classical bootstrap fails (Davison et al., 2003); • bootstrap(m); ◮ m < n observations drawn from the original data with replacement; ◮ already considered in Bickel & Freedman (1981); • here m = 0 . 632 n , the average number of unique observations in a bootstrap(n) sample; • comparability due to same pseudo-sample size. . Statistical Computing 2014, July 23 rd 2014 7/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Methods: Resampling censored data • one of our examples (Glioma data) deals with survival data; • the presence of censored observations may raise some complications; • we directly apply the resampling technique, also if it produces pseudo-samples with different effective sizes (number of events); • alternatives are available (e.g., resample separately events and censored observations); • arguments in favor of the direct approach can be found in Burr (1994) and Zelterman et al. (1996). Statistical Computing 2014, July 23 rd 2014 8/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Data: Glioma dataset • original study: Ulm et al. (1989); • publicly available at http: //portal.uni-freiburg.de/imbi/Royston-Sauerbrei-book ; • time-to-event data: survival time of patients with malignant glioma; • 411 patients, 274 events (median follow up: 712 days); • 15 variables available: 1 continuous, 8 binary, 6 dummy variables representing 3 originally categorical variables; • the proportional hazards assumption is acceptable → Cox model. Statistical Computing 2014, July 23 rd 2014 9/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Data: Ozona dataset • original study: Ihorst et al. (2004); • we use the subset defined by Buchholz et al. (2008); • information about ozone effect on 496 school children’s lung growth; • 24 variables available: 7 continuous and 17 binary; • classical multivariable linear regression model; Statistical Computing 2014, July 23 rd 2014 10/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Results • the results are based on 10000 iterations of the following procedure: ◮ we draw a pseudo-sample from the original data; ◮ we select a model applying a backward elimination procedure with inclusion criterion α = 0 . 05 ; • we consider: ◮ inclusion frequencies of the variables; ◮ average number of variables included in the models; ◮ number of unique models selected; ◮ structure of the models; • for each of the three resampling techniques. Statistical Computing 2014, July 23 rd 2014 11/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Glioma data: Variables’ inclusion frequencies 1.0 bootstrap(n) bootstrap(m) 0.8 subsample(m) frequency of inclusion 0.6 0.4 0.2 0.0 sex time gradd1 gradd2 age kard1 kard2 surgd1 surgd2 convul cort epi amnesia ops aph Statistical Computing 2014, July 23 rd 2014 12/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Glioma data: Number of unique models and average number of variables average number number of resampling method of variables unique models bootstrap(n) 6.864 1787 bootstrap(m) 5.856 1829 subsample(m) 5.057 580 Statistical Computing 2014, July 23 rd 2014 13/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Glioma data: Models’ selection frequencies bootstrap(n) bootstrap(m) subsample(m) model rank freq. rank freq. rank freq. basic+kard1 2 124 1 326 1 1615 basic+kard1+epi 8 93 7 128 2 417 basic 8 123 3 398 basic+kard1+surgd2 6 103 3 163 4 352 basic+kard1+cort 5 106 4 148 5 298 basic+kard1+sex 3 108 2 187 6 290 basic+cort+ops 4 148 7 264 basic+epi 8 242 basic+kard1+sex+epi 1 156 6 140 9 225 basic* 10 117 10 205 basic+ops 9 121 basic*+kard1+cort+ops 3 108 basic+cort+ops 7 97 basic*+kard1+cort 8 93 basic+kard1+surgd2+sex+epi 10 89 basic=intercept+gradd1+age+surgd1; basic*=intercept+gradd2+age+surgd1 Statistical Computing 2014, July 23 rd 2014 14/ 23
Subsampling versus bootstrap in resampling-based model selection for multivariable regression Glioma data: Models’ structures Variable bootstrap(n) bootstrap(m) subsample(m) basic 15 123 398 basic + 1 additional 247 878 2432 basic + 2 additional 1030 1923 2786 basic + 3 additional 2071 2123 1956 basic + 4 additional 2505 1451 653 basic + 5 additional 1742 676 155 basic + > 5 additional 1103 275 27 others 1287 2551 1593 Variable bootstrap(n) bootstrap(m) subsample(m) basic* 17 178 473 basic* + 1 additional 304 1213 2841 basic* + 2 additional 1272 2590 3441 basic* + 3 additional 2472 2772 2309 basic* + 4 additional 2832 1825 730 basic* + 5 additional 1904 803 163 basic* + > 5 additional 1180 321 27 Without at least 1 core* 19 298 16 Statistical Computing 2014, July 23 rd 2014 15/ 23
Recommend
More recommend