Detection of influential points as a byproduct of resampling-based variable selection procedures Riccardo De Bin Department of Mathematics - University of Oslo based on a joint work with Anne-Laure Boulesteix (University of Munich) and Willi Sauerbrei (University Medical Center Freiburg) Seminars in Statistics, February 21 th , 2019 1/ 37
Resampling-based detection of influential points Outline of the talk Introduction Methods Detection of possible influential points Conclusions Seminars in Statistics, February 21 th , 2019 2/ 37
Resampling-based detection of influential points Introduction: overview • Importance of model stability: ◮ small data perturbations may lead to different selected models; ◮ the best model is not clear (if there is one). • among the approaches that handle this issue: ◮ resampling-based variable selection (Chen & George, 1985); ◮ frequentist model averaging (Buckland et al., 1997); • these approaches may rely on variable selection performed on several pseudo-samples (inclusion frequencies, data-driven weights, . . . ); • we use the same information to detect possible influential points: ◮ outliers; ◮ single observations which have a high impact on the results. Seminars in Statistics, February 21 th , 2019 3/ 37
Resampling-based detection of influential points Introduction: body fat data • Reference: Johnson (1996); • Sample size: 252; • Outcome: percentage of body fat (continuous); • Covariates: age , weight , height and other 10 continuous body circumference measurements; • Data: http://portal.uni-freiburg.de/imbi/ Royston-Sauerbrei-book/Multivariable_ Model-building/downloads/datasets/edu_bodyfat_ both.zip . • This dataset contains at least one influential point (Royston & Sauerbrei, 2007): observation 39. Seminars in Statistics, February 21 th , 2019 4/ 37
Resampling-based detection of influential points Introduction: body fat data BIC α = 0 . 05 AIC variable in out in out in out ← obs. 39 age ✓ ✓ ✓ ✓ weight ✓ ✓ ✓ height ✓ ✓ ✓ neck ✓ ✓ chest ✓ ab ✓ ✓ ✓ ✓ ✓ ✓ hip ✓ thigh ✓ knee ankle biceps forearm ✓ ✓ ✓ ✓ wrist ✓ ✓ ✓ ✓ ✓ ✓ NB: the models are obtained by using backward elimination Seminars in Statistics, February 21 th , 2019 5/ 37
Resampling-based detection of influential points Methods: resampling-based variable selection To perform a resampling-based variable selection: • generate several pseudo-samples through a resampling technique (e.g., bootstrap, subsampling, . . . ); • apply a variable selection procedure on each pseudo-sample (e.g., backward elimination); • for each variable, consider the proportion of pseudo-samples in which the variable has been selected → inclusion frequency; • consider only variables with larger inclusion frequencies. Seminars in Statistics, February 21 th , 2019 6/ 37
Resampling-based detection of influential points Methods: resampling-based variable selection variable pseudo-sample V 1 V 2 V 3 . . . V q − 1 V q 1 1 0 1 . . . 0 1 2 0 1 1 . . . 0 0 3 1 0 1 . . . 0 1 . . . . . . ... . . . . . . . . . . . . B 1 0 1 . . . 0 0 inclusion 0.96 0.24 1.00 . . . 0.05 0.69 frequency Seminars in Statistics, February 21 th , 2019 7/ 37
Resampling-based detection of influential points Methods: model averaging with resampling-based weights • Fit k models M 1 , . . . , M k on the data; • for each model, compute the estimate ˆ θ M j ; • compute the estimate ˆ θ as the weighted average θ = � k ˆ j =1 w j ˆ θ M j ; • A highly relevant point is the choice of the weights w j : ◮ based on information criteria or Mallows’ criterion (e.g., Buckland et al., 1997; Hjort & Claeskens, 2003; Hansen, 2007); ◮ resampling-based (Burnham & Anderson, 2002; Augustin et al., 2005); • we focus on the latter: ◮ find the best model for several pseudo-samples; ◮ w j is the proportion of time in which the model M j is selected. Seminars in Statistics, February 21 th , 2019 8/ 37
Resampling-based detection of influential points Methods: model averaging with resampling-based weights variable pseudo-sample V 1 V 2 V 3 . . . V q − 1 V q model 1 1 0 1 . . . 0 1 → M 1 2 0 1 1 . . . 0 0 → M 2 3 1 0 1 . . . 0 1 → M 1 . . . . . . . ... . . . . . . . . . . . . . → . B 1 0 1 . . . 0 0 → M k Seminars in Statistics, February 21 th , 2019 9/ 37
Resampling-based detection of influential points Methods: inclusion matrix • Both approaches rely on the inclusion matrix; • each row shows which variables are included in the best model on a particular pseudo-sample; variable pseudo-sample V 1 V 2 V 3 . . . V q − 1 V q model 1 1 0 1 . . . 0 1 → M 1 2 0 1 1 . . . 0 0 → M 2 3 1 0 1 . . . 0 1 → M 1 . . . . . . . ... . . . . . . . . . . . . . → . B 1 0 1 . . . 0 0 → M k inclusion 0.96 0.24 1.00 . . . 0.05 0.69 frequency Seminars in Statistics, February 21 th , 2019 10/ 37
Resampling-based detection of influential points Detection of possible influential points: towards the frequency matrix For each row we know which observations belong to the specific pseudo-sample and which do not; ⇓ we can contrast results for pseudo-samples with and without a specific observation • for each variable, two separate inclusion frequencies; ◮ inclusion frequencies computed on pseudo-samples including the i -th observation → I-frequencies; ◮ inclusion frequencies computed on pseudo-samples without the i -th observation → O-frequencies; Seminars in Statistics, February 21 th , 2019 11/ 37
Resampling-based detection of influential points Detection of possible influential points: I-frequency matrix • We can organize the I-frequencies in a I-frequency matrix, ◮ each row correspond to I-frequencies computed only on pseudo-samples containing a specific observation. variable observation V 1 V 2 V 3 . . . V q − 1 V q 1 0.969 0.215 1.000 . . . 0.023 0.692 2 0.902 0.260 1.000 . . . 0.056 0.776 3 0.994 0.241 1.000 . . . 0.087 0.614 . . . . . . ... . . . . . . . . . . . . n − 1 0.978 0.301 1.000 . . . 0.061 0.661 n 0.984 0.292 1.000 . . . 0.047 0.676 Seminars in Statistics, February 21 th , 2019 12/ 37
Resampling-based detection of influential points Detection of possible influential points: I-frequency matrix • No influential points ↔ similar I-frequencies (i.e., similar values in the same column); • a possible strongly separated I-frequency may be reveal the presence of an influential point; how do we identify possible separated values? graphical analytical approach approach Seminars in Statistics, February 21 th , 2019 13/ 37
Resampling-based detection of influential points Detection of possible influential points: graphical approach • The idea is to take advantage of “the human gift for pattern recognition” (Friedman & Tukey, 1974); • the box-plot is a simple and effective tool: ◮ extreme observations are not included in the whiskers; ◮ usually when farther than 1.5 IR from the first/third quartile; ◮ the extreme points are those of interest → anomalous inclusion frequencies. Seminars in Statistics, February 21 th , 2019 14/ 37
Resampling-based detection of influential points Detection of possible influential points: body fat example 1.0 ● ● 0.8 ● ● ● ● inclusion frequency ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● 0.0 age weight height neck chest ab hip thigh knee ankle biceps forearm wrist variable Seminars in Statistics, February 21 th , 2019 15/ 37
Resampling-based detection of influential points Detection of possible influential points: body fat example 1.0 ● ● O 39 0.8 ● O 39 ● ● ● inclusion frequency ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 ● 39 O ● ● ● ● 39 O ● ● ● ● ● 0.0 age weight height neck chest ab hip thigh knee ankle biceps forearm wrist variable Seminars in Statistics, February 21 th , 2019 15/ 37
Recommend
More recommend