Super learner with application to Predicting HIV-1 Drug Resistance Beilin Jia University of North Carolina at Chapel Hill April 24, 2018 Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 1 / 20
Overview Super learner 1 Simulation 2 HIV-1 Example 3 Discussion 4 Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 2 / 20
Super learner The super learner is a prediction algorithm, which applies a set of candidate learners to the observed data, and chooses the optimal learner for a given prediction problem based on cross-validated risk. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 3 / 20
Super learner Sinisi et al. (2007) Based on unified loss-based estimation theory Three main steps: Define the parameter of interest in terms of a loss function Construct a set of candidate estimators based on loss function Select optimal estimator based on cross-validation. In the paper, candidate learning algorithms include: Least Angle Regression (LARS) Logic Regression Deletion/Substitution/Addition (D/S/A) algorithm Classification and Regression Trees (CART) Ridge Regression Linear Regression Cross-validation selector selects the learner with the best performance on the validation sets. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 4 / 20
Super learner The super learner will asymptotically outperform any of the candidate estimators it uses as long as the number of candidate learners is polynomial in sample size (if one of the candidate estimators it employs achieves a parametric rate of convergence, the super learner will converge at an almost parametric rate). (Sinisi and van der Laan, 2004; Van Der Laan and Dudoit, 2003; Van der Laan et al., 2006; Van der Vaart et al., 2006) Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 5 / 20
Simulation setup N=500 y i = 2 w 1 w 10 + 4 w 2 w 7 + 3 w 4 w 5 − 5 w 6 w 10 + 3 w 8 w 9 + N (0 , 1), i = 1 , 2 , · · · , 500, w j ∼ Bin (0 . 4) , j = 1 , · · · , 10. Y : outcome. W : 10-dimensional covariates. 10-fold cross-validation. Internal cross-validation to select the optimal fraction in LARS and the fine-tuning parameters for Logic Regression and D/S/A. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 6 / 20
Simulation results Table 1: Super Learner: Cross-Validated Risks of Candidate Learners (n=500) Method Median Mean Std Error Linear Regression (1) 4.477 4.414 0.76 Linear Regression (2) 1.182 1.165 0.16 LARS (1) 4.594 4.719 0.92 LARS (2) 1.179 1.183 0.13 Logic Regression 1.026 1.043 0.21 1.055 D/S/A 1.026 0.19 CART 1.773 1.828 0.60 Ridge Regression 1.176 1.157 0.16 Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 7 / 20
Simulation results Next step: apply Logic Regression to the entire dataset. The final logic tree: − 3 . 09 ∗ (( not w 9 ) or ( not w 8 )) + 4 . 58 ∗ (( not w 10 ) or ( not w 6 )) + 4 . 17 ∗ (( not w 6 ) and w 6 ) or ( w 7 and w 2 ) − 3 . 09 ∗ (( not w 5 ) or ( not w 4 )) + 0 . 839 ∗ w 1 Due to close competition between Logic Regression and D/S/A, Sinisi et al. (2007) evaluated the performance of two estimators on an independent test set of sample size 5,000 Logic Regression: mean squared prediction error (MSPE) = 1.37, R 2 = 0.84. D/S/A: MSPE = 1.05, R 2 = 0.88 Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 8 / 20
Simulation results Sinisi et al. (2007) applied the super learner to a dataset of increasing sample size, n = 100 , n = 1000 , n = 10 , 000. Consider the same set of candidate learners. Estimated cross-validated risks vary less as sample size increases. Both D/S/A and Logic regression converge at a parametric rate to the true model. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 9 / 20
Simulation results Try a di ff erent data generating distribution: y i = 2 w 1 w 10 + 4 w 2 w 7 + 3 w 4 w 5 − 5 w 6 w 10 + 3 w 8 w 9 + w 1 w 2 w 4 − 2 w 7 (1 − w 6 ) w 9 − 4(1 − w 10 ) w 1 (1 − w 4 ) + N (0 , 1), w j ∼ Bin (0 , 4) , j = 1 , 2 , · · · , 10 Each candidate learner would not be converging at a parametric rate to the true model. a new candidate learner: convex combination of the other candidate learners. e.g. ˆ y convex , α = α ˆ y DSA + (1 − α )ˆ y Logic D/S/A is better than Logic Regression in terms of lowest cross-validated risk, but the convex combination ( α = 0 . 8316) of the two models outperforms both. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 10 / 20
HIV-1 example Data from the Stanford HIV Reverse Transcriptase and Protease Sequence Database. (Rhee et al., 2006) Predict viral susceptibility to protease inhibitors (PIs) based on mutations in the protease region of the viral strand. Use non-polymorphic treatment-selected mutations (TSMs) as predictors. 58 TSMs used, occurring at 34 positions in protease. Outcome: standardized log fold change in drug susceptibility. IC 50 of an isolate Fold change = IC 50 of a standard wildtype control isolate . IC 50 is the concentration of a drug needed to inhibit viral replication by 50% where IC stands for inhibitory concentration. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 11 / 20
HIV-1 example, results Apply super learner to predicting susceptibility to a single PI, nelfinavir (NFV). 10-fold cross-validation, candidate learners include: LARS, Logic Regression, D/S/A, CART, Ridge Regression, and linear regression. Rhee et al. (2006) found that including all two-way interactions among the mutations as input did not improve the prediction accuracy. Optimal learner: linear regression with all 58 main terms, average risk = 0.187 Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 12 / 20
HIV-1 example, results D/S/A estimator, average cross-validated risk = 0.188 Apply linear regression and D/S/A to entire dataset. Cross-validation selects a final D/S/A estimator with 40 main terms. Marginally improved by including the other 18 mutations in the prediction model. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 13 / 20
HIV example, results Figure 1: D/S/A Estimator applied to learning sample, size ∈ { 1 , · · · , 50 } Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 14 / 20
HIV-1 example, results Figure 2: Linear Regression Model Fit Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 15 / 20
HIV-1 example, results Figure 3: D/S/A Estimator: Best Model of Sizes 1 to 20. (i.e., Best model of size 1: L90M, Best Model of Size 2: L90M and 30N, etc.) p-values for the coe ffi cients from linear regression fit and the list of best D/S/A models of each size imply the importance of each candidate mutation for resistance to NFV. Two models yield quite comparable insight into the set of mutations key to predicting susceptibility to NFV. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 16 / 20
Discussion Instead of considering one learning algorithm to deal with prediction problems, a better approach would be to apply as many candidate learners as are feasible and choose the optimal one. Benefit of using convex combinations as additional candidate learners. In practice, no guarantee that super learner will always select the optimal learner. Variability in the estimates of cross-validated risk for each candidate learner clearly depends on the size of the dataset. The candidate learner selected can shift with increasing sample size. Worthwhile to evaluate not only the final optimal estimator but also competitive estimators. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 17 / 20
Reference I Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and Robert W Shafer. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences , 103(46):17355–17360, 2006. Sandra E Sinisi and Mark J van der Laan. Deletion/substitution/addition algorithm in learning with applications in genomics. Statistical applications in genetics and molecular biology , 3(1):1–38, 2004. Sandra E Sinisi, Eric C Polley, Maya L Petersen, Soo-Yon Rhee, and Mark J van der Laan. Super learning: an application to the prediction of hiv-1 drug resistance. Statistical applications in genetics and molecular biology , 6(1), 2007. Mark J Van Der Laan and Sandrine Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. 2003. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 18 / 20
Reference II Mark J Van der Laan, Sandrine Dudoit, and Aad W van der Vaart. The cross-validated adaptive epsilon-net estimator. Statistics & Decisions , 24(3):373–395, 2006. Aad W Van der Vaart, Sandrine Dudoit, and Mark J van der Laan. Oracle inequalities for multi-fold cross validation. Statistics & Decisions , 24(3): 351–371, 2006. Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 19 / 20
Thanks for listening! Beilin Jia (UNC-Chapel Hill) BIOS 740 Final Presentation April 24, 2018 20 / 20
Recommend
More recommend