optimally combining outcomes to improve prediction
play

Optimally Combining Outcomes to Improve Prediction David Benkeser - PowerPoint PPT Presentation

Optimally Combining Outcomes to Improve Prediction David Benkeser benkeser@berkeley.edu UC Berkeley Biostatistics November 15, 2016 Acknowledgments Collaborators: Mark van der Laan, Alan Hubbard, Ben Arnold, Jack Colford, Andrew Mertens,


  1. Optimally Combining Outcomes to Improve Prediction David Benkeser benkeser@berkeley.edu UC Berkeley Biostatistics November 15, 2016

  2. Acknowledgments Collaborators: Mark van der Laan, Alan Hubbard, Ben Arnold, Jack Colford, Andrew Mertens, Oleg Sofyrgin Funding: Bill and Melinda Gates Foundation OPP1147962

  3. Motivation The Gates Foundation’s HBGDki is a program aimed at improving early childhood development [1] A “precision public health” initiative – getting the right child, the right intervention, at the right time.

  4. Cebu Study The Cebu Longitudinal Health and Nutrition Survey enrolled pregnant women between 1983 and 1984. [2] Children were followed every 6 months for two years after birth and again at ages 8,11,16,18,21. Research is focused on the long-term effects of prenatal and early childhood exposures on later outcomes.

  5. Cebu Study Questions: 1. Can a screening tool be constructed using information available in early childhood to predict neurocognitive deficits later in life? 2. What variables improve prediction of neurocognitive outcomes? 3. Do somatic growth measures improve predictions of neurocognitive outcomes?

  6. Observed data The observed data are n = 2 , 166 observations O = ( X , Y ) . X = D covariates Early childhood measurements: health care access, sanitation, parental information, gestational age Y = J outcomes Achievement test scores at 11 years old: Math, English, Cebuano

  7. Combining test scores Reasons to combine test scores into single score: 1. No scientific reason to prefer one score 2. Predicting deficit in any domain is important 3. Avoid multiple comparisons 4. Improve prediction?

  8. Existing methods Scores could be standardized and summed: Y i − ¯ Y j Z i = ∑ ˆ σ j j Downsides: 1. Somewhat ad-hoc 2. Outcomes may not be strongly related to X

  9. Existing methods Principal components/factor analysis of Y Perform some transformation of Y Look at eigenvalues/scree plots/etc... to choose factors Decide on linear combination of factors Downsides: 1. Very ad-hoc 2. Difficult to interpret/explain 3. Outcomes may not be strongly related to X

  10. Existing methods Supervised methods, e.g., canonical correlation, redundancy analysis, partial least squares, etc... Downsides: 1. Difficult to interpret 2. Outcomes not naturally combined into single score 3. Inference not straightforward

  11. Combining scores What are characteristics of a good composite score? 1. Simple to interpret 2. Reflect the scientific goal (prediction) 3. Procedure can be fully pre-specified Consider a simple weighted combination of outcomes, ∑ ∑ Y ω = ω j Y j , with ω j > 0 for all j , and ω j = 1 j j Can we choose the weights to optimize our predictive power?

  12. Predicting composite outcome Considering predicting composite outcome Y ω with a prediction function ψ ω ( X ) . A measure of the performance of ψ ω is 0 ,ω ( ψ ω ) = 1 − E 0 [ { Y ω − ψ ω ( X ) } 2 ] R 2 E 0 [ { Y ω − µ 0 ,ω ( X ) } 2 ] , where µ 0 ,ω = E 0 ( Y ω ) .

  13. Predicting composite outcome Easy to show that R 2 0 ,ω is maximized by using ψ 0 ,ω ( X ) = E 0 ( Y ω | X ) (∑ � ) = E 0 ω j Y j � X � � j ∑ ω j E 0 ( Y j | X ) = j ∑ ω j ψ 0 , j ( X ) . = j The best predictor of composite outcome is the weighted combination of the best predictor for each outcome.

  14. Choosing weights For each choice of weights, 0 ,ω ( ψ 0 ,ω ) = 1 − E 0 [ { Y ω − ψ 0 ,ω ( X ) } 2 ] R 2 E 0 [ { Y ω − µ 0 ,ω ( X ) } 2 ] , is the best we could do predicting composite outcome. Now, we choose the weights that maximize R -squared, ω 0 = argmax ω R 2 0 ,ω ( ψ 0 ,ω ) The statistical goal is to estimate ω 0 and ψ 0 ,ω 0 .

  15. Simple example Let X = ( X 1 , . . . , X 6 ) and Y = ( Y 1 , . . . , Y 6 ) , with X d ∼ Uniform (0 , 4) for d = 1 , . . . , 6 Y j ∼ Normal (0 , 25) for j = 1 , 2 , 3 Y j ∼ Normal ( X 1 + 2 X 2 + 4 X 3 + 2 X j , 25) for j = 4 , 5 , 6 Y 1 , Y 2 , and Y 3 are noise X 1 , X 2 , X 3 predict Y 4 , Y 5 , Y 6 X j predicts only Y j for j = 4 , 5 , 6 . Outcome R 2 0 Y 1 , Y 2 , Y 3 0.00 Y 4 , Y 5 , Y 6 0.57 Standardized 0.37 Optimal 0.87

  16. Predicting each outcome The best predictor of composite outcome is the weighted combination of the best predictor for each outcome. How should be go about estimating a prediction function for each outcome? Linear regression, with interactions, and nonlinear terms, or splines (with different degrees?) Penalized linear regression, with different penalties? Random forests, with different tuning parameters? Gradient boosting? Support vector machines? Deep neural networks? Ad infinitum...

  17. Predicting each outcome We have no way of knowing a-priori which algorithm is best. This depends on the truth! Best prediction function might be different across outcomes. How can we objectively evaluate M algorithms for Y j ? Option 1: Train algorithms, see which has best R 2 overfit! Option 2: Train algorithms, do new experiment to evaluate expensive! Option 3: Cross validation!

  18. Cross validation Consider randomly splitting the data into K different pieces. S 1 S 2 S 3 S 4 S 5

  19. Cross validation Validation ( V 1 = { i ∈ S 1 } ) and training ( T 1 = { i / ∈ S 1 } ) V 1 T 1 T 1 T 1 T 1

  20. Cross validation For m = 1 , . . . , M , fit algorithms using training sample. For example, Ψ m could correspond to a linear regression. 1. Estimate parameters of regression model using { O i : i ∈ T 1 } . 2. Ψ m ( T 1 ) is now a prediction function To predict on a new observation x , Ψ m ( T 1 )( x ) = ˆ β 0 + ˆ β 1 x 1 + . . . + ˆ β D x D

  21. Cross validation The algorithm Ψ m could be more complicated, 1. Estimate parameters of full regression model using { O i : i ∈ T 1 } 2. Use backward selection, eliminating variables with p-value > 0 . 1 3. Stop when no more variables are eliminated 4. Ψ m ( T 1 ) is now a prediction function To predict on a new observation x , Ψ m ( T 1 )( x ) = x final ˆ β final

  22. Cross validation The algorithm Ψ m could be a machine learning algorithm, 1. Train a support vector machine using { O i : i ∈ T 1 } 2. Ψ m ( T 1 ) is now a “black box” prediction function To predict on a new observation x , feed x into black box and get prediction back.

  23. Cross validation For m = 1 , . . . , M , compute error on validation sample 1 ∑ { Y i − Ψ m ( T 1 )( X i ) } 2 E m , 1 = | V 1 | i ∈ V 1 V 1 T 1 T 1 T 1 T 1

  24. Cross validation For m = 1 , . . . , M , compute error on validation sample 1 ∑ { Y i − Ψ m ( T 1 )( X i ) } 2 E m , 1 = | V 1 | i ∈ V 1 To compute 1. Obtain predictions on validation sample using algorithms from training sample. 2. Average squared residual for each observation As though we did another experiment (of size | S 1 | ) to evaluate the algorithms!

  25. Cross validation Validation ( V 2 = { i ∈ S 2 } ) and training ( T 2 = { i / ∈ S 2 } ) T 2 V 2 T 2 T 2 T 2

  26. Cross validation For m = 1 , . . . , M , fit algorithms using training sample. T 2 V 2 T 2 T 2 T 2

  27. Cross validation For m = 1 , . . . , M , compute error on validation sample 1 ∑ { Y i − Ψ m ( T 2 )( X i ) } 2 E m , 2 = | V 2 | i ∈ V 2 T 2 V 2 T 2 T 2 T 2

  28. Cross validation Continue until each split has been validation once. T 3 T 3 V 3 T 3 T 3

  29. Cross validation Continue until each split has been validation once. V 4 T 4 T 4 V 4 T 4

  30. Cross validation Continue until each split has been validation once. T 5 T 5 T 5 T 5 V 5

  31. Cross validation selector The overall performance of algorithm m is E m = 1 ¯ ∑ E m , k K k the average mean squared-error across splits. At this point, could choose m ∗ , the algorithm with lowest error. This is called the cross-validation selector. Prediction function is Ψ m ∗ ( F ) , where F = { 1 , . . . , n } is the full data.

  32. Super learner Alternatively, consider an ensemble prediction function ∑ ∑ α m Ψ m , α m > 0 for all m , and Ψ = α m = 1 m m and choose α that minimizes cross-validated error. Often seen to have superior performance to choosing the single best algorithm. This estimator is referred to as the Super Learner. [3]

  33. Super learner For example, linear regression might capture one feature, but support vector machines captures another. The prediction function Ψ( x ) = 0 . 5Ψ linmod ( x ) + 0 . 5Ψ svm ( X ) might be better than Ψ linmod or Ψ svm alone. Computing the best weighting of algorithms is computationally simple after cross validation.

  34. Oracle inequality The name Super Learner derives from an important theoretical result called the oracle inequality. For large enough sample size, the Super Learner predicts as well as the (unknown) best algorithm considered. The number of algorithms one may consider is large and allowed to grow with n , e.g., M n = n 2 [4]

  35. Combined prediction function We now have a super learner prediction function Ψ j ( F ) for each outcome Y j . For any choice of weights, we have a prediction function for combined outcome: ∑ ω j Ψ j ( F ) . ψ n ,ω = j Still need to choose weights that maximize predictive performance of combined outcome.

  36. Choosing optimal weights How do we go about estimating ω 0 , the weights that yield the highest R 2 ? Option 1: Get SL predictions, maximize R 2 over weights overfit! Option 2: Do new experiment, maximize R 2 on new data expensive! Option 3: Cross validation!

Recommend


More recommend