Cross-validation 1. Split the data in to V “folds” of size roughly n / V . 2. For each fold v = 1 , . . . , V : • the data in folds other than v is called the training set ; • the data in fold v is called the test/validation set . • we obtain ˆ µ k , v using the training set ; • we obtain ˆ µ k , v ( X i ) for X i in the validation set V v . 3. Our cross-validated MSE is V � � µ k ) = 1 1 � µ k , v ( X i )] 2 . MSE CV (ˆ [ Y i − ˆ V |V v | v = 1 i ∈V v 15 / 48
Cross-validation 1. Split the data in to V “folds” of size roughly n / V . 2. For each fold v = 1 , . . . , V : • the data in folds other than v is called the training set ; • the data in fold v is called the test/validation set . • we obtain ˆ µ k , v using the training set ; • we obtain ˆ µ k , v ( X i ) for X i in the validation set V v . 3. Our cross-validated MSE is V � � µ k ) = 1 1 � µ k , v ( X i )] 2 . MSE CV (ˆ [ Y i − ˆ V |V v | v = 1 i ∈V v We average the MSEs of the V validation sets. 15 / 48
CV preds. Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 Schematic of 10-fold cross-validation. Gray: training sets. Yellow: validation sets. 16 / 48
How do we choose V ? • Large V : 17 / 48
How do we choose V ? • Large V : – more training data , so better for small n 17 / 48
How do we choose V ? • Large V : – more training data , so better for small n – more computation time 17 / 48
How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates 17 / 48
How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ 17 / 48
How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ • Small V : 17 / 48
How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ • Small V : – more test data 17 / 48
How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ • Small V : – more test data – less computation time. 17 / 48
How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ • Small V : – more test data – less computation time. (People typically use V = 5 or V = 10.) 17 / 48
“Discrete” Super Learner • At this point, we have cross-validated MSE estimates � µ 1 ) , . . . , � MSE CV (ˆ MSE CV (ˆ µ K ) for each of our candidate algorithms. 18 / 48
“Discrete” Super Learner • At this point, we have cross-validated MSE estimates � µ 1 ) , . . . , � MSE CV (ˆ MSE CV (ˆ µ K ) for each of our candidate algorithms. • We could simply take as our estimator the ˆ µ k minimizing these cross-validated MSEs. 18 / 48
“Discrete” Super Learner • At this point, we have cross-validated MSE estimates � µ 1 ) , . . . , � MSE CV (ˆ MSE CV (ˆ µ K ) for each of our candidate algorithms. • We could simply take as our estimator the ˆ µ k minimizing these cross-validated MSEs. • We call this the “ discrete Super Learner ”. 18 / 48
Super Learner • Let λ = ( λ 1 , . . . , λ K ) be an element of S K , the K -dimensional simplex: each λ k ∈ [ 0 , 1 ] and � k λ k = 1. 19 / 48
Super Learner • Let λ = ( λ 1 , . . . , λ K ) be an element of S K , the K -dimensional simplex: each λ k ∈ [ 0 , 1 ] and � k λ k = 1. • Super Learner considers as its set of candidate algorithms µ λ := � K all convex combinations ˆ k = 1 λ k ˆ µ k . 19 / 48
Super Learner • Let λ = ( λ 1 , . . . , λ K ) be an element of S K , the K -dimensional simplex: each λ k ∈ [ 0 , 1 ] and � k λ k = 1. • Super Learner considers as its set of candidate algorithms µ λ := � K all convex combinations ˆ k = 1 λ k ˆ µ k . • The Super Learner is ˆ µ � λ , where � K � � � � λ := arg min MSE CV λ k ˆ µ k . λ ∈S K k = 1 (We use constrained optimization to compute the argmin.) 19 / 48
Super Learner � K � � � � λ := arg min MSE CV λ k ˆ µ k . λ ∈S K k = 1 20 / 48
Super Learner � K � � � � λ := arg min MSE CV λ k ˆ µ k . λ ∈S K k = 1 � K � � � 2 V K � � � � = 1 1 � MSE CV λ k ˆ µ k Y i − λ k ˆ µ k , v ( X i ) . V |V v | k = 1 v = 1 i ∈V v k = 1 20 / 48
Super Learner � K � � � � λ := arg min MSE CV λ k ˆ µ k . λ ∈S K k = 1 � K � � � 2 V K � � � � = 1 1 � MSE CV λ k ˆ µ k Y i − λ k ˆ µ k , v ( X i ) . V |V v | k = 1 v = 1 i ∈V v k = 1 20 / 48
Super Learner: steps Putting it all together: 21 / 48
Super Learner: steps Putting it all together: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 21 / 48
Super Learner: steps Putting it all together: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 2. Obtain the CV-predictions ˆ µ k , v ( X i ) for all k , v and i ∈ V v . 21 / 48
Super Learner: steps Putting it all together: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 2. Obtain the CV-predictions ˆ µ k , v ( X i ) for all k , v and i ∈ V v . 3. Use constrained optimization to compute the SL weights �� K � λ := arg min λ ∈S K � � MSE CV k = 1 λ k ˆ µ k . 21 / 48
Super Learner: steps Putting it all together: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 2. Obtain the CV-predictions ˆ µ k , v ( X i ) for all k , v and i ∈ V v . 3. Use constrained optimization to compute the SL weights �� K � λ := arg min λ ∈S K � � MSE CV k = 1 λ k ˆ µ k . µ SL = � K k = 1 � 4. Take ˆ λ k ˆ µ k . 21 / 48
II. Lab 1: Vanilla SL for a continuous outcome 21 / 48
III. Into the weeds: a mathematical presentation of SL 21 / 48
Review Recall the construction of SL for a continuous outcome: 22 / 48
Review Recall the construction of SL for a continuous outcome: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 2. Obtain the CV-predictions ˆ µ k , v ( X i ) for all k , v and i ∈ V v . 3. Use constrained optimization to compute the SL weights �� K � λ := arg min λ ∈S K � � k = 1 λ k ˆ µ k . MSE CV µ SL = � K k = 1 � 4. Take ˆ λ k ˆ µ k . 22 / 48
In this section, we generalize this procedure to estimation of any summary of the observed data distribution given an appropriate loss for the summary of interest. 23 / 48
Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . 24 / 48
Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O 24 / 48
Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . 24 / 48
Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . • Denote by P 0 ∈ M the true distribution of O . 24 / 48
Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . • Denote by P 0 ∈ M the true distribution of O . • Thus, we observe i.i.d. copies O 1 , . . . , O n ∼ P 0 . 24 / 48
Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . • Denote by P 0 ∈ M the true distribution of O . • Thus, we observe i.i.d. copies O 1 , . . . , O n ∼ P 0 . • Suppose we want to estimate a parameter θ : M → Θ . 24 / 48
Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . • Denote by P 0 ∈ M the true distribution of O . • Thus, we observe i.i.d. copies O 1 , . . . , O n ∼ P 0 . • Suppose we want to estimate a parameter θ : M → Θ . • Denote θ 0 := θ ( P 0 ) the true parameter value. 24 / 48
Loss and risk • Let L be a map from O × Θ to R . 25 / 48
Loss and risk • Let L be a map from O × Θ to R . • We call L a loss function for θ if it holds that θ 0 = arg min E P 0 [ L ( O , θ )] . θ ∈ Θ 25 / 48
Loss and risk • Let L be a map from O × Θ to R . • We call L a loss function for θ if it holds that θ 0 = arg min E P 0 [ L ( O , θ )] . θ ∈ Θ • R 0 ( θ ) = E P 0 [ L ( O , θ )] is called the oracle risk . 25 / 48
Loss and risk • Let L be a map from O × Θ to R . • We call L a loss function for θ if it holds that θ 0 = arg min E P 0 [ L ( O , θ )] . θ ∈ Θ • R 0 ( θ ) = E P 0 [ L ( O , θ )] is called the oracle risk . • These definitions of loss and risk come from the statistical learning literature (see, e.g. Vapnik, 1992, 1999, 2013) and are not to be confused with loss and risk from the decision theory literature (e.g. Ferguson, 2014). 25 / 48
Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function 26 / 48
Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function • O = ( Y , X ) . 26 / 48
Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function • O = ( Y , X ) . • θ ( P ) = µ ( P ) = { x �→ E P [ Y | X = x ] } 26 / 48
Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function • O = ( Y , X ) . • θ ( P ) = µ ( P ) = { x �→ E P [ Y | X = x ] } • L ( O , µ ) = [ Y − µ ( X )] 2 is the squared-error loss . 26 / 48
Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function • O = ( Y , X ) . • θ ( P ) = µ ( P ) = { x �→ E P [ Y | X = x ] } • L ( O , µ ) = [ Y − µ ( X )] 2 is the squared-error loss . • R 0 ( µ ) = MSE ( µ ) = E P 0 [ Y − µ ( X )] 2 . 26 / 48
Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] 27 / 48
Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] • Suppose that ˆ θ 1 , . . . , ˆ θ K are candidate estimators. 27 / 48
Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] • Suppose that ˆ θ 1 , . . . , ˆ θ K are candidate estimators. • As before, we need to estimate R 0 ( θ ) to evaluate each ˆ θ k . 27 / 48
Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] • Suppose that ˆ θ 1 , . . . , ˆ θ K are candidate estimators. • As before, we need to estimate R 0 ( θ ) to evaluate each ˆ θ k . � n • The naive estimator is � R (ˆ θ k ) = 1 i = 1 L ( O i , ˆ θ k ) . n 27 / 48
Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] • Suppose that ˆ θ 1 , . . . , ˆ θ K are candidate estimators. • As before, we need to estimate R 0 ( θ ) to evaluate each ˆ θ k . � n • The naive estimator is � R (ˆ θ k ) = 1 i = 1 L ( O i , ˆ θ k ) . n • We instead estimate R 0 ( θ ) using the cross-validated risk V � � θ k ) = 1 1 R CV (ˆ � L ( O i , ˆ θ k , v ) . V |V v | v = 1 i ∈V v 27 / 48
Super Learner: general steps Using this framework, we can generalize the SL recipe: 28 / 48
Recommend
More recommend