Introduction to Super Learning Ted Westling, PhD Postdoctoral - PowerPoint PPT Presentation

Cross-validation 1. Split the data in to V “folds” of size roughly n / V . 2. For each fold v = 1 , . . . , V : • the data in folds other than v is called the training set ; • the data in fold v is called the test/validation set . • we obtain ˆ µ k , v using the training set ; • we obtain ˆ µ k , v ( X i ) for X i in the validation set V v . 3. Our cross-validated MSE is V � � µ k ) = 1 1 � µ k , v ( X i )] 2 . MSE CV (ˆ [ Y i − ˆ V |V v | v = 1 i ∈V v 15 / 48

Cross-validation 1. Split the data in to V “folds” of size roughly n / V . 2. For each fold v = 1 , . . . , V : • the data in folds other than v is called the training set ; • the data in fold v is called the test/validation set . • we obtain ˆ µ k , v using the training set ; • we obtain ˆ µ k , v ( X i ) for X i in the validation set V v . 3. Our cross-validated MSE is V � � µ k ) = 1 1 � µ k , v ( X i )] 2 . MSE CV (ˆ [ Y i − ˆ V |V v | v = 1 i ∈V v We average the MSEs of the V validation sets. 15 / 48

CV preds. Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 Schematic of 10-fold cross-validation. Gray: training sets. Yellow: validation sets. 16 / 48

How do we choose V ? • Large V : 17 / 48

How do we choose V ? • Large V : – more training data , so better for small n 17 / 48

How do we choose V ? • Large V : – more training data , so better for small n – more computation time 17 / 48

How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates 17 / 48

How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ 17 / 48

How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ • Small V : 17 / 48

How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ • Small V : – more test data 17 / 48

How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ • Small V : – more test data – less computation time. 17 / 48

How do we choose V ? • Large V : – more training data , so better for small n – more computation time – well-suited to high-dimensional covariates – well-suited to complicated or non-smooth µ • Small V : – more test data – less computation time. (People typically use V = 5 or V = 10.) 17 / 48

“Discrete” Super Learner • At this point, we have cross-validated MSE estimates � µ 1 ) , . . . , � MSE CV (ˆ MSE CV (ˆ µ K ) for each of our candidate algorithms. 18 / 48

“Discrete” Super Learner • At this point, we have cross-validated MSE estimates � µ 1 ) , . . . , � MSE CV (ˆ MSE CV (ˆ µ K ) for each of our candidate algorithms. • We could simply take as our estimator the ˆ µ k minimizing these cross-validated MSEs. 18 / 48

“Discrete” Super Learner • At this point, we have cross-validated MSE estimates � µ 1 ) , . . . , � MSE CV (ˆ MSE CV (ˆ µ K ) for each of our candidate algorithms. • We could simply take as our estimator the ˆ µ k minimizing these cross-validated MSEs. • We call this the “ discrete Super Learner ”. 18 / 48

Super Learner • Let λ = ( λ 1 , . . . , λ K ) be an element of S K , the K -dimensional simplex: each λ k ∈ [ 0 , 1 ] and � k λ k = 1. 19 / 48

Super Learner • Let λ = ( λ 1 , . . . , λ K ) be an element of S K , the K -dimensional simplex: each λ k ∈ [ 0 , 1 ] and � k λ k = 1. • Super Learner considers as its set of candidate algorithms µ λ := � K all convex combinations ˆ k = 1 λ k ˆ µ k . 19 / 48

Super Learner • Let λ = ( λ 1 , . . . , λ K ) be an element of S K , the K -dimensional simplex: each λ k ∈ [ 0 , 1 ] and � k λ k = 1. • Super Learner considers as its set of candidate algorithms µ λ := � K all convex combinations ˆ k = 1 λ k ˆ µ k . • The Super Learner is ˆ µ � λ , where � K � � � � λ := arg min MSE CV λ k ˆ µ k . λ ∈S K k = 1 (We use constrained optimization to compute the argmin.) 19 / 48

Super Learner � K � � � � λ := arg min MSE CV λ k ˆ µ k . λ ∈S K k = 1 20 / 48

Super Learner � K � � � � λ := arg min MSE CV λ k ˆ µ k . λ ∈S K k = 1 � K � � � 2 V K � � � � = 1 1 � MSE CV λ k ˆ µ k Y i − λ k ˆ µ k , v ( X i ) . V |V v | k = 1 v = 1 i ∈V v k = 1 20 / 48

Super Learner: steps Putting it all together: 21 / 48

Super Learner: steps Putting it all together: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 21 / 48

Super Learner: steps Putting it all together: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 2. Obtain the CV-predictions ˆ µ k , v ( X i ) for all k , v and i ∈ V v . 21 / 48

Super Learner: steps Putting it all together: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 2. Obtain the CV-predictions ˆ µ k , v ( X i ) for all k , v and i ∈ V v . 3. Use constrained optimization to compute the SL weights �� K � λ := arg min λ ∈S K � � MSE CV k = 1 λ k ˆ µ k . 21 / 48

Super Learner: steps Putting it all together: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 2. Obtain the CV-predictions ˆ µ k , v ( X i ) for all k , v and i ∈ V v . 3. Use constrained optimization to compute the SL weights �� K � λ := arg min λ ∈S K � � MSE CV k = 1 λ k ˆ µ k . µ SL = � K k = 1 � 4. Take ˆ λ k ˆ µ k . 21 / 48

II. Lab 1: Vanilla SL for a continuous outcome 21 / 48

III. Into the weeds: a mathematical presentation of SL 21 / 48

Review Recall the construction of SL for a continuous outcome: 22 / 48

Review Recall the construction of SL for a continuous outcome: 1. Define a library of candidate algorithms ˆ µ 1 , . . . , ˆ µ K . 2. Obtain the CV-predictions ˆ µ k , v ( X i ) for all k , v and i ∈ V v . 3. Use constrained optimization to compute the SL weights �� K � λ := arg min λ ∈S K � � k = 1 λ k ˆ µ k . MSE CV µ SL = � K k = 1 � 4. Take ˆ λ k ˆ µ k . 22 / 48

In this section, we generalize this procedure to estimation of any summary of the observed data distribution given an appropriate loss for the summary of interest. 23 / 48

Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . 24 / 48

Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O 24 / 48

Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . 24 / 48

Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . • Denote by P 0 ∈ M the true distribution of O . 24 / 48

Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . • Denote by P 0 ∈ M the true distribution of O . • Thus, we observe i.i.d. copies O 1 , . . . , O n ∼ P 0 . 24 / 48

Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . • Denote by P 0 ∈ M the true distribution of O . • Thus, we observe i.i.d. copies O 1 , . . . , O n ∼ P 0 . • Suppose we want to estimate a parameter θ : M → Θ . 24 / 48

Loss and risk: setup • Denote by O the observed data unit – e.g. O = ( Y , X ) . • Denote by O the sample space of O • Let M denote our statistical model . • Denote by P 0 ∈ M the true distribution of O . • Thus, we observe i.i.d. copies O 1 , . . . , O n ∼ P 0 . • Suppose we want to estimate a parameter θ : M → Θ . • Denote θ 0 := θ ( P 0 ) the true parameter value. 24 / 48

Loss and risk • Let L be a map from O × Θ to R . 25 / 48

Loss and risk • Let L be a map from O × Θ to R . • We call L a loss function for θ if it holds that θ 0 = arg min E P 0 [ L ( O , θ )] . θ ∈ Θ 25 / 48

Loss and risk • Let L be a map from O × Θ to R . • We call L a loss function for θ if it holds that θ 0 = arg min E P 0 [ L ( O , θ )] . θ ∈ Θ • R 0 ( θ ) = E P 0 [ L ( O , θ )] is called the oracle risk . 25 / 48

Loss and risk • Let L be a map from O × Θ to R . • We call L a loss function for θ if it holds that θ 0 = arg min E P 0 [ L ( O , θ )] . θ ∈ Θ • R 0 ( θ ) = E P 0 [ L ( O , θ )] is called the oracle risk . • These definitions of loss and risk come from the statistical learning literature (see, e.g. Vapnik, 1992, 1999, 2013) and are not to be confused with loss and risk from the decision theory literature (e.g. Ferguson, 2014). 25 / 48

Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function 26 / 48

Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function • O = ( Y , X ) . 26 / 48

Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function • O = ( Y , X ) . • θ ( P ) = µ ( P ) = { x �→ E P [ Y | X = x ] } 26 / 48

Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function • O = ( Y , X ) . • θ ( P ) = µ ( P ) = { x �→ E P [ Y | X = x ] } • L ( O , µ ) = [ Y − µ ( X )] 2 is the squared-error loss . 26 / 48

Loss and risk: MSE example MSE is the oracle risk corresponding to a squared-error loss function • O = ( Y , X ) . • θ ( P ) = µ ( P ) = { x �→ E P [ Y | X = x ] } • L ( O , µ ) = [ Y − µ ( X )] 2 is the squared-error loss . • R 0 ( µ ) = MSE ( µ ) = E P 0 [ Y − µ ( X )] 2 . 26 / 48

Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] 27 / 48

Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] • Suppose that ˆ θ 1 , . . . , ˆ θ K are candidate estimators. 27 / 48

Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] • Suppose that ˆ θ 1 , . . . , ˆ θ K are candidate estimators. • As before, we need to estimate R 0 ( θ ) to evaluate each ˆ θ k . 27 / 48

Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] • Suppose that ˆ θ 1 , . . . , ˆ θ K are candidate estimators. • As before, we need to estimate R 0 ( θ ) to evaluate each ˆ θ k . � n • The naive estimator is � R (ˆ θ k ) = 1 i = 1 L ( O i , ˆ θ k ) . n 27 / 48

Estimating the oracle risk θ 0 = arg min R 0 ( θ ) θ ∈ Θ R 0 ( θ ) = E P 0 [ L ( O , θ )] • Suppose that ˆ θ 1 , . . . , ˆ θ K are candidate estimators. • As before, we need to estimate R 0 ( θ ) to evaluate each ˆ θ k . � n • The naive estimator is � R (ˆ θ k ) = 1 i = 1 L ( O i , ˆ θ k ) . n • We instead estimate R 0 ( θ ) using the cross-validated risk V � � θ k ) = 1 1 R CV (ˆ � L ( O i , ˆ θ k , v ) . V |V v | v = 1 i ∈V v 27 / 48

Super Learner: general steps Using this framework, we can generalize the SL recipe: 28 / 48

Introduction to Super Learning Ted Westling, PhD Postdoctoral - PowerPoint PPT Presentation

Introduction to Super Learning Ted Westling, PhD Postdoctoral Researcher Center for Causal Inference Perelman School of Medicine University of Pennsylvania September 25, 2018 1 / 48 Learning Goals Conceptual understanding of Super

UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER-ORBITAL UNVEILING

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

THE FALL 2018 NFL PRIMETIME SEASON & THE SUPER BOWL KTG CONTENT STRATEGY SUPER BOWL

Bigger is Better Trends in super computers, super software, and super data Michael L. Norman,

Super- -Kamiokande Kamiokande s s Solar Neutrino results Solar Neutrino results Super

The Super-Sbox Cryptanalysis Thomas Peyrin CCRG seminar - Nanyang Technological University

Tau Physics at the Super B-Factory N. Sato (Nagoya Univ.) 2005.04.21 Super B-Factory Workshop in

Super Premium Brands Bruno Cosentino Director Super Premium Brands Connection with Strategy:

Spelling, Punctuation and Grammar Prefixes super- anti- auto- Year One SPaG | Prefixes super-

Super Minds Level 6 Presentation Plus DVD-ROM: Level Super Minds Level 6 Presentation Plus

SUPER SUPER Issue : May 23, 2018 THE FUTURE OF TV IS HERE Are you ready? CONNECT YOUR

SUPER SUPER Issue : March 1, 2018 THE FUTURE OF TV IS HERE Are you ready? CONNECT YOUR

SUPER SUPER Issue : May 2, 2018 THE FUTURE OF TV IS HERE Are you ready? CONNECT YOUR

I TEACH! Whats Your Super Power? What are the Super 7 Elements? These 7 elements are VITAL

HRANCO SERVICE FABRICATION ERECTION PRESSURE FINISHING CAPACITY

IFCBGFICBG

t t

Inertial Game Dynamics R. Laraki P. Mertikopoulos CNRS LAMSADE laboratory CNRS

IP Int nter ernet networ orking king 1 Sending Datagrams from Source to Destination IP

The intergalactic medium and the epoch of reionization Cristiano Porciani AIfA,

Realistic Simulations of the Intergalactic Medium: The Search for Missing Physics Michael Norman

for Scalable Joint Distribution Matching Ziliang Chen , Zhanfu Yang, Xiaoxi Wang*, Xiaodan Liang,

Lya emission from z=2-3 galaxies in SDSS/BOSS Rupert Croft + Other members of SDSS III/BOSS Ly

Introduction to Super Learning Ted Westling, PhD Postdoctoral - PowerPoint PPT Presentation

Introduction to Super Learning Ted Westling, PhD Postdoctoral Researcher Center for Causal Inference Perelman School of Medicine University of Pennsylvania September 25, 2018 1 / 48 Learning Goals Conceptual understanding of Super

UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER ORBITAL UNVEILING THE SUPER-ORBITAL UNVEILING

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

THE FALL 2018 NFL PRIMETIME SEASON &amp; THE SUPER BOWL KTG CONTENT STRATEGY SUPER BOWL

Bigger is Better Trends in super computers, super software, and super data Michael L. Norman,

Super- -Kamiokande Kamiokande s s Solar Neutrino results Solar Neutrino results Super

The Super-Sbox Cryptanalysis Thomas Peyrin CCRG seminar - Nanyang Technological University

Tau Physics at the Super B-Factory N. Sato (Nagoya Univ.) 2005.04.21 Super B-Factory Workshop in

Super Premium Brands Bruno Cosentino Director Super Premium Brands Connection with Strategy:

Spelling, Punctuation and Grammar Prefixes super- anti- auto- Year One SPaG | Prefixes super-

Super Minds Level 6 Presentation Plus DVD-ROM: Level Super Minds Level 6 Presentation Plus

SUPER SUPER Issue : May 23, 2018 THE FUTURE OF TV IS HERE Are you ready? CONNECT YOUR

SUPER SUPER Issue : March 1, 2018 THE FUTURE OF TV IS HERE Are you ready? CONNECT YOUR

SUPER SUPER Issue : May 2, 2018 THE FUTURE OF TV IS HERE Are you ready? CONNECT YOUR

I TEACH! Whats Your Super Power? What are the Super 7 Elements? These 7 elements are VITAL

HRANCO SERVICE FABRICATION ERECTION PRESSURE FINISHING CAPACITY

IFCBGFICBG

t t

Inertial Game Dynamics R. Laraki P. Mertikopoulos CNRS LAMSADE laboratory CNRS

IP Int nter ernet networ orking king 1 Sending Datagrams from Source to Destination IP

The intergalactic medium and the epoch of reionization Cristiano Porciani AIfA,

Realistic Simulations of the Intergalactic Medium: The Search for Missing Physics Michael Norman

for Scalable Joint Distribution Matching Ziliang Chen *, Zhanfu Yang*, Xiaoxi Wang*, Xiaodan Liang,

Lya emission from z=2-3 galaxies in SDSS/BOSS Rupert Croft + Other members of SDSS III/BOSS Ly

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

THE FALL 2018 NFL PRIMETIME SEASON & THE SUPER BOWL KTG CONTENT STRATEGY SUPER BOWL

for Scalable Joint Distribution Matching Ziliang Chen , Zhanfu Yang, Xiaoxi Wang*, Xiaodan Liang,