Cross Validation & Ensembling Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 1 / 34
Outline Cross Validation 1 How Many Folds? Ensemble Methods 2 Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 2 / 34
Outline Cross Validation 1 How Many Folds? Ensemble Methods 2 Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 3 / 34
Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an “unfortunate” split? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34
Cross Validation So far, we use the hold out method for: Hyperparameter tuning: validation set Performance reporting: testing set What if we get an “unfortunate” split? K -fold cross validation : Split the data set X evenly into K subsets X ( i ) (called folds ) 1 For i = 1 , ··· , K , train f � N ( i ) using all data but the i -th fold ( X \ X ( i ) ) 2 Report the cross-validation error C CV by averaging all testing errors 3 C [ f � N ( i ) ] ’s on X ( i ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34
Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 ⇥ 2 nested CV Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 ⇥ 2 nested CV Inner (2 folds): select 1 hyperparameters giving lowest C CV Can be wrapped by grid search Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 ⇥ 2 nested CV Inner (2 folds): select 1 hyperparameters giving lowest C CV Can be wrapped by grid search Train final model using both 2 training and validation sets with the selected hyperparameters Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
Nested Cross Validation Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5 ⇥ 2 nested CV Inner (2 folds): select 1 hyperparameters giving lowest C CV Can be wrapped by grid search Train final model using both 2 training and validation sets with the selected hyperparameters Outer (5 folds): report C CV as 3 test error Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34
Outline Cross Validation 1 How Many Folds? Ensemble Methods 2 Voting Bagging Boosting Why AdaBoost Works? Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 6 / 34
How Many Folds K ? I Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
How Many Folds K ? I The cross-validation error C CV is an average of C [ f � N ( i ) ] ’s Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
How Many Folds K ? I The cross-validation error C CV is an average of C [ f � N ( i ) ] ’s Regard each C [ f � N ( i ) ] as an estimator of the expected generalization error E X ( C [ f N ]) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
How Many Folds K ? I The cross-validation error C CV is an average of C [ f � N ( i ) ] ’s Regard each C [ f � N ( i ) ] as an estimator of the expected generalization error E X ( C [ f N ]) C CV is an estimator too, and we have MSE ( C CV ) = E X [( C CV � E X ( C [ f N ])) 2 ] = Var X ( C CV )+ bias ( C CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34
Point Estimation Revisited: Mean Square Error Let ˆ θ n be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θ n : MSE ( ˆ ( ˆ θ n � θ ) 2 ⇤ ⇥ θ n ) = E X Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error Let ˆ θ n be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θ n : MSE ( ˆ ( ˆ θ n � θ ) 2 ⇤ ⇥ θ n ) = E X Can be decomposed into the bias and variance: ( ˆ ( ˆ θ n � E [ ˆ θ n ]+ E [ ˆ ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ = E E X Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error Let ˆ θ n be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θ n : MSE ( ˆ ( ˆ θ n � θ ) 2 ⇤ ⇥ θ n ) = E X Can be decomposed into the bias and variance: ( ˆ ( ˆ θ n � E [ ˆ θ n ]+ E [ ˆ ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ = E E X θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ ⇥ ⇤ = E θ n ] � θ ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error Let ˆ θ n be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θ n : MSE ( ˆ ( ˆ θ n � θ ) 2 ⇤ ⇥ θ n ) = E X Can be decomposed into the bias and variance: ( ˆ ( ˆ θ n � E [ ˆ θ n ]+ E [ ˆ ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ = E E X θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ ⇥ ⇤ = E θ n ] � θ ) � ˆ ( ˆ θ n � E [ ˆ ( E [ ˆ θ n � E [ ˆ ( E [ ˆ ⇥ θ n ]) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ � = E + E + 2E θ n ] θ n ] � θ ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error Let ˆ θ n be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θ n : MSE ( ˆ ( ˆ θ n � θ ) 2 ⇤ ⇥ θ n ) = E X Can be decomposed into the bias and variance: ( ˆ ( ˆ θ n � E [ ˆ θ n ]+ E [ ˆ ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ = E E X θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ ⇥ ⇤ = E θ n ] � θ ) � ˆ ( ˆ θ n � E [ ˆ ( E [ ˆ θ n � E [ ˆ ( E [ ˆ ⇥ θ n ]) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ � = E + E + 2E θ n ] θ n ] � θ ) � 2 + 2 · 0 · ( E [ ˆ ( ˆ θ n � E [ ˆ E [ ˆ ⇥ θ n ]) 2 ⇤ � = E + θ n ] � θ θ n ] � θ ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Point Estimation Revisited: Mean Square Error Let ˆ θ n be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θ n : MSE ( ˆ ( ˆ θ n � θ ) 2 ⇤ ⇥ θ n ) = E X Can be decomposed into the bias and variance: ( ˆ ( ˆ θ n � E [ ˆ θ n ]+ E [ ˆ ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ = E E X θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ ⇥ ⇤ = E θ n ] � θ ) � ˆ ( ˆ θ n � E [ ˆ ( E [ ˆ θ n � E [ ˆ ( E [ ˆ ⇥ θ n ]) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ � = E + E + 2E θ n ] θ n ] � θ ) � 2 + 2 · 0 · ( E [ ˆ ( ˆ θ n � E [ ˆ E [ ˆ ⇥ θ n ]) 2 ⇤ � = E + θ n ] � θ θ n ] � θ ) = Var X ( ˆ θ n )+ bias ( ˆ θ n ) 2 MSE of an unbiased estimator is its variance Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34
Example: 5-Fold vs. 10-Fold CV MSE ( C CV ) = E X [( C CV � E X ( C [ f N ])) 2 ] = Var X ( C CV )+ bias ( C CV ) 2 Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Example: 5-Fold vs. 10-Fold CV MSE ( C CV ) = E X [( C CV � E X ( C [ f N ])) 2 ] = Var X ( C CV )+ bias ( C CV ) 2 Consider polynomial regression where P ( y | x ) = sin ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Example: 5-Fold vs. 10-Fold CV MSE ( C CV ) = E X [( C CV � E X ( C [ f N ])) 2 ] = Var X ( C CV )+ bias ( C CV ) 2 Consider polynomial regression where P ( y | x ) = sin ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) Let C [ · ] be the MSE of predictions (made by a function) to true labels Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Example: 5-Fold vs. 10-Fold CV MSE ( C CV ) = E X [( C CV � E X ( C [ f N ])) 2 ] = Var X ( C CV )+ bias ( C CV ) 2 Consider polynomial regression where P ( y | x ) = sin ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) Let C [ · ] be the MSE of predictions (made by a function) to true labels E X ( C [ f N ]) : read line Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Example: 5-Fold vs. 10-Fold CV MSE ( C CV ) = E X [( C CV � E X ( C [ f N ])) 2 ] = Var X ( C CV )+ bias ( C CV ) 2 Consider polynomial regression where P ( y | x ) = sin ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) Let C [ · ] be the MSE of predictions (made by a function) to true labels E X ( C [ f N ]) : read line bias ( C CV ) : gaps between the red and other solid lines ( E X [ C CV ] ) Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34
Recommend
More recommend