 
              MODEL SELECTION AND REGULARISATION
MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL ▸ We train the model with available data, and we will make predictions on unseen examples. ▸ Why do we need to estimate the accuracy of our model? (Instead of accuracy let’s say figure of merit FoM, accuracy, log loss, correlation, AUC, etc) ▸ To know what to expect ( should we risk our money on our stock market predictions ) ▸ To select the most accurate model! ▸ Model selection ▸ The type of the model (knn, linear, trees, svm, nn, etc) ▸ Best combination of input variables or engineered features
MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL ▸ Usually we can not estimate the FoM on the training data! ▸ Each model is able to learn the noise to some varying extent. How much worse it will be on unseen data? ▸ We cannot really tell whether one model is more accurate than the other just based on the training error. ▸ Maybe it learned something general which will work on unseen data points, or maybe it simply learned to memorise the noise more. ▸ There are solutions for simple models (AIC, BIC), but not for the more complex ones ▸ We need a general framework for FoM estimation and model selection.
MODEL SELECTION TRAIN - VALIDATION SPLIT 1 2 3 n 7 22 13 91 ▸ Cut the training data into 2 parts, training and validation ▸ Not ideal, we use only a subset for validation : 50 Linear Degree 2 Degree 5 our estimate of the FoM will not the most precise. 40 Miles per gallon ▸ Balancing between training and validation 30 dataset size, 10, 20, 30% usually. 20 ▸ Typically used when datasets are huge, and training is very expensive: this is the standard in 10 image recognition. 50 100 150 200 Horsepower ▸ Note, that in small sets the accuracy may be widely varying among different splits. 28 28 Mean Squared Error Mean Squared Error 26 26 24 24 ▸ Note: we split the training data! Competitions use 22 22 a held-out test data to ensure fair evaluation but 20 20 18 18 that is a different setup. 16 16 2 4 6 8 10 2 4 6 8 10 Degree of Polynomial Degree of Polynomial
MODEL SELECTION 1 2 3 n LEAVE ONE OUT CROSS VALIDATION 1 2 3 n ▸ With N data points, cut the training data into 2 parts, 1 2 3 n training and validation, N times. Each point is used 1 2 3 n for validation once. · · · ▸ The FoM is estimated as the mean accuracy on each 1 2 3 n example. It is trivial for independent metrics (MSE, LOOCV accuracy). 28 ▸ Not so trivial for ranking or correlation metrics! One Mean Squared Error 26 can use the LOOC predictions on the full dataset, but 24 then predictions from different models are mixed. 22 20 ▸ Each data point is used for estimation, estimation of 18 accuracy will be as accurate as possible. 16 2 4 6 8 10 ▸ It can take forever, it is practically impossible to use Degree of Polynomial with most datasets. ▸ Note: there is no random process! n � 2 CV ( n ) = 1 � y i − ˆ y i � n 1 − h i ▸ Magic formula (only) for linear regression! i =1
MODEL SELECTION K-FOLD CROSS VALIDATION 1 2 3 n ▸ With N data points, cut the training data into K parts, use K-1 11 76 5 47 11 76 5 47 for training and 1 for testing, K times. K is usually 3, 5,10. 11 76 5 47 ▸ The FoM is estimated as the mean accuracy on each set. 11 76 5 47 Works for raking and correlation metrics too, because each 11 76 5 47 validation set has a large number of points. ▸ 80% of the data points are used in training: the model will be close to be as good as possible 10 − fold CV ▸ Each data point is used for validation, estimation of the FoM will be good 28 Mean Squared Error 26 ▸ Usually can be done for large sets, the model needs to be 24 trained only 5-10 times. One exception is image recognition, 22 because then it is too expensive to even train 5 times (a week 20 vs a month) 18 ▸ Note: there is a random process of splitting, but variation is 16 not as wild as in a train-validation split because in the end the same data points are used 10 2 4 6 8 10 Degree of Polynomial ▸ THE STANDARD
MODEL SELECTION BOOTSTRAP ▸ We estimate parameters from data points, samples ▸ These samples are random examples from a true Obs X Y population. Parameters inferred from these data points 3 5.3 2.8 *1 ˆ will be somewhat different from the true relationship a 1 4.3 2.4 Z *1 3 5.3 2.8 ▸ We need to estimate the uncertainties of the estimates Obs X Y Obs X Y ▸ E.g.: Linear regression coefficients. But generally we do 2 2.1 1.1 Z *2 not have formulas like in the case of linear regression 1 4.3 2.4 *2 ˆ 3 5.3 2.8 a · 2 2.1 1.1 · (AUC), we need a more general approach. · · 1 4.3 2.4 · · · · 3 5.3 2.8 · · · · · · · · · · · · · · · ▸ We can not generate new data points, but we can · · · Z *B · · · · · Original Data (Z) · · · resample ! · · Obs X Y * B ˆ a 2 2.1 1.1 ▸ Select N data points from N data points with 2 2.1 1.1 replacement. 1 4.3 2.4 ▸ Estimate the parameter on each example. � � 2 B � B � 1 α ∗ r − 1 ▸ Calculate the standard error (or quantiles) of those � � � α ∗ r ′ SE B (ˆ α ) = ˆ ˆ . � B − 1 B estimates. r =1 r ′ =1 ▸ Note replacement , and keep in mind.
LINEAR MODEL SELECTION AND REGULARISATION MODIFYING THE SOLUTION OF LINEAR REGRESSION ▸ Before: Least squares fit, why change it? ▸ LS is the most accurate on the training data, but thats not our goal, our goal is accuracy of unseen new data ▸ LS uses all predictors -> with many predictors it is hard to interpret the model. Can we select a good model with only a few predictors? ▸ Pursuing test accuracy: ▸ LS will work well when the number of data points (N) is multiple orders of magnitude larger than the number of predictors (p) ▸ When N is not much larger than p (e.g.:100 vs 10), then we might fit the noise to some extent, or in other words the coefficients will have large variance ▸ When N<p, there is no unique LS fit
LINEAR MODEL SELECTION AND REGULARISATION SUBSET SELECTION ▸ Fewer variables: ▸ When the number of data points is low, a simpler model will not fit the noise that much, and it will be more accurate on unseen data. ▸ It is also easier to interpret the model 1.0 ▸ Brute force: best subset selection 0.8 ▸ 2^p choice. How to identify the best? Start from 0 variables and 0.6 move to 1,2,… R 2 0.4 ▸ Select the best model for each p. Comparing models with the same number of variables is OK. 0.2 ▸ Compare the best models with different p 0.0 ▸ This is tricky: the training error will always be smaller when 10 2 4 6 8 using more variables. Cross validation error, AIC, BIC or Number of Predictors adjusted R value ▸ Computational problem: d=10: 1000 runs, d=20: 1 million runs…
LINEAR MODEL SELECTION AND REGULARISATION SUBSET SELECTION ▸ Fast approach: stepwise greedy selection C p = 1 σ 2 � � RSS + 2 d ˆ , ▸ Forward: Start with 0 variables, iteratively add the one which improves the n training error the most (stop after a while) 1 σ 2 � � AIC = RSS + 2 d ˆ n ˆ σ 2 ▸ Backward: Start with all variables, iteratively remove the one which degrades the training error the least 1 � σ 2 � BIC = RSS + log( n ) d ˆ . n ˆ σ 2 ▸ Compare the best models with different p. Cross validation error, AIC, BIC or adjusted R value Adjusted R 2 = 1 − RSS / ( n − d − 1) . ▸ Note, use different metric for selecting next variable and the best p to avoid TSS / ( n − 1) overfitting! 220 220 220 Cross − Validation Error 200 Square Root of BIC 200 Validation Set Error 20 0 180 180 180 160 160 16 0 140 140 140 120 120 120 100 10 0 100 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Number of Predictors Number of Predictors Number of Predictors
SHRINKAGE SHRINKAGE ( RIDGE, LASSO ) 2 ⎛ ⎞ p p p n ▸ Shrinkage: a simpler model has smaller � � � � β 2 β 2 ⎝ y i − β 0 − β j x ij + λ j = RSS + λ j , ⎠ coefficients, instead of cancelling large i =1 j =1 j =1 j =1 coefficients (zero coeff is subset selection): ▸ Ridge-regression, L2 penalty ( Weight- decay ) 2 ⎛ ⎞ p p p n � � � � ⎝ y i − β 0 − + λ | β j | = RSS + λ | β j | . β j x ij ⎠ ▸ Lasso-regression, L1 penalty. Sparse i =1 j =1 j =1 j =1 regression . Often results in 0 coefficients ▸ Elastic-net: both ▸ They both increase the training error! (but hope to improve generalisation by 400 400 Income Limit Standardized Coefficients 30 0 creating a simpler model) Standardized Coefficients 300 Rating Student 200 200 ▸ Essential for underdetermined problems! 100 100 (SVD would not fail but not ideal) 0 0 − 100 ▸ Note: The scale of predictive variables did − 200 not matter before! Now it does! − 300 20 50 100 200 500 2000 5000 1e − 02 1e+00 1e+02 1e+04 λ λ
Recommend
More recommend