�� Day 6: Model Selection II Lucas Leemann Essex Summer School Introduction to Statistical Learning L. Leemann (Essex Summer School) Day 6 Introduction to SL 1 / 26
�� 1 Repetition Week 1 2 Regularization Approaches Ridge Regression Lasso Lasso vs Ridge L. Leemann (Essex Summer School) Day 6 Introduction to SL 2 / 26
�� Repetition: Fundamental Problem Red: Test error. (Hastie et al, 2008: 220) Blue: Training error. L. Leemann (Essex Summer School) Day 6 Introduction to SL 3 / 26
�� Tuesday: Linear Models Y i =2.45 ^ i =0.6 u 3 ^ i =1.85 Y Δ Y 2 Δ X β =( Δ Y)/ Δ X) Y 1 α 0 -1 -2 -1 0 1 2 X L. Leemann (Essex Summer School) Day 6 Introduction to SL 4 / 26
�� Wednesday: Classification (James et al, 2013: 140) L. Leemann (Essex Summer School) Day 6 Introduction to SL 5 / 26
�� Thursday: Resampling (James et al, 2013: 181) L. Leemann (Essex Summer School) Day 6 Introduction to SL 6 / 26
�� Friday: Model Selection I Subset Selection: 1 Generate an empty model and call it M 0 2 For k = 1.... p : " possible models with k explanatory variables ! p i) Generate all k ii) determine the model with the best criteria value (e.g. R 2 ) and call it M k 3 Determine best model within the set of these models: M 0 , ...., M p - rely on a criteria like AIC, BIC, R 2 , C p or use CV and estimate test error L. Leemann (Essex Summer School) Day 6 Introduction to SL 7 / 26
�� Regularization Approaches L. Leemann (Essex Summer School) Day 6 Introduction to SL 8 / 26
�� Shrinkage Methods Ridge regression and Lasso • The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. • As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coe ffi cient estimates, or equivalently, that shrinks the coe ffi cient estimates towards zero. • It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coe ffi cient estimates can significantly reduce their variance. L. Leemann (Essex Summer School) Day 6 Introduction to SL 9 / 26
�� Regularization • Recall that the least squares fitting procedure estimates — 0 , — 1 , . . . , — p using the values that minimize n J 1 2 2 ÿ ÿ y i ≠ — 0 ≠ — j x ij = RSS i =1 j =1 • In contrast, the regularization approach minimizes: n J 2 2 1 ÿ ÿ y i ≠ — 0 ≠ + ⁄ f ( — j ) = RSS + ⁄ f ( — j ) — j x ij i =1 j =1 where ⁄ Ø 0 is a tuning parameter, to be determined separately. L. Leemann (Essex Summer School) Day 6 Introduction to SL 10 / 26
�� Ridge Regression • Ridge Regression minimizes this expression: n J J 2 2 1 ÿ ÿ ÿ — 2 y i ≠ — 0 ≠ + ⁄ — j x ij j i =1 j =1 j =1 ¸ ˚˙ ˝ ¸ ˚˙ ˝ standard OLS estimate penalty • ⁄ is a tuning parameter, i.e. di ff erent values of ⁄ lead to di ff erent models and predictions. • When ⁄ is very big the estimates get pushed to 0. • When ⁄ is 0 the ridge regression and OLS are identical. • We can find an optimal value for ⁄ by relying on cross-validation. L. Leemann (Essex Summer School) Day 6 Introduction to SL 11 / 26
�� Example: Credit data 400 Income 400 Limit Standardized Coefficients Standardized Coefficients 300 Rating 300 Student 200 200 100 100 0 0 − 100 − 100 − 300 − 300 1e − 02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0 k ˆ λ k 2 / k ˆ β R λ β k 2 Òq p || ˆ j =1 β 2 β || 2 = (James et al, 2013: 216) j L. Leemann (Essex Summer School) Day 6 Introduction to SL 12 / 26
�� Ridge Regression: Details • Shrinkage is not applied to the model constant — 0 , model estimate for conditional mean should be un-shrunk . • Ridge regression is an example of ¸ 2 regularization: • ¸ 1 : f ( — j ) = q J j =1 | — j | • ¸ 2 : f ( — j ) = q J j =1 — 2 j x ij ˜ x ij = Ò q n 1 x j ) 2 i =1 ( x ij ≠ ¯ n L. Leemann (Essex Summer School) Day 6 Introduction to SL 13 / 26
�� Ridge regression: scaling of predictors • The standard least squares coe ffi cient estimates are scale equivariant: multiplying X j by a constant c simply leads to a scaling of the least squares coe ffi cient estimates by a factor of 1 / c . In other words, regardless of how the j th predictor is scaled, X j ˆ — j will remain the same. • In contrast, the ridge regression coe ffi cient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coe ffi cients term in the penalty part of the ridge regression objective function. • Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula x ij ˜ x ij = Ò q n 1 x j ) 2 i =1 ( x ij ≠ ¯ n L. Leemann (Essex Summer School) Day 6 Introduction to SL 14 / 26
�� Why Does Ridge Regression Improve Over Least Squares? 60 60 Mean Squared Error 50 Mean Squared Error 50 40 40 30 30 20 20 10 10 0 0 1e − 01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 (James et al, 2013: 218) k ˆ β R λ k 2 / k ˆ λ β k 2 • Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coe ffi cients. • Squared bias (black), variance (green), and test mean squared error (purple). • The purple crosses indicate the ridge regression models for which the MSE is smallest. • OLS with p variables is low bias but high variance - shrinkage lowers variance at the price of bias. L. Leemann (Essex Summer School) Day 6 Introduction to SL 15 / 26
�� The Lasso • Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. • The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coe ffi cients, ˆ — L λ , minimize this quantity Q R 2 p p p n ÿ ÿ ÿ ÿ a y i ≠ — 0 ≠ — j x ij + ⁄ | — j | = RSS + ⁄ | — j | b i =1 j =1 j =1 j =1 • In statistical parlance, the lasso uses an ¸ 1 (pronounced “ell 1”) penalty instead of an ¸ 2 penalty. The ¸ 1 norm of a coe ffi cient vector — is given by Î — Î 1 = q | — j | . L. Leemann (Essex Summer School) Day 6 Introduction to SL 16 / 26
�� The Lasso: continued • As with ridge regression, the lasso shrinks the coe ffi cient estimates towards zero. • However, in the case of the lasso, the ¸ 1 penalty has the e ff ect of forcing some of the coe ffi cient estimates to be exactly equal to zero when the tuning parameter ⁄ is su ffi ciently large. • Hence, much like best subset selection, the lasso performs variable selection. • We say that the lasso yields sparse models – that is, models that involve only a subset of the variables. • As in ridge regression, selecting a good value of ⁄ for the lasso is critical; cross-validation is again the method of choice. L. Leemann (Essex Summer School) Day 6 Introduction to SL 17 / 26
�� Example: Credit data 400 400 Standardized Coefficients Standardized Coefficients 300 300 200 200 100 100 0 0 − 100 Income Limit − 200 Rating Student − 300 20 50 100 200 500 2000 5000 0.0 0.2 0.4 0.6 0.8 1.0 k ˆ β L λ k 1 / k ˆ λ β k 1 (James et al, 2013: 220) L. Leemann (Essex Summer School) Day 6 Introduction to SL 18 / 26
�� Example: Baseball Data 19 18 2 0 0 0 14 50 Coefficients 6 3 2 0 12 11 16 17 8 5 9 13 18 10 19 1 4 7 -50 15 -150 -5 0 5 10 15 20 Log Lambda 19 19 17 17 18 17 9 7 5 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Mean-Squared Error 200000 100000 -5 0 5 10 15 20 log(Lambda) L. Leemann (Essex Summer School) Day 6 Introduction to SL 19 / 26
�� Lasso Example 4 > lasso.pred <- predict(lasso.mod, s = log(cv.out$lambda.1se), newx = x[test, ]) > plot(lasso.pred, y[test], ylim=c(0,2500), xlim=c(0,2500), ylab="True Value in Test Data", xlab="Predicted Value in Test Data") > abline(coef = c(0,1),lty=2) 2500 2000 True Value in Test Data 1500 1000 500 0 0 500 1000 1500 2000 2500 Predicted Value in Test Data L. Leemann (Essex Summer School) Day 6 Introduction to SL 20 / 26
�� Comparing the Lasso and Ridge Regression 60 60 Mean Squared Error 50 Mean Squared Error 50 40 40 30 30 20 20 10 10 0 0 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 R 2 on Training Data λ • Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on simulated data set. • Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). • Both are plotted against their R 2 on the training data, as a common form of indexing. • The crosses in both plots indicate the lasso model for which the MSE is smallest. L. Leemann (Essex Summer School) Day 6 Introduction to SL 21 / 26
Recommend
More recommend