Lecture 5: − Regularization − ML Methodology Aykut Erdem February 2016 Hacettepe University
Recall from last time… Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) N i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1 Gradient Descent Update Rule: ⇣ ⌘ t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ from Bishop Closed Form Solution: � − 1 X T t X T X � w = � � � � � � � � � � � � � � � 1 � � � � � � Training inde- Test �� �� � � � � � � � �� �� �� �� E RMS 0.5 �� �� � � � �� �� � � � �� �� � � �� �� � �� �� � � � �� �� 0 0 3 6 9 M w � ( � T � ) � 1 � T y � 2 �
1-D regression illustrates key concepts • Data fits – is linear model best ( model selection )? − Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model • One method of assessing fit: test generalization = model’s ability to predict the held out data • Optimization is essential: stochastic and batch iterative approaches; analytic when available slide by Richard Zemel 3
Today • Regularization • Machine Learning Methodology - validation - cross-validation (k-fold, leave-one-out) - model selection 4
Regularization 5
Regularized Least Squares • A technique to control the overfitting phenomenon • Add a penalty term to the error function in order to discourage the coe ffi cients from reaching large values Ridge N � E ( w ) = 1 � { y ( x n , w ) − t n } 2 + λ � 2 ∥ w ∥ 2 � regression 2 n =1 where ∥ w ∥ 2 ≡ w T w = w 2 0 + w 2 1 + . . . + w 2 M , slide by Erik Sudderth importance of the regularization term compared which is minimized by ' 6
The e ff ect of regularization ln λ = − 18 ln λ = 0 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 9 slide by Erik Sudderth 7
The e ff ect of regularization 1 ln λ = −∞ ln λ = − 18 ln λ = 0 Training w ⋆ 0.35 0.35 0.13 0 w ⋆ Test 232.37 4.74 -0.05 1 w ⋆ -5321.83 -0.77 -0.06 2 w ⋆ 48568.31 -31.97 -0.05 3 E RMS w ⋆ -231639.30 -3.89 -0.03 4 0.5 w ⋆ 640042.26 55.28 -0.02 5 w ⋆ -1061800.52 41.32 -0.01 6 w ⋆ 1042400.18 -45.95 -0.00 7 w ⋆ -557682.99 -91.53 0.00 8 w ⋆ 125201.43 72.68 0.01 9 0 − 35 − 30 − 25 − 20 ln λ The corresponding coe ffi cients from the fitted polynomials, showing slide by Erik Sudderth that regularization has the desired e ff ect of reducing the magnitude of the coe ffi cients. 8
A more general regularizer N M 1 { t n − w T φ ( x n ) } 2 + λ � � | w j | q 2 2 n =1 j =1 q = 0 . 5 q = 1 q = 2 q = 4 slide by Richard Zemel 9
Machine Learning Methodology 10
Recap: Regression � � � � � i are • In regression, labels y y � � � � � continuous � � • Classification/regression are � � � � solved very similarly 6 • Everything we have done so � � 3 far transfers to classification � � with very minor changes x 4 8 1 � � � � � • Error: sum of distances from � � � � examples to the fitted slide by Olga Veksler � � � model � � � � � 11 � � � �
Training/Test Data Split • Talked about splitting data in training/test sets - training data is used to fit parameters - test data is used to assess how classifier generalizes to new data • What if classifier has “non ‐ tunable” parameters? - a parameter is “non ‐ tunable” if tuning (or training) it on the training data leads to overfitting - Examples: ‣ k in kNN classifier ‣ number of hidden units in MNN slide by Olga Veksler ‣ number of hidden layers in MNN ‣ etc … 12
Example of Overfitting � � � � � � � • Want to fit a polynomial machine f(x,w) y � • Instead of fixing polynomial degree, � � � � make it parameter d � � �� � - learning machine f(x,w,d) � �� • Consider just three choices for d x � � � � � - degree 1 - degree 2 � - degree 3 � � • Training error is a bad measure to choose d − degree 3 is the best according to the training error, but overfits � � � � � � � � slide by Olga Veksler the data � � � � � � � � � � � 13
Training/Test Data Split � � � � � � � • What about test error? Seems appropriate − degree 2 is the best model according to the test error � � � � � � � � � • Except what do we report as the test error now? � � � � � � � � � � � � � � � � � � • Test error should be computed on data that was not used for slide by Olga Veksler � � � � training at all! �� � � � � � � • Here used “test” data for training, i.e. choosing model 14
Validation data � • Same question when choosing among several classifiers � � � � � � - our polynomial degree example can be looked at as � � � � � � � � � � choosing among 3 classifiers (degree 1, 2, or 3) � � � � � � � • Solution: split the labeled data into three parts � � � �� � � � labeled � data Training Test � Validation � � 20% � 60% � 20% train � other � use � only to � train � tunable slide by Olga Veksler parameters, � assess � final � parameters � w or � to � select � performance classifier 15
Training/Validation labeled � data Test � Training Validation � � 20% � 60% � 20% Validation � error: � Test � error: � Training � error: ��� computed � on � computed � on � computed � on � training � validation � test � examples examples examples slide by Olga Veksler 16
Training/Validation/Test Data � d = � 1 � d = � 3 � d = � 2 � validation error: � 3.3 validation � error: � 1.8 � validation � error: � 3.4 � • Training Data � • Validation Data � - d = 2 is chosen � � � slide by Olga Veksler • Test Data � - 1.3 test error computed for d = 2 � � � � � � 17
Choosing Parameters: Example � � error Validation � Error Training � Error number � of � hidden � units 50 � � � � � � � � � � • Need to choose number of hidden units for a MNN � � � � � � � � � slide by Olga Veksler - The more hidden units, the better can fit training data � � � � � � - But at some point we overfit the data 18
Diagnosing Underfitting/Overfitting � Underfitting Just � Right Overfitting • large � training � error • small � training � error • small � training � error • large � validation � error • small � validation � error • large � validation � error slide by Olga Veksler 19
Fixing Underfitting/Overfitting • Fixing Underfitting - getting more training examples will not help - get more features - try more complex classifier ‣ if using MNN, try more hidden units • Fixing Overfitting - getting more training examples might help - try smaller set of features - Try less complex classifier slide by Olga Veksler ‣ If using MNN, try less hidden units 20
Train/Test/Validation Method • Good news - Very simple • Bad news: - Wastes data - in general, the more data we have, the better are the estimated parameters - we estimate parameters on 40% less data, since 20% removed for test and 20% for validation data - If we have a small dataset our test (validation) set might just be lucky or unlucky slide by Olga Veksler • Cross Validation is a method for performance evaluation that wastes less data 21
Small Dataset � Linear � Model: Quadratic � Model: Join � the � dots � Model: x Mean � Squared � Error � = � 2.4 Mean � Squared � Error � = � 0.9 Mean � Squared � Error � = � 2.2 slide by Olga Veksler 22
LOOCV (Leave-one-out Cross Validation) � � � � � For � k=1 � to � R 1. � Let ( x k , y k ) be � the � k example y x slide by Olga Veksler 23
LOOCV (Leave-one-out Cross Validation) � � � � � For � k =1 � to � n 1. � Let ( x k , y k ) be � the � k th example 2. � Temporarily � remove � ( x k , y k ) from � the � dataset y x slide by Olga Veksler 24
LOOCV (Leave-one-out Cross Validation) � � � � � For � k =1 � to � n 1. � Let ( x k , y k ) be � the � k th example 2. � Temporarily � remove � ( x k , y k ) from � the � dataset 3. � Train � on � the � remaining � n � 1 � y examples x slide by Olga Veksler 25
LOOCV (Leave-one-out Cross Validation) � � � � � For � k =1 � to � n 1. � Let ( x k , y k ) be � the � k th example 2. � Temporarily � remove � ( x k , y k ) from � the � dataset 3. � Train � on � the � remaining � n � 1 � y examples 4. � Note � your � error � on � ( x k , y k ) x slide by Olga Veksler 26
Recommend
More recommend