Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 - PowerPoint PPT Presentation

Lecture 5: − Regularization − ML Methodology Aykut Erdem February 2016 Hacettepe University

Recall from last time… Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) N i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1 Gradient Descent Update Rule: ⇣ ⌘ t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ from Bishop Closed Form Solution: � − 1 X T t X T X � w = � � � � � � � � � � � � � � ฀ � 1 � � � � � � Training inde- Test �� ฀ � � � � � �� E RMS 0.5 �� ฀ � �� 0 0 3 6 9 M w � ( � T � ) � 1 � T y ฀ � 2 ฀ �

1-D regression illustrates key concepts • Data fits – is linear model best ( model selection )? − Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model • One method of assessing fit: test generalization = model’s ability to predict the held out data • Optimization is essential: stochastic and batch iterative approaches; analytic when available slide by Richard Zemel 3

  Today • Regularization • Machine Learning Methodology - validation - cross-validation (k-fold, leave-one-out) - model selection   4

Regularization 5

Regularized Least Squares • A technique to control the overfitting phenomenon • Add a penalty term to the error function in order to discourage the coe ffi cients from reaching large values Ridge N � E ( w ) = 1 � { y ( x n , w ) − t n } 2 + λ � 2 ∥ w ∥ 2 � regression 2 n =1 where ∥ w ∥ 2 ≡ w T w = w 2 0 + w 2 1 + . . . + w 2 M , slide by Erik Sudderth importance of the regularization term compared which is minimized by ' 6

The e ff ect of regularization ln λ = − 18 ln λ = 0 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 9 slide by Erik Sudderth 7

The e ff ect of regularization 1 ln λ = −∞ ln λ = − 18 ln λ = 0 Training w ⋆ 0.35 0.35 0.13 0 w ⋆ Test 232.37 4.74 -0.05 1 w ⋆ -5321.83 -0.77 -0.06 2 w ⋆ 48568.31 -31.97 -0.05 3 E RMS w ⋆ -231639.30 -3.89 -0.03 4 0.5 w ⋆ 640042.26 55.28 -0.02 5 w ⋆ -1061800.52 41.32 -0.01 6 w ⋆ 1042400.18 -45.95 -0.00 7 w ⋆ -557682.99 -91.53 0.00 8 w ⋆ 125201.43 72.68 0.01 9 0 − 35 − 30 − 25 − 20 ln λ The corresponding coe ffi cients from the fitted polynomials, showing slide by Erik Sudderth that regularization has the desired e ff ect of reducing the magnitude of the coe ffi cients. 8

A more general regularizer N M 1 { t n − w T φ ( x n ) } 2 + λ � � | w j | q 2 2 n =1 j =1 q = 0 . 5 q = 1 q = 2 q = 4 slide by Richard Zemel 9

Machine Learning   Methodology 10

Recap: Regression � � � � � i are • In regression, labels y y � � � � � continuous � � • Classification/regression are � � � � solved very similarly 6 • Everything we have done so � � 3 far transfers to classification � � with very minor changes x 4 8 1 � � � � � • Error: sum of distances from � � � � examples to the fitted slide by Olga Veksler � � � model � � � � � 11 � � � �

Training/Test Data Split • Talked about splitting data in training/test sets - training data is used to fit parameters - test data is used to assess how classifier generalizes to new data • What if classifier has “non ‐ tunable” parameters? - a parameter is “non ‐ tunable” if tuning (or training) it on the training data leads to overfitting - Examples: ‣ k in kNN classifier ‣ number of hidden units in MNN slide by Olga Veksler ‣ number of hidden layers in MNN ‣ etc … 12

Example of Overfitting � � � � � � � • Want to fit a polynomial machine f(x,w) y � • Instead of fixing polynomial degree,   � � � � make it parameter d � � �� - learning machine f(x,w,d) � �� • Consider just three choices for d x � � � � � - degree 1 - degree 2 � - degree 3   � � • Training error is a bad measure to choose d − degree 3 is the best according to the training error, but overfits � � � � � � � � slide by Olga Veksler the data � � � � � � � � � � � 13

Training/Test Data Split � � � � � � � • What about test error? Seems appropriate − degree 2 is the best model according to the test error � � � � � � � � � • Except what do we report as the test error now? � � � � � � � � � � � � � � � � � � • Test error should be computed on data that was not used for slide by Olga Veksler � � � � training at all! �� • Here used “test” data for training, i.e. choosing model 14

Validation data � • Same question when choosing among several classifiers � � � � � � - our polynomial degree example can be looked at as � � � � � � � � � � choosing among 3 classifiers (degree 1, 2, or 3) � � � � � � � • Solution: split the labeled data into three parts � � � �� labeled � data Training Test � Validation � � 20% � 60% � 20% train � other � use � only to � train � tunable slide by Olga Veksler parameters, � assess � final � parameters � w or � to � select � performance classifier 15

Training/Validation labeled � data Test � Training Validation � � 20% � 60% � 20% Validation � error: � Test � error: � Training � error: �� computed � on � computed � on � computed � on � training � validation � test � examples examples examples slide by Olga Veksler 16

Training/Validation/Test Data � d = � 1 � d = � 3 � d = � 2 � validation error: � 3.3 validation � error: � 1.8 � validation � error: � 3.4 � • Training Data � • Validation Data � - d = 2 is chosen � � � slide by Olga Veksler • Test Data � - 1.3 test error computed for d = 2 � � � � � � 17

Choosing Parameters: Example � � error Validation � Error Training � Error number � of � hidden � units 50 � � � � � � � � � � • Need to choose number of hidden units for a MNN � � � � � � � � � slide by Olga Veksler - The more hidden units, the better can fit training data � � � � � � - But at some point we overfit the data 18

Diagnosing Underfitting/Overfitting � Underfitting Just � Right Overfitting • large � training � error • small � training � error • small � training � error • large � validation � error • small � validation � error • large � validation � error slide by Olga Veksler 19

Fixing Underfitting/Overfitting • Fixing Underfitting - getting more training examples will not help - get more features - try more complex classifier ‣ if using MNN, try more hidden units   • Fixing Overfitting - getting more training examples might help - try smaller set of features - Try less complex classifier slide by Olga Veksler ‣ If using MNN, try less hidden units 20

Train/Test/Validation Method • Good news - Very simple   • Bad news: - Wastes data - in general, the more data we have, the better are the estimated parameters - we estimate parameters on 40% less data, since 20% removed for test and 20% for validation data - If we have a small dataset our test (validation) set might just be lucky or unlucky slide by Olga Veksler • Cross Validation is a method for performance evaluation that wastes less data 21

Small Dataset � Linear � Model: Quadratic � Model: Join � the � dots � Model: x Mean � Squared � Error � = � 2.4 Mean � Squared � Error � = � 0.9 Mean � Squared � Error � = � 2.2 slide by Olga Veksler 22

LOOCV (Leave-one-out Cross Validation) � � � � � For � k=1 � to � R 1. � Let ( x k , y k ) be � the � k example y x slide by Olga Veksler 23

LOOCV (Leave-one-out Cross Validation) � � � � � For � k =1 � to � n 1. � Let ( x k , y k ) be � the � k th example 2. � Temporarily � remove � ( x k , y k ) from � the � dataset y x slide by Olga Veksler 24

LOOCV (Leave-one-out Cross Validation) � � � � � For � k =1 � to � n 1. � Let ( x k , y k ) be � the � k th example 2. � Temporarily � remove � ( x k , y k ) from � the � dataset 3. � Train � on � the � remaining � n � 1 � y examples x slide by Olga Veksler 25

LOOCV (Leave-one-out Cross Validation) � � � � � For � k =1 � to � n 1. � Let ( x k , y k ) be � the � k th example 2. � Temporarily � remove � ( x k , y k ) from � the � dataset 3. � Train � on � the � remaining � n � 1 � y examples 4. � Note � your � error � on � ( x k , y k ) x slide by Olga Veksler 26

Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 - PowerPoint PPT Presentation

Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 Hacettepe University Recall from last time Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) N i 2 h X t ( n ) ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

Bayesian leave-one-out cross-validation for large data Mns Magnusson (Aalto University) Michael

Machine Learning July 20, 2016 Basic Concepts: Review Example machine learning problem: Decide

Cross-Validation Machine Learning 1 Model selection Very broadly: Choosing the best model using

Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12