Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1
CV & Penalized LR Objectives • Regression with transformations of features • Cross Validation • L2 penalties • L1 penalties Mike Hughes - Tufts COMP 135 - Spring 2019 3
What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4
Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 5
Review: Linear Regression Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 x 11 . . . x 1 F 1 x 21 . . . x 2 F 1 Exact formula for optimal values of w, b exist! ˜ X = . . . 1 x N 1 . . . x NF [ w 1 . . . w F b ] T = ( ˜ X T ˜ X ) − 1 ˜ X T y Math works in 1D and for many dimensions Mike Hughes - Tufts COMP 135 - Spring 2019 6
Recap: solving linear regression • More examples than features (N > F) And if inverse of X^T X exists (needs to be full rank) Then an optimal weight vector exists, can use formula Likely has non-zero error (overdetermined) • Same number of examples and features (N=F) And if inverse of X^T X exists (needs to be full rank): Then an optimal weight vector exists, can use formula Will have zero error on training set. • Fewer examples than features (N < F) or low rank Then: Infinitely many optimal weight vectors exist with zero error Inverse of X^T X does not exist (naïvely, formula will fail) Mike Hughes - Tufts COMP 135 - Spring 2019 7
Recap • Squared error is special • Exact formulas for estimating parameters • Most metrics do not have exact formulas • Take derivative, set to zero, try to solve, …. HARD ! • Example: absolute error • General algorithm: Gradient Descent! • As long as first derivative exists, we can do iterations to estimate optimal parameters Mike Hughes - Tufts COMP 135 - Spring 2019 8
Transformations of Features Mike Hughes - Tufts COMP 135 - Spring 2019 9
Fitting a line isn’t always ideal Mike Hughes - Tufts COMP 135 - Spring 2019 10
Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i φ ( x i ) = [1 x i x 2 i x 3 i ] Can be written as a linear function of 4 θ g φ g ( x i ) = θ T φ ( x i ) X y ( x i ) = ˆ g =1 “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Spring 2019 11
What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [1 x i x 2 i . . . ] • interactions between feature dimensions φ ( x i ) = [1 x i 1 x i 2 x i 3 x i 4 . . . ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Spring 2019 12
Linear Regression with Transformed Features φ ( x i ) = [1 φ 1 ( x i ) φ 2 ( x i ) . . . φ G − 1 ( x i )] y ( x i ) = θ T φ ( x i ) ˆ Optimization problem: “Least Squares” n =1 ( y n − θ T φ ( x i )) 2 P N min θ 1 φ 1 ( x 1 ) φ G − 1 ( x 1 ) . . . Exact solution: 1 φ 1 ( x 2 ) φ G − 1 ( x 2 ) . . . Φ = . ... . . θ ∗ = (Φ T Φ) − 1 Φ T y 1 φ 1 ( x N ) φ G − 1 ( x N ) . . . N x G matrix Mike Hughes - Tufts COMP 135 - Spring 2019 13
Cross Validation Mike Hughes - Tufts COMP 135 - Spring 2019 14
Generalize: sample to population Mike Hughes - Tufts COMP 135 - Spring 2019 15
Labeled dataset y x Each row represents one example Assume rows are arranged “uniformly at random” (order doesn’t matter) Mike Hughes - Tufts COMP 135 - Spring 2019 16
Split into train and test y x train test Mike Hughes - Tufts COMP 135 - Spring 2019 17
Model Complexity vs Error Overfitting Underfitting Mike Hughes - Tufts COMP 135 - Spring 2019 18
How to fit best model? Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train validation test Mike Hughes - Tufts COMP 135 - Spring 2019 19
How to fit best model? Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train Concerns • Will train be too small? • Make better use of data? validation test Mike Hughes - Tufts COMP 135 - Spring 2019 20
Estimating Heldout Error with Fixed Validation Set Single random split 10 other random splits Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Spring 2019 21
3-fold Cross Validation y x Divide labeled dataset fold 1 into 3 even-sized parts fold 2 fold 3 Fit model 3 independent times. Each time leave one fold as validation and keep remaining as training y y x x y x train validation Heldout error estimate: average of the validation error across all 3 fits Mike Hughes - Tufts COMP 135 - Spring 2019 22
K-fold CV: How many folds K ? • Can do as low as 2 fold • Can do as high as N-1 folds (“Leave one out”) • Usual rule of thumb: 5-fold or 10-fold CV • Computation runtime scales linearly with K • Larger K also means each fit uses more train data, so each fit might take longer too • Each fit is independent and parallelizable Mike Hughes - Tufts COMP 135 - Spring 2019 23
Estimating Heldout Error with Cross Validation 9 separate splits Leave-one-out (N-1 folds) Each one with 10 folds Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Spring 2019 24
What to do about underfitting? • Increase model complexity • Add more features! Mike Hughes - Tufts COMP 135 - Spring 2019 25
What to do about overfitting? • Select complexity with cross validation • Control single-fit complexity with a penalty! Mike Hughes - Tufts COMP 135 - Spring 2019 26
Zero degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 27
1 st degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 28
3rd degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 29
9 th degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 30
Error vs Complexity sqrt of mean squared errror polynomial degree Mike Hughes - Tufts COMP 135 - Spring 2019 31
Polynomial degree 9 3 0 1 Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 32
Idea: Penalize magnitude of weights N J ( θ ) = 1 ( y n − θ T ˜ X X x n ) 2 θ 2 + α f 2 n =1 f α ≥ 0 Penalty strength: Larger alpha means we prefer smaller magnitude weights Mike Hughes - Tufts COMP 135 - Spring 2019 33
Idea: Penalize magnitude of weights N J ( θ ) = 1 ( y n − θ T ˜ X X x n ) 2 θ 2 + α f 2 n =1 f Written via matrix/vector product notation: J ( θ ) = 1 Xθ ) T ( y − ˜ Xθ ) + αθ T θ 2( y − ˜ Mike Hughes - Tufts COMP 135 - Spring 2019 34
Exact solution for L2 penalized linear regression Optimization problem: “Penalized Least Squares” 1 Xθ ) T ( y − ˜ Xθ ) + αθ T θ 2( y − ˜ min θ Solution: θ ∗ = ( ˜ X T ˜ X + α I ) − 1 ˜ X T y If alpha > 0 , this is always invertible! Mike Hughes - Tufts COMP 135 - Spring 2019 35
Slides on L1/L2 penalties See slides 71-82 from UC-Irvine course here: https://canvas.eee.uci.edu/courses/8278/files/2 735313/ Mike Hughes - Tufts COMP 135 - Spring 2019 36
Pair Coding Activity https://github.com/tufts-ml-courses/comp135-19s- assignments/blob/master/labs/GradientDescentDemo.ipynb • Try existing gradient descent code: • Optimizes scalar slope to produce minimum error • Try step sizes of 0.0001, 0.02, 0.05, 0.1 • Add L2 penalty with alpha > 0 • Write calc_penalized_loss and calc_penalized_grad • What happens to estimated slope value w? • Repeat with L1 penalty with alpha > 0 Mike Hughes - Tufts COMP 135 - Spring 2019 37
Recommend
More recommend