linear regression with polynomial features cross
play

Linear Regression with Polynomial Features , Cross Validation, and - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Linear Regression with Polynomial Features , Cross Validation, and Hyperparameter Selection Many slides attributable to: Prof. Mike Hughes Erik Sudderth


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Linear Regression with Polynomial Features , Cross Validation, and Hyperparameter Selection Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

  2. Objectives for Today (day 04) • Regression with transformations of features • Especially, polynomial features • Ways to estimate generalization error • Fixed Validation Set • K-fold Cross Validation • Hyperparameter Selection Mike Hughes - Tufts COMP 135 - Fall 2020 3

  3. What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Fall 2020 4

  4. Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Fall 2020 5

  5. Review: Linear Regression Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1   x 11 . . . x 1 F 1 x 21 . . . x 2 F 1 Exact formula for optimal values of w, b exist! ˜   X =   . . .   x N 1 . . . x NF 1 [ w 1 . . . w F b ] T = ( ˜ X T ˜ X ) − 1 ˜ X T y Math works in 1D and for many dimensions Mike Hughes - Tufts COMP 135 - Fall 2020 6

  6. Transformations of Features Mike Hughes - Tufts COMP 135 - Fall 2020 8

  7. Fitting a line isn’t always ideal Mike Hughes - Tufts COMP 135 - Fall 2020 9

  8. Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i φ ( x i ) = [1 x i x 2 i x 3 i ] Can be written as a linear function of 4 θ g φ g ( x i ) = θ T φ ( x i ) X y ( x i ) = ˆ g =1 “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Fall 2020 10

  9. What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [1 x i x 2 i . . . ] • interactions between feature dimensions φ ( x i ) = [1 x i 1 x i 2 x i 3 x i 4 . . . ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Fall 2020 11

  10. Linear Regression with Transformed Features φ ( x i ) = [1 φ 1 ( x i ) φ 2 ( x i ) . . . φ G − 1 ( x i )] y ( x i ) = θ T φ ( x i ) ˆ Optimization problem: “Least Squares” n =1 ( y n − θ T φ ( x i )) 2 P N min θ   1 φ 1 ( x 1 ) φ G − 1 ( x 1 ) Exact solution: . . . 1 φ 1 ( x 2 ) φ G − 1 ( x 2 ) . . .   Φ =  .  ... .   .   θ ∗ = ( Φ T Φ ) − 1 Φ T y 1 φ 1 ( x N ) φ G − 1 ( x N ) . . . N x G matrix Mike Hughes - Tufts COMP 135 - Fall 2020 12

  11. 0 th degree polynomial features --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 13

  12. 1 st degree polynomial features --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 14

  13. 3 rd degree polynomial features --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 15

  14. 9 th degree polynomial features --- true function o training data -- predictions from LR using polynomial features Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Fall 2020 16

  15. Error vs Degree mean squared error polynomial degree Mike Hughes - Tufts COMP 135 - Fall 2020 17

  16. Error vs Model Complexity high-degree 0 degree polynomial polynomial Overfitting Underfitting Mike Hughes - Tufts COMP 135 - Fall 2020 18

  17. What to do about underfitting? Increase model complexity (add more features!) What to do about overfitting? Select among several complexity levels the one that generalizes best (today) Control complexity with a penalty in training objective (next class) Mike Hughes - Tufts COMP 135 - Fall 2020 19

  18. Hyperparameter Selection Selection problem: What polynomial degree to use? If we picked lowest training error, mean we’d select a 9-degree polynomial squared error If we picked lowest test error, we’d select a 3 or 4 degree polynomial polynomial degree “Parameter” (e.g. weight values in linear regression) a numerical variable controlling quality of fit that we can effectively estimate by minimizing error on training set “Hyperparameter” (e.g. degree of polynomial features) a numerical variable controlling model complexity / quality of fit whose value we cannot effectively estimate from the training set Mike Hughes - Tufts COMP 135 - Fall 2020 20

  19. Goal of regression (supervised ML) is to generalize : sample to population For any regression task, we might want to: • Train a model (estimate parameters) • Requires calling `fit` on a training labeled dataset • Select hyperparameters (e.g. which degree of polynomial?) • Requires evaluating predictions on a validation labeled dataset • Report its ability on data it has never seen before (“generalization error” or “test error”) • Requires comparing predictions to a test labeled dataset Should ALWAYS use different labeled datasets to do each of these things! Mike Hughes - Tufts COMP 135 - Fall 2020 21

  20. Two Ways to Measure Generalization Error - Fixed Validation Set - Cross-Validation Mike Hughes - Tufts COMP 135 - Fall 2020 22

  21. Labeled dataset y x Each row represents one example Assume rows are arranged “uniformly at random” (order doesn’t matter) N x F N x 1 Mike Hughes - Tufts COMP 135 - Fall 2020 23

  22. Split into train and test y x train test Mike Hughes - Tufts COMP 135 - Fall 2020 24

  23. Selection via Fixed Validation Set Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train validation test Mike Hughes - Tufts COMP 135 - Fall 2020 25

  24. Selection via Fixed Validation Set Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train Concerns • What sizes to pick? • Will train be too small? validation • Is validation set used effectively? (only to test evaluate predictions?) Mike Hughes - Tufts COMP 135 - Fall 2020 26

  25. For small datasets, randomness in validation split will impact selection Single random split 10 other random splits Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Fall 2020 27

  26. 3-fold Cross Validation y x Divide labeled dataset fold 1 into 3 even-sized parts fold 2 fold 3 Fit model 3 independent times. Each time leave one fold as validation and keep remaining as training y y x x y x train validation Heldout error estimate: average of the validation error across all 3 fits Mike Hughes - Tufts COMP 135 - Fall 2020 28

  27. K-fold CV: How many folds K ? • Can do as low as 2 fold • Can do as high as N-1 folds (“Leave one out”) • Usual rule of thumb: 5-fold or 10-fold CV • Computation runtime scales linearly with K • Larger K also means each fit uses more train data, so each fit might take longer too • Each fit is independent and parallelizable Mike Hughes - Tufts COMP 135 - Fall 2020 29

  28. Estimating Heldout Error with Cross Validation 9 separate splits Leave-one-out (N-1 folds) Each one with 10 folds Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Fall 2020 30

Recommend


More recommend