cross validation and penalized linear regression
play

Cross Validation and Penalized Linear Regression Many slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James,


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Cross Validation and Penalized Linear Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1

  2. CV & Penalized LR Objectives • Regression with transformations of features • Cross Validation • L2 penalties • L1 penalties Mike Hughes - Tufts COMP 135 - Spring 2019 3

  3. What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4

  4. Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 5

  5. Review: Linear Regression Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1   x 11 . . . x 1 F 1 x 21 . . . x 2 F 1 Exact formula for optimal values of w, b exist! ˜   X =   . . .   1 x N 1 . . . x NF [ w 1 . . . w F b ] T = ( ˜ X T ˜ X ) − 1 ˜ X T y Math works in 1D and for many dimensions Mike Hughes - Tufts COMP 135 - Spring 2019 6

  6. Recap: solving linear regression • More examples than features (N > F) And if inverse of X^T X exists (needs to be full rank) Then an optimal weight vector exists, can use formula Likely has non-zero error (overdetermined) • Same number of examples and features (N=F) And if inverse of X^T X exists (needs to be full rank): Then an optimal weight vector exists, can use formula Will have zero error on training set. • Fewer examples than features (N < F) or low rank Then: Infinitely many optimal weight vectors exist with zero error Inverse of X^T X does not exist (naïvely, formula will fail) Mike Hughes - Tufts COMP 135 - Spring 2019 7

  7. Recap • Squared error is special • Exact formulas for estimating parameters • Most metrics do not have exact formulas • Take derivative, set to zero, try to solve, …. HARD ! • Example: absolute error • General algorithm: Gradient Descent! • As long as first derivative exists, we can do iterations to estimate optimal parameters Mike Hughes - Tufts COMP 135 - Spring 2019 8

  8. Transformations of Features Mike Hughes - Tufts COMP 135 - Spring 2019 9

  9. Fitting a line isn’t always ideal Mike Hughes - Tufts COMP 135 - Spring 2019 10

  10. Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i φ ( x i ) = [1 x i x 2 i x 3 i ] Can be written as a linear function of 4 θ g φ g ( x i ) = θ T φ ( x i ) X y ( x i ) = ˆ g =1 “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Spring 2019 11

  11. What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [1 x i x 2 i . . . ] • interactions between feature dimensions φ ( x i ) = [1 x i 1 x i 2 x i 3 x i 4 . . . ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Spring 2019 12

  12. Linear Regression with Transformed Features φ ( x i ) = [1 φ 1 ( x i ) φ 2 ( x i ) . . . φ G − 1 ( x i )] y ( x i ) = θ T φ ( x i ) ˆ Optimization problem: “Least Squares” n =1 ( y n − θ T φ ( x i )) 2 P N min θ   1 φ 1 ( x 1 ) φ G − 1 ( x 1 ) . . . Exact solution: 1 φ 1 ( x 2 ) φ G − 1 ( x 2 ) . . .   Φ =  .  ... .   .   θ ∗ = (Φ T Φ) − 1 Φ T y 1 φ 1 ( x N ) φ G − 1 ( x N ) . . . N x G matrix Mike Hughes - Tufts COMP 135 - Spring 2019 13

  13. Cross Validation Mike Hughes - Tufts COMP 135 - Spring 2019 14

  14. Generalize: sample to population Mike Hughes - Tufts COMP 135 - Spring 2019 15

  15. Labeled dataset y x Each row represents one example Assume rows are arranged “uniformly at random” (order doesn’t matter) Mike Hughes - Tufts COMP 135 - Spring 2019 16

  16. Split into train and test y x train test Mike Hughes - Tufts COMP 135 - Spring 2019 17

  17. Model Complexity vs Error Overfitting Underfitting Mike Hughes - Tufts COMP 135 - Spring 2019 18

  18. How to fit best model? Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train validation test Mike Hughes - Tufts COMP 135 - Spring 2019 19

  19. How to fit best model? Option: Fit on train, select on validation 1) Fit each model to training data 2) Evaluate each model on validation data 3) Select model with lowest validation error 4)Report error on test set y x train Concerns • Will train be too small? • Make better use of data? validation test Mike Hughes - Tufts COMP 135 - Spring 2019 20

  20. Estimating Heldout Error with Fixed Validation Set Single random split 10 other random splits Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Spring 2019 21

  21. 3-fold Cross Validation y x Divide labeled dataset fold 1 into 3 even-sized parts fold 2 fold 3 Fit model 3 independent times. Each time leave one fold as validation and keep remaining as training y y x x y x train validation Heldout error estimate: average of the validation error across all 3 fits Mike Hughes - Tufts COMP 135 - Spring 2019 22

  22. K-fold CV: How many folds K ? • Can do as low as 2 fold • Can do as high as N-1 folds (“Leave one out”) • Usual rule of thumb: 5-fold or 10-fold CV • Computation runtime scales linearly with K • Larger K also means each fit uses more train data, so each fit might take longer too • Each fit is independent and parallelizable Mike Hughes - Tufts COMP 135 - Spring 2019 23

  23. Estimating Heldout Error with Cross Validation 9 separate splits Leave-one-out (N-1 folds) Each one with 10 folds Credit: ISL Textbook, Chapter 5 Mike Hughes - Tufts COMP 135 - Spring 2019 24

  24. What to do about underfitting? • Increase model complexity • Add more features! Mike Hughes - Tufts COMP 135 - Spring 2019 25

  25. What to do about overfitting? • Select complexity with cross validation • Control single-fit complexity with a penalty! Mike Hughes - Tufts COMP 135 - Spring 2019 26

  26. Zero degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 27

  27. 1 st degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 28

  28. 3rd degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 29

  29. 9 th degree polynomial Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 30

  30. Error vs Complexity sqrt of mean squared errror polynomial degree Mike Hughes - Tufts COMP 135 - Spring 2019 31

  31. Polynomial degree 9 3 0 1 Credit: Slides from course by Prof. Erik Sudderth (UCI) Mike Hughes - Tufts COMP 135 - Spring 2019 32

  32. Idea: Penalize magnitude of weights N J ( θ ) = 1 ( y n − θ T ˜ X X x n ) 2 θ 2 + α f 2 n =1 f α ≥ 0 Penalty strength: Larger alpha means we prefer smaller magnitude weights Mike Hughes - Tufts COMP 135 - Spring 2019 33

  33. Idea: Penalize magnitude of weights N J ( θ ) = 1 ( y n − θ T ˜ X X x n ) 2 θ 2 + α f 2 n =1 f Written via matrix/vector product notation: J ( θ ) = 1 Xθ ) T ( y − ˜ Xθ ) + αθ T θ 2( y − ˜ Mike Hughes - Tufts COMP 135 - Spring 2019 34

  34. Exact solution for L2 penalized linear regression Optimization problem: “Penalized Least Squares” 1 Xθ ) T ( y − ˜ Xθ ) + αθ T θ 2( y − ˜ min θ Solution: θ ∗ = ( ˜ X T ˜ X + α I ) − 1 ˜ X T y If alpha > 0 , this is always invertible! Mike Hughes - Tufts COMP 135 - Spring 2019 35

  35. Slides on L1/L2 penalties See slides 71-82 from UC-Irvine course here: https://canvas.eee.uci.edu/courses/8278/files/2 735313/ Mike Hughes - Tufts COMP 135 - Spring 2019 36

  36. Pair Coding Activity https://github.com/tufts-ml-courses/comp135-19s- assignments/blob/master/labs/GradientDescentDemo.ipynb • Try existing gradient descent code: • Optimizes scalar slope to produce minimum error • Try step sizes of 0.0001, 0.02, 0.05, 0.1 • Add L2 penalty with alpha > 0 • Write calc_penalized_loss and calc_penalized_grad • What happens to estimated slope value w? • Repeat with L1 penalty with alpha > 0 Mike Hughes - Tufts COMP 135 - Spring 2019 37

Recommend


More recommend