linear regression gradient descent
play

Linear Regression & Gradient Descent Many slides attributable - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Linear Regression & Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten,


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Linear Regression & Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1

  2. LR & GD Unit Objectives • Exact solutions of least squares • 1D case without bias • 1D case with bias • General case • Gradient descent for least squares Mike Hughes - Tufts COMP 135 - Spring 2019 3

  3. What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4

  4. Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 5

  5. Visualizing errors Mike Hughes - Tufts COMP 135 - Spring 2019 6

  6. Regression: Evaluation Metrics N 1 • mean squared error X y n ) 2 ( y n − ˆ N n =1 N • mean absolute error 1 X | y n − ˆ y n | N n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 7

  7. Linear Regression Parameters: w = [ w 1 , w 2 , . . . w f . . . w F ] weight vector b bias scalar Prediction: F X y ( x i ) , ˆ w f x if + b f =1 Training: find weights and bias that minimize error Mike Hughes - Tufts COMP 135 - Spring 2019 8

  8. Sales vs. Ad Budgets Mike Hughes - Tufts COMP 135 - Spring 2019 9

  9. Linear Regression: Training Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 10

  10. Linear Regression: Training Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 Exact formula for optimal values of w, b exist! x = mean( x 1 , . . . x N ) ¯ With only one feature (F=1): y = mean( y 1 , . . . y N ) ¯ P N n =1 ( x n − ¯ x )( y n − ¯ y ) b = ¯ y − w ¯ x w = P N n =1 ( x n − ¯ x ) 2 Where does this come from? Mike Hughes - Tufts COMP 135 - Spring 2019 11

  11. Linear Regression: Training Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 Exact formula for optimal values of w, b exist!   x 11 . . . x 1 F 1 x 21 . . . x 2 F 1 ˜   X = With many features (F >= 1 ):   . . .   x N 1 . . . x NF 1 [ w 1 . . . w F b ] T = ( ˜ X T ˜ X ) − 1 ˜ X T y Where does this come from? Mike Hughes - Tufts COMP 135 - Spring 2019 12

  12. Derivation Notes http://www.cs.tufts.edu/comp/135/2019s/notes /day03_linear_regression.pdf Mike Hughes - Tufts COMP 135 - Spring 2019 13

  13. When does the Least Squares estimator exist? • Fewer examples than features (N < F) Infinitely many solutions! • Same number of examples and features (N=F) Optimum exists if X is full rank • More examples than features (N > F) Optimum exists if X is full rank Mike Hughes - Tufts COMP 135 - Spring 2019 14

  14. More compact notation θ = [ b w 1 w 2 . . . w F ] x n = [1 x n 1 x n 2 . . . x nF ] ˜ y ( x n , θ ) = θ T ˜ ˆ x n N X y ( x n , θ )) 2 J ( θ ) , ( y n − ˆ n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 15

  15. Idea: Optimize via small steps Mike Hughes - Tufts COMP 135 - Spring 2019 16

  16. Derivatives point uphill Mike Hughes - Tufts COMP 135 - Spring 2019 17

  17. To minimize, go downhill Step in the opposite direction of the derivative Mike Hughes - Tufts COMP 135 - Spring 2019 18

  18. Steepest descent algorithm input: initial θ ∈ R input: step size α ∈ R + while not converged: θ ← θ − α d d θ J ( θ ) Mike Hughes - Tufts COMP 135 - Spring 2019 19

  19. Steepest descent algorithm input: initial θ ∈ R input: step size α ∈ R + while not converged: θ ← θ − α d d θ J ( θ ) Mike Hughes - Tufts COMP 135 - Spring 2019 20

  20. How to set step size? Mike Hughes - Tufts COMP 135 - Spring 2019 21

  21. How to set step size? • Simple and usually effective: pick small constant α = 0 . 01 • Improve: decay over iterations α t = C α t = ( C + t ) − 0 . 9 t • Improve: Line search for best value at each step Mike Hughes - Tufts COMP 135 - Spring 2019 22

  22. How to assess convergence? • Ideal: stop when derivative equals zero • Practical heuristics: stop when … • when change in loss becomes small | J ( ✓ t ) − J ( ✓ t − 1 ) | < ✏ • when step size is indistinguishable from zero ↵ | d d ✓ J ( ✓ ) | < ✏ Mike Hughes - Tufts COMP 135 - Spring 2019 23

  23. Visualizing the cost function “Level set” contours : all points with same function value Mike Hughes - Tufts COMP 135 - Spring 2019 24

  24. In 2D parameter space gradient = vector of partial derivatives Mike Hughes - Tufts COMP 135 - Spring 2019 25

  25. Gradient Descent DEMO https://github.com/tufts-ml-courses/comp135-19s- assignments/blob/master/labs/GradientDescentDemo. ipynb Mike Hughes - Tufts COMP 135 - Spring 2019 26

  26. Fitting a line isn’t always ideal Mike Hughes - Tufts COMP 135 - Spring 2019 27

  27. Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i Can be written as a linear function of φ ( x i ) = [ x i x 2 x 3 i ] i y ( φ ( x i )) = θ 0 + θ 1 φ ( x i ) 1 + θ 2 φ ( x i ) 2 + θ 3 φ ( x i ) 3 ˆ “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Spring 2019 28

  28. What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [ x i x 2 x 3 i ] i • interactions between feature dimensions φ ( x i ) = [ x i 1 x i 2 x i 3 x i 4 ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Spring 2019 29

Recommend


More recommend