Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Linear Regression & Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1
LR & GD Unit Objectives • Exact solutions of least squares • 1D case without bias • 1D case with bias • General case • Gradient descent for least squares Mike Hughes - Tufts COMP 135 - Spring 2019 3
What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4
Task: Regression y is a numeric variable Supervised e.g. sales in $$ Learning regression y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 5
Visualizing errors Mike Hughes - Tufts COMP 135 - Spring 2019 6
Regression: Evaluation Metrics N 1 • mean squared error X y n ) 2 ( y n − ˆ N n =1 N • mean absolute error 1 X | y n − ˆ y n | N n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 7
Linear Regression Parameters: w = [ w 1 , w 2 , . . . w f . . . w F ] weight vector b bias scalar Prediction: F X y ( x i ) , ˆ w f x if + b f =1 Training: find weights and bias that minimize error Mike Hughes - Tufts COMP 135 - Spring 2019 8
Sales vs. Ad Budgets Mike Hughes - Tufts COMP 135 - Spring 2019 9
Linear Regression: Training Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 10
Linear Regression: Training Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 Exact formula for optimal values of w, b exist! x = mean( x 1 , . . . x N ) ¯ With only one feature (F=1): y = mean( y 1 , . . . y N ) ¯ P N n =1 ( x n − ¯ x )( y n − ¯ y ) b = ¯ y − w ¯ x w = P N n =1 ( x n − ¯ x ) 2 Where does this come from? Mike Hughes - Tufts COMP 135 - Spring 2019 11
Linear Regression: Training Optimization problem: “Least Squares” N ⌘ 2 ⇣ X min y n − ˆ y ( x n , w, b ) w,b n =1 Exact formula for optimal values of w, b exist! x 11 . . . x 1 F 1 x 21 . . . x 2 F 1 ˜ X = With many features (F >= 1 ): . . . x N 1 . . . x NF 1 [ w 1 . . . w F b ] T = ( ˜ X T ˜ X ) − 1 ˜ X T y Where does this come from? Mike Hughes - Tufts COMP 135 - Spring 2019 12
Derivation Notes http://www.cs.tufts.edu/comp/135/2019s/notes /day03_linear_regression.pdf Mike Hughes - Tufts COMP 135 - Spring 2019 13
When does the Least Squares estimator exist? • Fewer examples than features (N < F) Infinitely many solutions! • Same number of examples and features (N=F) Optimum exists if X is full rank • More examples than features (N > F) Optimum exists if X is full rank Mike Hughes - Tufts COMP 135 - Spring 2019 14
More compact notation θ = [ b w 1 w 2 . . . w F ] x n = [1 x n 1 x n 2 . . . x nF ] ˜ y ( x n , θ ) = θ T ˜ ˆ x n N X y ( x n , θ )) 2 J ( θ ) , ( y n − ˆ n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 15
Idea: Optimize via small steps Mike Hughes - Tufts COMP 135 - Spring 2019 16
Derivatives point uphill Mike Hughes - Tufts COMP 135 - Spring 2019 17
To minimize, go downhill Step in the opposite direction of the derivative Mike Hughes - Tufts COMP 135 - Spring 2019 18
Steepest descent algorithm input: initial θ ∈ R input: step size α ∈ R + while not converged: θ ← θ − α d d θ J ( θ ) Mike Hughes - Tufts COMP 135 - Spring 2019 19
Steepest descent algorithm input: initial θ ∈ R input: step size α ∈ R + while not converged: θ ← θ − α d d θ J ( θ ) Mike Hughes - Tufts COMP 135 - Spring 2019 20
How to set step size? Mike Hughes - Tufts COMP 135 - Spring 2019 21
How to set step size? • Simple and usually effective: pick small constant α = 0 . 01 • Improve: decay over iterations α t = C α t = ( C + t ) − 0 . 9 t • Improve: Line search for best value at each step Mike Hughes - Tufts COMP 135 - Spring 2019 22
How to assess convergence? • Ideal: stop when derivative equals zero • Practical heuristics: stop when … • when change in loss becomes small | J ( ✓ t ) − J ( ✓ t − 1 ) | < ✏ • when step size is indistinguishable from zero ↵ | d d ✓ J ( ✓ ) | < ✏ Mike Hughes - Tufts COMP 135 - Spring 2019 23
Visualizing the cost function “Level set” contours : all points with same function value Mike Hughes - Tufts COMP 135 - Spring 2019 24
In 2D parameter space gradient = vector of partial derivatives Mike Hughes - Tufts COMP 135 - Spring 2019 25
Gradient Descent DEMO https://github.com/tufts-ml-courses/comp135-19s- assignments/blob/master/labs/GradientDescentDemo. ipynb Mike Hughes - Tufts COMP 135 - Spring 2019 26
Fitting a line isn’t always ideal Mike Hughes - Tufts COMP 135 - Spring 2019 27
Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i Can be written as a linear function of φ ( x i ) = [ x i x 2 x 3 i ] i y ( φ ( x i )) = θ 0 + θ 1 φ ( x i ) 1 + θ 2 φ ( x i ) 2 + θ 3 φ ( x i ) 3 ˆ “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Spring 2019 28
What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [ x i x 2 x 3 i ] i • interactions between feature dimensions φ ( x i ) = [ x i 1 x i 2 x i 3 x i 4 ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Spring 2019 29
Recommend
More recommend