Lecture 4: Optimization Justin Johnson Lecture 4 - 1 September 16, 2019
Waitlist Update We will open the course for enrollment later today / tomorrow Justin Johnson Lecture 4 - 2 September 16, 2019
Reminder: Assignment 1 Was due yesterday! (But you do have late days…) Justin Johnson Lecture 4 - 3 September 16, 2019
Assignment 2 • Will be released today • Use SGD to train linear classifiers and fully-connected networks • After today, can do linear classifiers section • After Wednesday, can do fully-connected networks • If you have a hard time computing derivatives, wait for next Monday’s lecture on backprop • Due Monday September 30, 11:59pm (two weeks from today) Justin Johnson Lecture 4 - 4 September 16, 2019
Course Update - A1: 10% - A2: 10% - A3: 10% - A4: 10% - A5: 10% - A6: 10% - Midterm: 20% - Final: 20% Justin Johnson Lecture 4 - 5 September 16, 2019
Course Update: No Final Exam - A1: 10% - A1: 10% - A2: 10% - A2: 13% - A3: 10% - A3: 13% - A4: 10% - A4: 13% Expect A5 and A6 - A5: 10% - A5: 13% to be longer than other homework - A6: 10% - A6: 13% - Midterm: 20% - Midterm: 25% - Final: 20% - Final Justin Johnson Lecture 4 - 6 September 16, 2019
Last Time: Linear Classifiers Visual Viewpoint Algebraic Viewpoint Geometric Viewpoint One template Hyperplanes f(x,W) = Wx per class cutting up space Justin Johnson Lecture 4 - 7 September 16, 2019
Last Time: Loss Functions quantify preferences We have some dataset of (x, y) - We have a score function: - We have a loss function : - Linear classifier Softmax SVM Full loss Justin Johnson Lecture 4 - 8 September 16, 2019
Last Time: Loss Functions quantify preferences Q : How do we find the best W? We have some dataset of (x, y) - We have a score function: - We have a loss function : - Linear classifier Softmax SVM Full loss Justin Johnson Lecture 4 - 9 September 16, 2019
Optimization Justin Johnson Lecture 4 - 10 September 16, 2019
This image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Justin Johnson Lecture 4 - 11 September 16, 2019
This image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Justin Johnson Lecture 4 - 12 September 16, 2019
Idea #1: Random Search ch (bad idea!) Justin Johnson Lecture 4 - 13 September 16, 2019
Idea #1: Random Search ch (bad idea!) 15.5% accuracy! not bad! Justin Johnson Lecture 4 - 14 September 16, 2019
Idea #1: Random Search ch (bad idea!) 15.5% accuracy! not bad! (SOTA is ~95%) Justin Johnson Lecture 4 - 15 September 16, 2019
Idea #2: Fo Follow the slope Justin Johnson Lecture 4 - 16 September 16, 2019
Idea #2 : : Follow th the e slope In 1-dimension, the derivative of a function gives the slope: Justin Johnson Lecture 4 - 17 September 16, 2019
Idea #2 : : Follow th the e slope In 1-dimension, the derivative of a function gives the slope: In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient Justin Johnson Lecture 4 - 18 September 16, 2019
current W: gradient dL/dW: [0.34, [?, -1.11, ?, 0.78, ?, 0.12, ?, 0.55, ?, 2.81, ?, -3.1, ?, -1.5, ?, 0.33,…] ?,…] loss 1.25347 Justin Johnson Lecture 4 - 19 September 16, 2019
current W: W + h (first dim) : gradient dL/dW: [0.34, [0.34 + 0.0001 , [?, -1.11, -1.11, ?, 0.78, 0.78, ?, 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25322 Justin Johnson Lecture 4 - 20 September 16, 2019
current W: W + h (first dim) : gradient dL/dW: [0.34, [0.34 + 0.0001 , [ -2.5 , -1.11, -1.11, ?, 0.78, 0.78, ?, 0.12, 0.12, ?, (1.25322 - 1.25347)/0.0001 0.55, 0.55, ?, = -2.5 2.81, 2.81, ?, -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25322 Justin Johnson Lecture 4 - 21 September 16, 2019
current W: W + h (second dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11 + 0.0001 , ?, 0.78, 0.78, ?, 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25353 Justin Johnson Lecture 4 - 22 September 16, 2019
current W: W + h (second dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11 + 0.0001 , 0.6 , 0.78, 0.78, ?, 0.12, 0.12, ?, 0.55, 0.55, ?, (1.25353 - 1.25347)/0.0001 2.81, 2.81, ?, = 0.6 -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25353 Justin Johnson Lecture 4 - 23 September 16, 2019
current W: W + h (third dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11, 0.6, 0.78, 0.78 + 0.0001 , ?, 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25347 Justin Johnson Lecture 4 - 24 September 16, 2019
current W: W + h (third dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11, 0.6, 0.78, 0.78 + 0.0001 , 0.0 , 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, (1.25347 - 1.25347)/0.0001 -3.1, -3.1, ?, = 0.0 -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25347 Justin Johnson Lecture 4 - 25 September 16, 2019
current W: W + h (third dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11, 0.6, 0.78, 0.78 + 0.0001 , 0.0 , 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, -3.1, -3.1, Numeric Gradient : ?, - Slow: O(#dimensions) -1.5, -1.5, ?, - Approximate 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25347 Justin Johnson Lecture 4 - 26 September 16, 2019
Loss is a function of W want Justin Johnson Lecture 4 - 27 September 16, 2019
Loss is a function of W: Analytic Gradient want Use calculus to compute an analytic gradient This image is in the public domain This image is in the public domain Justin Johnson Lecture 4 - 28 September 16, 2019
current W: gradient dL/dW: [0.34, [-2.5, dL/dW = ... -1.11, 0.6, (some function 0.78, 0, data and W) 0.12, 0.2, 0.55, 0.7, 2.81, -0.5, -3.1, 1.1, -1.5, 1.3, 0.33,…] -2.1,…] loss 1.25347 Justin Johnson Lecture 4 - 29 September 16, 2019
current W: gradient dL/dW: [0.34, [-2.5, dL/dW = ... -1.11, 0.6, (some function 0.78, 0, data and W) 0.12, 0.2, 0.55, 0.7, 2.81, -0.5, -3.1, (In practice we will compute 1.1, dL/dW using backpropagation; -1.5, 1.3, see Lecture 6) 0.33,…] -2.1,…] loss 1.25347 Justin Johnson Lecture 4 - 30 September 16, 2019
Computing Gradients Numeric gradient : approximate, slow, easy to write - Analytic gradient : exact, fast, error-prone - In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Justin Johnson Lecture 4 - 31 September 16, 2019
Computing Gradients Numeric gradient : approximate, slow, easy to write - Analytic gradient : exact, fast, error-prone - In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Justin Johnson Lecture 4 - 32 September 16, 2019
Computing Gradients Numeric gradient : approximate, slow, easy to write - Analytic gradient : exact, fast, error-prone - In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Justin Johnson Lecture 4 - 33 September 16, 2019
Computing Gradients Numeric gradient : approximate, slow, easy to write - Analytic gradient : exact, fast, error-prone - In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Justin Johnson Lecture 4 - 34 September 16, 2019
Gradient Descent Iteratively step in the direction of the negative gradient (direction of local steepest descent) Hyperparameters : - Weight initialization method - Number of steps - Learning rate Justin Johnson Lecture 4 - 35 September 16, 2019
negative Gradient Descent gradient original W direction Iteratively step in the direction of W_2 the negative gradient (direction of local steepest descent) Hyperparameters : - Weight initialization method - Number of steps - Learning rate W_1 Justin Johnson Lecture 4 - 36 September 16, 2019
Gradient Descent Iteratively step in the direction of the negative gradient (direction of local steepest descent) Hyperparameters : - Weight initialization method - Number of steps - Learning rate Justin Johnson Lecture 4 - 37 September 16, 2019
Batch ch Gradient Desce cent Full sum expensive when N is large! Justin Johnson Lecture 4 - 38 September 16, 2019
cent (SGD) Stoch chastic c Gradient Desce Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common Hyperparameters : - Weight initialization - Number of steps - Learning rate - Batch size - Data sampling Justin Johnson Lecture 4 - 39 September 16, 2019
Recommend
More recommend