lecture 4 optimization
play

Lecture 4: Optimization Justin Johnson Lecture 4 - 1 September - PowerPoint PPT Presentation

Lecture 4: Optimization Justin Johnson Lecture 4 - 1 September 16, 2019 Waitlist Update We will open the course for enrollment later today / tomorrow Justin Johnson Lecture 4 - 2 September 16, 2019 Reminder: Assignment 1 Was due


  1. Lecture 4: Optimization Justin Johnson Lecture 4 - 1 September 16, 2019

  2. Waitlist Update We will open the course for enrollment later today / tomorrow Justin Johnson Lecture 4 - 2 September 16, 2019

  3. Reminder: Assignment 1 Was due yesterday! (But you do have late days…) Justin Johnson Lecture 4 - 3 September 16, 2019

  4. Assignment 2 • Will be released today • Use SGD to train linear classifiers and fully-connected networks • After today, can do linear classifiers section • After Wednesday, can do fully-connected networks • If you have a hard time computing derivatives, wait for next Monday’s lecture on backprop • Due Monday September 30, 11:59pm (two weeks from today) Justin Johnson Lecture 4 - 4 September 16, 2019

  5. Course Update - A1: 10% - A2: 10% - A3: 10% - A4: 10% - A5: 10% - A6: 10% - Midterm: 20% - Final: 20% Justin Johnson Lecture 4 - 5 September 16, 2019

  6. Course Update: No Final Exam - A1: 10% - A1: 10% - A2: 10% - A2: 13% - A3: 10% - A3: 13% - A4: 10% - A4: 13% Expect A5 and A6 - A5: 10% - A5: 13% to be longer than other homework - A6: 10% - A6: 13% - Midterm: 20% - Midterm: 25% - Final: 20% - Final Justin Johnson Lecture 4 - 6 September 16, 2019

  7. Last Time: Linear Classifiers Visual Viewpoint Algebraic Viewpoint Geometric Viewpoint One template Hyperplanes f(x,W) = Wx per class cutting up space Justin Johnson Lecture 4 - 7 September 16, 2019

  8. Last Time: Loss Functions quantify preferences We have some dataset of (x, y) - We have a score function: - We have a loss function : - Linear classifier Softmax SVM Full loss Justin Johnson Lecture 4 - 8 September 16, 2019

  9. Last Time: Loss Functions quantify preferences Q : How do we find the best W? We have some dataset of (x, y) - We have a score function: - We have a loss function : - Linear classifier Softmax SVM Full loss Justin Johnson Lecture 4 - 9 September 16, 2019

  10. Optimization Justin Johnson Lecture 4 - 10 September 16, 2019

  11. This image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Justin Johnson Lecture 4 - 11 September 16, 2019

  12. This image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Justin Johnson Lecture 4 - 12 September 16, 2019

  13. Idea #1: Random Search ch (bad idea!) Justin Johnson Lecture 4 - 13 September 16, 2019

  14. Idea #1: Random Search ch (bad idea!) 15.5% accuracy! not bad! Justin Johnson Lecture 4 - 14 September 16, 2019

  15. Idea #1: Random Search ch (bad idea!) 15.5% accuracy! not bad! (SOTA is ~95%) Justin Johnson Lecture 4 - 15 September 16, 2019

  16. Idea #2: Fo Follow the slope Justin Johnson Lecture 4 - 16 September 16, 2019

  17. Idea #2 : : Follow th the e slope In 1-dimension, the derivative of a function gives the slope: Justin Johnson Lecture 4 - 17 September 16, 2019

  18. Idea #2 : : Follow th the e slope In 1-dimension, the derivative of a function gives the slope: In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient Justin Johnson Lecture 4 - 18 September 16, 2019

  19. current W: gradient dL/dW: [0.34, [?, -1.11, ?, 0.78, ?, 0.12, ?, 0.55, ?, 2.81, ?, -3.1, ?, -1.5, ?, 0.33,…] ?,…] loss 1.25347 Justin Johnson Lecture 4 - 19 September 16, 2019

  20. current W: W + h (first dim) : gradient dL/dW: [0.34, [0.34 + 0.0001 , [?, -1.11, -1.11, ?, 0.78, 0.78, ?, 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25322 Justin Johnson Lecture 4 - 20 September 16, 2019

  21. current W: W + h (first dim) : gradient dL/dW: [0.34, [0.34 + 0.0001 , [ -2.5 , -1.11, -1.11, ?, 0.78, 0.78, ?, 0.12, 0.12, ?, (1.25322 - 1.25347)/0.0001 0.55, 0.55, ?, = -2.5 2.81, 2.81, ?, -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25322 Justin Johnson Lecture 4 - 21 September 16, 2019

  22. current W: W + h (second dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11 + 0.0001 , ?, 0.78, 0.78, ?, 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25353 Justin Johnson Lecture 4 - 22 September 16, 2019

  23. current W: W + h (second dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11 + 0.0001 , 0.6 , 0.78, 0.78, ?, 0.12, 0.12, ?, 0.55, 0.55, ?, (1.25353 - 1.25347)/0.0001 2.81, 2.81, ?, = 0.6 -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25353 Justin Johnson Lecture 4 - 23 September 16, 2019

  24. current W: W + h (third dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11, 0.6, 0.78, 0.78 + 0.0001 , ?, 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, -3.1, -3.1, ?, -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25347 Justin Johnson Lecture 4 - 24 September 16, 2019

  25. current W: W + h (third dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11, 0.6, 0.78, 0.78 + 0.0001 , 0.0 , 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, (1.25347 - 1.25347)/0.0001 -3.1, -3.1, ?, = 0.0 -1.5, -1.5, ?, 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25347 Justin Johnson Lecture 4 - 25 September 16, 2019

  26. current W: W + h (third dim) : gradient dL/dW: [0.34, [0.34, [-2.5, -1.11, -1.11, 0.6, 0.78, 0.78 + 0.0001 , 0.0 , 0.12, 0.12, ?, 0.55, 0.55, ?, 2.81, 2.81, ?, -3.1, -3.1, Numeric Gradient : ?, - Slow: O(#dimensions) -1.5, -1.5, ?, - Approximate 0.33,…] 0.33,…] ?,…] loss 1.25347 loss 1.25347 Justin Johnson Lecture 4 - 26 September 16, 2019

  27. Loss is a function of W want Justin Johnson Lecture 4 - 27 September 16, 2019

  28. Loss is a function of W: Analytic Gradient want Use calculus to compute an analytic gradient This image is in the public domain This image is in the public domain Justin Johnson Lecture 4 - 28 September 16, 2019

  29. current W: gradient dL/dW: [0.34, [-2.5, dL/dW = ... -1.11, 0.6, (some function 0.78, 0, data and W) 0.12, 0.2, 0.55, 0.7, 2.81, -0.5, -3.1, 1.1, -1.5, 1.3, 0.33,…] -2.1,…] loss 1.25347 Justin Johnson Lecture 4 - 29 September 16, 2019

  30. current W: gradient dL/dW: [0.34, [-2.5, dL/dW = ... -1.11, 0.6, (some function 0.78, 0, data and W) 0.12, 0.2, 0.55, 0.7, 2.81, -0.5, -3.1, (In practice we will compute 1.1, dL/dW using backpropagation; -1.5, 1.3, see Lecture 6) 0.33,…] -2.1,…] loss 1.25347 Justin Johnson Lecture 4 - 30 September 16, 2019

  31. Computing Gradients Numeric gradient : approximate, slow, easy to write - Analytic gradient : exact, fast, error-prone - In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Justin Johnson Lecture 4 - 31 September 16, 2019

  32. Computing Gradients Numeric gradient : approximate, slow, easy to write - Analytic gradient : exact, fast, error-prone - In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Justin Johnson Lecture 4 - 32 September 16, 2019

  33. Computing Gradients Numeric gradient : approximate, slow, easy to write - Analytic gradient : exact, fast, error-prone - In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Justin Johnson Lecture 4 - 33 September 16, 2019

  34. Computing Gradients Numeric gradient : approximate, slow, easy to write - Analytic gradient : exact, fast, error-prone - In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Justin Johnson Lecture 4 - 34 September 16, 2019

  35. Gradient Descent Iteratively step in the direction of the negative gradient (direction of local steepest descent) Hyperparameters : - Weight initialization method - Number of steps - Learning rate Justin Johnson Lecture 4 - 35 September 16, 2019

  36. negative Gradient Descent gradient original W direction Iteratively step in the direction of W_2 the negative gradient (direction of local steepest descent) Hyperparameters : - Weight initialization method - Number of steps - Learning rate W_1 Justin Johnson Lecture 4 - 36 September 16, 2019

  37. Gradient Descent Iteratively step in the direction of the negative gradient (direction of local steepest descent) Hyperparameters : - Weight initialization method - Number of steps - Learning rate Justin Johnson Lecture 4 - 37 September 16, 2019

  38. Batch ch Gradient Desce cent Full sum expensive when N is large! Justin Johnson Lecture 4 - 38 September 16, 2019

  39. cent (SGD) Stoch chastic c Gradient Desce Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common Hyperparameters : - Weight initialization - Number of steps - Learning rate - Batch size - Data sampling Justin Johnson Lecture 4 - 39 September 16, 2019

Recommend


More recommend