stochastic gradient descent sgd today s class
play

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient - PowerPoint PPT Presentation

CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap Regression vs Classification Generalization / Overfitting / Underfitting Regularization


  1. CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD)

  2. Today’s Class Stochastic Gradient Descent (SGD) SGD Recap • Regression vs Classification • Generalization / Overfitting / Underfitting • Regularization • Momentum Updates / ADAM Updates •

  3. Our function L(w) ! " = 3 + (" − 4) + 3

  4. Our function L(w) ! " = 3 + (" − 4) + , Easy way to find minimum (and max): Find where ," ! " = 0 , " = 4 This is zero when: ," ! " = 2 " − 4 4

  5. Our function L(w) ! " = 3 + (" − 4) + But this is not easy for complex functions: L " , , " + , . . , " ,+ = −/012034567 1 " , , " + , . . , " ,+ , 7 , 89:;8 < −/012034567 1 " , , " + , . . , " ,+ , 7 + 89:;8 = … −/012034567 1 " , , " + , . . , " ,+ , 7 ? 89:;8 @ 5

  6. Our function L(w) ! " = 3 + (" − 4) + Or even for simpler functions: ! " = , -. + / + 0!(") 0 = −, -. + 2/ = −, -. + 2/ 0" How do you find x?

  7. Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=12 " 7

  8. Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 " 8

  9. Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8 " 9

  10. Gradient Descent (GD) 7 = 0.01 4 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /56 for e = 0, num_epochs do Compute: and ;!(#, %)/;% ;!(#, %)/;# Update w: # = # − 7 ;!(#, %)/;# Update b: % = % − 7 ;!(#, %)/;% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 10

  11. Gradient Descent (GD) expensive 7 = 0.01 4 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /56 for e = 0, num_epochs do Compute: and ;!(#, %)/;% ;!(#, %)/;# Update w: # = # − 7 ;!(#, %)/;# Update b: % = % − 7 ;!(#, %)/;% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 11

  12. (mini-batch) Stochastic Gradient Descent (SGD) 6 = 0.01 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /∈5 for e = 0, num_epochs do for b = 0, num_batches do Compute: and :!(#, %)/:% :!(#, %)/:# Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 12

  13. (mini-batch) Stochastic Gradient Descent (SGD) 6 = 0.01 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /∈5 for e = 0, num_epochs do for b = 0, num_batches do Compute: and :!(#, %)/:% for |B| = 1 :!(#, %)/:# Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 13

  14. Regression vs Classification Regression Classification Labels are continuous Labels are discrete variables (1 • • variables – e.g. distance. out of K categories) Losses: Distance-based Losses: Cross-entropy loss, • • losses, e.g. sum of distances margin losses, logistic regression to true values. (binary cross entropy) Evaluation: Mean distances, Evaluation: Classification • • correlation coefficients, etc. accuracy, etc.

  15. Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) "

  16. Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " Model: ! . = 0" + 2

  17. Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " Model: ! . = 0" + 2

  18. Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 57- ' Loss: 3 0, 2 = 4 ! . 5 − ! 5 Model: ! . = 0" + 2 57$

  19. Quadratic Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 57- . = 0 $ " ' + 0 ' " + 2 ' Loss: 3 0, 2 = 4 ! . 5 − ! 5 Model: ! 57$

  20. n-polynomial Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 79- . = 0 1 " 1 + ⋯ + 0 $ " + 4 ' Loss: 5 0, 4 = 6 ! . 7 − ! 7 Model: ! 79$

  21. Overfitting % is a polynomial of % is linear % is cubic degree 9 !"## $ is high !"## $ is low !"## $ is zero! Overfitting Underfitting High Bias High Variance Christopher M. Bishop – Pattern Recognition and Machine Learning

  22. Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: ! ", $ + & |" ( | ) minimize (

  23. Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: ! ", $ + & ' |" ) | * minimize Regularizer term e.g. L2- regularizer )

  24. SGD with Regularization (L-2) , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: and 0!(", $)/0$ 0!(", $)/0" Update w: " = " − , 0!(", $)/0" − ,'" $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 24

  25. Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do These are only for b = 0, num_batches do approximations to the Compute: and 0!(", $)/0$ 0!(", $)/0" true gradient with Update w: " = " − , 0!(", $)/0" − ,'" respect to 6(", $) $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 25

  26. Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do This could lead to “un- for b = 0, num_batches do learning” what has Compute: and 0!(", $)/0$ 0!(", $)/0" been learned in some Update w: " = " − , 0!(", $)/0" − ,'" previous steps of training. $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 26

  27. Solution: Momentum Updates , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an accumulator variable! Compute: and 0!(", $)/0$ 0!(", $)/0" and use a weighted Update w: " = " − , 0!(", $)/0" − ,'" average with current $ = $ − , 0!(", $)/0$ − ,'" gradient. Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 27

  28. Solution: Momentum Updates , = 0.01 7 = 0.9 Initialize w and b randomly ! ", $ = ! ", $ + ' ∑ |" * | + * global 6 for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an Compute: 0!(", $)/0" accumulator variable! Compute: 6 = 76 + 0!(", $)/0" + '" and use a weighted average with current Update w: " = " − , 6 gradient. // Useful to see if this is becoming smaller or not. Print: !(", $) end end 28

  29. More on Momentum https://distill.pub/2017/momentum/

  30. Questions? 30

Recommend


More recommend