CS6501: Deep Learning for Visual Recognition Stochastic Gradient Descent (SGD)
Today’s Class Stochastic Gradient Descent (SGD) SGD Recap • Regression vs Classification • Generalization / Overfitting / Underfitting • Regularization • Momentum Updates / ADAM Updates •
Our function L(w) ! " = 3 + (" − 4) + 3
Our function L(w) ! " = 3 + (" − 4) + , Easy way to find minimum (and max): Find where ," ! " = 0 , " = 4 This is zero when: ," ! " = 2 " − 4 4
Our function L(w) ! " = 3 + (" − 4) + But this is not easy for complex functions: L " , , " + , . . , " ,+ = −/012034567 1 " , , " + , . . , " ,+ , 7 , 89:;8 < −/012034567 1 " , , " + , . . , " ,+ , 7 + 89:;8 = … −/012034567 1 " , , " + , . . , " ,+ , 7 ? 89:;8 @ 5
Our function L(w) ! " = 3 + (" − 4) + Or even for simpler functions: ! " = , -. + / + 0!(") 0 = −, -. + 2/ = −, -. + 2/ 0" How do you find x?
Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=12 " 7
Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 " 8
Gradient Descent (GD) (idea) ! " 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8 " 9
Gradient Descent (GD) 7 = 0.01 4 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /56 for e = 0, num_epochs do Compute: and ;!(#, %)/;% ;!(#, %)/;# Update w: # = # − 7 ;!(#, %)/;# Update b: % = % − 7 ;!(#, %)/;% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 10
Gradient Descent (GD) expensive 7 = 0.01 4 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /56 for e = 0, num_epochs do Compute: and ;!(#, %)/;% ;!(#, %)/;# Update w: # = # − 7 ;!(#, %)/;# Update b: % = % − 7 ;!(#, %)/;% // Useful to see if this is becoming smaller or not. Print: !(#, %) end 11
(mini-batch) Stochastic Gradient Descent (SGD) 6 = 0.01 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /∈5 for e = 0, num_epochs do for b = 0, num_batches do Compute: and :!(#, %)/:% :!(#, %)/:# Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 12
(mini-batch) Stochastic Gradient Descent (SGD) 6 = 0.01 !(#, %) = ( −log . /,01230 (#, %) Initialize w and b randomly /∈5 for e = 0, num_epochs do for b = 0, num_batches do Compute: and :!(#, %)/:% for |B| = 1 :!(#, %)/:# Update w: # = # − 6 :!(#, %)/:# Update b: % = % − 6 :!(#, %)/:% // Useful to see if this is becoming smaller or not. Print: !(#, %) end end 13
Regression vs Classification Regression Classification Labels are continuous Labels are discrete variables (1 • • variables – e.g. distance. out of K categories) Losses: Distance-based Losses: Cross-entropy loss, • • losses, e.g. sum of distances margin losses, logistic regression to true values. (binary cross entropy) Evaluation: Mean distances, Evaluation: Classification • • correlation coefficients, etc. accuracy, etc.
Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) "
Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " Model: ! . = 0" + 2
Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " Model: ! . = 0" + 2
Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 57- ' Loss: 3 0, 2 = 4 ! . 5 − ! 5 Model: ! . = 0" + 2 57$
Quadratic Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 57- . = 0 $ " ' + 0 ' " + 2 ' Loss: 3 0, 2 = 4 ! . 5 − ! 5 Model: ! 57$
n-polynomial Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 79- . = 0 1 " 1 + ⋯ + 0 $ " + 4 ' Loss: 5 0, 4 = 6 ! . 7 − ! 7 Model: ! 79$
Overfitting % is a polynomial of % is linear % is cubic degree 9 !"## $ is high !"## $ is low !"## $ is zero! Overfitting Underfitting High Bias High Variance Christopher M. Bishop – Pattern Recognition and Machine Learning
Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: ! ", $ + & |" ( | ) minimize (
Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: ! ", $ + & ' |" ) | * minimize Regularizer term e.g. L2- regularizer )
SGD with Regularization (L-2) , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: and 0!(", $)/0$ 0!(", $)/0" Update w: " = " − , 0!(", $)/0" − ,'" $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 24
Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do These are only for b = 0, num_batches do approximations to the Compute: and 0!(", $)/0$ 0!(", $)/0" true gradient with Update w: " = " − , 0!(", $)/0" − ,'" respect to 6(", $) $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 25
Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do This could lead to “un- for b = 0, num_batches do learning” what has Compute: and 0!(", $)/0$ 0!(", $)/0" been learned in some Update w: " = " − , 0!(", $)/0" − ,'" previous steps of training. $ = $ − , 0!(", $)/0$ − ,'" Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 26
Solution: Momentum Updates , = 0.01 ! ", $ = ! ", $ + ' ∑ |" * | + * Initialize w and b randomly for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an accumulator variable! Compute: and 0!(", $)/0$ 0!(", $)/0" and use a weighted Update w: " = " − , 0!(", $)/0" − ,'" average with current $ = $ − , 0!(", $)/0$ − ,'" gradient. Update b: // Useful to see if this is becoming smaller or not. Print: !(", $) end end 27
Solution: Momentum Updates , = 0.01 7 = 0.9 Initialize w and b randomly ! ", $ = ! ", $ + ' ∑ |" * | + * global 6 for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an Compute: 0!(", $)/0" accumulator variable! Compute: 6 = 76 + 0!(", $)/0" + '" and use a weighted average with current Update w: " = " − , 6 gradient. // Useful to see if this is becoming smaller or not. Print: !(", $) end end 28
More on Momentum https://distill.pub/2017/momentum/
Questions? 30
Recommend
More recommend