 
              CS4501: Introduction to Computer Vision Max-Margin Classifier, Regularization, Generalization, Momentum, Regression, Multi-label Classification / Tagging
Previous Class • Softmax Classifier • Inference vs Training • Gradient Descent (GD) • Stochastic Gradient Descent (SGD) • mini-batch Stochastic Gradient Descent (SGD)
Previous Class • Softmax Classifier • Inference vs Training • Gradient Descent (GD) • Stochastic Gradient Descent (SGD) • mini-batch Stochastic Gradient Descent (SGD) • Generalization • Regularization / Momentum • Max-Margin Classifier • Regression / Tagging
(mini-batch) Stochastic Gradient Descent (SGD) '(), +) = / −log 6 0,789:7 (), +) ! = 0.01 0∈2 Initialize w and b randomly For Softmax Classifier for e = 0, num_epochs do for b = 0, num_batches do Compute: and &'(), +)/&) &'(), +)/&+ Update w: ) = ) − ! &'(), +)/&) Update b: + = + − ! &'(), +)/&+ // Useful to see if this is becoming smaller or not. Print: '(), +) end end 4
Supervised Learning –Softmax Classifier ; " = : [1 1 1 0 ] + / Extract features ! " = [! "% ! "& ! "' ! "( ] Run features through classifier * + = , +% ! "% + , +& ! "& + , +' ! "' + , +( ! "( + . + * / = , /% ! "% + , /& ! "& + , /' ! "' + , /( ! "( + . / * 0 = , 0% ! "% + , 0& ! "& + , 0' ! "' + , 0( ! "( + . 0 Get predictions + = 2 3 4 /(2 3 4 +2 3 7 + 2 3 8 ) 1 / = 2 3 7 /(2 3 4 +2 3 7 + 2 3 8 ) 1 0 = 2 3 8 /(2 3 4 +2 3 7 + 2 3 8 ) 1 5
Linear Max Margin-Classifier Training Data targets / labels / predictions inputs ground truth ! & = + ! & = [4.3 -1.3 1.1] ' & = [' && ' &% ' &$ ' &) ] [1 0 0] ! % = + ' % = [' %& ' %% ' %$ ' %) ] ! % = [0.5 5.6 -4.2] [0 1 0] ! $ = + ! $ = [3.3 3.5 1.1] ' $ = [' $& ' $% ' $$ ' $) ] [1 0 0] . . . ! " = + ! " = [1.1 -5.3 -9.4] ' " = [' "& ' "% ' "$ ' ") ] [0 0 1] 6
Linear – Max Margin Classifier - Inference ! " = ! " = + [, , , / ] $ " = [$ "& $ "' $ "( $ ") ] [1 0 0] - . , - = 0 -& $ "& + 0 -' $ "' + 0 -( $ "( + 0 -) $ ") + 2 - , . = 0 .& $ "& + 0 .' $ "' + 0 .( $ "( + 0 .) $ ") + 2 . , / = 0 /& $ "& + 0 /' $ "' + 0 /( $ "( + 0 /) $ ") + 2 / 7
Training: How do we find a good w and b? ! " = ! " = + [, - (/, 1) , 3 (/, 1) , 4 (/, 1)] $ " = [$ "& $ "' $ "( $ ") ] [1 0 0] We need to find w, and b that minimize the following: 8 5 /, 1 = 6 6 max(0, + ! "9 − + ! ",;<4=; + Δ) "7& 9:;<4=; Why this might be good compared to softmax? 8
Regression vs Classification Regression Classification Labels are continuous Labels are discrete variables (1 • • variables – e.g. distance. out of K categories) Losses: Distance-based Losses: Cross-entropy loss, • • losses, e.g. sum of distances margin losses, logistic regression to true values. (binary cross entropy) Evaluation: Mean distances, Evaluation: Classification • • correlation coefficients, etc. accuracy, etc.
Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) "
Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " ! = 0" + 2 . Model:
Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " ! = 0" + 2 . Model:
Linear Regression – 1 output, 1 input ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 56- ! 5 − ! 5 ' Loss: 3 0, 2 = 4 . ! = 0" + 2 . Model: 56$
Quadratic Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 56- ! = 0 $ " ' + 0 ' " + 2 ! 5 − ! 5 ' Loss: 3 0, 2 = 4 . Model: . 56$
n-polynomial Regression ! (" - , ! - ) (" + , ! + ) (" , , ! , ) (" ) , ! ) ) (" * , ! * ) (" ' , ! ' ) (" ( , ! ( ) (" $ , ! $ ) " 78- ! = 0 1 " 1 + ⋯ + 0 $ " + 4 ! 7 − ! 7 ' Loss: 5 0, 4 = 6 . Model: . 78$
Taken from Christopher Bishop’s Machine Learning and Pattern Recognition Book. Overfitting % is a polynomial of % is linear % is cubic degree 9 !"## $ is high !"## $ is low !"## $ is zero! Overfitting Underfitting High Bias High Variance
Detecting Overfitting • Look at the values of the weights in the polynomial
Recommended Reading • http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20- %20Pattern%20Recognition%20And%20Machine%20Learning%20- %20Springer%20%202006.pdf Print and Read Chapter 1 (at minimum)
More … • Regularization • Momentum updates 19
Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: |" ' | ) ! ", $ + & minimize '
Regularization Large weights lead to large variance. i.e. model fits to the training • data too strongly. Solution: Minimize the loss but also try to keep the weight values • small by doing the following: |" ( | * ! ", $ + & ' minimize Regularizer term e.g. L2- regularizer (
SGD with Regularization (L-2) , = 0.01 ! ", $ = ! ", $ + ' ∑ ) |" ) | + Initialize w and b randomly for e = 0, num_epochs do for b = 0, num_batches do Compute: and 0!(", $)/0" 0!(", $)/0$ Update w: " = " − , 0!(", $)/0" − ,'" Update b: $ = $ − , 0!(", $)/0$ − ,'" // Useful to see if this is becoming smaller or not. Print: !(", $) end end 22
Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ ) |" ) | + Initialize w and b randomly for e = 0, num_epochs do These are only for b = 0, num_batches do approximations to the Compute: and 0!(", $)/0" 0!(", $)/0$ true gradient with Update w: " = " − , 0!(", $)/0" − ,'" respect to 5(", $) Update b: $ = $ − , 0!(", $)/0$ − ,'" // Useful to see if this is becoming smaller or not. Print: !(", $) end end 23
Revisiting Another Problem with SGD , = 0.01 ! ", $ = ! ", $ + ' ∑ ) |" ) | + Initialize w and b randomly for e = 0, num_epochs do This could lead to “un- for b = 0, num_batches do learning” what has Compute: and 0!(", $)/0" 0!(", $)/0$ been learned in some Update w: " = " − , 0!(", $)/0" − ,'" previous steps of training. Update b: $ = $ − , 0!(", $)/0$ − ,'" // Useful to see if this is becoming smaller or not. Print: !(", $) end end 24
Solution: Momentum Updates , = 0.01 ! ", $ = ! ", $ + ' ∑ ) |" ) | + Initialize w and b randomly for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an accumulator variable! Compute: and 0!(", $)/0" 0!(", $)/0$ and use a weighted Update w: " = " − , 0!(", $)/0" − ,'" average with current gradient. Update b: $ = $ − , 0!(", $)/0$ − ,'" // Useful to see if this is becoming smaller or not. Print: !(", $) end end 25
Solution: Momentum Updates , = 0.01 6 = 0.9 Initialize w and b randomly ! ", $ = ! ", $ + ' ∑ ) |" ) | + global 5 for e = 0, num_epochs do Keep track of previous for b = 0, num_batches do gradients in an Compute: 0!(", $)/0" accumulator variable! Compute: 5 = 65 + 0!(", $)/0" + '" and use a weighted average with current Update w: " = " − , 5 gradient. // Useful to see if this is becoming smaller or not. Print: !(", $) end end 26
More on Momentum https://distill.pub/2017/momentum/
Questions? 29
Recommend
More recommend