support vector machines training with stochastic gradient
play

Support Vector Machines: Training with Stochastic Gradient Descent - PowerPoint PPT Presentation

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and


  1. Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1

  2. Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 2

  3. SVM objective function Regularization term: Empirical Loss: Maximize the margin Hinge loss • • Imposes a preference over the Penalizes weight vectors that make • • hypothesis space and pushes for mistakes better generalization Can be replaced with other Can be replaced with other loss • • regularization terms which impose functions which impose other other preferences preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 3

  4. Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 4

  5. Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 5

  6. Solving the SVM optimization problem This function is convex in w 6

  7. Recall: Convex functions A function 𝑔 is convex if for every 𝒗, 𝒘 in the domain, and for every 𝜇 ∈ [0,1] we have 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≤ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) From geometric perspective f(u) Every tangent plane lies below the function f(v) u v 7

  8. Recall: Convex functions A function 𝑔 is convex if for every 𝒗, 𝒘 in the domain, and for every 𝜇 ∈ [0,1] we have 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≤ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) From geometric perspective f(u) Every tangent plane lies below the function f(v) u v 8

  9. Convex functions Linear functions max is convex Some ways to show that a function is convex: 1. Using the definition of convexity 2. Showing that the second derivative is positive (for one dimensional functions) 3. Showing that the second derivative is positive semi-definite (for vector functions) 9

  10. Not all functions are convex These are concave These are neither 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≥ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) 10

  11. Convex functions are convenient A function 𝑔 is convex if for every 𝒗, 𝒘 in the domain, and for every 𝜇 ∈ [0,1] we have 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≤ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) f(u) f(v) u v In general: Necessary condition for x to be a minimum for the function f is r f (x)= 0 For convex functions, this is both necessary and sufficient 11

  12. Solving the SVM optimization problem This function is convex in w This is a quadratic optimization problem because the objective is • quadratic Older methods: Used techniques from Quadratic Programming • – Very slow No constraints, can use gradient descent • – Still very slow! 12

  13. We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 13

  14. We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 1 w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 14

  15. We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 2 w 1 w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 15

  16. We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 2 w 3 w 1 w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 16

  17. Gradient descent for SVM We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2, …. t ) 1. Compute gradient of J( w ) at w t . Call it r J( w 2. Update w as follows: r : Called the learning rate . 17

  18. Outline: Training SVM by optimization ü Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 18

  19. Gradient descent for SVM We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2, …. t ) 1. Compute gradient of J( w ) at w t . Call it r J( w Gradient of the SVM objective requires summing over the 2. Update w as follows: entire training set Slow, does not really scale r : Called the learning rate 19

  20. Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 20

  21. Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 21

  22. Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 22

  23. Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 23

  24. Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 24

  25. Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w What is the gradient of the hinge loss with respect to w? This algorithm is guaranteed to converge to the minimum of J if ° t is small enough. (The hinge loss is not a differentiable function!) Why? The objective J( w ) is a convex function 25

  26. Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 26

  27. Gradient Descent vs SGD Gradient descent 27

  28. Gradient Descent vs SGD Stochastic Gradient descent 28

  29. Gradient Descent vs SGD Stochastic Gradient descent 29

  30. Gradient Descent vs SGD Stochastic Gradient descent 30

  31. Gradient Descent vs SGD Stochastic Gradient descent 31

  32. Gradient Descent vs SGD Stochastic Gradient descent 32

  33. Gradient Descent vs SGD Stochastic Gradient descent 33

  34. Gradient Descent vs SGD Stochastic Gradient descent 34

Recommend


More recommend