Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1
Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 2
SVM objective function Regularization term: Empirical Loss: Maximize the margin Hinge loss • • Imposes a preference over the Penalizes weight vectors that make • • hypothesis space and pushes for mistakes better generalization Can be replaced with other Can be replaced with other loss • • regularization terms which impose functions which impose other other preferences preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 3
Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 4
Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 5
Solving the SVM optimization problem This function is convex in w 6
Recall: Convex functions A function 𝑔 is convex if for every 𝒗, 𝒘 in the domain, and for every 𝜇 ∈ [0,1] we have 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≤ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) From geometric perspective f(u) Every tangent plane lies below the function f(v) u v 7
Recall: Convex functions A function 𝑔 is convex if for every 𝒗, 𝒘 in the domain, and for every 𝜇 ∈ [0,1] we have 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≤ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) From geometric perspective f(u) Every tangent plane lies below the function f(v) u v 8
Convex functions Linear functions max is convex Some ways to show that a function is convex: 1. Using the definition of convexity 2. Showing that the second derivative is positive (for one dimensional functions) 3. Showing that the second derivative is positive semi-definite (for vector functions) 9
Not all functions are convex These are concave These are neither 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≥ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) 10
Convex functions are convenient A function 𝑔 is convex if for every 𝒗, 𝒘 in the domain, and for every 𝜇 ∈ [0,1] we have 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≤ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) f(u) f(v) u v In general: Necessary condition for x to be a minimum for the function f is r f (x)= 0 For convex functions, this is both necessary and sufficient 11
Solving the SVM optimization problem This function is convex in w This is a quadratic optimization problem because the objective is • quadratic Older methods: Used techniques from Quadratic Programming • – Very slow No constraints, can use gradient descent • – Still very slow! 12
We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 13
We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 1 w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 14
We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 2 w 1 w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 15
We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 2 w 3 w 1 w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 16
Gradient descent for SVM We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2, …. t ) 1. Compute gradient of J( w ) at w t . Call it r J( w 2. Update w as follows: r : Called the learning rate . 17
Outline: Training SVM by optimization ü Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 18
Gradient descent for SVM We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2, …. t ) 1. Compute gradient of J( w ) at w t . Call it r J( w Gradient of the SVM objective requires summing over the 2. Update w as follows: entire training set Slow, does not really scale r : Called the learning rate 19
Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 20
Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 21
Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 22
Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 23
Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 24
Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t à w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w What is the gradient of the hinge loss with respect to w? This algorithm is guaranteed to converge to the minimum of J if ° t is small enough. (The hinge loss is not a differentiable function!) Why? The objective J( w ) is a convex function 25
Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 26
Gradient Descent vs SGD Gradient descent 27
Gradient Descent vs SGD Stochastic Gradient descent 28
Gradient Descent vs SGD Stochastic Gradient descent 29
Gradient Descent vs SGD Stochastic Gradient descent 30
Gradient Descent vs SGD Stochastic Gradient descent 31
Gradient Descent vs SGD Stochastic Gradient descent 32
Gradient Descent vs SGD Stochastic Gradient descent 33
Gradient Descent vs SGD Stochastic Gradient descent 34
Recommend
More recommend