Support Vector Machines: Training with Stochastic Gradient Descent - PowerPoint PPT Presentation

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1

Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 2

SVM objective function Regularization term: Empirical Loss: Maximize the margin Hinge loss • • Imposes a preference over the Penalizes weight vectors that make • • hypothesis space and pushes for mistakes better generalization Can be replaced with other Can be replaced with other loss • • regularization terms which impose functions which impose other other preferences preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 3

Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 4

Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 5

Solving the SVM optimization problem This function is convex in w 6

Recall: Convex functions A function 𝑔 is convex if for every 𝒗, 𝒘 in the domain, and for every 𝜇 ∈ [0,1] we have 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≤ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) From geometric perspective f(u) Every tangent plane lies below the function f(v) u v 7

Recall: Convex functions A function 𝑔 is convex if for every 𝒗, 𝒘 in the domain, and for every 𝜇 ∈ [0,1] we have 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≤ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) From geometric perspective f(u) Every tangent plane lies below the function f(v) u v 8

Convex functions Linear functions max is convex Some ways to show that a function is convex: 1. Using the definition of convexity 2. Showing that the second derivative is positive (for one dimensional functions) 3. Showing that the second derivative is positive semi-definite (for vector functions) 9

Not all functions are convex These are concave These are neither 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≥ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) 10

Convex functions are convenient A function 𝑔 is convex if for every 𝒗, 𝒘 in the domain, and for every 𝜇 ∈ [0,1] we have 𝑔 𝜇𝒗 + 1 − 𝜇 𝒘 ≤ 𝜇𝑔 𝒗 + 1 − 𝜇 𝑔(𝒘) f(u) f(v) u v In general: Necessary condition for x to be a minimum for the function f is r f (x)= 0 For convex functions, this is both necessary and sufficient 11

Solving the SVM optimization problem This function is convex in w This is a quadratic optimization problem because the objective is • quadratic Older methods: Used techniques from Quadratic Programming • – Very slow No constraints, can use gradient descent • – Still very slow! 12

We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 13

We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 1 w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 14

We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 2 w 1 w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 15

We are trying to minimize Gradient descent J( w ) General strategy for minimizing a function J( w ) • Start with an initial guess for w , say w 0 • Iterate till convergence: w – Compute the gradient of the w 2 w 3 w 1 w 0 gradient of J at w t Intuition : The gradient is the direction – Update w t to get w t+1 by taking of steepest increase in the function. To a step in the opposite direction get to the minimum, go in the opposite of the gradient direction 16

Gradient descent for SVM We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2, …. t ) 1. Compute gradient of J( w ) at w t . Call it r J( w 2. Update w as follows: r : Called the learning rate . 17

Outline: Training SVM by optimization ü Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 18

Gradient descent for SVM We are trying to minimize 1. Initialize w 0 2. For t = 0, 1, 2, …. t ) 1. Compute gradient of J( w ) at w t . Call it r J( w Gradient of the SVM objective requires summing over the 2. Update w as follows: entire training set Slow, does not really scale r : Called the learning rate 19

Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t Ã w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w 20

Stochastic gradient descent for SVM Given a training set S = {( x i , y i )}, x 2 < n , y 2 {-1,1} Initialize w 0 = 0 2 < n 1. 2. For epoch = 1 … T: 1. Pick a random example ( x i , y i ) from the training set S 2. Treat ( x i , y i ) as a full dataset and take the derivative of the SVM objective at the current w t - 1 to be r J t ( w t - 1 ) Update : w t Ã w t - 1 – ° t r J t ( w t - 1 ) 3. 3. Return final w What is the gradient of the hinge loss with respect to w? This algorithm is guaranteed to converge to the minimum of J if ° t is small enough. (The hinge loss is not a differentiable function!) Why? The objective J( w ) is a convex function 25

Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 26

Gradient Descent vs SGD Gradient descent 27

Gradient Descent vs SGD Stochastic Gradient descent 28

Support Vector Machines: Training with Stochastic Gradient Descent - PowerPoint PPT Presentation

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Ori Data Structures and Operations Data Structures and Operations An Interactive Paper

Challenges in Framing the Problem: Just what are we trying to optimize anyway? Michael C. Runge

Advisory Group Call National Center for Health in Public Housing 10/22/2019 Agenda HRSA

T owards An Application Objective-Aware Network Interface Sangeetha Abdu Jyothi Sayed Hadi

Explaining Objective Color in Terms of Subjective Reactions Gilbert Harman 1. WHY OBJECTIVE

Pre-trained Sentence and Contextualized Word Representations Graham Neubig Site

Possibility of Kolmorogovs . . . How Can We Define . . . Objective Interval Observation and

Online Auto-Tuning Ray S. Chen Jeffrey K. Hollingsworth 1 Motivation HPC systems will

Support Vector Machines: Training with Stochastic Gradient Descent - PowerPoint PPT Presentation

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Ori Data Structures and Operations Data Structures and Operations An Interactive Paper

Challenges in Framing the Problem: Just what are we trying to optimize anyway? Michael C. Runge

Advisory Group Call National Center for Health in Public Housing 10/22/2019 Agenda HRSA

T owards An Application Objective-Aware Network Interface Sangeetha Abdu Jyothi Sayed Hadi

Explaining Objective Color in Terms of Subjective Reactions Gilbert Harman 1. WHY OBJECTIVE

Pre-trained Sentence and Contextualized Word Representations Graham Neubig Site

Possibility of Kolmorogovs . . . How Can We Define . . . Objective Interval Observation and

Online Auto-Tuning Ray S. Chen Jeffrey K. Hollingsworth 1 Motivation HPC systems will

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David