Support Vector Machines 290N, 2014 Support Vector Machines (SVM) - PowerPoint PPT Presentation

Support Vector Machines 290N, 2014

Support Vector Machines (SVM)   Supervised learning methods for classification and regression   they can represent non-linear functions and they have an efficient training algorithm   derived from statistical learning theory by Vapnik and Chervonenkis (COLT-92)   SVM got into mainstream because of their exceptional performance in Handwritten Digit Recognition  1.1% error rate which was comparable to a very carefully constructed (and complex) ANN

Two Class Problem: Linear Separable Case  Many decision Class 2 boundaries can separate these two classes  Which one should we choose? Class 1

Example of Bad Decision Boundaries Class 2 Class 2 Class 1 Class 1

Another intuition  If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased 5

Support Vector Machine (SVM) Support vectors  SVMs maximize the margin around the separating hyperplane.  A.k.a. large margin classifiers  The decision function is fully specified by a subset of training samples, the support vectors . Maximize  Quadratic programming margin problem 6

Training examples for document ranking Two ranking signals are used (Cosine text similarity score, proximity of term appearance window) Example DocID Query Cosine score Judgment 37 linux operating 0.032 3 relevant system 37 penguin logo 0.02 4 nonrelevant 238 operating system 0.043 2 relevant 238 runtime 0.004 2 nonrelevant environment 1741 kernel layer 0.022 3 relevant 2094 device driver 0.03 2 relevant 3191 device driver 0.027 5 nonrelevant 7

Proposed scoring function for ranking Cosine score R R R R R N R R R 0.025 N R N N N N N N N 0 Term proximity 8 2 3 4 5

Formalization  w: weight coefficients  x i : data point i  y i : class result of data point i (+1 or -1) f(x i ) = sign(w T x i + b)  Classifier is: y i (w T x i + b)  Functional margin of x i is:  We can increase this margin by scaling w, b… 9

Linear Support Vector Machine (SVM) w T x a + b = 1 ρ w T x b + b = -1 Hyperplane  w T x + b = 0 w T x + b = 1 w T x + b = -1 w T x + b = 0 Support vectors ρ = ||x a – x b || 2 = 2/||w|| 2  datapoints that the margin pushes up against 10

Geometric View: Margin of a point T  w x b  Distance from example to the separator is  r y w Examples closest to the hyperplane are support vectors  Margin ρ of the separator is the width of separation between support  vectors of classes. ρ x r x ′ 11

Geometric View of Margin T  w x b  Distance to the separator is  r y w Let X be in line wTx+b=z. Thus (wTx+b) –( wTx’+b)=z -0  then |w| |x- x’|= |z| = y(wTx+b) thus |w| r = y(wTx+b). ρ x r x ′ 12

Linear Support Vector Machine (SVM) w T x a + b = 1 ρ w T x b + b = -1 Hyperplane  w T x + b = 0 This implies:  w T (x a – x b ) = 2 ρ = ||x a – x b || 2 = 2/||w|| 2 w T x + b = 0 Support vectors datapoints that the margin pushes up against 13

Linear SVM Mathematically Assume that all data is at least distance 1 from the hyperplane, then  the following two constraints follow for a training set {( x i , y i )} w T x i + b ≥ 1 if y i = 1 w T x i + b ≤ - 1 if y i = -1 For support vectors, the inequality becomes an equality  Then, since each example’s distance from the hyperplane is  T  w x b  r y w The margin of dataset is:  2   w 14

The Optimization Problem  Let { x 1 , ..., x n } be our data set and let y i  {1,-1} be the class label of x i  The decision boundary should classify all points correctly   A constrained optimization problem  || w || 2 = w T w

Lagrangian of Original Problem  The Lagrangian is Lagrangian multipliers  Note that || w || 2 = w T w  Setting the gradient of w.r.t. w and b to zero,  i  0

The Dual Optimization Problem  We can transform the problem to its dual Dot product of X  ’s  New variables (Lagrangian multipliers)  This is a convex quadratic programming (QP) problem  Global maximum of  i can always be found  well established tools for solving this optimization problem (e.g. cplex)

A Geometrical Interpretation Class 2 Support vectors  10 =0   ’s with values  8 =0.6 different from zero (they hold up the  7 =0 separating plane)!  2 =0  5 =0  1 =0.8  4 =0  6 =1.4  9 =0  3 =0 Class 1

The Optimization Problem Solution The solution has the form:  w = Σ α i y i x i b = y k - w T x k for any x k such that α k  0 Each non-zero α i indicates that corresponding x i is a support vector.  Then the classifying function will have the form:  f ( x ) = Σ α i y i x i T x + b Notice that it relies on an inner product between the test point x and the  support vectors x i – we will return to this later. Also keep in mind that solving the optimization problem involved  T x j between all pairs of training points. computing the inner products x i 19

Classification with SVMs  Given a new point ( x 1 ,x 2 ), we can score its projection onto the hyperplane normal:  In 2 dims: score = w 1 x 1 +w 2 x 2 +b .  I.e., compute score: wx + b = Σ α i y i x i T x + b  Set confidence threshold t. Score > t: yes Score < -t: no 7 3 5 Else: don’t know 20

Soft Margin Classification  If the training set is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples.  Allow some errors ξ i  Let some points be ξ j moved to where they belong, at a cost  Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin) 21

Soft margin  We allow “error” x i in classification; it is based on the output of the discriminant function w T x +b  x i approximates the number of misclassified samples New objective function: Class 2 C : tradeoff parameter between error and margin; chosen by the user; large C means a higher penalty to errors Class 1

Soft Margin Classification Mathematically The old formulation:  Find w and b such that Φ ( w ) =½ w T w is minimized and for all { ( x i , y i )} y i ( w T x i + b) ≥ 1 The new formulation incorporating slack variables:  Find w and b such that Φ ( w ) =½ w T w + C Σ ξ i is minimized and for all { ( x i , y i )} y i ( w T x i + b ) ≥ 1- ξ i and ξ i ≥ 0 for all i Parameter C can be viewed as a way to control overfitting – a  regularization term 23

The Optimization Problem  The dual of the problem is  w is also recovered as  The only difference with the linear separable case is that there is an upper bound C on  i  Once again, a QP solver can be used to find  i efficiently!!!

Soft Margin Classification – Solution The dual problem for soft margin classification:  Find α 1 …α N such that Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i Neither slack variables ξ i nor their Lagrange multipliers appear in the  dual problem! Again, x i with non-zero α i will be support vectors.  Solution to the dual problem is:  But w not needed explicitly w = Σ α i y i x i for classification! b= y k (1- ξ k ) - w T x k where k = argmax α k f ( x ) = Σ α i y i x i T x + b k 25

Linear SVMs: Summary The classifier is a separating hyperplane.  Most “important” training points are support vectors; they define  the hyperplane. Quadratic optimization algorithms can identify which training  points x i are support vectors with non-zero Lagrangian multipliers α i . Both in the dual formulation of the problem and in the solution  training points appear only inside inner products: f ( x ) = Σ α i y i x i Find α 1 …α N such that T x + b Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i 26

Non-linear SVMs Datasets that are linearly separable (with some noise) work out great:  x 0 But what are we going to do if the dataset is just too hard?  x 0 How about … mapping data to a higher -dimensional space:  x 2 x 0 27

Non-linear SVMs: Feature spaces  General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ : x → φ ( x ) 28

Transformation to Feature Space  “Kernel tricks”  Make non-separable problem separable.  Map data into better representational space  ( )  ( )  ( )  ( )  ( )  ( )  (.)  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( ) Feature space Input space

Modification Due to Kernel Function  Change all inner products to kernel functions  For training, Original With kernel function     K x x ( , ) ( ) x ( x ) i j i j

Example Transformation  Consider the following transformation  Define the kernel function K ( x , y ) as  The inner product  (.)  (.) can be computed by K without going through the map  (.) explicitly!!!

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) - PowerPoint PPT Presentation

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning methods for classification and regression they can represent non-linear functions and they have an efficient training algorithm derived

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Exponential lower bounds for hom. depth-5 circuits over finite fields Mrinal Kumar Ramprasad

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Capturing rare events with the heterogeneous multiscale method David Kelly Eric Vanden-Eijnden

Experimental quantum fast Carlo Di Franco hitting on hexagonal graphs 9th International

The New Large D Limit of Matrix Models: Theory and Applications Frank FERRARI Universit Libre

Comparison of Fault Currents in AC & DC Microgrids Mahdi Izadkhast Learning Objectives

Design Optimization of Time- and Cost-Constrained Fault-Tolerant Distributed Embedded Systems

Fault Modeling 1 Why Fault Models? Actual number of physical defects in a circuit are too

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) - PowerPoint PPT Presentation

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning methods for classification and regression they can represent non-linear functions and they have an efficient training algorithm derived

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines Support Vector Machines CSC 411 Tutorial April 1, 2015 Tutor: Shenlong

Exponential lower bounds for hom. depth-5 circuits over finite fields Mrinal Kumar Ramprasad

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Capturing rare events with the heterogeneous multiscale method David Kelly Eric Vanden-Eijnden

Experimental quantum fast Carlo Di Franco hitting on hexagonal graphs 9th International

The New Large D Limit of Matrix Models: Theory and Applications Frank FERRARI Universit Libre

Comparison of Fault Currents in AC &amp; DC Microgrids Mahdi Izadkhast Learning Objectives

Design Optimization of Time- and Cost-Constrained Fault-Tolerant Distributed Embedded Systems

Fault Modeling 1 Why Fault Models? Actual number of physical defects in a circuit are too

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Comparison of Fault Currents in AC & DC Microgrids Mahdi Izadkhast Learning Objectives