Support Vector Machines 290N, 2014
Support Vector Machines (SVM) Supervised learning methods for classification and regression they can represent non-linear functions and they have an efficient training algorithm derived from statistical learning theory by Vapnik and Chervonenkis (COLT-92) SVM got into mainstream because of their exceptional performance in Handwritten Digit Recognition 1.1% error rate which was comparable to a very carefully constructed (and complex) ANN
Two Class Problem: Linear Separable Case Many decision Class 2 boundaries can separate these two classes Which one should we choose? Class 1
Example of Bad Decision Boundaries Class 2 Class 2 Class 1 Class 1
Another intuition If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased 5
Support Vector Machine (SVM) Support vectors SVMs maximize the margin around the separating hyperplane. A.k.a. large margin classifiers The decision function is fully specified by a subset of training samples, the support vectors . Maximize Quadratic programming margin problem 6
Training examples for document ranking Two ranking signals are used (Cosine text similarity score, proximity of term appearance window) Example DocID Query Cosine score Judgment 37 linux operating 0.032 3 relevant system 37 penguin logo 0.02 4 nonrelevant 238 operating system 0.043 2 relevant 238 runtime 0.004 2 nonrelevant environment 1741 kernel layer 0.022 3 relevant 2094 device driver 0.03 2 relevant 3191 device driver 0.027 5 nonrelevant 7
Proposed scoring function for ranking Cosine score R R R R R N R R R 0.025 N R N N N N N N N 0 Term proximity 8 2 3 4 5
Formalization w: weight coefficients x i : data point i y i : class result of data point i (+1 or -1) f(x i ) = sign(w T x i + b) Classifier is: y i (w T x i + b) Functional margin of x i is: We can increase this margin by scaling w, b… 9
Linear Support Vector Machine (SVM) w T x a + b = 1 ρ w T x b + b = -1 Hyperplane w T x + b = 0 w T x + b = 1 w T x + b = -1 w T x + b = 0 Support vectors ρ = ||x a – x b || 2 = 2/||w|| 2 datapoints that the margin pushes up against 10
Geometric View: Margin of a point T w x b Distance from example to the separator is r y w Examples closest to the hyperplane are support vectors Margin ρ of the separator is the width of separation between support vectors of classes. ρ x r x ′ 11
Geometric View of Margin T w x b Distance to the separator is r y w Let X be in line wTx+b=z. Thus (wTx+b) –( wTx’+b)=z -0 then |w| |x- x’|= |z| = y(wTx+b) thus |w| r = y(wTx+b). ρ x r x ′ 12
Linear Support Vector Machine (SVM) w T x a + b = 1 ρ w T x b + b = -1 Hyperplane w T x + b = 0 This implies: w T (x a – x b ) = 2 ρ = ||x a – x b || 2 = 2/||w|| 2 w T x + b = 0 Support vectors datapoints that the margin pushes up against 13
Linear SVM Mathematically Assume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set {( x i , y i )} w T x i + b ≥ 1 if y i = 1 w T x i + b ≤ - 1 if y i = -1 For support vectors, the inequality becomes an equality Then, since each example’s distance from the hyperplane is T w x b r y w The margin of dataset is: 2 w 14
The Optimization Problem Let { x 1 , ..., x n } be our data set and let y i {1,-1} be the class label of x i The decision boundary should classify all points correctly A constrained optimization problem || w || 2 = w T w
Lagrangian of Original Problem The Lagrangian is Lagrangian multipliers Note that || w || 2 = w T w Setting the gradient of w.r.t. w and b to zero, i 0
The Dual Optimization Problem We can transform the problem to its dual Dot product of X ’s New variables (Lagrangian multipliers) This is a convex quadratic programming (QP) problem Global maximum of i can always be found well established tools for solving this optimization problem (e.g. cplex)
A Geometrical Interpretation Class 2 Support vectors 10 =0 ’s with values 8 =0.6 different from zero (they hold up the 7 =0 separating plane)! 2 =0 5 =0 1 =0.8 4 =0 6 =1.4 9 =0 3 =0 Class 1
The Optimization Problem Solution The solution has the form: w = Σ α i y i x i b = y k - w T x k for any x k such that α k 0 Each non-zero α i indicates that corresponding x i is a support vector. Then the classifying function will have the form: f ( x ) = Σ α i y i x i T x + b Notice that it relies on an inner product between the test point x and the support vectors x i – we will return to this later. Also keep in mind that solving the optimization problem involved T x j between all pairs of training points. computing the inner products x i 19
Classification with SVMs Given a new point ( x 1 ,x 2 ), we can score its projection onto the hyperplane normal: In 2 dims: score = w 1 x 1 +w 2 x 2 +b . I.e., compute score: wx + b = Σ α i y i x i T x + b Set confidence threshold t. Score > t: yes Score < -t: no 7 3 5 Else: don’t know 20
Soft Margin Classification If the training set is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples. Allow some errors ξ i Let some points be ξ j moved to where they belong, at a cost Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin) 21
Soft margin We allow “error” x i in classification; it is based on the output of the discriminant function w T x +b x i approximates the number of misclassified samples New objective function: Class 2 C : tradeoff parameter between error and margin; chosen by the user; large C means a higher penalty to errors Class 1
Soft Margin Classification Mathematically The old formulation: Find w and b such that Φ ( w ) =½ w T w is minimized and for all { ( x i , y i )} y i ( w T x i + b) ≥ 1 The new formulation incorporating slack variables: Find w and b such that Φ ( w ) =½ w T w + C Σ ξ i is minimized and for all { ( x i , y i )} y i ( w T x i + b ) ≥ 1- ξ i and ξ i ≥ 0 for all i Parameter C can be viewed as a way to control overfitting – a regularization term 23
The Optimization Problem The dual of the problem is w is also recovered as The only difference with the linear separable case is that there is an upper bound C on i Once again, a QP solver can be used to find i efficiently!!!
Soft Margin Classification – Solution The dual problem for soft margin classification: Find α 1 …α N such that Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i Neither slack variables ξ i nor their Lagrange multipliers appear in the dual problem! Again, x i with non-zero α i will be support vectors. Solution to the dual problem is: But w not needed explicitly w = Σ α i y i x i for classification! b= y k (1- ξ k ) - w T x k where k = argmax α k f ( x ) = Σ α i y i x i T x + b k 25
Linear SVMs: Summary The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i . Both in the dual formulation of the problem and in the solution training points appear only inside inner products: f ( x ) = Σ α i y i x i Find α 1 …α N such that T x + b Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i 26
Non-linear SVMs Datasets that are linearly separable (with some noise) work out great: x 0 But what are we going to do if the dataset is just too hard? x 0 How about … mapping data to a higher -dimensional space: x 2 x 0 27
Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ : x → φ ( x ) 28
Transformation to Feature Space “Kernel tricks” Make non-separable problem separable. Map data into better representational space ( ) ( ) ( ) ( ) ( ) ( ) (.) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Feature space Input space
Modification Due to Kernel Function Change all inner products to kernel functions For training, Original With kernel function K x x ( , ) ( ) x ( x ) i j i j
Example Transformation Consider the following transformation Define the kernel function K ( x , y ) as The inner product (.) (.) can be computed by K without going through the map (.) explicitly!!!
Recommend
More recommend