support vector machines
play

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) - PowerPoint PPT Presentation

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning methods for classification and regression they can represent non-linear functions and they have an efficient training algorithm derived


  1. Support Vector Machines 290N, 2014

  2. Support Vector Machines (SVM)   Supervised learning methods for classification and regression   they can represent non-linear functions and they have an efficient training algorithm   derived from statistical learning theory by Vapnik and Chervonenkis (COLT-92)   SVM got into mainstream because of their exceptional performance in Handwritten Digit Recognition  1.1% error rate which was comparable to a very carefully constructed (and complex) ANN

  3. Two Class Problem: Linear Separable Case  Many decision Class 2 boundaries can separate these two classes  Which one should we choose? Class 1

  4. Example of Bad Decision Boundaries Class 2 Class 2 Class 1 Class 1

  5. Another intuition  If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased 5

  6. Support Vector Machine (SVM) Support vectors  SVMs maximize the margin around the separating hyperplane.  A.k.a. large margin classifiers  The decision function is fully specified by a subset of training samples, the support vectors . Maximize  Quadratic programming margin problem 6

  7. Training examples for document ranking Two ranking signals are used (Cosine text similarity score, proximity of term appearance window) Example DocID Query Cosine score Judgment 37 linux operating 0.032 3 relevant system 37 penguin logo 0.02 4 nonrelevant 238 operating system 0.043 2 relevant 238 runtime 0.004 2 nonrelevant environment 1741 kernel layer 0.022 3 relevant 2094 device driver 0.03 2 relevant 3191 device driver 0.027 5 nonrelevant 7

  8. Proposed scoring function for ranking Cosine score R R R R R N R R R 0.025 N R N N N N N N N 0 Term proximity 8 2 3 4 5

  9. Formalization  w: weight coefficients  x i : data point i  y i : class result of data point i (+1 or -1) f(x i ) = sign(w T x i + b)  Classifier is: y i (w T x i + b)  Functional margin of x i is:  We can increase this margin by scaling w, b… 9

  10. Linear Support Vector Machine (SVM) w T x a + b = 1 ρ w T x b + b = -1 Hyperplane  w T x + b = 0 w T x + b = 1 w T x + b = -1 w T x + b = 0 Support vectors ρ = ||x a – x b || 2 = 2/||w|| 2  datapoints that the margin pushes up against 10

  11. Geometric View: Margin of a point T  w x b  Distance from example to the separator is  r y w Examples closest to the hyperplane are support vectors  Margin ρ of the separator is the width of separation between support  vectors of classes. ρ x r x ′ 11

  12. Geometric View of Margin T  w x b  Distance to the separator is  r y w Let X be in line wTx+b=z. Thus (wTx+b) –( wTx’+b)=z -0  then |w| |x- x’|= |z| = y(wTx+b) thus |w| r = y(wTx+b). ρ x r x ′ 12

  13. Linear Support Vector Machine (SVM) w T x a + b = 1 ρ w T x b + b = -1 Hyperplane  w T x + b = 0 This implies:  w T (x a – x b ) = 2 ρ = ||x a – x b || 2 = 2/||w|| 2 w T x + b = 0 Support vectors datapoints that the margin pushes up against 13

  14. Linear SVM Mathematically Assume that all data is at least distance 1 from the hyperplane, then  the following two constraints follow for a training set {( x i , y i )} w T x i + b ≥ 1 if y i = 1 w T x i + b ≤ - 1 if y i = -1 For support vectors, the inequality becomes an equality  Then, since each example’s distance from the hyperplane is  T  w x b  r y w The margin of dataset is:  2   w 14

  15. The Optimization Problem  Let { x 1 , ..., x n } be our data set and let y i  {1,-1} be the class label of x i  The decision boundary should classify all points correctly   A constrained optimization problem  || w || 2 = w T w

  16. Lagrangian of Original Problem  The Lagrangian is Lagrangian multipliers  Note that || w || 2 = w T w  Setting the gradient of w.r.t. w and b to zero,  i  0

  17. The Dual Optimization Problem  We can transform the problem to its dual Dot product of X  ’s  New variables (Lagrangian multipliers)  This is a convex quadratic programming (QP) problem  Global maximum of  i can always be found  well established tools for solving this optimization problem (e.g. cplex)

  18. A Geometrical Interpretation Class 2 Support vectors  10 =0   ’s with values  8 =0.6 different from zero (they hold up the  7 =0 separating plane)!  2 =0  5 =0  1 =0.8  4 =0  6 =1.4  9 =0  3 =0 Class 1

  19. The Optimization Problem Solution The solution has the form:  w = Σ α i y i x i b = y k - w T x k for any x k such that α k  0 Each non-zero α i indicates that corresponding x i is a support vector.  Then the classifying function will have the form:  f ( x ) = Σ α i y i x i T x + b Notice that it relies on an inner product between the test point x and the  support vectors x i – we will return to this later. Also keep in mind that solving the optimization problem involved  T x j between all pairs of training points. computing the inner products x i 19

  20. Classification with SVMs  Given a new point ( x 1 ,x 2 ), we can score its projection onto the hyperplane normal:  In 2 dims: score = w 1 x 1 +w 2 x 2 +b .  I.e., compute score: wx + b = Σ α i y i x i T x + b  Set confidence threshold t. Score > t: yes Score < -t: no 7 3 5 Else: don’t know 20

  21. Soft Margin Classification  If the training set is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples.  Allow some errors ξ i  Let some points be ξ j moved to where they belong, at a cost  Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin) 21

  22. Soft margin  We allow “error” x i in classification; it is based on the output of the discriminant function w T x +b  x i approximates the number of misclassified samples New objective function: Class 2 C : tradeoff parameter between error and margin; chosen by the user; large C means a higher penalty to errors Class 1

  23. Soft Margin Classification Mathematically The old formulation:  Find w and b such that Φ ( w ) =½ w T w is minimized and for all { ( x i , y i )} y i ( w T x i + b) ≥ 1 The new formulation incorporating slack variables:  Find w and b such that Φ ( w ) =½ w T w + C Σ ξ i is minimized and for all { ( x i , y i )} y i ( w T x i + b ) ≥ 1- ξ i and ξ i ≥ 0 for all i Parameter C can be viewed as a way to control overfitting – a  regularization term 23

  24. The Optimization Problem  The dual of the problem is  w is also recovered as  The only difference with the linear separable case is that there is an upper bound C on  i  Once again, a QP solver can be used to find  i efficiently!!!

  25. Soft Margin Classification – Solution The dual problem for soft margin classification:  Find α 1 …α N such that Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i Neither slack variables ξ i nor their Lagrange multipliers appear in the  dual problem! Again, x i with non-zero α i will be support vectors.  Solution to the dual problem is:  But w not needed explicitly w = Σ α i y i x i for classification! b= y k (1- ξ k ) - w T x k where k = argmax α k f ( x ) = Σ α i y i x i T x + b k 25

  26. Linear SVMs: Summary The classifier is a separating hyperplane.  Most “important” training points are support vectors; they define  the hyperplane. Quadratic optimization algorithms can identify which training  points x i are support vectors with non-zero Lagrangian multipliers α i . Both in the dual formulation of the problem and in the solution  training points appear only inside inner products: f ( x ) = Σ α i y i x i Find α 1 …α N such that T x + b Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i 26

  27. Non-linear SVMs Datasets that are linearly separable (with some noise) work out great:  x 0 But what are we going to do if the dataset is just too hard?  x 0 How about … mapping data to a higher -dimensional space:  x 2 x 0 27

  28. Non-linear SVMs: Feature spaces  General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ : x → φ ( x ) 28

  29. Transformation to Feature Space  “Kernel tricks”  Make non-separable problem separable.  Map data into better representational space  ( )  ( )  ( )  ( )  ( )  ( )  (.)  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( ) Feature space Input space

  30. Modification Due to Kernel Function  Change all inner products to kernel functions  For training, Original With kernel function     K x x ( , ) ( ) x ( x ) i j i j

  31. Example Transformation  Consider the following transformation  Define the kernel function K ( x , y ) as  The inner product  (.)  (.) can be computed by K without going through the map  (.) explicitly!!!

Recommend


More recommend