support vector machine
play

Support Vector Machine Supervised Learning - Classification Ricco - PowerPoint PPT Presentation

SVM Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ Outline 1. Binary classification Linear


  1. SVM Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Université Lumière Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  2. Outline 1. Binary classification – Linear classifier 2. Maximize the margin (I) – Primal form 3. Maximize the margin (II) – Dual form 4. Noisy labels – Soft Margin 5. Nonlinear classification – Kernel trick 6. Estimating class membership probabilities 7. Feature selection 8. Extension to multiclass problem 9. SVM in practice – Tools and software 10. Conclusion – Pros and cons 11. References Ricco Rakotomalala 2 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  3. Binary classification LINEAR SVM Ricco Rakotomalala 3 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  4. Binary classification Supervised learning : Y = f(x 1 ,x 2 ,…, x p ;  ) for a binary problem i.e. Y  {+, -} or Y  {+1, -1} Data linearly separable 10 x1 x2 y 9 1 3 -1 8 2 1 -1 7 4 5 -1 6 6 9 -1 x2 5 8 7 -1 4 5 1 1 3 7 1 1 2 9 4 1 1 12 7 1 0 0 2 4 6 8 10 12 14 13 6 1 x1 The aim is to find a hyperplane which enables to separate perfectly the “+” and “ - ”. The classifier comes in the form of a linear combination of the variables.     T f ( x ) x  =(  1 ,  2 ,…,  p ) and  0 are the (p+1) 0         x x parameters (coefficients) to estimate. 1 1 2 2 0 Ricco Rakotomalala 4 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  5. Once the "shape" of the decision boundary defined, we Finding the “optimal” solution have to choose a solution among the infinite number of possible solutions. 10 9 8 Two keys issues always in the supervised learning framework: 7 (1) Choosing the “Representation bias” or “hypothesis bias”  6 x2 5 we define the shape of the separator 4 (2) Choosing the search bias i.e. the way to select the best 3 solution among all the possible solutions  it often boils 2 1 down to set the objective function to optimize 0 0 2 4 6 8 10 12 14 x1 10 9 8 7 Example: Linear Discriminant Analysis 6 x2 5 The separating hyperplane is to halfway between 4 3 the two conditional centroids within the meaning 2 of the Mahalanobis distance. 1 0 0 2 4 6 8 10 12 14 x1 Ricco Rakotomalala 5 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  6. Primal problem MAXIMIZE THE MARGIN (I) Ricco Rakotomalala 6 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  7. The optimal separating hyperplane separates the two Hard-margin principle classes and maximizes the distance to the closest point Intuitive layout from either class (Vapnik, 1996) [HTF, page 132] • Distance from any point x with    T x 1  0 d the boundary ( see projection )   10 9 8 2   • The maximum margin is  7 6 x2 • The instances from which rely the margins are 5 “support vectors”. If we remove them from the d 4 sample, the optimal solution is modified. 3 2 • Several areas are defined in the representation 1 space 0 0 2 4 6 8 10 12 14 f(x) = 0, we have the maximum margin hyperplane x1 f(x) > 0, the area of « + » instances f(x) < 0, the area of « - » instances f(x) = +1 or -1, these hyperplanes are the margins Ricco Rakotomalala 7 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  8. Maximizing the margin Maximize the margin is Mathematical formulation 2   max min equivalent to minimize the norm  of the vector of parameters        2 2   • Norm: min 1 p   , 0 • Constraints point out that all points are on the right Subject to side, at least they are on the hyperplane of support     y f ( x ) 1 , i 1 , , n i i vectors. • We have a problem of convex optimization (quadratic objective function, linear constraints). A global optimum exists. Note: There are often also writing • But there is no literal solution. It must pass through 1  2 min numerical optimization programs.   2 , 0 Ricco Rakotomalala 8 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  9. Maximizing the margin We use the SOLVER to solve the optimization problem. A toy example under EXCEL (!) beta.1 beta.2 beta.0 (p + 1) variable cells 0.667 -0.667 -1.667 n° x1 x2 y f(x) f(x)*y 1 1 3 -1 -3 3 2 2 1 -1 -1 1 3 4 5 -1 -2.33333333 2.33333333 Saturated constraints: 3 support 4 6 9 -1 -3.66666667 3.66666667 vectors were found (n° 2 , 5 et 6 ) 5 8 7 -1 -1 1 6 5 1 1 1 1 7 7 1 1 2.33333333 2.33333333 8 9 4 1 1.66666667 1.66666667 9 12 7 1 1.66666667 1.66666667 10 13 6 1 3 3    0 . 667 x 0 . 667 x 1 . 667 0 1 2 12 Norme.Beta 0.943 10 n = 10 constraints 8 Objective cell: 𝛾 6 x2 4 2     0 0 . 667 x 0 . 667 x 1 . 667 1 0 1 2 0 2 4 6 8 10 12 14      T x 1 0 0 -2     0 . 667 x 0 . 667 x 1 . 667 1 0 1 2 -4 x1 Ricco Rakotomalala 9 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  10. Primal problem Comments beta.1 beta.2 beta.0 0.667 -0.667 -1.667 x1 x2 y f(x) prediction 1 3 -1 -3 -1 2 1 -1 -1 -1 4 5 -1 -2.3333 -1 Rule assignment for the     ˆ 0 y 1 6 9 -1 -3.6667 -1 i *  f ( x ) 8 7 -1 -1 -1 instance i* based on the     i * ˆ  0 y 1 5 1 1 1 1 ˆ i *  7 1 1 2.33333 1 estimated coefficients j 9 4 1 1.66667 1 12 7 1 1.66667 1 13 6 1 3 1 (1) Algorithms for numerical optimization (quadratic prog.) are not operational when “p” is large (> a few hundred). This often Drawbacks of this primal happens when we handle real problems (e.g. text mining, image,...) (few examples, many descriptors) form (2) This primal form does not highlight the possibility of using "kernel" functions that enable to go beyond to the linear classifiers Ricco Rakotomalala 10 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  11. Dual problem MAXIMIZE THE MARGIN (II) Ricco Rakotomalala 11 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  12. Dual problem A convex optimization problem has a dual form by using the Lagrange multipliers. Lagrangian multiplier method 1  2 min   2 , 0 The primal problem…        T  s . c . y ( x ) 1 , i 1 , , n i i 0     n         2        …becomes under the dual T 1 L , , y x 1 P 0 i i i 0 2 form  i 1 Where  i are the Lagrange multipliers  n L  We can obtain the parameters (coefficients) of      y x 0   i i i the hyperplane from the Lagrange multipliers  i 1  n By setting each partial L     i y 0   i derivate equal to zero  i 1 0    L          T y x 1 0 , i   i i 0 i           T [ y x 1 ] 0 , i The solution must satisfy the Karush-Kuhn-Tucker (KKT) conditions i i i 0 Ricco Rakotomalala 12 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  13. Dual problem By using information from the partial derivative of the Lagrangian, the results rely only on multipliers Optimization • <x i ,x i ’ > is the scalar product between the vectors of values for the instances i and i ’ n n n   1   p        max L y y x , x   x , x x x D i i i ' i i ' i i '  2 ' ' i i ij i j    i 1 i 1 i ' 1  j 1    0 , i i Subject to n •  i > 0 define the important instances i.e. the    0 y i i support vectors  i 1 • Inevitably, there will be support vectors with different class labels, otherwise this condition cannot be met. Ricco Rakotomalala 13 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Recommend


More recommend