classifiers support vector machine 1 machine learning
play

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is - PowerPoint PPT Presentation

MACHINE LEARNING MACHINE LEARNING Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult Children Detecting facial attributes He & Zhang, Pattern Recognition, 2011 Sony (Make Believe) Training set


  1. MACHINE LEARNING MACHINE LEARNING Classifiers: Support Vector Machine 1

  2. MACHINE LEARNING What is Classification? Female Adult Children Detecting facial attributes He & Zhang, Pattern Recognition, 2011 Sony (Make Believe) Training set must be as unambiguous as possible Not easy, especially as members of different classes may share similar attributes Learning implies generalization; which of the features of each member of class makes the class most distinguishable from the other classes. 2

  3. MACHINE LEARNING Multi-Class Classification Male Adult Female Adult Children Whenever possible, the classes should be balanced Garbage model : Male adult versus anything that is neither a female adult nor a child Classes can no longer be balanced! 3

  4. MACHINE LEARNING Classifiers There is a plethora of classifiers, e.g: - Neural networks (feed-forward with backpropagation, multi-layer perceptron) - Decision trees (C4.5, random forest) - Kernel methods (support vector machine, gaussian process classifier) - Mixtures of linear classifiers (boosting) In this class, we will see only SVM and Boosting for mixture of classifiers Each classifier type has its pros and cons: - Complex model: embed non-linearity but heavy computation - Simple models: often high number of models, hence high stack memory Number of hyperparameters: high  extensive crossvalidation to determine - the optimal classifier - Some of the classifiers come with guarantees for global optimal solution; other have only local optimality guarantee 4

  5. MACHINE LEARNING Support Vector Machine Brief history: SVM was invented by Vladimir Vapnik Started with the invention of the statistical learning theory (Vapnik1979) The current form of SVM was presented in (Boser, Guyon and Vapnik 1992) and Cortes and Vapnik (1995) Textbooks: A good survey of the theory behind SVM is An easy introduction to SVM is given in Learning with given in Support Vector Machines and other Kernels by Bernhard Scholkopf and Alexander Smola. Kernel Based Learning methods by Nello Cristianini and John-Shawe Taylor. 5

  6. MACHINE LEARNING Support Vector Machine Was applied to numerous classification problems: - Computer vision (face detection, object recognition, feature categorization, etc) - Bioinformatics (categorization of gene expression, of microarray data) - WWW (categorization of websites) - Production (control of quality, detection of defaults) - Robotics (categorization of sensor readings) - Finance (bankruptcy prediction) The success of SVM is mainly due to: - Its ease of use (lots of software available, good documentation) - Excellent performance on variety of datasets - Good solvers making optimization (learning phase) very quick Very fast at retrieval time – does not hinder practical applications - 6

  7. MACHINE LEARNING Optimal Linear Classification ‘good’ ‘OK’ ‘bad’ • Which choice is better? • How could we formulate this problem? 7

  8. MACHINE LEARNING (W, b) Linear Classifiers x f y est       f x w b ; , sgn w x , b denotes -1 denotes +1 How would you classify this data? 8

  9. MACHINE LEARNING (W, b) Linear Classifiers x f y est       f x w b ; , sgn w x , b denotes -1 denotes +1 How would you classify this data? 9

  10. MACHINE LEARNING (W, b) Linear Classifiers x f y est       f x w b ; , sgn w x , b denotes -1 denotes +1 How would you classify this data? 10

  11. MACHINE LEARNING (W, b) Linear Classifiers x f y est       f x w b ; , sgn w x , b denotes -1 denotes +1 How would you classify this data? 11

  12. MACHINE LEARNING (W, b) Linear Classifiers x f y est       f x w b ; , sgn w x , b denotes -1 denotes +1 Any of these would be fine.. ..but which is best? 12

  13. MACHINE LEARNING (W, b) Classifier Margin x f y est       f x w b ; , sgn w x , b denotes -1 Define the margin denotes +1 of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. 13

  14. MACHINE LEARNING (W, b) Classifier Margin x f y est       f x w b ; , sgn w x , b denotes -1 The maximum denotes +1 margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM 14

  15. MACHINE LEARNING (W, b) Classifier Margin x f y est       f x w b ; , sgn w x , b denotes -1 The maximum denotes +1 margin linear classifier is the linear classifier Support Vectors with the, um, are those maximum margin. datapoints that the margin This is the pushes up simplest kind of against SVM (Called an LSVM) Linear SVM 15

  16. MACHINE LEARNING (W, b) Classifier Margin x f y est       f x w b ; , sgn w x , b denotes -1 denotes +1 Need to determine a measure of the margin 16

  17. MACHINE LEARNING (W, b) Classifier Margin x f y est       f x w b ; , sgn w x , b denotes -1 denotes +1 Need to determine a measure of the margin  To maximize this measure 17

  18. MACHINE LEARNING Determining the Optimal Separating Hyperplane     x : w x , b 0      x : w x , b 1      x : w x , b 1 Definition :    w x , b 1. The margin on either side of the hyperplane satisfy 18

  19. MACHINE LEARNING Determining the Optimal Separating Hyperplane Decision function: Class with label y=-1       f x w b ; , sgn w x , b Class with label y=+1      x : w x , b 2      x : w x , b 2      x : w x , b 3      x : w x , b 3 Points on either side of the separating plane have negative and positive coordinates, respectively . 19

  20. MACHINE LEARNING Determining the Optimal Separating Hyperplane   w x , b 0 ? x What is the distance from a point x to the hyperplane < w , x> +b= 0? 20

  21. MACHINE LEARNING Determining the Optimal Separating Hyperplane   x s t ' . . w x , ' b 0    w x , x ' w x , ' w x , ,     , ' , . w x x b w x x ’ -x x Projection of x-x' onto w:     w w x , b w x , b w  w x ’ 2 w w w unitary vector The margin between two classes is at least 2/||w||.  w x , b  Distance to hyperplane w 21

  22. MACHINE LEARNING Determining the Optimal Separating Hyperplane Class with label y=-1 Two points on either side of Class with label y=+1 the margin:    1 w x , b 1    2 w x , b 1 x1      1 2 x2 w , x x 2 2    1 2 x x w The margin between two classes is at least 2/||w||. 22

  23. MACHINE LEARNING – 2012 Determining the Optimal Separating Hyperplane 2 Separating condition is measured by . w w To maximize this condition is equivalent to minimizing . 2 2 w Better even is to minimize the convex form . 2 23

  24. MACHINE LEARNING – 2012 Determining the Optimal Separating Hyperplane • Finding the Optimal Separating Hyperplane turns out to be an optimization problem of the following form: 1 2 min w 2 w b ,        w x , b 1 when y 1   i i    y w x , b 1, i i i=1,2,....,M.       w x , b 1 when y 1 i i  • N +1 parameters (N: dimension of data) • M constraints (M: nm of datapoints) • It is called the primal problem. 24

  25. MACHINE LEARNING – 2012 Determining the Optimal Separating Hyperplane Rephrase the minimization under constraint problem in terms of the Lagrange Multipliers a i , i = 1, ..., M (M, # of data points), one for each of the inequality constraints and we get the dual problem :     M 1    a   a   2 L w b , , w y w x , b 1 i i i 2  i 1 a  with 0 i (Minimization of convex function under linear constraints through Lagrange gives the optimal solution) 25

  26. MACHINE LEARNING – 2012 Determining the Optimal Separating Hyperplane The solution of this problem is found when maximizing over a and minimizing over w and b :     a max min L w b , , a  w b , 0 where     M 1    a  2  a   L w b , , w y w x , b 1 i i i 2  i 1 26

  27. MACHINE LEARNING – 2012 Determining the Optimal Separating Hyperplane Requesting that the gradient of L vanishes with w.    a M L w b , ,     a 0 w y x i i  i w  i 1 The vector defining the hyperplane is determined by the training points. Note that while w is unique (minimization of convex function), the alpha are not unique. 27

  28. MACHINE LEARNING – 2012 Determining the Optimal Separating Hyperplane Requesting that the gradient of L vanishes with w.    a M L w b , ,    a  0 y 0 i  i b  i 1 Requires minimum one datapoint on each class 28

Recommend


More recommend