kernel machines
play

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 - PowerPoint PPT Presentation

Support Vector Machines Kernel Machines Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel Machines Kernel Machines Support Vector Machines 1 Optimal Separating HyperPlanes Soft Margin


  1. Support Vector Machines Kernel Machines Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1

  2. Support Vector Machines Kernel Machines Kernel Machines Support Vector Machines 1 Optimal Separating HyperPlanes Soft Margin HyperPlanes Kernel Machines 2 Kernel Functions Multi-Classes Regression Outlier Detection Dimensionality Reduction 2

  3. Support Vector Machines Kernel Machines Seeking Separation Recall earlier discussion: The most specific hypothesis S is the tightest rectangle enclosing the positive examples. prone to false negatives The most general hypothesis G is the largest rectangle enclosing the positive examples but containing no negative examples. prone to false positives Perhaps we should choose something in between? 3

  4. Support Vector Machines Kernel Machines Support Vector Machines Non-parametric, discriminant-based Defines the discriminant as a combination of support vectors Convex optimization problems with a unique solution 4

  5. Support Vector Machines Kernel Machines Separating HyperPlanes In our earlier linear discrimination techniques, we sought any separating hyperplane Some of the points could come arbitrarily close to the border Distance from the plane was a measure of confidence If the class was not linearly separable, too bad. 5

  6. Support Vector Machines Kernel Machines Support Vector Machines: Margins SVM’s seek the plane that maximizes the margin between the plane and the closest instances of each class. After testing, we do not insist on the margin But distance from plane is still an indication of confidence With minor modification, can extend to classes that are not linearly separable. 6

  7. Support Vector Machines Kernel Machines Defining the Margin � +1 x t ∈ C 1 if � x t , r t } where r t = X = { � x t ∈ C 2 − 1 if � Find � w and w 0 such that x t + w 0 ≥ +1 for r t = +1 w T � � x t + w 0 ≤ +1 for r t = − 1 w T � � or, equialently, x t + w 0 ) ≥ +1 r t ( � w T � 7

  8. Support Vector Machines Kernel Machines Maximizing the Margin Margin is the distance from the discriminant to the closest instances on either side w T � x t + w 0 | x to the hyperplane is | � Distance of � || � w || Let ρ denote the margin side: x t + w 0 | w T � ∀ t , | � ≥ ρ || � w || If we try to maximize ρ , there are infinite solutions form,ed by simply rescaling the � w Fix ρ || w || = 1, and minimize || w || to maximize ρ w T � x t + w 0 | w || 2 subject to ∀ t , | � Minimize 1 2 || � ≥ ρ || � w || 8

  9. Support Vector Machines Kernel Machines Margin Circled inputs are the ones that determine the border 9

  10. Support Vector Machines Kernel Machines Margin Circled inputs are the ones that determine the border After training, we could forget about the others 9

  11. Support Vector Machines Kernel Machines Margin Circled inputs are the ones that determine the border After training, we could forget about the others In fact, we might be able eliminate some of the others before training 9

  12. Support Vector Machines Kernel Machines Derivation (1/3) w || 2 subject to ∀ t , | � w T � x t + w 0 | Minimize 1 2 || � ≥ ρ || � w || N 1 w || 2 − x t + w 0 ) − 1] � α t [ r t ( � w T � = 2 || � L p t =1 N N 1 w || 2 − x t + w 0 )] + � � α t [ r t ( � w T � α t = 2 || � t =1 t =1 N ∂ L p � α t r t � x t w = 0 ⇒ � w = ∂� t =1 10

  13. Support Vector Machines Kernel Machines Derivation (2/3) N ∂ L p � α t r t � x t w = 0 ⇒ � w = ∂� t =1 N ∂ L p α t r t = 0 � = 0 ⇒ ∂ w 0 t =1 Maximizing L p is equivalent to maximizing the dual 1 w T � x t − w 0 α t r t + w T � � � α t r t � α t = 2( � w ) − � L d t t t t α t r t = 0 and α t ≥ 0 subject to � 11

  14. Support Vector Machines Kernel Machines Derivation (3/3) 1 x t − w 0 α t r t + w T � w T � � � α t r t � α t L d = 2( � w ) − � t t t − 1 w T � � α t = 2( � w ) + t − 1 α t α s r t r s + � � � α t = 2 t s t t α t r t = 0 and α t ≥ 0 subject to � Solve numerically. Most α t are 0 The small number with α t > 0 are the support vectors : N � α t r t � x t w = � t =1 12

  15. Support Vector Machines Kernel Machines Support Vectors Circled inputs are the support vectors w = � N t =1 α t r t � x t � Compute w 0 from the average over the support vectors of w 0 = r t − � w T � x t 13

  16. Support Vector Machines Kernel Machines Demo Applet on http://www.csie.ntu.edu.tw/ cjlin/libsvm/ 14

  17. Support Vector Machines Kernel Machines Soft Margin HyperPlanes Suppose the classes are almost, but not quite linearly separable x t + w 0 ) ≥ 1 − ξ t r t ( � w T � The ξ are slack variables, ξ t ≥ 0 storing deviation from the margin ξ t = 0 means � x t is more than 1 away from hyperplane 0 < ξ t < 1 means � x t is within the margin, but correctly classified ξ t ≥ 1 means � x t is misclassified t ξ t is a measure of error. Add as a penalty term � L p = 1 w || 2 + C � ξ t 2 || � t C is a penalty factor that trades off complexity against data misfitting 15

  18. Support Vector Machines Kernel Machines Soft Margin Derivation w || 2 + C � t ξ t C is a penalty factor that trades off L p = 1 2 || � complexity against data misfitting Leads to same numerical optimization problem with new constraint 0 ≤ α t ≤ C 16

  19. Support Vector Machines Kernel Machines Soft Margin Example Applet on http://www.csie.ntu.edu.tw/ cjlin/libsvm/ 17

  20. Support Vector Machines Kernel Machines Hinge Loss � 0 if y t r r ≥ L hinge ( y t , r t ) = 1 − y t r t ow Behavior in 0..1 makes this more robust than 0/1 and squred error Close to cross-entropy over much of its range 18

  21. Support Vector Machines Kernel Machines Non-Linear SVM z = � Replace inputs � x by a sequence of basis functions � φ ( � x ) Linear SVM Kernel SVM z t = α t r t � � � α t r t � x t ) � α t r t � x t � w = φ ( � w = � t t t w T � w T � g ( � x ) = � x g ( � x ) = � φ ( � x ) � α t r t � x T � x t α t r t � x t ) T � � = = φ ( � φ ( � x ) t t 19

  22. Support Vector Machines Kernel Machines The Kernel t α t r t � x T � x t g ( � x ) = � x t is a measure of similarity between � x T � � x and a support vector. 20

  23. Support Vector Machines Kernel Machines The Kernel t α t r t � x T � x t g ( � x ) = � x t is a measure of similarity between � x T � � x and a support vector. t α t r t � x t ) T � g ( � x ) = � φ ( � φ ( � x ) x t ) T � � φ ( � φ ( � x ) can be seen as a similarity measure in the non-linear basis space. 20

  24. Support Vector Machines Kernel Machines The Kernel t α t r t � x T � x t g ( � x ) = � x t is a measure of similarity between � x T � � x and a support vector. t α t r t � x t ) T � g ( � x ) = � φ ( � φ ( � x ) x t ) T � � φ ( � φ ( � x ) can be seen as a similarity measure in the non-linear basis space. x t ) T � x ) = � x t ,� To generalize, let K ( � φ ( � φ ( � x ) � α t r t K ( � x t ,� g ( � x ) = x ) t 20

  25. Support Vector Machines Kernel Machines The Kernel t α t r t � x T � x t g ( � x ) = � x t is a measure of similarity between � x T � � x and a support vector. t α t r t � x t ) T � g ( � x ) = � φ ( � φ ( � x ) x t ) T � � φ ( � φ ( � x ) can be seen as a similarity measure in the non-linear basis space. x t ) T � x ) = � x t ,� To generalize, let K ( � φ ( � φ ( � x ) � α t r t K ( � x t ,� g ( � x ) = x ) t K is a kernel function . 20

  26. Support Vector Machines Kernel Machines Polynomial Kernels x t + 1) q x t ,� x T � K q ( � x ) = ( � E.g., y + 1) 2 K ( � x ,� y ) = ( � x � ( x 1 y 1 + x 2 y 2 + 1) 2 = = 1 + 2 x 1 y 1 + 2 x 2 y 2 +2 x 1 x 2 y 1 y 2 + x 2 1 y 2 1 + x 2 2 y 2 2 √ √ √ � 2 x 1 x 2 , x 2 1 , x 2 φ ( � x ) = [1 , 2 x 1 , 2 x 2 , 2 ] (FWIW) 21

  27. Support Vector Machines Kernel Machines Radial-basis Kernels x t − � x || 2 � � −|| � x t ,� K ( � x ) = exp 2 s 2 Other options include sigmoidal (approximated as tanh ) 22

  28. Support Vector Machines Kernel Machines Selecting Kernels Kernels can be customized to application Choose appropriate measures of similarity Bag of words (normalized cosines between vocabulary vectors) Genetics: edit distance between strings Graphs: length of shortest path between nodes, or number of connecting paths For input sets with very large dimension, may be cheaper to pre-compute the and save the matrix of kernel values ( Gram matrix ) rather than keeping all the inputs available. 23

  29. Support Vector Machines Kernel Machines Multi-Classes 1 versus all K separate N variable problems pairwise separation K(K-1) separate N variable problems single multiclass optimization � K Minimize 1 t ξ t i =1 || w i || 2 + C � � i subject to 2 i x t + w z t 0 ≥ � x t + w i 0 + 2 − ξ t w T w T i , ∀ i � = z t � z t � i � x t where z t is the index of the class of � one K*N variable problem 24

Recommend


More recommend