support vector machines part 2
play

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude


  1. Support Vector Machines Part 2 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

  2. Goals for the lecture you should understand the following concepts • soft margin SVM • support vector regression • the kernel trick • polynomial kernel • Gaussian/RBF kernel • valid kernels and Mercer’s theorem • kernels and neural networks 2

  3. Variants: soft-margin and SVR

  4. Hard-margin SVM • Optimization (Quadratic Programming): 2 1 min 𝑥 2 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗

  5. Soft-margin SVM [Cortes & Vapnik, Machine Learning 1995] • if the training instances are not linearly separable, the previous formulation will fail • we can adjust our approach by using slack variables (denoted by 𝜂 𝑗 ) to tolerate errors 2 1 min 𝑥 + 𝐷 ෍ 𝜂 𝑗 2 𝑥,𝑐,𝜂 𝑗 𝑗 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1 − 𝜂 𝑗 , 𝜂 𝑗 ≥ 0, ∀𝑗 • 𝐷 determines the relative importance of maximizing margin vs. minimizing slack

  6. The effect of 𝐷 in soft-margin SVM Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010

  7. Hinge loss • when we covered neural nets, we talked about minimizing squared loss and cross-entropy loss • SVMs minimize hinge loss squared loss loss (error) when 𝑧 = 1 0/1 loss hinge loss model output ℎ 𝒚

  8. Support Vector Regression • the SVM idea can also be applied in regression tasks (𝑥 ⊤ 𝑦 + 𝑐) − 𝑧 = 𝜗 • an 𝜗 -insensitive error function specifies that a training instance is well explained if the model’s 𝑧 − (𝑥 ⊤ 𝑦 + 𝑐) = 𝜗 prediction is within 𝜗 of 𝑧 𝑗

  9. Support Vector Regression • Regression using slack variables (denoted by 𝜂 𝑗 , 𝜊 𝑗 ) to tolerate errors 2 1 min 𝑥 + 𝐷 ෍ 𝜂 𝑗 + 𝜊 𝑗 2 𝑥,𝑐,𝜂 𝑗 ,𝜊 𝑗 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 𝑧 𝑗 ≤ 𝜗 + 𝜂 𝑗 , 𝑧 𝑗 − 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≤ 𝜗 + 𝜊 𝑗 , 𝜂 𝑗 , 𝜊 𝑗 ≥ 0. slack variables allow predictions for some training instances to be off by more than 𝜗

  10. Kernel methods

  11. Features 𝑦 𝜚 𝑦 Color Histogram Extract features Red Green Blue

  12. Features Proper feature mapping can make non-linear to linear!

  13. Only depend on inner products Recall: SVM dual form • Reduces to dual problem: 𝛽 𝑗 − 1 𝑈 𝑦 𝑘 ℒ 𝑥, 𝑐, 𝜷 = ෍ 2 ෍ 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 𝑗 𝑗𝑘 ෍ 𝛽 𝑗 𝑧 𝑗 = 0, 𝛽 𝑗 ≥ 0 𝑗 𝑈 𝑦 + 𝑐 • Since 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , we have 𝑥 𝑈 𝑦 + 𝑐 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗

  14. Features • Using SVM on the feature space {𝜚 𝑦 𝑗 } : only need 𝜚 𝑦 𝑗 𝑈 𝜚(𝑦 𝑘 ) • Conclusion: no need to design 𝜚 ⋅ , only need to design 𝑙 𝑦 𝑗 , 𝑦 𝑘 = 𝜚 𝑦 𝑗 𝑈 𝜚(𝑦 𝑘 )

  15. Polynomial kernels • Fix degree 𝑒 and constant 𝑑 : 𝑙 𝑦, 𝑦′ = 𝑦 𝑈 𝑦′ + 𝑑 𝑒 • What are 𝜚(𝑦) ? • Expand the expression to get 𝜚(𝑦)

  16. Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

  17. SVMs with polynomial kernels Figure from Ben-Hur & Weston, Methods in Molecular Biology 2010 18

  18. Gaussian/RBF kernels • Fix bandwidth 𝜏 : 2 /2𝜏 2 ) 𝑙 𝑦, 𝑦′ = exp(− 𝑦 − 𝑦 ′ • Also called radial basis function (RBF) kernels • What are 𝜚(𝑦) ? Consider the un-normalized version 𝑙′ 𝑦, 𝑦′ = exp(𝑦 𝑈 𝑦′/𝜏 2 ) • Power series expansion: +∞ 𝑦 𝑈 𝑦 ′ 𝑗 𝑙′ 𝑦, 𝑦 ′ = ෍ 𝜏 𝑗 𝑗! 𝑗

  19. The RBF kernel illustrated 𝛿 = −10 𝛿 = −100 𝛿 = −1000 Figures from openclassroom.stanford.edu (Andrew Ng) 20

  20. Mercer’s condition for kenerls • Theorem: 𝑙 𝑦, 𝑦′ has expansion +∞ 𝑏 𝑗 𝜚 𝑗 𝑦 𝜚 𝑗 (𝑦 ′ ) 𝑙 𝑦, 𝑦′ = ෍ 𝑗 if and only if for any function 𝑑(𝑦) , ∫ ∫ 𝑑 𝑦 𝑑 𝑦 ′ 𝑙 𝑦, 𝑦 ′ 𝑒𝑦𝑒𝑦 ′ ≥ 0 (Omit some math conditions for 𝑙 and 𝑑 )

  21. Constructing new kernels • Kernels are closed under positive scaling, sum, product, pointwise +∞ 𝑏 𝑗 𝑙 𝑗 (𝑦, 𝑦 ′ ) limit, and composition with a power series σ 𝑗 • Example: 𝑙 1 𝑦, 𝑦′ , 𝑙 2 𝑦, 𝑦′ are kernels, then also is 𝑙 𝑦, 𝑦 ′ = 2𝑙 1 𝑦, 𝑦′ + 3𝑙 2 𝑦, 𝑦′ • Example: 𝑙 1 𝑦, 𝑦′ is kernel, then also is 𝑙 𝑦, 𝑦 ′ = exp(𝑙 1 𝑦, 𝑦 ′ )

  22. Kernel algebra • given a valid kernel, we can make new valid kernels using a variety of operators kernel composition mapping composition ( ) k ( x , v ) = k a ( x , v ) + k b ( x , v ) f ( x ) = f a ( x ), f b ( x ) k ( x , v ) = g k a ( x , v ), g > 0 f ( x ) = g f a ( x ) f l ( x ) = f ai ( x ) f bj ( x ) k ( x , v ) = k a ( x , v ) k b ( x , v ) k ( x , v ) = f ( x ) f ( v ) k a ( x , v ) f ( x ) = f ( x ) f a ( x ) 23

  23. Kernels v.s. Neural networks

  24. Features 𝑦 Color Histogram Extract build 𝑧 = 𝑥 𝑈 𝜚 𝑦 features hypothesis Red Green Blue

  25. Features: part of the model Nonlinear model build 𝑧 = 𝑥 𝑈 𝜚 𝑦 hypothesis Linear model

  26. Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

  27. Polynomial kernel SVM as two layer neural network 2 𝑦 1 2 𝑦 2 𝑦 1 2𝑦 1 𝑦 2 𝑧 = sign(𝑥 𝑈 𝜚(𝑦) + 𝑐) 2𝑑𝑦 1 𝑦 2 2𝑑𝑦 2 𝑑 First layer is fixed. If also learn first layer, it becomes two layer neural network

  28. Comments on SVMs • we can find solutions that are globally optimal (maximize the margin) • because the learning task is framed as a convex optimization problem • no local minima, in contrast to multi-layer neural nets • there are two formulations of the optimization: primal and dual • dual represents classifier decision in terms of support vectors • dual enables the use of kernel functions • we can use a wide range of optimization methods to learn SVM • standard quadratic programming solvers • SMO [Platt, 1999] • linear programming solvers for some formulations • etc. 29

  29. Comments on SVMs • kernels provide a powerful way to • allow nonlinear decision boundaries • represent/compare complex objects such as strings and trees • incorporate domain knowledge into the learning task • using the kernel trick, we can implicitly use high-dimensional mappings without explicitly computing them • one SVM can represent only a binary classification task; multi-class problems handled using multiple SVMs and some encoding • empirically, SVMs have shown (close to) state-of-the art accuracy for many tasks • the kernel idea can be extended to other tasks (anomaly detection, regression, etc.) 30

Recommend


More recommend