Chapter 8. Support Vector Machines Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 � Wei Pan c
Introduction ◮ SVM: § 4.5.2, 12.1-12.3; by Vapnik (1996). ◮ Training data: ( Y i , X i ), Y i = ± 1, i = 1 , ..., n . ◮ Fig 4.14: with two separable classes, many possible separating hyperplanes, e.g. , LSE (or LDA): 1 error; Perceptron: diff starting values; SVC: max the “separation” b/w two classes; Fig 4.16. Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 4 c FIGURE 4.14. A toy example with two classes sep- arable by a hyperplane. The orange line is the least squares solution, which misclassifies one of the train- ing points. Also shown are two blue separating hyper- planes found by the perceptron learning algorithm with different random starts.
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 4 c FIGURE 4.14. A toy example with two classes sep- arable by a hyperplane. The orange line is the least squares solution, which misclassifies one of the train- ing points. Also shown are two blue separating hyper- planes found by the perceptron learning algorithm with different random starts.
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 4 c The same data as in Figure 4.14. FIGURE 4.16. The shaded region delineates the maximum margin sep- arating the two classes. There are three support points indicated, which lie on the boundary of the margin, and the optimal separating hyperplane (blue line) bisects the slab. Included in the figure is the boundary found using logistic regression (red line), which is very close to the optimal separating hyperplane (see Section 12.3.3).
Review ◮ Hyperplane L : f ( x ) = β 0 + β ′ x = 0. ⇒ β ′ ( x 1 − x 2 ) = 0. ◮ 1) Any x 1 , x 2 ∈ L = β ⊥ L ; β ∗ = β/ || β || , vector normal to L . ⇒ β 0 + β ′ x 0 = 0. ◮ 2) x 0 ∈ L = ◮ 3) The signed distance of any x to L is: β ∗′ ( x − x 0 ) = ( β ′ x − β ′ x 0 ) / || β || = ( β ′ x + β 0 ) / || β || . = ⇒ f ( x ) ∝ signed dist of x to L . Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 4 c x x 0 β 0 + β T x = 0 β ∗ FIGURE 4.15. The linear algebra of a hyperplane (affine set).
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 4 c x x 0 β 0 + β T x = 0 β ∗ FIGURE 4.15. The linear algebra of a hyperplane (affine set).
Case I: two classes are separable ◮ WLOG, assume || β || = 1 in f ( x ) = β 0 + β ′ x . Classifier: G ( x ) = sign( f ( x )). ◮ Since the two claases are separable, 1) Exists a f ( x ) = β 0 + β ′ x = 0 s.t. Y i f ( X i ) > 0 for all i ; 2) Exists a f ( x ) = β 0 + β ′ x = 0 s.t. the margin is maximized; Fig 12.1. ◮ Optimization problem max β 0 ,β, || β || =1 M s.t. Y i ( β 0 + β ′ X i ) ≥ M for i = 1 , ..., n . Q: what is β 0 + β ′ X i ? ◮ Or, max β 0 ,β M s.t. Y i ( β 0 + β ′ X i ) / || β || ≥ M for i = 1 , ..., n . ◮ Set || β || = 1 / M , then min β 0 ,β || β || or min β 0 ,β 1 2 || β || 2 s.t. Y i ( β 0 + β ′ X i ) ≥ 1 for i = 1 , ..., n .
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 12 c x T β + β 0 = 0 x T β + β 0 = 0 • • • • • • • • ξ ∗ ξ ∗ ξ ∗ • ξ ∗ • • • • 4 4 4 5 • • • • 1 M = ξ ∗ ξ ∗ � β � • ξ ∗ ξ ∗ ξ ∗ • • • • • 3 3 • M = 1 1 1 1 • • � β � • ξ ∗ ξ ∗ ξ ∗ • • • • margin 2 2 2 • • margin • • • • • • • • 1 M = � β � • • 1 M = � β � FIGURE 12.1. Support vector classifiers. The left panel shows the separable case. The decision boundary is the solid line, while broken lines bound the shaded maximal margin of width 2 M = 2 / � β � . The right panel shows the nonseparable (overlap) case. The points la- beled ξ ∗ j are on the wrong side of their margin by an amount ξ ∗ j = Mξ j ; points on the correct side have ξ ∗ j = 0 . The margin is maximized subject to a total budget P ξ i ≤ constant. Hence P ξ ∗ j is the total dis- tance of points on the wrong side of their margin.
◮ Convex programming: a quadratic obj with linear inequality constraints. ◮ Rewritten as a Lagrange function, ... β is defined by some support points/vectors X i ’s. Fig 4.16: 3 SVs ◮ Remarks: 1) SVC: a large margin leads to better separation/prediction on test data!? ◮ 2) Robustness: β 0 and β determined only by SVs, but ...
Case II: non-separable ◮ Introduce some new variable ξ ’s: max β 0 ,β, || β || =1 M s.t. Y i ( β 0 + β ′ X i ) ≥ M (1 − ξ i ) and ξ i ≥ 0 and � n i =1 ξ i ≤ B for i = 1 , ..., n , ◮ Rewrite 2 || β || 2 + C � n min β 0 ,β 1 i =1 ξ i s.t. ξ i ≥ 0 and Y i ( β 0 + β ′ X i ) ≥ 1 − ξ i ∀ i , where C is the ”cost”, a tuning parameter. Fig 12.2 ◮ .......similar results as before (e.g. convex programming, SVs) (8000): Computing § 12.2.1.
Recommend
More recommend