the support vector machine
play

The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos - PDF document

The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos g UC San Diego ( Classification a Classification Problem has two types of variables X - vector of observations (features) in the world Y - state (class) of the


  1. The Support Vector Machine (Ken Kreutz-Delgado) ) Nuno Vasconcelos g UC San Diego (

  2. Classification a Classification Problem has two types of variables • X - vector of observations (features) in the world • Y - state (class) of the world E.g. X ∈ X ⊂ R 2 , X = (fever, blood pressure) X 2 • X X (f bl d ) Y ∈ Y = {disease, no disease} • X, Y are stochastically related and this X Y are stochastically related and this relationship can be well approximated by an “optimal” classifier function ≈ = ˆ x ( ) y y f x ( ) · f Goal: Design a “good” classifier h ≈ f ≈ y , h : X → Y 2

  3. Loss Functions and Risk Usually h ( . ) is a parametric function, h ( x, α ) Generally it cannot estimate the value y arbitrarily well Generally it cannot estimate the value y arbitrarily well • Indeed, the best we can (optimistically) hope for is that h will well approximate the unknown optimal classifier f , h ≈ f L y h x α We define a loss function: [ , ( , )] Goal: Find the parameter values (equivalently, find the classifier) that minimize the expected value of the loss: l ifi ) th t i i i th t d l f th l { { } } α = α ( ) [ , ( , )] Risk = Average Loss = R E L y h x , , X Y In particular, under the “0-1” loss the optimal solution is the Bayes Decision Rule (BDR): [ ] = *( ) argmax | h x P i x Y X | i 3

  4. Bayes Decision Rule The BDR carves up the observation space X , assigning a label to each region a label to each region Clearly, h* depends on the class densities { } [ ] [ ] = + *( ) argmax log | log h x P x i P i X Y | Y i Problematic! Usually we don’t know these densities!! Problematic! Usually we don t know these densities!! Key idea of discriminant learning: • First estimating the densities followed by deriving the decision • First estimating the densities, followed by deriving the decision boundaries is a computationally intractable (hence bad) strategy • Vapnik’s Rule: “When solving a problem avoid solving a more general (and thus usually much harder) problem as more general (and thus usually much harder) problem as an intermediate step!” 4

  5. Discriminant Learning Work directly with the decision function 1. Postulate a (parametric) family of decision boundaries 2. 2 Pi k th Pick the element in this family that produces the best classifier l t i thi f il th t d th b t l ifi Q: What is a good family of decision boundaries? Consider two equal probability Gaussian class conditional Consider two equal probability Gaussian class conditional densities of equal covariance: ⎧ ⎧ ⎫ ⎫ 1 = µ Σ Σ + ⎨ ⎨ ⎬ ⎬ *( ) *( ) argmax log l ( , ( , ) ) log 2 l h h x G G x i i ⎩ ⎭ i { { } } = − µ µ Σ Σ − − µ µ 1 T argmin ( argmin ( ) ) ( ( ) ) x x x x i i i i i ⎧ − − − µ Σ − µ < − µ Σ − µ 1 1 T T 0, ( ) ( ) ( ) ( ) if x x x x = ⎨ ⎨ 0 0 1 1 ⎩ ⎩ 1, otherwise 5

  6. The Linear Discriminant Function The decision boundary is the set of points − µ µ Σ Σ − − µ µ = − µ µ Σ Σ − − µ µ T T 1 1 x x x x ( ( x ) ) ( ( x ) ) ( ( x ) ) ( ( x ) ) 0 0 1 1 which, after some algebra, becomes µ − µ Σ − + µ Σ − µ − µ Σ − µ = T T T T T T 1 1 1 1 1 1 2 ( ) x 0 1 0 0 0 1 1 This is the equation of the hyperplane + = T 0 w x b with = Σ − µ − µ 1 2 ( ) w 1 0 = µ Σ − µ − µ Σ − µ 1 1 T T b 0 0 1 1 This is a linear discriminant 6

  7. Linear Discriminants The hyperplane equation can also be written as ⎛ ⎞ w ⎜ ⎟ + = ⇔ + = ⇔ T T 0 0 w x b w x b ⎜ ⎜ ⎟ ⎟ 2 w w ⎝ ⎝ ⎠ ⎠ w x 0 ( ( ) ) w − = = − x 1 T T 0 0 1 w x x with ith x b b 0 0 2 w x x n Geometric interpretation Geometric interpretation x2 x2 x 3 • Hyperplane of normal w • Hyperplane passes through x 0 0 • Hyperplane point x 0 is the point closest to the origin 7

  8. Linear Discriminants For the given model, the quadratic discriminant function ⎧ − µ µ Σ − − µ µ < − µ µ Σ − − µ µ 1 1 T T 0, , if ( ( ) ) ( ( ) ) ( ( ) ) ( ( ) ) x x x x = ⎨ ⎨ 0 0 0 0 1 1 1 1 *( ) *( ) 1, h h x − µ Σ − − µ > − µ Σ − − µ 1 1 T T if ( ) ( ) ( ) ( ) ⎩ x x x x 0 0 1 1 is equivalent to the linear discriminant function is equivalent to the linear discriminant function > ⎧ 0 if ( ) 0 g x x = ⎨ *( ) h x < ⎩ ⎩ 1 if ( ) ( ) 0 g x g x x 0 x-x 0 where θ w ( ( ) ) x 0 = − T ( ) ( ) g x g w x x 0 0 x 1 = − θ x n w · · cos x x 0 g(x) > 0 if x is on the side w points to g(x) > 0 if x is on the side w points to x2 x 3 (“ w points to the positive side”) 8

  9. Linear Discriminants x Finally, note that T ( ) ( ) g x g x w w ( ) w w x-x 0 θ = − x x 0 w w x 0 ( ) g x is: is: x 1 1 w • The projection of x-x 0 onto the unit x n b vector in the direction of w w x2 x2 x 3 • The length of the component of x-x 0 orthogonal to the plane I.e. g(x)/||w|| I e g(x)/||w|| = perpendicular distance from x to the plane perpendicular distance from x to the plane Similarly, | b|/||w|| is the distance from the plane to the origin, since: w w = − x b 0 2 w 9

  10. Geometric Interpretation Summarizing, the linear discriminant decision rule g x > > ⎧ ⎧ 0 0 if ( ) if ( ) 0 0 g x = ⎨ = + with T *( ) ( ) h x g x w x b < ⎩ 1 if ( ) 0 g x has the following properties has the following properties w It divides X into two “half-spaces” • • The boundary is the hyperplane with: The boundary is the hyperplane with: x • normal w ( ) g x | | w b • distance to the origin b/||w|| w • g( x )/|| w || gives the signed distance from point x to the boundary • g(x) = 0 for points on the plane • g(x) > 0 for points on the side w points to (“positive side”) • g(x) < 0 for points on the “negative side” 10

  11. The Linear Discriminant Function When is it a good decision function? We’ve just seen that it is optimal for • Gaussian classes having equal class probabilities and covariances But, this sounds too much like an But, this sounds too much like an artificial, toy problem However, it is also optimal if the data is linearly separable • I.e., if there is a hyperplane which has • all “class 0” data on one side • all class 0 data on one side • all “class 1” data on the other Note: this holding on the training set only guarantees optimality in the minimum training error sense, not in the sense of minimizing the true risk 11

  12. Linear Discriminants For now, our goal is to explore the y =1 y simplicity of the linear discriminant p y w let’s assume linear separability of the training data One handy trick is to use class labels y ∈ { -1,1 } instead of y ∈ { 0,1 } , where • y = 1 for points on the positive side y 1 for points on the positive side y =-1 • y = -1 for points on the negative side The decision function then becomes The decision function then becomes > ⎧ 1 if ( ) 0 g x [ ] = ⇔ = ⎨ *( ) *( ) sgn ( ) h x h x g x − < < ⎩ ⎩ 1 1 if ( ) if ( ) 0 0 g x g x 12

  13. Linear Discriminants & Separable Data We have a classification error if We have a classification error if • y = 1 and g(x) < 0 or y = -1 and g(x) > 0 • i.e., if yg(x) < 0 yg(x) < 0 • i e if We have a correct classification if y = 1 and g(x) > 0 y 1 and g(x) > 0 or y 1 and g(x) < 0 y = -1 and g(x) < 0 • or • i.e., if yg(x) > 0 Note that if the data is linearly separable given a training set Note that, if the data is linearly separable, given a training set D = {( x 1 ,y 1 ) , ... , ( x n ,y n )} we can have zero training error. The necessary & sufficient condition for this is that ( ) + > ∀ = T 0, 1, ···, y w x b i n i i 13

  14. The Margin y=1 The margin is the distance from the boundary to the closest point w + T w x b γ = i min w i There will be no error on the training y y=-1 1 set if it is strictly greater than zero: set if it is strictly greater than zero: ( ) + > ∀ ⇔ γ > T 0, 0 y w x b i i i w Note that this is ill-defined in the sense Note that this is ill-defined in the sense that γ does not change if both w and b x g x ( ) are scaled by a common scalar λ | b | w w We need a normalization 14

  15. Support Vector Machine (SVM) A convenient normalization is to make y=1 | g(x)| = 1 for the closest point, i.e. w + ≡ T min 1 w x b i i under which under which 1 γ = w y y=-1 1 The Support Vector Machine (SVM) is w the linear discriminant classifier that maximizes the margin subject to x g x these constraints: ( ) | b | w w ( ) 2 2 + ≥ ∀ T min subject to 1 w y w x b i i i , w b 15

Recommend


More recommend