about this class maximizing the margin
play

About this class Maximizing the Margin Maximum margin classifiers - PowerPoint PPT Presentation

About this class Maximizing the Margin Maximum margin classifiers Picture of large and small margin hyperplanes SVMs: geometric derivation of the primal prob- lem Intuition: large margin condition acts as a reg- ularizer and should generalize


  1. About this class Maximizing the Margin Maximum margin classifiers Picture of large and small margin hyperplanes SVMs: geometric derivation of the primal prob- lem Intuition: large margin condition acts as a reg- ularizer and should generalize better Statement of the dual problem The Support Vector Machine (SVM) makes this formal. Not only that, it is amenable to The “kernel trick” the kernel trick which will allow us to get much greater representational power! SVMs as the solution to a regularization prob- lem 1 2

  2. Since x − z is parallel to w (both perpendicular Deriving the SVM to the separating hyperplane) k = w. ( x − z ) (Derivation based on Ryan Rifkin’s slides in MIT 9.520 from Spring 2003) ⇒ k = || w |||| x − z || k Assume we classify a point x as sgn( w.x ) ⇒ || x − z || = || w || Let x be a datapoint on the margin, and z the So now, maximizing || x − z || is equivalent to point on the separating hyperplane closest to minimizing || w || x We can fix k = 1 (this is just a rescaling) We want to maximize || x − z || Now we have an optimization problem: For some k (assumed positive) w ∈ R n || w || 2 min w.x = k subject to: w.z = 0 y i ( w.x i ) ≥ 1 , i = 1 , . . . , l ⇒ w. ( x − z ) = k Can be solved using quadratic programming 3

  3. When a Separating Hyperplane Does Not Exist We introduce slack variables . The new opti- mization problem becomes Think about this expression in terms of train- l ξ i + 1 ing set error and inductive bias! 2 || w || 2 � min w ∈ R n , ξ ∈ R l C i =1 Typically we also use a bias term to shift the subject to: hyperplane around (so it doesn’t have to pass y i ( w.x i + b ) ≥ 1 − ξ i , i = 1 , . . . , l through the origin) Now f ( x ) = sgn( w.x + b ) ξ i ≥ 0 , i = 1 , . . . , l Now we are trading the error o ff against the margin 4

  4. The Dual Formulation l � � max α i − α i α j y i y j ( x i .x j ) α ∈ R l i =1 i,j subject to: l � y i α i = 0 i =1 This allows for more e ffi cient solution of the 0 ≤ α i ≤ C, i = 1 , . . . , l QP than we could get otherwise The hypothesis is then: l � f ( x ) = sgn( α i y i ( x.x i )) i =1 Sparsity: it turns out that: y i f ( x i ) > 1 ⇒ α i = 0 y i f ( x i ) < 1 ⇒ α i = C 5

  5. The Kernel Trick ! 2 x 1 x 2 3 2 The really nice thing: optimization depends 1 0 only on the dot product between examples. -1 2.5 -2 2 -3 0 1.5 0.5 2 1 1 x 2 1.5 2 0.5 x 1 An example from Russell & Norvig 2 Now F ( x i ) .F ( x j ) = ( x i .x j ) 2 1.5 1 0.5 We don’t need to compute the actual feature x 2 0 representation in the higher dimensional space, -0.5 because of Mercer’s theorem. -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 x 1 For a Mercer Kernel K , the dot product of F ( x i ) and F ( x j ) is given by K ( x i , x j ). Now suppose we go from representation: x = < x 1 , x 2 > to representation: √ What is a Mercer kernel? Continuous, sym- F ( x ) = < x 2 1 , x 2 2 x 1 x 2 > 2 , metric, and positive definite 6

  6. Positive definiteness: for any m -size subset of the input space, the matrix K where K ij = K ( X i , X j ) is positive definite Remember positive definiteness: for all non- zero vectors z, z T Kz > 0 Allows us to work with very high-dimensional spaces! How do we choose which kernel and which λ Examples: to use? (The first could be harder!) 1. Polynomial: K ( X i , X j ) = (1 + x i .x j ) d (fea- ture space is exponential in d !) || xi − xj || 2 e − 2 σ 2 2. Gaussian: (infinite dimensional feature space!) 3. String kernels, protein kernels!

  7. Selecting the Best Hypothesis Expected error of a hypothesis: Expected error on a sample drawn from the underlying (un- known) distribution Based on notes from Poggio, Mukherjee and Rifkin � I [ f ] = V ( f ( x ) , y ) dµ ( x, y ) Define the performance of a hypothesis by a In discrete terms we would replace with a sum loss function V and µ with P Commonly used for regression: V ( f ( x ) , y ) = Empirical error, or empirical risk, is the average ( f ( x ) − y ) 2 loss over the training set I S [ f ] = 1 Could use absolute value: V ( f ( x ) , y ) = | f ( x ) − � V ( f ( x i ) , y i ) l y | Empirical risk minimization: find the hypothe- What about classification? 0-1 loss: V ( f ( x ) , y ) = sis in the hypothesis space that minimizes the I [ y = f ( x )] empirical risk Hinge loss: V ( f ( x ) , y ) = (1 − y.f ( x )) + n 1 Hypothesis space: space of functions that we � min V ( f ( x i ) , y i ) n f ∈ H search i =1 7

  8. For most hypothesis spaces, ERM is an ill- posed problem. A problem is ill-posed if it is ω is the regularization or smoothness func- not well-posed . A problem is well-posed if its tional. The mathematical machinery for defin- solution exists, is unique, and depends contin- ing this is complex, and we won’t get into it uously on the data much more, but the interesting thing is that if we use the hinge loss and the linear kernel, Regularization restores well-posedness. Ivanov the SVM comes out of solving the Tikhonov regularization directly constrains the hypothe- regularization problem! sis space, and Tikhonov regularization imposes a penalty on hypothesis complexity Meaning of using an unregularized bias term? Ivanov regularization: Punish function complexity but not an arbitrary translation of the origin n 1 � min V ( f ( x i ) , y i ) subject to ω ( f ) ≤ τ n f ∈ H i =1 However, in the case of SVMs, the answer will end up being di ff erent if we add a fictional “1” Tikhonov regularization: to each example, because now we punish the n 1 weight we put on it! � min V ( f ( x i ) , y i ) + λω ( f ) n f ∈ H i =1

  9. Generalization Bounds Important concepts of error: 1. Sample (estimation) error: di ff erence be- tween hypothesis we find in H and the best hypothesis in H 2. Approximation error: di ff erence between best hypothesis in H and the true function in some other space T 3. Generalization error: di ff erence between hy- pothesis we find in H and the true function in T , which is the sum of the two above Tradeo ff : making H bigger makes the approx- imation error smaller, but the estimation error larger 8

Recommend


More recommend