Support Vector Machines L´ eon Bottou COS 424 – 4/1/2010
Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/46 COS 424 – 4/1/2010
Summary 1. Maximizing margins. 2. Soft margins. 3. Kernels. 4. Kernels everywhere. L´ eon Bottou 3/46 COS 424 – 4/1/2010
The curse of dimensionality Polynomial classifiers in dimension d Discriminant function: f ( x ) = w ⊤ Φ( x ) + b . Degree Dim (Φ(x)) Φ(x) 1 d Φ( x ) = [ x i ] 1 ≤ i ≤ d ≈ d 2 / 2 2 Φ( x ) += [ x i x j ] 1 ≤ i ≤ j ≤ d ≈ d 3 / 6 3 Φ( x ) += [ x i x j x k ] 1 ≤ i ≤ j ≤ k ≤ d . . . ≈ d n /n ! n The number of parameters increases quickly. Training such a classifier directly requires a number of examples that increases just as quickly as the number of parameters. L´ eon Bottou 4/46 COS 424 – 4/1/2010
Beating the curse of dimensionality? Capacity ≪ number of parameters Assume the patterns x 1 . . . x 2 l are known beforehand. The classes are unknown. Let R = max � x i � . We say that a hyperplane w ⊤ x + b w , x ∈ R d � w � = 1 separates patterns with margin ∆ if | w ⊤ x i + b | ≥ ∆ ∀ i = 1 . . . 2 l The family of ∆ -margin separating hyperplanes has � � R 2 log N ( F , D ) ≤ h log 2 le h ≤ min with ∆ 2 , d + 1 h L´ eon Bottou 5/46 COS 424 – 4/1/2010
Maximizing margins Patterns x i ∈ R d , classes y i = ± 1 . 2∆ w ∀ i y i ( w ⊤ x i + b ) ≥ ∆ w ,b, ∆ ∆ max subject to � w � = 1 and L´ eon Bottou 6/46 COS 424 – 4/1/2010
Maximizing margins Classic formulation wx+b = +1 wx+b = −1 w w ,b � w � 2 ∀ i y i ( w ⊤ x i + b ) ≥ 1 min subject to This is a quadratic programming problem with linear constraints. L´ eon Bottou 7/46 COS 424 – 4/1/2010
Maximizing margins Equivalence between the formulations Let w ′ = w ∆ and b ′ = b ∆ . Constraint y i ( w ⊤ x i + b ) ≥ ∆ becomes y i ( w ′⊤ x i + b ′ ) ≥ 1 . w ′ ,b ′ � w ′ � w ,b, ∆ ∆ subject to � w � = 1 becomes min Problem max Both discriminant functions w ⊤ x + b and w ′⊤ x + b ′ describe the same decision boundary. L´ eon Bottou 8/46 COS 424 – 4/1/2010
Primal and dual formulation Karush-Kuhn-Tucker theory – Refined theory for convex otimization under constraints. – Construct a dual optimization problem whose constraints are simpler, and whose solution is related to the solution we seek. L´ eon Bottou 9/46 COS 424 – 4/1/2010
Primal and dual formulation Karush-Kuhn-Tucker theory – Refined theory for convex otimization under constraints. – Construct a dual optimization problem whose constraints are simpler, and whose solution is related to the solution we seek. Primal formulation Dual formulation A Min distance B Max margin between convex hulls between classes L´ eon Bottou 10/46 COS 424 – 4/1/2010
Dual formulation A Min distance B between convex hulls � � – Point A: β i x i subject to β i ≥ 0 and β i = 1 i ∈ Pos i ∈ Pos � � subject to β i ≥ 0 and – Point B: β i x i β i = 1 i ∈ Neg i ∈ Neg � � � y i β i x i subject to β i ≥ 0 , β i = 2 , and y i β i = 0 . – Vector BA: i i i L´ eon Bottou 11/46 COS 424 – 4/1/2010
Dual formulation A Min distance B between convex hulls ∀ i β i ≥ 0 y i y j β i β j x ⊤ � � min i x j subject to i y i β i = 0 β ij � i β i = 2 Then w = � i y i β i x i . Then b is easy to find by projecting all examples on w . L´ eon Bottou 12/46 COS 424 – 4/1/2010
Dual formulation Classic formulation A Min distance B between convex hulls � ∀ i α i ≥ 0 α i − 1 y i y j α i α j x ⊤ � � max i x j subject to 2 α � i y i α i = 0 i ij This is equivalent with α i = β i ∆ − 2 but the proof is nontrivial. L´ eon Bottou 13/46 COS 424 – 4/1/2010
Support Vectors Machines A Min distance B between convex hulls ∀ i β i ≥ 0 y i y j β i β j x ⊤ � � min i x j i y i β i = 0 subject to β ij � i β i = 2 The only non zero β i are those corresponding to support vectors. L´ eon Bottou 14/46 COS 424 – 4/1/2010
Leave-One-Out Leave one out = n -fold cross-validation – Compute classifiers f i using training set minus example ( x i , y i ) . n – Estimate test misclassification rate as E LOO = 1 � 1 I { y i f i ( x i ) ≤ 0 } . n i =1 Leave one out for maximal margin classifier – Removing a non support vector does not change the classifier. E LOO ≤ # support vectors # examples – The important quantity is not the dimension but is the number of support vectors. L´ eon Bottou 15/46 COS 424 – 4/1/2010
Soft margins When the examples are not linearly separable, the constraints y i ( w ⊤ x i + b ) ≥ 1 cannot be satisfied. Adding slack variables ξ i n w ,b, ξ � w � 2 + C ∀ i y i ( w ⊤ x i + b ) ≥ 1 − ξ i , � ξ i ≥ 0 min ξ i subject to i =1 Parameter C controls the relative importance of: – correctly classifying all the training examples, – obtaining the separation with the largest margin. Reduces to hard margins when C = ∞ . L´ eon Bottou 16/46 COS 424 – 4/1/2010
Soft margins and Hinge loss The soft margin problem n w ,b, ξ � w � 2 + C ∀ i y i ( w ⊤ x i + b ) ≥ 1 − ξ i , � ξ i ≥ 0 min ξ i subject to i =1 is the same thing as n w ,b, ξ � w � 2 + C ℓ ( y i ( w ⊤ x i + b )) � ℓ ( z ) = max(0 , 1 − z ) min with i =1 L´ eon Bottou 17/46 COS 424 – 4/1/2010
Soft Margins Primal formulation n w ,b, ξ � w � 2 + C ∀ i y i ( w ⊤ x i + b ) ≥ 1 − ξ i , � ξ i ≥ 0 min ξ i subject to i =1 Dual formulation � ∀ i 0 ≤ α i ≤ C α i − 1 y i y j α i α j x ⊤ � � max i x j subject to 2 α � i y i α i = 0 i ij n � The primal and dual solutions obey the relation w = y i α i x i . i =1 The threshold b is easy to find once w is known. L´ eon Bottou 18/46 COS 424 – 4/1/2010
Soft Margins α i<C α i<C 0< 0< α =0 α =0 i i α =0 α =0 i i α i<C 0< α i=C α i=C ξ i ξ i ξ i α i<C 0< α i=C α =0 i α i<C 0< α =0 α =0 i i α =0 i L´ eon Bottou 19/46 COS 424 – 4/1/2010
Beyond linear separation Reintroducing the Φ(x) – Define K ( x , v ) = Φ( x ) ⊤ Φ( v ) . – Dual optimization problem � ∀ i 0 ≤ α i ≤ C α i − 1 � � max y i y j α i α j K ( x i , x j ) subject to 2 α � i y i α i = 0 i ij – Discriminant function n f ( x ) = w ⊤ Φ( x ) + b = � y i α i K ( x i , x ) i =1 Curious fact – We do not really need to compute Φ( x ) . – The dot products K ( x , v ) = Φ( x ) ⊤ Φ( v ) are enough. – Can we take advantage of this? L´ eon Bottou 20/46 COS 424 – 4/1/2010
Quadratic Kernel Quadratic basis � √ x 2 � � � � � � � Φ( x ) = x i i , i , 2 x i x j i i<j Dot product Φ( x ) ⊤ Φ( v ) = x 2 i v 2 � � � x i v i + i + 2 x i v i x j v j i i i<j – Are there d ( d + 3) / 2 terms to add ? L´ eon Bottou 21/46 COS 424 – 4/1/2010
Quadratic Kernel Quadratic basis � √ x 2 � � � � � � � Φ( x ) = x i i , i , 2 x i x j i i<j Dot product Φ( x ) ⊤ Φ( v ) = x 2 i v 2 � � � x i v i + i + 2 x i v i x j v j i i i<j � � = x i v i + x i v i x j v j i i,j � 2 �� = ( x ⊤ v ) + ( x ⊤ v ) 2 � = x i v i + x i v i i i – There are only d terms to add ! L´ eon Bottou 22/46 COS 424 – 4/1/2010
Polynomial kernel Φ(x) ⊤ Φ(v) Degree Dim (Φ(x)) ( x ⊤ v ) 1 d ( x ⊤ v ) + ( x ⊤ v ) 2 ≈ d 2 / 2 2 ( x ⊤ v ) + ( x ⊤ v ) 2 + ( x ⊤ v ) 3 ≈ d 3 / 6 3 . . . ≈ d n /n ! (1 + x ⊤ v ) d n The number of parameters increases exponentially quickly. But the total computation remains nearly constant. L´ eon Bottou 23/46 COS 424 – 4/1/2010
Linear L´ eon Bottou 24/46 COS 424 – 4/1/2010
Quadratic L´ eon Bottou 25/46 COS 424 – 4/1/2010
Polynomial degree 3 L´ eon Bottou 26/46 COS 424 – 4/1/2010
Polynomial degree 5 L´ eon Bottou 27/46 COS 424 – 4/1/2010
Polynomial kernels and more d γ i i ! ( x ⊤ v ) i . � Weighted polynomial kernel: K d ( x , v ) = i =0 – This is a polynomial kernel. – Coefficient γ controls the relative importance of terms of various degree. L´ eon Bottou 28/46 COS 424 – 4/1/2010
Recommend
More recommend