Natural Language Processing and Information Retrieval Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@disi.unitn.it
Summary Support Vector Machines Hard-margin SVMs Soft-margin SVMs
Which hyperplane choose?
Classifier with a Maximum Margin Var 1 IDEA 1: Select the hyperplane with maximum margin Margin Margin Var 2
Support Vector Var 1 Support Vectors Margin Var 2
Support Vector Machine Classifiers Var 1 The margin is equal to 2 k w � � � � w x b k + = w � � � k Var 2 w x b k + = � k w � � x b 0 � + =
Support Vector Machines Var 1 The margin is equal to 2 k w We need to solve 2 k || � max � � � w || w � � � x + b � + k , if � � w x b k + = w x is positive w � � � x + b � � k , if � x is negative � � � k Var 2 w x b k + = � k w � � x b 0 � + =
Support Vector Machines Var 1 There is a scale for which k=1 . The problem transforms in: 2 || � max � � w || � w x b 1 w � � � x + b � + 1, if � � + = w x is positive w � � � x + b � � 1, if � x is negative � � 1 Var 2 w x b 1 � + = � 1 w � � x b 0 � + =
Final Formulation 2 || � max 2 w || w � � � || � max � � w || x i + b � + 1, y i = 1 y i ( � w � � w � � � x i + b ) � 1 x i + b � � 1, y i = -1 min || � min || � 2 w || w || � � 2 2 y i ( � w � � y i ( � w � � x i + b ) � 1 x i + b ) � 1
Optimization Problem Optimal Hyperplane: � ( � � ) = 1 2 w w Minimize 2 y i (( � w � � Subject to x i ) + b ) � 1, i = 1,..., m The dual problem is simpler
Lagrangian Definition �
Dual Optimization Problem
Dual Transformation Given the Lagrangian associated with our problem To solve the dual problem we need to evaluate: � w Let us impose the derivatives to 0, with respect to
Dual Transformation (cont’d) and wrt b Then we substituted them in the Lagrange function
Final Dual Problem
Khun-Tucker Theorem Necessary and sufficient conditions to optimality α ∗ , � ∂ L ( � β ∗ ) w ∗ , � = � 0 ∂ � w α ∗ , � ∂ L ( � β ∗ ) w ∗ , � = � 0 ∂ b α ∗ i g i ( � w ∗ ) = 0 , i = 1 , .., m g i ( � w ∗ ) ≤ 0 , i = 1 , .., m α ∗ ≥ 0 , i = 1 , .., m i
Properties coming from constraints � � m m � � Lagrange constraints: y i = 0 w = y i x i � i � i i = 1 i = 1 Karush-Kuhn-Tucker constraints � i � [ y i ( � i � � x w + b ) � 1] = 0, i = 1,..., m Support Vectors have not null � i To evaluate b, we can apply the following equation
Warning! On the graphical examples, we always consider normalized hyperplane (hyperplanes with normalized gradient) b in this case is exactly the distance of the hyperplane from the origin So if we have an equation not normalized we may have x � � � ' + b = 0 with � ) and � ( ( ) w x = x , y w ' = 1,1 and b is not the distance
Warning! Let us consider a normalized gradient � ( ) w = 1/ 2,1/ 2 ( ) + b = 0 � x / 2 + y / 2 = � b ( ) � 1/ 2,1/ 2 x , y � y = � x � b 2 Now we see that -b is exactly the distance. For x =0, we have the intersection with . This � b 2 � distance projected on is - b w
Soft Margin SVMs Var 1 slack variables are � i � added i Some errors are allowed but they should penalize the objective function � � � w x b 1 � + = w � � 1 w x b 1 Var 2 � + = � 1 w � � x b 0 � + =
Soft Margin SVMs The new constraints are y i ( � w � � Var 1 x i + b ) � 1 � � i � i � � x i where � i � 0 The objective function penalizes the incorrect � � classified examples � w x b 1 � + = w 2 || � min 1 w || 2 + C � � i � � i 1 w x b 1 Var 2 � + = � 1 C is the trade-off w � � x b 0 between margin and the � + = error
Dual formulation w || + C � m 1 i =1 ξ 2 2 || � min i y i ( � w · � x i + b ) ≥ 1 − ξ i , ∀ i = 1 , .., m ξ i ≥ 0 , i = 1 , .., m m m α ) = 1 w + C w, b, � � � ξ 2 L ( � α i [ y i ( � x i + b ) − 1 + ξ i ] , ξ , � 2 � w · � w · � i − 2 i =1 i =1 � � By deriving wrt w , and b �
Partial Derivatives
Substitution in the objective function of Kronecker � ij
Final dual optimization problem
Soft Margin Support Vector Machines y i ( � w � � i + b ) � 1 � � i � � 2 || � x x min 1 w || 2 + C � � i i i � i � 0 The algorithm tries to keep ξ i low and maximize the margin NB: The number of error is not directly minimized (NP-complete problem); the distances from the hyperplane are minimized If C →∞ , the solution tends to the one of the hard-margin algorithm � y i b � 1 � � i � � || w || x Attention !!!: if C = 0 we get = 0, since i If C increases the number of error decreases. When C tends to infinite the number of errors must be 0, i.e. the hard-margin formulation
Robusteness of Soft vs. Hard Margin SVMs Var 1 Var 1 � i ξ i Var 2 Var 2 w � � w � � x b 0 � + = x b 0 � + = Hard Margin SVM Soft Margin SVM
Soft vs Hard Margin SVMs Soft-Margin has ever a solution Soft-Margin is more robust to odd examples Hard-Margin does not require parameters
Parameters 2 || � 2 || � + + C � min 1 w || 2 + C = min 1 w || 2 + C + � � � � � i � i � i i i i 2 || � + + ( ) = min 1 w || 2 + C J � � � � i � i i i C: trade-off parameter J: cost factor
Theoretical Justification
Definition of Training Set error Training Data ( � 1 , y 1 ),....,( � N � ± 1 f : R 1 { } { } N x x m , y m ) � R � ± Empirical Risk (error) 2 f ( � m � R emp [ f ] = 1 x i ) � y i 1 m Risk (error) i = 1 2 f ( � ) � ydP ( � 1 � R [ f ] = x x , y )
Error Characterization (part 1) From PAC-learning Theory ( Vapnik ): log( � ) R ( � ) � R emp ( � ) + � ( d m , m ) d (log 2 m d + 1) � log( � 4 ) log( � ) � ( d m , m ) = m where d is theVC-dimension, m is the number of examples, δ is a bound on the probability to get such error and α is a classifier parameter.
There are many versions for different bounds
Error Characterization (part 2)
Ranking, Regression and Multiclassification
The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] The aim is to classify instance pairs as correctly ranked or incorrectly ranked This turns an ordinal regression problem back into a binary classification problem We want a ranking function f such that x i > x j iff f ( x i ) > f ( x j ) … or at least one that tries to do this with minimal error Suppose that f is a linear function f ( x i ) = w x i
• Sec. 15.4.2 The Ranking SVM Ranking Model: f ( x i ) f ( x i )
• Sec. 15.4.2 The Ranking SVM Then (combining the two equations on the last slide): x i > x j iff w x i − w x j > 0 x i > x j iff w ( x i − x j ) > 0 Let us then create a new instance space from such pairs: z k = x i − x k y k = +1, − 1 as x i ≥ , < x k
Support Vector Ranking w || + C � m 1 i =1 ξ 2 2 || � min i y k ( � w · ( � x j ) + b ) ≥ 1 − ξ k , ∀ i, j = 1 , .., m x i − � (2 k = 1 , .., m 2 ξ k ≥ 0 , y k = 1 if rank ( � x i ) > rank ( � x j ), 0 otherwise, where k = i × m + j � 1 Given two examples we build one example ( x i , x j )
Support Vector Regression (SVR) Solution: f(x) 1 T Min w w 2 + ε ( ) f x wx b = + Constraints: 0 T - ε y w x b − − ≤ ε i i T w x b y + − ≤ ε i i x
Support Vector Regression (SVR) f(x) Minimise: 1 N T w + C ( * ) � 2 w + ε � i + � i ( ) f x wx b = + i = 1 0 - ε Constraints: ξ T y w x b − − ≤ ε + ξ i i i T * w x b y + − ≤ ε + ξ ξ * i i i * , 0 ξ ξ ≥ i i x
Support Vector Regression y i is not -1 or 1 anymore, now it is a value ε is the tollerance of our function value
From Binary to Multiclass classifiers Three different approaches: ONE-vs-ALL (OVA) Given the example sets, {E1, E2, E3, …} for the categories: {C1, C2, C3,…} the binary classifiers: {b1, b2, b3,…} are built. For b1, E1 is the set of positives and E2 ∪ E3 ∪ … is the set of negatives, and so on For testing: given a classification instance x, the category is the one associated with the maximum margin among all binary classifiers
Recommend
More recommend