Natural Language Processing and Information Retrieval Support - PowerPoint PPT Presentation
Natural Language Processing and Information Retrieval Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@disi.unitn.it Summary Support Vector Machines
Natural Language Processing and Information Retrieval Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@disi.unitn.it
Summary Support Vector Machines Hard-margin SVMs Soft-margin SVMs
Which hyperplane choose?
Classifier with a Maximum Margin Var 1 IDEA 1: Select the hyperplane with maximum margin Margin Margin Var 2
Support Vector Var 1 Support Vectors Margin Var 2
Support Vector Machine Classifiers Var 1 The margin is equal to 2 k w � � � � w x b k + = w � � � k Var 2 w x b k + = � k w � � x b 0 � + =
Support Vector Machines Var 1 The margin is equal to 2 k w We need to solve 2 k || � max � � � w || w � � � x + b � + k , if � � w x b k + = w x is positive w � � � x + b � � k , if � x is negative � � � k Var 2 w x b k + = � k w � � x b 0 � + =
Support Vector Machines Var 1 There is a scale for which k=1 . The problem transforms in: 2 || � max � � w || � w x b 1 w � � � x + b � + 1, if � � + = w x is positive w � � � x + b � � 1, if � x is negative � � 1 Var 2 w x b 1 � + = � 1 w � � x b 0 � + =
Final Formulation 2 || � max 2 w || w � � � || � max � � w || x i + b � + 1, y i = 1 y i ( � w � � w � � � x i + b ) � 1 x i + b � � 1, y i = -1 min || � min || � 2 w || w || � � 2 2 y i ( � w � � y i ( � w � � x i + b ) � 1 x i + b ) � 1
Optimization Problem Optimal Hyperplane: � ( � � ) = 1 2 w w Minimize 2 y i (( � w � � Subject to x i ) + b ) � 1, i = 1,..., m The dual problem is simpler
Lagrangian Definition �
Dual Optimization Problem
Dual Transformation Given the Lagrangian associated with our problem To solve the dual problem we need to evaluate: � w Let us impose the derivatives to 0, with respect to
Dual Transformation (cont’d) and wrt b Then we substituted them in the Lagrange function
Final Dual Problem
Khun-Tucker Theorem Necessary and sufficient conditions to optimality α ∗ , � ∂ L ( � β ∗ ) w ∗ , � = � 0 ∂ � w α ∗ , � ∂ L ( � β ∗ ) w ∗ , � = � 0 ∂ b α ∗ i g i ( � w ∗ ) = 0 , i = 1 , .., m g i ( � w ∗ ) ≤ 0 , i = 1 , .., m α ∗ ≥ 0 , i = 1 , .., m i
Properties coming from constraints � � m m � � Lagrange constraints: y i = 0 w = y i x i � i � i i = 1 i = 1 Karush-Kuhn-Tucker constraints � i � [ y i ( � i � � x w + b ) � 1] = 0, i = 1,..., m Support Vectors have not null � i To evaluate b, we can apply the following equation
Warning! On the graphical examples, we always consider normalized hyperplane (hyperplanes with normalized gradient) b in this case is exactly the distance of the hyperplane from the origin So if we have an equation not normalized we may have x � � � ' + b = 0 with � ) and � ( ( ) w x = x , y w ' = 1,1 and b is not the distance
Warning! Let us consider a normalized gradient � ( ) w = 1/ 2,1/ 2 ( ) + b = 0 � x / 2 + y / 2 = � b ( ) � 1/ 2,1/ 2 x , y � y = � x � b 2 Now we see that -b is exactly the distance. For x =0, we have the intersection with . This � b 2 � distance projected on is - b w
Soft Margin SVMs Var 1 slack variables are � i � added i Some errors are allowed but they should penalize the objective function � � � w x b 1 � + = w � � 1 w x b 1 Var 2 � + = � 1 w � � x b 0 � + =
Soft Margin SVMs The new constraints are y i ( � w � � Var 1 x i + b ) � 1 � � i � i � � x i where � i � 0 The objective function penalizes the incorrect � � classified examples � w x b 1 � + = w 2 || � min 1 w || 2 + C � � i � � i 1 w x b 1 Var 2 � + = � 1 C is the trade-off w � � x b 0 between margin and the � + = error
Dual formulation w || + C � m 1 i =1 ξ 2 2 || � min i y i ( � w · � x i + b ) ≥ 1 − ξ i , ∀ i = 1 , .., m ξ i ≥ 0 , i = 1 , .., m m m α ) = 1 w + C w, b, � � � ξ 2 L ( � α i [ y i ( � x i + b ) − 1 + ξ i ] , ξ , � 2 � w · � w · � i − 2 i =1 i =1 � � By deriving wrt w , and b �
Partial Derivatives
Substitution in the objective function of Kronecker � ij
Final dual optimization problem
Soft Margin Support Vector Machines y i ( � w � � i + b ) � 1 � � i � � 2 || � x x min 1 w || 2 + C � � i i i � i � 0 The algorithm tries to keep ξ i low and maximize the margin NB: The number of error is not directly minimized (NP-complete problem); the distances from the hyperplane are minimized If C →∞ , the solution tends to the one of the hard-margin algorithm � y i b � 1 � � i � � || w || x Attention !!!: if C = 0 we get = 0, since i If C increases the number of error decreases. When C tends to infinite the number of errors must be 0, i.e. the hard-margin formulation
Robusteness of Soft vs. Hard Margin SVMs Var 1 Var 1 � i ξ i Var 2 Var 2 w � � w � � x b 0 � + = x b 0 � + = Hard Margin SVM Soft Margin SVM
Soft vs Hard Margin SVMs Soft-Margin has ever a solution Soft-Margin is more robust to odd examples Hard-Margin does not require parameters
Parameters 2 || � 2 || � + + C � min 1 w || 2 + C = min 1 w || 2 + C + � � � � � i � i � i i i i 2 || � + + ( ) = min 1 w || 2 + C J � � � � i � i i i C: trade-off parameter J: cost factor
Theoretical Justification
Definition of Training Set error Training Data ( � 1 , y 1 ),....,( � N � ± 1 f : R 1 { } { } N x x m , y m ) � R � ± Empirical Risk (error) 2 f ( � m � R emp [ f ] = 1 x i ) � y i 1 m Risk (error) i = 1 2 f ( � ) � ydP ( � 1 � R [ f ] = x x , y )
Error Characterization (part 1) From PAC-learning Theory ( Vapnik ): log( � ) R ( � ) � R emp ( � ) + � ( d m , m ) d (log 2 m d + 1) � log( � 4 ) log( � ) � ( d m , m ) = m where d is theVC-dimension, m is the number of examples, δ is a bound on the probability to get such error and α is a classifier parameter.
There are many versions for different bounds
Error Characterization (part 2)
Ranking, Regression and Multiclassification
The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] The aim is to classify instance pairs as correctly ranked or incorrectly ranked This turns an ordinal regression problem back into a binary classification problem We want a ranking function f such that x i > x j iff f ( x i ) > f ( x j ) … or at least one that tries to do this with minimal error Suppose that f is a linear function f ( x i ) = w x i
• Sec. 15.4.2 The Ranking SVM Ranking Model: f ( x i ) f ( x i )
• Sec. 15.4.2 The Ranking SVM Then (combining the two equations on the last slide): x i > x j iff w x i − w x j > 0 x i > x j iff w ( x i − x j ) > 0 Let us then create a new instance space from such pairs: z k = x i − x k y k = +1, − 1 as x i ≥ , < x k
Support Vector Ranking w || + C � m 1 i =1 ξ 2 2 || � min i y k ( � w · ( � x j ) + b ) ≥ 1 − ξ k , ∀ i, j = 1 , .., m x i − � (2 k = 1 , .., m 2 ξ k ≥ 0 , y k = 1 if rank ( � x i ) > rank ( � x j ), 0 otherwise, where k = i × m + j � 1 Given two examples we build one example ( x i , x j )
Support Vector Regression (SVR) Solution: f(x) 1 T Min w w 2 + ε ( ) f x wx b = + Constraints: 0 T - ε y w x b − − ≤ ε i i T w x b y + − ≤ ε i i x
Support Vector Regression (SVR) f(x) Minimise: 1 N T w + C ( * ) � 2 w + ε � i + � i ( ) f x wx b = + i = 1 0 - ε Constraints: ξ T y w x b − − ≤ ε + ξ i i i T * w x b y + − ≤ ε + ξ ξ * i i i * , 0 ξ ξ ≥ i i x
Support Vector Regression y i is not -1 or 1 anymore, now it is a value ε is the tollerance of our function value
From Binary to Multiclass classifiers Three different approaches: ONE-vs-ALL (OVA) Given the example sets, {E1, E2, E3, …} for the categories: {C1, C2, C3,…} the binary classifiers: {b1, b2, b3,…} are built. For b1, E1 is the set of positives and E2 ∪ E3 ∪ … is the set of negatives, and so on For testing: given a classification instance x, the category is the one associated with the maximum margin among all binary classifiers
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.