natural language processing and information retrieval
play

Natural Language Processing and Information Retrieval Support - PowerPoint PPT Presentation

Natural Language Processing and Information Retrieval Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@disi.unitn.it Summary Support Vector Machines


  1. Natural Language Processing and Information Retrieval Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@disi.unitn.it

  2. Summary Support Vector Machines Hard-margin SVMs Soft-margin SVMs

  3. Which hyperplane choose?

  4. Classifier with a Maximum Margin Var 1 IDEA 1: Select the hyperplane with maximum margin Margin Margin Var 2

  5. Support Vector Var 1 Support Vectors Margin Var 2

  6. Support Vector Machine Classifiers Var 1 The margin is equal to 2 k w � � � � w x b k + = w � � � k Var 2 w x b k + = � k w � � x b 0 � + =

  7. Support Vector Machines Var 1 The margin is equal to 2 k w We need to solve 2 k || � max � � � w || w � � � x + b � + k , if � � w x b k + = w x is positive w � � � x + b � � k , if � x is negative � � � k Var 2 w x b k + = � k w � � x b 0 � + =

  8. Support Vector Machines Var 1 There is a scale for which k=1 . The problem transforms in: 2 || � max � � w || � w x b 1 w � � � x + b � + 1, if � � + = w x is positive w � � � x + b � � 1, if � x is negative � � 1 Var 2 w x b 1 � + = � 1 w � � x b 0 � + =

  9. Final Formulation 2 || � max 2 w || w � � � || � max � � w || x i + b � + 1, y i = 1 y i ( � w � � w � � � x i + b ) � 1 x i + b � � 1, y i = -1 min || � min || � 2 w || w || � � 2 2 y i ( � w � � y i ( � w � � x i + b ) � 1 x i + b ) � 1

  10. Optimization Problem Optimal Hyperplane: � ( � � ) = 1 2 w w Minimize 2 y i (( � w � � Subject to x i ) + b ) � 1, i = 1,..., m The dual problem is simpler

  11. Lagrangian Definition �

  12. Dual Optimization Problem

  13. Dual Transformation Given the Lagrangian associated with our problem To solve the dual problem we need to evaluate: � w Let us impose the derivatives to 0, with respect to

  14. Dual Transformation (cont’d) and wrt b Then we substituted them in the Lagrange function

  15. Final Dual Problem

  16. Khun-Tucker Theorem Necessary and sufficient conditions to optimality α ∗ , � ∂ L ( � β ∗ ) w ∗ , � = � 0 ∂ � w α ∗ , � ∂ L ( � β ∗ ) w ∗ , � = � 0 ∂ b α ∗ i g i ( � w ∗ ) = 0 , i = 1 , .., m g i ( � w ∗ ) ≤ 0 , i = 1 , .., m α ∗ ≥ 0 , i = 1 , .., m i

  17. Properties coming from constraints � � m m � � Lagrange constraints: y i = 0 w = y i x i � i � i i = 1 i = 1 Karush-Kuhn-Tucker constraints � i � [ y i ( � i � � x w + b ) � 1] = 0, i = 1,..., m Support Vectors have not null � i To evaluate b, we can apply the following equation

  18. Warning! On the graphical examples, we always consider normalized hyperplane (hyperplanes with normalized gradient) b in this case is exactly the distance of the hyperplane from the origin So if we have an equation not normalized we may have x � � � ' + b = 0 with � ) and � ( ( ) w x = x , y w ' = 1,1 and b is not the distance

  19. Warning! Let us consider a normalized gradient � ( ) w = 1/ 2,1/ 2 ( ) + b = 0 � x / 2 + y / 2 = � b ( ) � 1/ 2,1/ 2 x , y � y = � x � b 2 Now we see that -b is exactly the distance. For x =0, we have the intersection with . This � b 2 � distance projected on is - b w

  20. Soft Margin SVMs Var 1 slack variables are � i � added i Some errors are allowed but they should penalize the objective function � � � w x b 1 � + = w � � 1 w x b 1 Var 2 � + = � 1 w � � x b 0 � + =

  21. Soft Margin SVMs The new constraints are y i ( � w � � Var 1 x i + b ) � 1 � � i � i � � x i where � i � 0 The objective function penalizes the incorrect � � classified examples � w x b 1 � + = w 2 || � min 1 w || 2 + C � � i � � i 1 w x b 1 Var 2 � + = � 1 C is the trade-off w � � x b 0 between margin and the � + = error

  22. Dual formulation w || + C � m 1 i =1 ξ 2  2 || � min i   y i ( � w · � x i + b ) ≥ 1 − ξ i , ∀ i = 1 , .., m ξ i ≥ 0 , i = 1 , .., m   m m α ) = 1 w + C w, b, � � � ξ 2 L ( � α i [ y i ( � x i + b ) − 1 + ξ i ] , ξ , � 2 � w · � w · � i − 2 i =1 i =1 � � By deriving wrt w , and b �

  23. Partial Derivatives

  24. Substitution in the objective function of Kronecker � ij

  25. Final dual optimization problem

  26. Soft Margin Support Vector Machines y i ( � w � � i + b ) � 1 � � i � � 2 || � x x min 1 w || 2 + C � � i i i � i � 0 The algorithm tries to keep ξ i low and maximize the margin NB: The number of error is not directly minimized (NP-complete problem); the distances from the hyperplane are minimized If C →∞ , the solution tends to the one of the hard-margin algorithm � y i b � 1 � � i � � || w || x Attention !!!: if C = 0 we get = 0, since i If C increases the number of error decreases. When C tends to infinite the number of errors must be 0, i.e. the hard-margin formulation

  27. Robusteness of Soft vs. Hard Margin SVMs Var 1 Var 1 � i ξ i Var 2 Var 2 w � � w � � x b 0 � + = x b 0 � + = Hard Margin SVM Soft Margin SVM

  28. Soft vs Hard Margin SVMs Soft-Margin has ever a solution Soft-Margin is more robust to odd examples Hard-Margin does not require parameters

  29. Parameters 2 || � 2 || � + + C � min 1 w || 2 + C = min 1 w || 2 + C + � � � � � i � i � i i i i 2 || � + + ( ) = min 1 w || 2 + C J � � � � i � i i i C: trade-off parameter J: cost factor

  30. Theoretical Justification

  31. Definition of Training Set error Training Data ( � 1 , y 1 ),....,( � N � ± 1 f : R 1 { } { } N x x m , y m ) � R � ± Empirical Risk (error) 2 f ( � m � R emp [ f ] = 1 x i ) � y i 1 m Risk (error) i = 1 2 f ( � ) � ydP ( � 1 � R [ f ] = x x , y )

  32. Error Characterization (part 1) From PAC-learning Theory ( Vapnik ): log( � ) R ( � ) � R emp ( � ) + � ( d m , m ) d (log 2 m d + 1) � log( � 4 ) log( � ) � ( d m , m ) = m where d is theVC-dimension, m is the number of examples, δ is a bound on the probability to get such error and α is a classifier parameter.

  33. There are many versions for different bounds

  34. Error Characterization (part 2)

  35. Ranking, Regression and Multiclassification

  36. The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] The aim is to classify instance pairs as correctly ranked or incorrectly ranked This turns an ordinal regression problem back into a binary classification problem We want a ranking function f such that x i > x j iff f ( x i ) > f ( x j ) … or at least one that tries to do this with minimal error Suppose that f is a linear function f ( x i ) = w  x i

  37. • Sec.
15.4.2
 The Ranking SVM Ranking Model: f ( x i ) f ( x i )

  38. • Sec.
15.4.2
 The Ranking SVM Then (combining the two equations on the last slide): x i > x j iff w  x i − w  x j > 0 x i > x j iff w  ( x i − x j ) > 0 Let us then create a new instance space from such pairs: z k = x i − x k y k = +1, − 1 as x i ≥ , < x k

  39. Support Vector Ranking   w || + C � m 1 i =1 ξ 2  2 || � min i   y k ( � w · ( � x j ) + b ) ≥ 1 − ξ k , ∀ i, j = 1 , .., m x i − � (2 k = 1 , .., m 2 ξ k ≥ 0 ,   y k = 1 if rank ( � x i ) > rank ( � x j ), 0 otherwise, where k = i × m + j � 1 Given two examples we build one example ( x i , x j )

  40. Support Vector Regression (SVR) Solution: f(x) 1 T Min w w 2 + ε ( ) f x wx b = + Constraints: 0 T - ε y w x b − − ≤ ε i i T w x b y + − ≤ ε i i x

  41. Support Vector Regression (SVR) f(x) Minimise: 1 N T w + C ( * ) � 2 w + ε � i + � i ( ) f x wx b = + i = 1 0 - ε Constraints: ξ T y w x b − − ≤ ε + ξ i i i T * w x b y + − ≤ ε + ξ ξ * i i i * , 0 ξ ξ ≥ i i x

  42. Support Vector Regression y i is not -1 or 1 anymore, now it is a value ε is the tollerance of our function value

  43. From Binary to Multiclass classifiers Three different approaches: ONE-vs-ALL (OVA) Given the example sets, {E1, E2, E3, …} for the categories: {C1, C2, C3,…} the binary classifiers: {b1, b2, b3,…} are built. For b1, E1 is the set of positives and E2 ∪ E3 ∪ … is the set of negatives, and so on For testing: given a classification instance x, the category is the one associated with the maximum margin among all binary classifiers

Recommend


More recommend