support vector machines
play

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly - PowerPoint PPT Presentation

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Support Vector Machines Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric


  1. Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Support Vector Machines

  2. Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric derivation of SVMs. Practical issues. C. Frogner Support Vector Machines

  3. The Regularization Setting (Again) Given n examples ( x 1 , y 1 ) , . . . , ( x n , y n ) , with x i ∈ R n and y i ∈ {− 1 , 1 } for all i . We can find a classification function by solving a regularized learning problem: n 1 V ( y i , f ( x i )) + λ || f || 2 � argmin H . n f ∈H i = 1 Note that in this class we are specifically considering binary classification . C. Frogner Support Vector Machines

  4. The Hinge Loss The classical SVM arises by considering the specific loss function V ( f ( x , y )) ≡ ( 1 − yf ( x )) + , where ( k ) + ≡ max ( k , 0 ) . C. Frogner Support Vector Machines

  5. The Hinge Loss 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 −3 −2 −1 0 1 2 3 y * f(x) C. Frogner Support Vector Machines

  6. Substituting In The Hinge Loss With the hinge loss, our regularization problem becomes n 1 ( 1 − y i f ( x i )) + + λ || f || 2 � argmin H . n f ∈H i = 1 Note that we don’t have a 1 2 multiplier on the regularization term. C. Frogner Support Vector Machines

  7. Slack Variables This problem is non -differentiable (because of the “kink” in V ). So rewrite the “max” function using slack variables ξ i . � n i = 1 ξ i + λ || f || 2 1 argmin n H f ∈H ξ i ≥ 1 − y i f ( x i ) i = 1 , . . . , n subject to : i = 1 , . . . , n ξ i ≥ 0 C. Frogner Support Vector Machines

  8. Applying The Representer Theorem Substituting in: n f ∗ ( x ) = c i K ( x , x i ) , � i = 1 we get a constrained quadratic programming problem: � n i = 1 ξ i + λ c T K c 1 argmin n c ∈ R n ,ξ ∈ R n � n subject to : ξ i ≥ 1 − y i j = 1 c j K ( x i , x j ) i = 1 , . . . , n i = 1 , . . . , n ξ i ≥ 0 C. Frogner Support Vector Machines

  9. Adding A Bias Term Adding an unregularized bias term b (which presents some theoretical difficulties) we get the “primal” SVM: � n i = 1 ξ i + λ c T K c 1 argmin n c ∈ R n , b ∈ R ,ξ ∈ R n ξ i ≥ 1 − y i ( � n subject to : j = 1 c j K ( x i , x j ) + b ) i = 1 , . . . , n i = 1 , . . . , n ξ i ≥ 0 C. Frogner Support Vector Machines

  10. Standard Notation In most of the SVM literature, instead of λ , a parameter C is used to control regularization: 1 C = 2 λ n . Using this definition (after multiplying our objective function by the constant 1 2 λ , the regularization problem becomes n V ( y i , f ( x i )) + 1 C 2 || f || 2 � argmin H . f ∈H i = 1 Like λ, the parameter C also controls the tradeoff between classification accuracy and the norm of the function. The primal problem becomes . . . C. Frogner Support Vector Machines

  11. The Reparametrized Problem C � n 2 c T K c i = 1 ξ i + 1 argmin c ∈ R n , b ∈ R ,ξ ∈ R n ξ i ≥ 1 − y i ( � n j = 1 c j K ( x i , x j ) + b ) i = 1 , . . . , n subject to : i = 1 , . . . , n ξ i ≥ 0 C. Frogner Support Vector Machines

  12. How to Solve? C � n 2 c T K c i = 1 ξ i + 1 argmin c ∈ R n , b ∈ R ,ξ ∈ R n ξ i ≥ 1 − y i ( � n subject to : j = 1 c j K ( x i , x j ) + b ) i = 1 , . . . , n i = 1 , . . . , n ξ i ≥ 0 This is a constrained optimization problem. The general approach: Form the primal problem – we did this. Lagrangian from primal – just like Lagrange multipliers. Dual – one dual variable associated to each primal constraint in the Lagrangian. C. Frogner Support Vector Machines

  13. Lagrangian We derive the dual from the primal using the Lagrangian: n ξ i + 1 2 c T K c L ( c , ξ, b , α, ζ ) C � = i = 1 n n α i ( y i { c j K ( x i , x j ) + b } − 1 + ξ i ) � � − i = 1 j = 1 n � − ζ i ξ i i = 1 C. Frogner Support Vector Machines

  14. Dual I Dual problem is: c ,ξ, b L ( c , ξ, b , α, ζ ) argmax inf α,ζ ≥ 0 First, minimize L w.r.t. ( c , ξ, b ) : ∂ L ⇒ c i = α i y i ( 1 ) ∂ c = 0 = n ∂ L α i y i = 0 � ( 2 ) ∂ b = 0 = ⇒ i = 1 ∂ L ⇒ C − α i − ζ i = 0 ( 3 ) ∂ξ i = 0 = ⇒ 0 ≤ α i ≤ C = C. Frogner Support Vector Machines

  15. Dual II Dual: c ,ξ, b L ( c , ξ, b , α, ζ ) argmax inf α,ζ ≥ 0 Optimality conditions: c i = α i y i ( 1 ) � n i = 1 α i y i = 0 ( 2 ) α i ∈ [ 0 , C ] ( 3 ) Plug in ( 2 ) and ( 3 ) : n  n  c L ( c , α ) = 1 2 c T K c +  1 − y i K ( x i , x j ) c j � � argmax inf α i  α ≥ 0 i = 1 j = 1 C. Frogner Support Vector Machines

  16. Dual II Dual: c ,ξ, b L ( c , ξ, b , α, ζ ) argmax inf α,ζ ≥ 0 Optimality conditions: c i = α i y i ( 1 ) � n i = 1 α i y i = 0 ( 2 ) α i ∈ [ 0 , C ] ( 3 ) Plug in ( 1 ) : = � n � n L ( α ) i , j = 1 α i y i K ( x i , x j ) α j y j i = 1 α i − 1 argmax 2 α ≥ 0 = � n 2 α T ( diag Y ) K ( diag Y ) α i = 1 α i − 1 C. Frogner Support Vector Machines

  17. The Primal and Dual Problems Again C � n 2 c T K c i = 1 ξ i + 1 argmin c ∈ R n , b ∈ R ,ξ ∈ R n ξ i ≥ 1 − y i ( � n j = 1 c j K ( x i , x j ) + b ) i = 1 , . . . , n subject to : i = 1 , . . . , n ξ i ≥ 0 � n 2 α T Q α i = 1 α i − 1 max α ∈ R n � n subject to : i = 1 y i α i = 0 0 ≤ α i ≤ C i = 1 , . . . , n C. Frogner Support Vector Machines

  18. SVM Training Basic idea: solve the dual problem to find the optimal α ’s, and use them to find b and c . The dual problem is easier to solve the primal problem. It has simple box constraints and a single equality constraint, and the problem can be decomposed into a sequence of smaller problems (see appendix). C. Frogner Support Vector Machines

  19. Interpreting the solution α tells us: c and b . The identities of the misclassified points. How to analyze? Use the optimality conditions . Already used: derivative of L w.r.t. ( c , ξ, b ) is zero at optimality. Haven’t used: complementary slackness, primal/dual constraints. C. Frogner Support Vector Machines

  20. Optimality Conditions: all of them All optimal solutions must satisfy: n n c j K ( x i , x j ) − y i α j K ( x i , x j ) = 0 i = 1 , . . . , n � � j = 1 j = 1 n α i y i = 0 � i = 1 C − α i − ζ i = 0 i = 1 , . . . , n n y i ( � y j α j K ( x i , x j ) + b ) − 1 + ξ i ≥ 0 i = 1 , . . . , n j = 1 n α i [ y i ( y j α j K ( x i , x j ) + b ) − 1 + ξ i ] = 0 i = 1 , . . . , n � j = 1 i = 1 , . . . , n ζ i ξ i = 0 i = 1 , . . . , n ξ i , α i , ζ i ≥ 0 C. Frogner Support Vector Machines

  21. Optimality Conditions II These optimality conditions are both necessary and sufficient for optimality: ( c , ξ, b , α, ζ ) satisfy all of the conditions if and only if they are optimal for both the primal and the dual. (Also known as the Karush -Kuhn-Tucker (KKT) conditons.) C. Frogner Support Vector Machines

  22. Interpreting the solution — c ∂ L ⇒ c i = α i y i , ∀ i ∂ c = 0 = C. Frogner Support Vector Machines

  23. Interpreting the solution — b Suppose we have the optimal α i ’s. Also suppose that there exists an i satisfying 0 < α i < C . Then α i < C = ⇒ ζ i > 0 = ⇒ ξ i = 0 n y i ( y j α j K ( x i , x j ) + b ) − 1 = 0 � = ⇒ j = 1 n b = y i − y j α j K ( x i , x j ) � = ⇒ j = 1 C. Frogner Support Vector Machines

  24. Interpreting the solution — sparsity (Remember we defined f ( x ) = � n i = 1 y i α i K ( x , x i ) + b .) y i f ( x i ) > 1 ( 1 − y i f ( x i )) < 0 ⇒ ξ i � = ( 1 − y i f ( x i )) ⇒ ⇒ α i = 0 C. Frogner Support Vector Machines

  25. Interpreting the solution — - support vectors y i f ( x i ) < 1 ( 1 − y i f ( x i )) > 0 ⇒ ⇒ ξ i > 0 ⇒ ζ i = 0 α i = C ⇒ C. Frogner Support Vector Machines

  26. Interpreting the solution — support vectors So y i f ( x i ) < 1 ⇒ α i = C . Conversely, suppose α i = C : α i = C ξ i = 1 − y i f ( x i ) = ⇒ y i f ( x i ) ≤ 1 = ⇒ C. Frogner Support Vector Machines

  27. Interpreting the solution Here are all of the derived conditions: y i f ( x i ) ≥ 1 α i = 0 = ⇒ 0 < α i < C y i f ( x i ) = 1 = ⇒ α i = C y i f ( x i ) < 1 ⇐ = y i f ( x i ) > 1 α i = 0 ⇐ = α i = C y i f ( x i ) ≤ 1 = ⇒ C. Frogner Support Vector Machines

  28. Geometric Interpretation of Reduced Optimality Conditions C. Frogner Support Vector Machines

  29. Summary so far The SVM is a Tikhonov regularization problem, using the hinge loss: n 1 ( 1 − y i f ( x i )) + + λ || f || 2 � argmin H . n f ∈H i = 1 Solving the SVM means solving a constrained quadratic program. Solutions can be sparse – some coefficients are zero. The nonzero coefficients correspond to points that aren’t classified correctly enough – this is where the “support vector” in SVM comes from. C. Frogner Support Vector Machines

Recommend


More recommend