cs 6316 machine learning
play

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- - PowerPoint PPT Presentation

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of Computer Science University of Virginia About Online Lectures Course Information Update Record the lectures and upload the videos on Collab


  1. CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of Computer Science University of Virginia

  2. About Online Lectures

  3. Course Information Update ◮ Record the lectures and upload the videos on Collab ◮ By default, turn off the video and mute yourself ◮ If you have a question ◮ Unmuate yourself and chime in anytime ◮ Use the raise hand feature ◮ Send me a private message 2

  4. Course Information Update ◮ Record the lectures and upload the videos on Collab ◮ By default, turn off the video and mute yourself ◮ If you have a question ◮ Unmuate yourself and chime in anytime ◮ Use the raise hand feature ◮ Send me a private message ◮ Slack: as a stable communication channel to ◮ send out instant messages if my network connection is unreliable ◮ online discussion 2

  5. Course Information Update ◮ Homework ◮ Subject to change 3

  6. Course Information Update ◮ Homework ◮ Subject to change ◮ Final project ◮ Send out my feedback later this week ◮ Continue your collaboration with your teammates ◮ Presentation: record a presentation video and share it with me 3

  7. Course Information Update ◮ Homework ◮ Subject to change ◮ Final project ◮ Send out my feedback later this week ◮ Continue your collaboration with your teammates ◮ Presentation: record a presentation video and share it with me ◮ Office hour ◮ Wednesday 11 AM: I will be on Zoom ◮ You can also send me an email or Slack message anytime 3

  8. Separable Cases

  9. Geometric Margin The geometric margin of a linear binary classifier h ( x ) � � w , x � + b at a point x is its distance to the hyper-plane � w , x � � 0 ρ h ( x ) � |� w , x � + b | (1) � w � 2 5

  10. Geometric Margin (II) The geometric margin of h ( x ) for a set of examples T � { x 1 , . . . , x m } is the minimal distance over these examples x ′ ∈ T ρ h ( x ′ ) ρ h ( T ) � min (2) [Mohri et al., 2018, Page 80] 6

  11. Half-Space Hypothesis Space ◮ Training set S � {( x 1 , y 1 ) , . . . , ( x m , y m )} with x i ∈ R d and y i ∈ { + 1 , − 1 } ◮ If the training set is linearly separable y i (� w , x i � + b ) > 0 ∀ i ∈ [ m ] (3) ◮ Linearly separable cases ◮ Existence of equation 3 ◮ All halfspace predictors that satisfy the condition in equation 3 are ERM hypotheses 7

  12. Which Hypothesis is Better? [Shalev-Shwartz and Ben-David, 2014, Page 203] 8

  13. Which Hypothesis is Better? ◮ Intuitively, a hypothesis with larger margin is better, because it is more robust to noise ◮ Final definition of margin will be provided later [Shalev-Shwartz and Ben-David, 2014, Page 203] 8

  14. Hard SVM/Separable Cases The mathematical formulation of the previous idea |� w , x i � + b | ρ max ( w , b ) min (4) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (5) ◮ y i (� w , x i � + b ) > 0 ∀ i : guarantee ( w , b ) is an ERM hypothesis 9

  15. Hard SVM/Separable Cases The mathematical formulation of the previous idea |� w , x i � + b | ρ max ( w , b ) min (4) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (5) ◮ y i (� w , x i � + b ) > 0 ∀ i : guarantee ( w , b ) is an ERM hypothesis ◮ min i ∈[ m ] : calculate the margin between a hyper-plane and a set of examples 9

  16. Hard SVM/Separable Cases The mathematical formulation of the previous idea |� w , x i � + b | ρ max ( w , b ) min (4) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (5) ◮ y i (� w , x i � + b ) > 0 ∀ i : guarantee ( w , b ) is an ERM hypothesis ◮ min i ∈[ m ] : calculate the margin between a hyper-plane and a set of examples ◮ max ( w , b ) : maximize the margin 9

  17. Illustration Original form |� w , x i � + b | ρ max ( w , b ) min (6) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (7) 10

  18. Alternative Forms ◮ Original form |� w , x i � + b | ρ max ( w , b ) min (8) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (9) 11

  19. Alternative Forms ◮ Original form |� w , x i � + b | ρ max ( w , b ) min (8) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (9) ◮ Alternative form 1 y i (� w , x i � + b ) (10) ρ max ( w , b ) min � � w � 2 i ∈[ m ] 11

  20. Alternative Forms ◮ Original form |� w , x i � + b | ρ max ( w , b ) min (8) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (9) ◮ Alternative form 1 y i (� w , x i � + b ) (10) ρ max ( w , b ) min � � w � 2 i ∈[ m ] ◮ Alternative form 2 1 ρ max (11) � � w � 2 ( w , b ) : min i ∈[ m ] y i (� w , x i � + b � 1 1 max (12) � � w � 2 ( w , b ) : y i (� w , x i � + b ≥ 1 11

  21. Alternative Forms (II) ◮ Alternative form 2 1 max (13) ρ � � w � 2 ( w , b ) : y i (� w , x i � + b ≥ 1 ◮ Alternative form 3: Quadratic programming (QP) 1 2 � w � 2 min 2 ( w , b ) (14) s.t. y i (� w , x i � + b ) ≥ 1 , ∀ i ∈ [ m ] which is a constrained optimization problem that can be solved by standard QP packages 12

  22. Alternative Forms (II) ◮ Alternative form 2 1 max (13) ρ � � w � 2 ( w , b ) : y i (� w , x i � + b ≥ 1 ◮ Alternative form 3: Quadratic programming (QP) 1 2 � w � 2 min 2 ( w , b ) (14) s.t. y i (� w , x i � + b ) ≥ 1 , ∀ i ∈ [ m ] which is a constrained optimization problem that can be solved by standard QP packages ◮ Exercise : Solve a SVM problem with quadratic programming 12

  23. Unconstrained Optimization Problem The quadratic programming problem with constraints can be converted to an unconstrained optimization problem with the Lagrangian method m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (15) i � 1 where ◮ α � { α 1 , . . . , α m } is the Lagrange multiplier, and ◮ α i ≥ 0 is associated with the i -th training example 13

  24. Constrained Optimization Problems

  25. Constrained Optimization Problems: Definition ◮ X ⊆ R d and ◮ f , g i : X → R , ∀ i ∈ [ m ] Then, a constrained optimization problem is defined in the form of f ( x ) min (16) x ∈ X s.t. g i ( x ) ≤ 0 , ∀ i ∈ [ m ] (17) 15

  26. Constrained Optimization Problems: Definition ◮ X ⊆ R d and ◮ f , g i : X → R , ∀ i ∈ [ m ] Then, a constrained optimization problem is defined in the form of f ( x ) min (16) x ∈ X s.t. g i ( x ) ≤ 0 , ∀ i ∈ [ m ] (17) Comments ◮ In general definition, x is the target variable for optimization ◮ Special cases of g i ( x ) : (1) g i ( x ) � 0 , (2) g i ( x ) ≥ 0 , and (3) g i ( x ) ≤ b 15

  27. Lagrangian The Lagrangian associated to the general constrained optimization problem defined in equation 16 – 17 is the function defined over X × R m + as m � L ( x , α ) � f ( x ) + α i g i ( x ) (18) i � 1 where ◮ α � ( α 1 , . . . , α m ) ∈ R m + ◮ α i ≥ 0 for any i ∈ [ m ] 16

  28. Karush-Kuhn-Tucker’s Theorem Assume that f , g i : X → R , ∀ i ∈ [ m ] are convex and differentiable and that the constraints are qualified. Then x ′ is a solution of the constrained problem if and only if there exist α ′ ≥ 0 such that ∇ x f ( x ′ ) + α ′ · ∇ x g ( x ) � 0 ∇ x L ( x ′ , α ′ ) (19) � g ( x ′ ) ≤ 0 ∇ α L ( x , α ) (20) � m � α ′ · g ( x ′ ) α ′ i g i ( x ′ ) � 0 (21) � i � 1 Equations 19 – 21 are called KKT conditions [Mohri et al., 2018, Thm B.30] 17

  29. KKT in SVM Apply the KKT conditions to the SVM problem m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (22) i � 1 We have m m � � ∇ w L � w − α i y i x i � 0 ⇒ α i y i x i w � i � 1 i � 1 18

  30. KKT in SVM Apply the KKT conditions to the SVM problem m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (22) i � 1 We have m m � � ∇ w L � w − α i y i x i � 0 ⇒ α i y i x i w � i � 1 i � 1 m m � � ∇ b L � − α i y i � 0 ⇒ α i y i � 0 i � 1 i � 1 18

  31. KKT in SVM Apply the KKT conditions to the SVM problem m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (22) i � 1 We have m m � � ∇ w L � w − α i y i x i � 0 ⇒ α i y i x i w � i � 1 i � 1 m m � � ∇ b L � − α i y i � 0 ⇒ α i y i � 0 i � 1 i � 1 ∀ i , α i ( y i (� w , x i � + b ) − 1 ) � 0 ⇒ α i � 0 or y i (� w , x i � + b ) � 1 18

  32. Support Vectors Consider the implication of the last equation in the previous page, ∀ i ◮ α i > 0 and y i (� w , x i � + b ) � 1 or 19

  33. Support Vectors Consider the implication of the last equation in the previous page, ∀ i ◮ α i > 0 and y i (� w , x i � + b ) � 1 or ◮ α i � 0 and y i (� w , x i � + b ) ≥ 1 19

  34. Support Vectors Consider the implication of the last equation in the previous page, ∀ i ◮ α i > 0 and y i (� w , x i � + b ) � 1 or ◮ α i � 0 and y i (� w , x i � + b ) ≥ 1 m � α i y i x i (23) w � i � 1 ◮ Examples with α i > 0 are called support vectors ◮ In R d , d + 1 examples are sufficient to define a 19 hyper-plane

  35. Non-separable Cases

Recommend


More recommend