CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of Computer Science University of Virginia
About Online Lectures
Course Information Update ◮ Record the lectures and upload the videos on Collab ◮ By default, turn off the video and mute yourself ◮ If you have a question ◮ Unmuate yourself and chime in anytime ◮ Use the raise hand feature ◮ Send me a private message 2
Course Information Update ◮ Record the lectures and upload the videos on Collab ◮ By default, turn off the video and mute yourself ◮ If you have a question ◮ Unmuate yourself and chime in anytime ◮ Use the raise hand feature ◮ Send me a private message ◮ Slack: as a stable communication channel to ◮ send out instant messages if my network connection is unreliable ◮ online discussion 2
Course Information Update ◮ Homework ◮ Subject to change 3
Course Information Update ◮ Homework ◮ Subject to change ◮ Final project ◮ Send out my feedback later this week ◮ Continue your collaboration with your teammates ◮ Presentation: record a presentation video and share it with me 3
Course Information Update ◮ Homework ◮ Subject to change ◮ Final project ◮ Send out my feedback later this week ◮ Continue your collaboration with your teammates ◮ Presentation: record a presentation video and share it with me ◮ Office hour ◮ Wednesday 11 AM: I will be on Zoom ◮ You can also send me an email or Slack message anytime 3
Separable Cases
Geometric Margin The geometric margin of a linear binary classifier h ( x ) � � w , x � + b at a point x is its distance to the hyper-plane � w , x � � 0 ρ h ( x ) � |� w , x � + b | (1) � w � 2 5
Geometric Margin (II) The geometric margin of h ( x ) for a set of examples T � { x 1 , . . . , x m } is the minimal distance over these examples x ′ ∈ T ρ h ( x ′ ) ρ h ( T ) � min (2) [Mohri et al., 2018, Page 80] 6
Half-Space Hypothesis Space ◮ Training set S � {( x 1 , y 1 ) , . . . , ( x m , y m )} with x i ∈ R d and y i ∈ { + 1 , − 1 } ◮ If the training set is linearly separable y i (� w , x i � + b ) > 0 ∀ i ∈ [ m ] (3) ◮ Linearly separable cases ◮ Existence of equation 3 ◮ All halfspace predictors that satisfy the condition in equation 3 are ERM hypotheses 7
Which Hypothesis is Better? [Shalev-Shwartz and Ben-David, 2014, Page 203] 8
Which Hypothesis is Better? ◮ Intuitively, a hypothesis with larger margin is better, because it is more robust to noise ◮ Final definition of margin will be provided later [Shalev-Shwartz and Ben-David, 2014, Page 203] 8
Hard SVM/Separable Cases The mathematical formulation of the previous idea |� w , x i � + b | ρ max ( w , b ) min (4) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (5) ◮ y i (� w , x i � + b ) > 0 ∀ i : guarantee ( w , b ) is an ERM hypothesis 9
Hard SVM/Separable Cases The mathematical formulation of the previous idea |� w , x i � + b | ρ max ( w , b ) min (4) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (5) ◮ y i (� w , x i � + b ) > 0 ∀ i : guarantee ( w , b ) is an ERM hypothesis ◮ min i ∈[ m ] : calculate the margin between a hyper-plane and a set of examples 9
Hard SVM/Separable Cases The mathematical formulation of the previous idea |� w , x i � + b | ρ max ( w , b ) min (4) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (5) ◮ y i (� w , x i � + b ) > 0 ∀ i : guarantee ( w , b ) is an ERM hypothesis ◮ min i ∈[ m ] : calculate the margin between a hyper-plane and a set of examples ◮ max ( w , b ) : maximize the margin 9
Illustration Original form |� w , x i � + b | ρ max ( w , b ) min (6) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (7) 10
Alternative Forms ◮ Original form |� w , x i � + b | ρ max ( w , b ) min (8) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (9) 11
Alternative Forms ◮ Original form |� w , x i � + b | ρ max ( w , b ) min (8) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (9) ◮ Alternative form 1 y i (� w , x i � + b ) (10) ρ max ( w , b ) min � � w � 2 i ∈[ m ] 11
Alternative Forms ◮ Original form |� w , x i � + b | ρ max ( w , b ) min (8) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (9) ◮ Alternative form 1 y i (� w , x i � + b ) (10) ρ max ( w , b ) min � � w � 2 i ∈[ m ] ◮ Alternative form 2 1 ρ max (11) � � w � 2 ( w , b ) : min i ∈[ m ] y i (� w , x i � + b � 1 1 max (12) � � w � 2 ( w , b ) : y i (� w , x i � + b ≥ 1 11
Alternative Forms (II) ◮ Alternative form 2 1 max (13) ρ � � w � 2 ( w , b ) : y i (� w , x i � + b ≥ 1 ◮ Alternative form 3: Quadratic programming (QP) 1 2 � w � 2 min 2 ( w , b ) (14) s.t. y i (� w , x i � + b ) ≥ 1 , ∀ i ∈ [ m ] which is a constrained optimization problem that can be solved by standard QP packages 12
Alternative Forms (II) ◮ Alternative form 2 1 max (13) ρ � � w � 2 ( w , b ) : y i (� w , x i � + b ≥ 1 ◮ Alternative form 3: Quadratic programming (QP) 1 2 � w � 2 min 2 ( w , b ) (14) s.t. y i (� w , x i � + b ) ≥ 1 , ∀ i ∈ [ m ] which is a constrained optimization problem that can be solved by standard QP packages ◮ Exercise : Solve a SVM problem with quadratic programming 12
Unconstrained Optimization Problem The quadratic programming problem with constraints can be converted to an unconstrained optimization problem with the Lagrangian method m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (15) i � 1 where ◮ α � { α 1 , . . . , α m } is the Lagrange multiplier, and ◮ α i ≥ 0 is associated with the i -th training example 13
Constrained Optimization Problems
Constrained Optimization Problems: Definition ◮ X ⊆ R d and ◮ f , g i : X → R , ∀ i ∈ [ m ] Then, a constrained optimization problem is defined in the form of f ( x ) min (16) x ∈ X s.t. g i ( x ) ≤ 0 , ∀ i ∈ [ m ] (17) 15
Constrained Optimization Problems: Definition ◮ X ⊆ R d and ◮ f , g i : X → R , ∀ i ∈ [ m ] Then, a constrained optimization problem is defined in the form of f ( x ) min (16) x ∈ X s.t. g i ( x ) ≤ 0 , ∀ i ∈ [ m ] (17) Comments ◮ In general definition, x is the target variable for optimization ◮ Special cases of g i ( x ) : (1) g i ( x ) � 0 , (2) g i ( x ) ≥ 0 , and (3) g i ( x ) ≤ b 15
Lagrangian The Lagrangian associated to the general constrained optimization problem defined in equation 16 – 17 is the function defined over X × R m + as m � L ( x , α ) � f ( x ) + α i g i ( x ) (18) i � 1 where ◮ α � ( α 1 , . . . , α m ) ∈ R m + ◮ α i ≥ 0 for any i ∈ [ m ] 16
Karush-Kuhn-Tucker’s Theorem Assume that f , g i : X → R , ∀ i ∈ [ m ] are convex and differentiable and that the constraints are qualified. Then x ′ is a solution of the constrained problem if and only if there exist α ′ ≥ 0 such that ∇ x f ( x ′ ) + α ′ · ∇ x g ( x ) � 0 ∇ x L ( x ′ , α ′ ) (19) � g ( x ′ ) ≤ 0 ∇ α L ( x , α ) (20) � m � α ′ · g ( x ′ ) α ′ i g i ( x ′ ) � 0 (21) � i � 1 Equations 19 – 21 are called KKT conditions [Mohri et al., 2018, Thm B.30] 17
KKT in SVM Apply the KKT conditions to the SVM problem m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (22) i � 1 We have m m � � ∇ w L � w − α i y i x i � 0 ⇒ α i y i x i w � i � 1 i � 1 18
KKT in SVM Apply the KKT conditions to the SVM problem m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (22) i � 1 We have m m � � ∇ w L � w − α i y i x i � 0 ⇒ α i y i x i w � i � 1 i � 1 m m � � ∇ b L � − α i y i � 0 ⇒ α i y i � 0 i � 1 i � 1 18
KKT in SVM Apply the KKT conditions to the SVM problem m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (22) i � 1 We have m m � � ∇ w L � w − α i y i x i � 0 ⇒ α i y i x i w � i � 1 i � 1 m m � � ∇ b L � − α i y i � 0 ⇒ α i y i � 0 i � 1 i � 1 ∀ i , α i ( y i (� w , x i � + b ) − 1 ) � 0 ⇒ α i � 0 or y i (� w , x i � + b ) � 1 18
Support Vectors Consider the implication of the last equation in the previous page, ∀ i ◮ α i > 0 and y i (� w , x i � + b ) � 1 or 19
Support Vectors Consider the implication of the last equation in the previous page, ∀ i ◮ α i > 0 and y i (� w , x i � + b ) � 1 or ◮ α i � 0 and y i (� w , x i � + b ) ≥ 1 19
Support Vectors Consider the implication of the last equation in the previous page, ∀ i ◮ α i > 0 and y i (� w , x i � + b ) � 1 or ◮ α i � 0 and y i (� w , x i � + b ) ≥ 1 m � α i y i x i (23) w � i � 1 ◮ Examples with α i > 0 are called support vectors ◮ In R d , d + 1 examples are sufficient to define a 19 hyper-plane
Non-separable Cases
Recommend
More recommend