CS 6316 Machine Learning Support Vector Machines and Kernel Meth- - PowerPoint PPT Presentation

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of Computer Science University of Virginia

About Online Lectures

Course Information Update ◮ Record the lectures and upload the videos on Collab ◮ By default, turn off the video and mute yourself ◮ If you have a question ◮ Unmuate yourself and chime in anytime ◮ Use the raise hand feature ◮ Send me a private message 2

Course Information Update ◮ Record the lectures and upload the videos on Collab ◮ By default, turn off the video and mute yourself ◮ If you have a question ◮ Unmuate yourself and chime in anytime ◮ Use the raise hand feature ◮ Send me a private message ◮ Slack: as a stable communication channel to ◮ send out instant messages if my network connection is unreliable ◮ online discussion 2

Course Information Update ◮ Homework ◮ Subject to change 3

Course Information Update ◮ Homework ◮ Subject to change ◮ Final project ◮ Send out my feedback later this week ◮ Continue your collaboration with your teammates ◮ Presentation: record a presentation video and share it with me 3

Course Information Update ◮ Homework ◮ Subject to change ◮ Final project ◮ Send out my feedback later this week ◮ Continue your collaboration with your teammates ◮ Presentation: record a presentation video and share it with me ◮ Office hour ◮ Wednesday 11 AM: I will be on Zoom ◮ You can also send me an email or Slack message anytime 3

Separable Cases

Geometric Margin The geometric margin of a linear binary classifier h ( x ) � � w , x � + b at a point x is its distance to the hyper-plane � w , x � � 0 ρ h ( x ) � |� w , x � + b | (1) � w � 2 5

Geometric Margin (II) The geometric margin of h ( x ) for a set of examples T � { x 1 , . . . , x m } is the minimal distance over these examples x ′ ∈ T ρ h ( x ′ ) ρ h ( T ) � min (2) [Mohri et al., 2018, Page 80] 6

Half-Space Hypothesis Space ◮ Training set S � {( x 1 , y 1 ) , . . . , ( x m , y m )} with x i ∈ R d and y i ∈ { + 1 , − 1 } ◮ If the training set is linearly separable y i (� w , x i � + b ) > 0 ∀ i ∈ [ m ] (3) ◮ Linearly separable cases ◮ Existence of equation 3 ◮ All halfspace predictors that satisfy the condition in equation 3 are ERM hypotheses 7

Which Hypothesis is Better? [Shalev-Shwartz and Ben-David, 2014, Page 203] 8

Which Hypothesis is Better? ◮ Intuitively, a hypothesis with larger margin is better, because it is more robust to noise ◮ Final definition of margin will be provided later [Shalev-Shwartz and Ben-David, 2014, Page 203] 8

Hard SVM/Separable Cases The mathematical formulation of the previous idea |� w , x i � + b | ρ max ( w , b ) min (4) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (5) ◮ y i (� w , x i � + b ) > 0 ∀ i : guarantee ( w , b ) is an ERM hypothesis 9

Hard SVM/Separable Cases The mathematical formulation of the previous idea |� w , x i � + b | ρ max ( w , b ) min (4) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (5) ◮ y i (� w , x i � + b ) > 0 ∀ i : guarantee ( w , b ) is an ERM hypothesis ◮ min i ∈[ m ] : calculate the margin between a hyper-plane and a set of examples 9

Hard SVM/Separable Cases The mathematical formulation of the previous idea |� w , x i � + b | ρ max ( w , b ) min (4) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (5) ◮ y i (� w , x i � + b ) > 0 ∀ i : guarantee ( w , b ) is an ERM hypothesis ◮ min i ∈[ m ] : calculate the margin between a hyper-plane and a set of examples ◮ max ( w , b ) : maximize the margin 9

Illustration Original form |� w , x i � + b | ρ max ( w , b ) min (6) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (7) 10

Alternative Forms ◮ Original form |� w , x i � + b | ρ max ( w , b ) min (8) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (9) 11

Alternative Forms ◮ Original form |� w , x i � + b | ρ max ( w , b ) min (8) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (9) ◮ Alternative form 1 y i (� w , x i � + b ) (10) ρ max ( w , b ) min � � w � 2 i ∈[ m ] 11

Alternative Forms ◮ Original form |� w , x i � + b | ρ max ( w , b ) min (8) � � w � 2 i ∈[ m ] s.t. y i (� w , x i � + b ) > 0 ∀ i (9) ◮ Alternative form 1 y i (� w , x i � + b ) (10) ρ max ( w , b ) min � � w � 2 i ∈[ m ] ◮ Alternative form 2 1 ρ max (11) � � w � 2 ( w , b ) : min i ∈[ m ] y i (� w , x i � + b � 1 1 max (12) � � w � 2 ( w , b ) : y i (� w , x i � + b ≥ 1 11

Alternative Forms (II) ◮ Alternative form 2 1 max (13) ρ � � w � 2 ( w , b ) : y i (� w , x i � + b ≥ 1 ◮ Alternative form 3: Quadratic programming (QP) 1 2 � w � 2 min 2 ( w , b ) (14) s.t. y i (� w , x i � + b ) ≥ 1 , ∀ i ∈ [ m ] which is a constrained optimization problem that can be solved by standard QP packages 12

Alternative Forms (II) ◮ Alternative form 2 1 max (13) ρ � � w � 2 ( w , b ) : y i (� w , x i � + b ≥ 1 ◮ Alternative form 3: Quadratic programming (QP) 1 2 � w � 2 min 2 ( w , b ) (14) s.t. y i (� w , x i � + b ) ≥ 1 , ∀ i ∈ [ m ] which is a constrained optimization problem that can be solved by standard QP packages ◮ Exercise : Solve a SVM problem with quadratic programming 12

Unconstrained Optimization Problem The quadratic programming problem with constraints can be converted to an unconstrained optimization problem with the Lagrangian method m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (15) i � 1 where ◮ α � { α 1 , . . . , α m } is the Lagrange multiplier, and ◮ α i ≥ 0 is associated with the i -th training example 13

Constrained Optimization Problems

Constrained Optimization Problems: Definition ◮ X ⊆ R d and ◮ f , g i : X → R , ∀ i ∈ [ m ] Then, a constrained optimization problem is defined in the form of f ( x ) min (16) x ∈ X s.t. g i ( x ) ≤ 0 , ∀ i ∈ [ m ] (17) 15

Constrained Optimization Problems: Definition ◮ X ⊆ R d and ◮ f , g i : X → R , ∀ i ∈ [ m ] Then, a constrained optimization problem is defined in the form of f ( x ) min (16) x ∈ X s.t. g i ( x ) ≤ 0 , ∀ i ∈ [ m ] (17) Comments ◮ In general definition, x is the target variable for optimization ◮ Special cases of g i ( x ) : (1) g i ( x ) � 0 , (2) g i ( x ) ≥ 0 , and (3) g i ( x ) ≤ b 15

Lagrangian The Lagrangian associated to the general constrained optimization problem defined in equation 16 – 17 is the function defined over X × R m + as m � L ( x , α ) � f ( x ) + α i g i ( x ) (18) i � 1 where ◮ α � ( α 1 , . . . , α m ) ∈ R m + ◮ α i ≥ 0 for any i ∈ [ m ] 16

Karush-Kuhn-Tucker’s Theorem Assume that f , g i : X → R , ∀ i ∈ [ m ] are convex and differentiable and that the constraints are qualified. Then x ′ is a solution of the constrained problem if and only if there exist α ′ ≥ 0 such that ∇ x f ( x ′ ) + α ′ · ∇ x g ( x ) � 0 ∇ x L ( x ′ , α ′ ) (19) � g ( x ′ ) ≤ 0 ∇ α L ( x , α ) (20) � m � α ′ · g ( x ′ ) α ′ i g i ( x ′ ) � 0 (21) � i � 1 Equations 19 – 21 are called KKT conditions [Mohri et al., 2018, Thm B.30] 17

KKT in SVM Apply the KKT conditions to the SVM problem m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (22) i � 1 We have m m � � ∇ w L � w − α i y i x i � 0 ⇒ α i y i x i w � i � 1 i � 1 18

KKT in SVM Apply the KKT conditions to the SVM problem m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (22) i � 1 We have m m � � ∇ w L � w − α i y i x i � 0 ⇒ α i y i x i w � i � 1 i � 1 m m � � ∇ b L � − α i y i � 0 ⇒ α i y i � 0 i � 1 i � 1 18

KKT in SVM Apply the KKT conditions to the SVM problem m � L ( w , b , α ) � 1 2 � w � 2 2 − α i ( y i (� w , x i � + b ) − 1 ) (22) i � 1 We have m m � � ∇ w L � w − α i y i x i � 0 ⇒ α i y i x i w � i � 1 i � 1 m m � � ∇ b L � − α i y i � 0 ⇒ α i y i � 0 i � 1 i � 1 ∀ i , α i ( y i (� w , x i � + b ) − 1 ) � 0 ⇒ α i � 0 or y i (� w , x i � + b ) � 1 18

Support Vectors Consider the implication of the last equation in the previous page, ∀ i ◮ α i > 0 and y i (� w , x i � + b ) � 1 or 19

Support Vectors Consider the implication of the last equation in the previous page, ∀ i ◮ α i > 0 and y i (� w , x i � + b ) � 1 or ◮ α i � 0 and y i (� w , x i � + b ) ≥ 1 19

Support Vectors Consider the implication of the last equation in the previous page, ∀ i ◮ α i > 0 and y i (� w , x i � + b ) � 1 or ◮ α i � 0 and y i (� w , x i � + b ) ≥ 1 m � α i y i x i (23) w � i � 1 ◮ Examples with α i > 0 are called support vectors ◮ In R d , d + 1 examples are sufficient to define a 19 hyper-plane

Non-separable Cases

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- - PowerPoint PPT Presentation

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of Computer Science University of Virginia About Online Lectures Course Information Update Record the lectures and upload the videos on Collab

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

360 Videos: Immersive Views of Astrophysical Simulations Christopher M. P . Russell

NANOTECHNOLOGY TO IMPROVE THE PERFORMANCES OF HYDRODYNAMIC SURFACES ALI ALSHEHRI , EL HADJ

Life after Calc Core Change Kohei Yoshida <kohei.yoshida@collabora.com> T opics What

Who? Networks of social entities Max Kemman University of Luxembourg December 13, 2016 Doing

Mail merge embedding in LibreOffice Writer By Miklos Vajna Software Engineer at Collabora

Operating Systems II Unit OS8: File System 8.1. Background: File System Concepts Prof. Dr.

Moving Average Model Moving average model of order q (MA( q )): x t = w t + 1 w t 1 +

Customer Imperatives Today Cost Effectiveness/ROI Ability to Backup and Archive data in a

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- - PowerPoint PPT Presentation

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of Computer Science University of Virginia About Online Lectures Course Information Update Record the lectures and upload the videos on Collab

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

360 Videos: Immersive Views of Astrophysical Simulations Christopher M. P . Russell

NANOTECHNOLOGY TO IMPROVE THE PERFORMANCES OF HYDRODYNAMIC SURFACES ALI ALSHEHRI , EL HADJ

Life after Calc Core Change Kohei Yoshida &lt;kohei.yoshida@collabora.com&gt; T opics What

Who? Networks of social entities Max Kemman University of Luxembourg December 13, 2016 Doing

Mail merge embedding in LibreOffice Writer By Miklos Vajna Software Engineer at Collabora

Operating Systems II Unit OS8: File System 8.1. Background: File System Concepts Prof. Dr.

Moving Average Model Moving average model of order q (MA( q )): x t = w t + 1 w t 1 +

Customer Imperatives Today Cost Effectiveness/ROI Ability to Backup and Archive data in a

Life after Calc Core Change Kohei Yoshida <kohei.yoshida@collabora.com> T opics What