Support Vector Machines Part 1 CS 760@UW-Madison Goals for the - PowerPoint PPT Presentation

Support Vector Machines Part 1 CS 760@UW-Madison

Goals for the lecture you should understand the following concepts • the margin • the linear support vector machine • the primal and dual formulations of SVM learning • support vectors • Optional: variants of SVM • Optional: Lagrange Multiplier 2

Motivation

Linear classification (𝑥 ∗ ) 𝑈 𝑦 = 0 (𝑥 ∗ ) 𝑈 𝑦 > 0 (𝑥 ∗ ) 𝑈 𝑦 < 0 𝑥 ∗ Class +1 Class -1 Assume perfect separation between the two classes

Attempt • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 𝑥 𝑦 ) = sign(𝑥 𝑈 𝑦) • Hypothesis 𝑧 = sign(𝑔 • 𝑧 = +1 if 𝑥 𝑈 𝑦 > 0 • 𝑧 = −1 if 𝑥 𝑈 𝑦 < 0 • Let’s assume that we can optimize to find 𝑥

Multiple optimal solutions? Class +1 𝑥 1 𝑥 2 𝑥 3 Class -1 Same on empirical loss; Different on test/expected loss

What about 𝑥 1 ? New test data Class +1 𝑥 1 Class -1

What about 𝑥 3 ? New test data Class +1 𝑥 3 Class -1

Most confident: 𝑥 2 New test data Class +1 𝑥 2 Class -1

Intuition: margin large margin Class +1 𝑥 2 Class -1

Margin

Margin We are going to prove the following math expression for margin using a geometric argument |𝑔 𝑥 𝑦 | • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 |𝑔 𝑥,𝑐 𝑦 | • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 𝑥,𝑐 𝑦 = | 𝑥 | 𝑥 𝑈 𝑦 + 𝑐 = 0 Need two geometric facts: • 𝑥 is orthogonal to the hyperplane 𝑔 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 0 • Let 𝑤 be a direction (i.e., unit vector). Then the length of the projection of 𝑦 on 𝑤 is 𝑤 𝑈 𝑦

Margin |𝑔 𝑥 𝑦 | • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 Proof: • 𝑥 is orthogonal to the hyperplane 𝑥 • The unit direction is | 𝑥 | 𝑈 𝑥 𝑔 𝑥 (𝑦) • The projection of 𝑦 is 𝑦 = 𝑥 | 𝑥 | 𝑦 0 𝑥 | 𝑥 | 𝑈 𝑥 𝑦 𝑥

Margin: with bias |𝑔 𝑥,𝑐 𝑦 | • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 𝑥,𝑐 𝑦 = | 𝑥 | 𝑥 𝑈 𝑦 + 𝑐 = 0 Proof: 𝑥 • Let 𝑦 = 𝑦 ⊥ + 𝑠 | 𝑥 | , then |𝑠| is the distance • Multiply both sides by 𝑥 𝑈 and add 𝑐 • Left hand side: 𝑥 𝑈 𝑦 + 𝑐 = 𝑔 𝑥,𝑐 𝑦 𝑥 𝑈 𝑥 • Right hand side: 𝑥 𝑈 𝑦 ⊥ + 𝑠 | 𝑥 | + 𝑐 = 0 + 𝑠| 𝑥 |

Margin: with bias The notation here is: 𝑧 𝑦 = 𝑥 𝑈 𝑦 + 𝑥 0 Figure from Pattern Recognition and Machine Learning , Bishop

Support Vector Machine (SVM)

SVM: objective • Absolute margin over all training data points: |𝑔 𝑥,𝑐 𝑦 𝑗 | 𝛿 = min | 𝑥 | 𝑗 • Since only want correct 𝑔 𝑥,𝑐 , and recall 𝑧 𝑗 ∈ {+1, −1} , we define the margin to be 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 𝛿 = min | 𝑥 | 𝑗 • If 𝑔 𝑥,𝑐 incorrect on some 𝑦 𝑗 , the margin is negative

SVM: objective • Maximize margin over all training data points: 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 max 𝑥,𝑐 𝛿 = max 𝑥,𝑐 min = max 𝑥,𝑐 min | 𝑥 | | 𝑥 | 𝑗 𝑗 • A bit complicated …

SVM: simplified objective • Observation: when (𝑥, 𝑐) scaled by a factor 𝑑 , the margin unchanged 𝑧 𝑗 (𝑑𝑥 𝑈 𝑦 𝑗 + 𝑑𝑐) = 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) | 𝑑𝑥 | | 𝑥 | • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closest to the hyperplane

SVM: simplified objective • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closet to the hyperplane • Now we have for all data 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1 and at least for one 𝑗 the equality holds 1 • Then the margin over all training points is | 𝑥 |

SVM: simplified objective • Optimization simplified to 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 𝑥 ∗ ? • How to find the optimum ෝ • Solved by Lagrange multiplier method

SVM: optimization

SVM: optimization • Optimization (Quadratic Programming): 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 • Generalized Lagrangian: 2 ℒ 𝑥, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 1] 2 𝑥 − ෍ 𝑗 where 𝜷 is the Lagrange multiplier

SVM: optimization • KKT conditions: 𝜖ℒ 𝜖𝑥 = 0, → 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 (1) 𝜖ℒ 𝜖𝑐 = 0, → 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 (2) • Plug into ℒ : 1 𝑈 𝑦 𝑘 (3) ℒ 𝑥, 𝑐, 𝜷 = σ 𝑗 𝛽 𝑗 − 2 σ 𝑗𝑘 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 combined with 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 , 𝛽 𝑗 ≥ 0

SVM: optimization Only depend on inner products • Reduces to dual problem: 𝛽 𝑗 − 1 𝑈 𝑦 𝑘 ℒ 𝑥, 𝑐, 𝜷 = ෍ 2 ෍ 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 𝑗 𝑗𝑘 ෍ 𝛽 𝑗 𝑧 𝑗 = 0, 𝛽 𝑗 ≥ 0 𝑗 • Since 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , we have 𝑥 𝑈 𝑦 + 𝑐 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 𝑈 𝑦 + 𝑐

Support Vectors • final solution is a sparse linear combination of the training instances • those instances with α i > 0 are called support vectors • they lie on the margin boundary • solution NOT changed if delete the instances with α i = 0 support vectors

Optional: Lagrange Multiplier

Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Lagrangian: ℒ 𝑥, 𝜸 = 𝑔 𝑥 + ෍ 𝛾 𝑗 ℎ 𝑗 (𝑥) 𝑗 where 𝛾 𝑗 ’s are called Lagrange multipliers

Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Solved by setting derivatives of Lagrangian to 0 𝜖ℒ 𝜖ℒ = 0; = 0 𝜖𝑥 𝑗 𝜖𝛾 𝑗

Generalized Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 𝑕 𝑗 𝑥 ≤ 0, ∀1 ≤ 𝑗 ≤ 𝑙 ℎ 𝑘 𝑥 = 0, ∀1 ≤ 𝑘 ≤ 𝑚 • Generalized Lagrangian: ℒ 𝑥, 𝜷, 𝜸 = 𝑔 𝑥 + ෍ 𝛽 𝑗 𝑕 𝑗 (𝑥) + ෍ 𝛾 𝑘 ℎ 𝑘 (𝑥) 𝑗 𝑘 where 𝛽 𝑗 , 𝛾 𝑘 ’s are called Lagrange multipliers

Generalized Lagrangian • Consider the quantity: 𝜄 𝑄 𝑥 ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max • Why? if 𝑥 satisfies all the constraints 𝜄 𝑄 𝑥 = ቊ 𝑔 𝑥 , +∞, if 𝑥 does not satisfy the constraints • So minimizing 𝑔 𝑥 is the same as minimizing 𝜄 𝑄 𝑥 min 𝑥 𝑔 𝑥 = min 𝑥 𝜄 𝑄 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥

Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Always true: 𝑒 ∗ ≤ 𝑞 ∗

Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Interesting case: when do we have 𝑒 ∗ = 𝑞 ∗ ?

Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that dual 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ complementarity Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ primal constraints dual constraints • Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

Lagrange duality • What are the proper conditions? • A set of conditions (Slater conditions): • 𝑔, 𝑕 𝑗 convex, ℎ 𝑘 affine, and exists 𝑥 satisfying all 𝑕 𝑗 𝑥 < 0 • There exist other sets of conditions • Check textbooks, e.g., Convex Optimization by Boyd and Vandenberghe

Optional: Variants of SVM

Hard-margin SVM • Optimization (Quadratic Programming): 2 1 min 2 𝑥 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗

Support Vector Machines Part 1 CS 760@UW-Madison Goals for the - PowerPoint PPT Presentation

Support Vector Machines Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts the margin the linear support vector machine the primal and dual formulations of SVM learning support vectors

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

EC400 Part II, Math for Micro: Lecture 5 Leonardo Felli NAB.SZT 15 September 2010 One

Counterexample-Guided Polynomial Quantitative Loop Invariants by Lagrange Interpolation Yu-Fang

Symbolic Aggregate Case of Interval . . . ApproXimation (SAX) How Measurement . . . How

A Parallel Bundle Method for Asynchronous Subspace Optimization in Lagrangian Relaxation Frank

A Semi-Lagrangian discretization of non linear fokker Planck equations E. Carlini Universit` a

Workshop on New Frontiers in Internet of Things Active Volcano Monitoring with Wireless Sensor

SE3250 Intro to 3D Computer Graphics Jay Urbain Credits: OpenGL, Jason Leigh University of

Chapter 5: Recursive and Recursively Enumerable Languages In this chapter, we will study a

Support Vector Machines Part 1 CS 760@UW-Madison Goals for the - PowerPoint PPT Presentation

Support Vector Machines Part 1 CS 760@UW-Madison Goals for the lecture you should understand the following concepts the margin the linear support vector machine the primal and dual formulations of SVM learning support vectors

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

EC400 Part II, Math for Micro: Lecture 5 Leonardo Felli NAB.SZT 15 September 2010 One

Counterexample-Guided Polynomial Quantitative Loop Invariants by Lagrange Interpolation Yu-Fang

Symbolic Aggregate Case of Interval . . . ApproXimation (SAX) How Measurement . . . How

A Parallel Bundle Method for Asynchronous Subspace Optimization in Lagrangian Relaxation Frank

A Semi-Lagrangian discretization of non linear fokker Planck equations E. Carlini Universit` a

Workshop on New Frontiers in Internet of Things Active Volcano Monitoring with Wireless Sensor

SE3250 Intro to 3D Computer Graphics Jay Urbain Credits: OpenGL, Jason Leigh University of

Chapter 5: Recursive and Recursively Enumerable Languages In this chapter, we will study a

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David