Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

Goals for the lecture you should understand the following concepts • the margin • the linear support vector machine • the primal and dual formulations of SVM learning • support vectors • VC-dimension and maximizing the margin 2

Motivation

Linear classification (𝑥 ∗ ) 𝑈 𝑦 = 0 (𝑥 ∗ ) 𝑈 𝑦 > 0 (𝑥 ∗ ) 𝑈 𝑦 < 0 𝑥 ∗ Class +1 Class -1 Assume perfect separation between the two classes

Attempt • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 𝑥 𝑦 ) = sign(𝑥 𝑈 𝑦) • Hypothesis 𝑧 = sign(𝑔 • 𝑧 = +1 if 𝑥 𝑈 𝑦 > 0 • 𝑧 = −1 if 𝑥 𝑈 𝑦 < 0 • Let’s assume that we can optimize to find 𝑥

Multiple optimal solutions? Class +1 𝑥 1 𝑥 2 𝑥 3 Class -1 Same on empirical loss; Different on test/expected loss

What about 𝑥 1 ? New test data Class +1 𝑥 1 Class -1

What about 𝑥 3 ? New test data Class +1 𝑥 3 Class -1

Most confident: 𝑥 2 New test data Class +1 𝑥 2 Class -1

Intuition: margin large margin Class +1 𝑥 2 Class -1

Margin

Margin |𝑔 𝑥 𝑦 | 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 Proof: • 𝑥 is orthogonal to the hyperplane 𝑥 • The unit direction is 𝑦 | 𝑥 | 0 𝑈 𝑥 𝑥 𝑔 𝑥 (𝑦) • The projection of 𝑦 is 𝑦 = | 𝑥 | 𝑥 | 𝑥 | 𝑈 𝑥 𝑦 𝑥

Margin: with bias 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 0 • Claim 1: 𝑥 is orthogonal to the hyperplane 𝑔 Proof: • pick any 𝑦 1 and 𝑦 2 on the hyperplane • 𝑥 𝑈 𝑦 1 + 𝑐 = 0 • 𝑥 𝑈 𝑦 2 + 𝑐 = 0 • So 𝑥 𝑈 (𝑦 1 − 𝑦 2 ) = 0

Margin: with bias −𝑐 | 𝑥 | to the hyperplane 𝑥 𝑈 𝑦 + 𝑐 = 0 • Claim 2: 0 has distance Proof: • pick any 𝑦 1 the hyperplane 𝑥 • Project 𝑦 1 to the unit direction | 𝑥 | to get the distance 𝑈 𝑥 −𝑐 | 𝑥 | since 𝑥 𝑈 𝑦 1 + 𝑐 = 0 • 𝑦 1 = 𝑥

Margin: with bias |𝑔 𝑥,𝑐 𝑦 | 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 | 𝑥 | 𝑐 = 0 Proof: 𝑥 • Let 𝑦 = 𝑦 ⊥ + 𝑠 | 𝑥 | , then |𝑠| is the distance • Multiply both sides by 𝑥 𝑈 and add 𝑐 • Left hand side: 𝑥 𝑈 𝑦 + 𝑐 = 𝑔 𝑥,𝑐 𝑦 𝑥 𝑈 𝑥 • Right hand side: 𝑥 𝑈 𝑦 ⊥ + 𝑠 | 𝑥 | + 𝑐 = 0 + 𝑠| 𝑥 |

The notation here is: 𝑧 𝑦 = 𝑥 𝑈 𝑦 + 𝑥 0 Figure from Pattern Recognition and Machine Learning , Bishop

Support Vector Machine (SVM)

SVM: objective • Margin over all training data points: |𝑔 𝑥,𝑐 𝑦 𝑗 | 𝛿 = min | 𝑥 | 𝑗 • Since only want correct 𝑔 𝑥,𝑐 , and recall 𝑧 𝑗 ∈ {+1, −1} , we have 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 𝛿 = min | 𝑥 | 𝑗 • If 𝑔 𝑥,𝑐 incorrect on some 𝑦 𝑗 , the margin is negative

SVM: objective • Maximize margin over all training data points: 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 max 𝑥,𝑐 𝛿 = max 𝑥,𝑐 min = max 𝑥,𝑐 min | 𝑥 | | 𝑥 | 𝑗 𝑗 • A bit complicated …

SVM: simplified objective • Observation: when (𝑥, 𝑐) scaled by a factor 𝑑 , the margin unchanged 𝑧 𝑗 (𝑑𝑥 𝑈 𝑦 𝑗 + 𝑑𝑐) = 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) | 𝑑𝑥 | | 𝑥 | • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closest to the hyperplane

SVM: simplified objective • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closet to the hyperplane • Now we have for all data 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1 and at least for one 𝑗 the equality holds 1 • Then the margin is | 𝑥 |

SVM: simplified objective • Optimization simplified to 2 1 min 𝑥 2 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 𝑥 ∗ ? • How to find the optimum ෝ • Solved by Lagrange multiplier method

Lagrange multiplier

Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Lagrangian: ℒ 𝑥, 𝜸 = 𝑔 𝑥 + ෍ 𝛾 𝑗 ℎ 𝑗 (𝑥) 𝑗 where 𝛾 𝑗 ’s are called Lagrange multipliers

Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Solved by setting derivatives of Lagrangian to 0 𝜖ℒ 𝜖ℒ = 0; = 0 𝜖𝑥 𝑗 𝜖𝛾 𝑗

Generalized Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 𝑕 𝑗 𝑥 ≤ 0, ∀1 ≤ 𝑗 ≤ 𝑙 ℎ 𝑘 𝑥 = 0, ∀1 ≤ 𝑘 ≤ 𝑚 • Generalized Lagrangian: ℒ 𝑥, 𝜷, 𝜸 = 𝑔 𝑥 + ෍ 𝛽 𝑗 𝑕 𝑗 (𝑥) + ෍ 𝛾 𝑘 ℎ 𝑘 (𝑥) 𝑗 𝑘 where 𝛽 𝑗 , 𝛾 𝑘 ’s are called Lagrange multipliers

Generalized Lagrangian • Consider the quantity: 𝜄 𝑄 𝑥 ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max • Why? 𝜄 𝑄 𝑥 = ቊ𝑔 𝑥 , if 𝑥 satisfies all the constraints +∞, if 𝑥 does not satisfy the constraints • So minimizing 𝑔 𝑥 is the same as minimizing 𝜄 𝑄 𝑥 min 𝑥 𝑔 𝑥 = min 𝑥 𝜄 𝑄 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥

Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Always true: 𝑒 ∗ ≤ 𝑞 ∗

Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Interesting case: when do we have 𝑒 ∗ = 𝑞 ∗ ?

Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

Lagrange duality dual complementarity • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that primal constraints dual constraints 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ • Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

Lagrange duality • What are the proper conditions? • A set of conditions (Slater conditions): • 𝑔, 𝑕 𝑗 convex, ℎ 𝑘 affine, and exists 𝑥 satisfying all 𝑕 𝑗 𝑥 < 0 • There exist other sets of conditions • Check textbooks, e.g., Convex Optimization by Boyd and Vandenberghe

SVM: optimization

SVM: optimization • Optimization (Quadratic Programming): 2 1 min 𝑥 2 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 • Generalized Lagrangian: 2 ℒ 𝑥, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 1] 2 𝑥 − ෍ 𝑗 where 𝜷 is the Lagrange multiplier

SVM: optimization • KKT conditions: 𝜖ℒ 𝜖𝑥 = 0,  𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 (1) 𝜖ℒ 𝜖𝑐 = 0,  0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 (2) • Plug into ℒ : 1 𝑈 𝑦 𝑘 (3) ℒ 𝑥, 𝑐, 𝜷 = σ 𝑗 𝛽 𝑗 − 2 σ 𝑗𝑘 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 combined with 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 , 𝛽 𝑗 ≥ 0

Only depend on inner products SVM: optimization • Reduces to dual problem: 𝛽 𝑗 − 1 𝑈 𝑦 𝑘 ℒ 𝑥, 𝑐, 𝜷 = ෍ 2 ෍ 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 𝑗 𝑗𝑘 ෍ 𝛽 𝑗 𝑧 𝑗 = 0, 𝛽 𝑗 ≥ 0 𝑗 𝑈 𝑦 + 𝑐 • Since 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , we have 𝑥 𝑈 𝑦 + 𝑐 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗

Support Vectors • final solution is a sparse linear combination of the training instances • those instances with α i > 0 are called support vectors • they lie on the margin boundary • solution NOT changed if delete the instances with α i = 0 support vectors

Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines (I): Overview and Linear SVM LING 572 Advanced Statistical Techniques

IAML: Support Vector Machines I Nigel Goddard School of Informatics Semester 1 1 / 18 Outline

L15:Microarray analysis (Classification) November 09 Bafna Silly Quiz Social networking

Hypercube locality-sensitive hashing for approximate near neighbors Thijs Laarhoven

Bisectors and foliations in the complex hyperbolic space Maciej Czarnecki Uniwersytet L

Introduction to Support Vector Machines Starting from slides drawn by Ming-Hsuan Yang and Antoine

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu

Chapter IX: Classification* 1. Basic idea 2. Decision trees 3. Nave Bayes classifier 4.

Sambuz

Useful Links

Newsletter

Mail Us

Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 - PowerPoint PPT Presentation

Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines (I): Overview and Linear SVM LING 572 Advanced Statistical Techniques

IAML: Support Vector Machines I Nigel Goddard School of Informatics Semester 1 1 / 18 Outline

L15:Microarray analysis (Classification) November 09 Bafna Silly Quiz Social networking

Hypercube locality-sensitive hashing for approximate near neighbors Thijs Laarhoven

Bisectors and foliations in the complex hyperbolic space Maciej Czarnecki Uniwersytet L

Introduction to Support Vector Machines Starting from slides drawn by Ming-Hsuan Yang and Antoine

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu

Chapter IX: Classification* 1. Basic idea 2. Decision trees 3. Nave Bayes classifier 4.

Sambuz

Useful Links

Newsletter

Mail Us

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David