Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu - PowerPoint PPT Presentation

Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang

Review: machine learning basics

Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 1 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 that minimizes ෠ 𝑜 𝑜 σ 𝑗=1 𝑀 𝑔 = 𝑚(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) • s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [𝑚(𝑔, 𝑦, 𝑧)]

Machine learning 1-2-3 • Collect data and extract features • Build model: choose hypothesis class 𝓘 and loss function 𝑚 • Optimization: minimize the empirical loss

Loss function • 𝑚 2 loss: linear regression • Cross-entropy: logistic regression • Hinge loss: Perceptron • General principle: maximum likelihood estimation (MLE) • 𝑚 2 loss: corresponds to Normal distribution • logistic regression: corresponds to sigmoid conditional distribution

Optimization • Linear regression: closed form solution • Logistic regression: gradient descent • Perceptron: stochastic gradient descent • General principle: local improvement • SGD: Perceptron; can also be applied to linear regression/logistic regression

Principle for hypothesis class? • Yes, there exists a general principle (at least philosophically) • Different names/faces/connections • Occam’s razor • VC dimension theory • Minimum description length • Tradeoff between Bias and variance; uniform convergence • The curse of dimensionality • Running example: Support Vector Machine (SVM)

Motivation

Linear classification (𝑥 ∗ ) 𝑈 𝑦 = 0 (𝑥 ∗ ) 𝑈 𝑦 > 0 (𝑥 ∗ ) 𝑈 𝑦 < 0 𝑥 ∗ Class +1 Class -1 Assume perfect separation between the two classes

Attempt • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 𝑥 𝑦 ) = sign(𝑥 𝑈 𝑦) • Hypothesis 𝑧 = sign(𝑔 • 𝑧 = +1 if 𝑥 𝑈 𝑦 > 0 • 𝑧 = −1 if 𝑥 𝑈 𝑦 < 0 • Let’s assume that we can optimize to find 𝑥

Multiple optimal solutions? Class +1 𝑥 1 𝑥 2 𝑥 3 Class -1 Same on empirical loss; Different on test/expected loss

What about 𝑥 1 ? New test data Class +1 𝑥 1 Class -1

What about 𝑥 3 ? New test data Class +1 𝑥 3 Class -1

Most confident: 𝑥 2 New test data Class +1 𝑥 2 Class -1

Intuition: margin large margin Class +1 𝑥 2 Class -1

Margin

Margin |𝑔 𝑥 𝑦 | 𝑥 𝑦 = 𝑥 𝑈 𝑦 = 0 • Lemma 1: 𝑦 has distance | 𝑥 | to the hyperplane 𝑔 Proof: • 𝑥 is orthogonal to the hyperplane 𝑥 • The unit direction is 𝑦 | 𝑥 | 0 𝑈 𝑥 𝑥 𝑔 𝑥 (𝑦) • The projection of 𝑦 is 𝑦 = | 𝑥 | 𝑥 | 𝑥 | 𝑈 𝑥 𝑦 𝑥

Margin: with bias 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 0 • Claim 1: 𝑥 is orthogonal to the hyperplane 𝑔 Proof: • pick any 𝑦 1 and 𝑦 2 on the hyperplane • 𝑥 𝑈 𝑦 1 + 𝑐 = 0 • 𝑥 𝑈 𝑦 2 + 𝑐 = 0 • So 𝑥 𝑈 (𝑦 1 − 𝑦 2 ) = 0

Margin: with bias −𝑐 | 𝑥 | to the hyperplane 𝑥 𝑈 𝑦 + 𝑐 = 0 • Claim 2: 0 has distance Proof: • pick any 𝑦 1 the hyperplane 𝑥 • Project 𝑦 1 to the unit direction | 𝑥 | to get the distance 𝑈 𝑥 −𝑐 | 𝑥 | since 𝑥 𝑈 𝑦 1 + 𝑐 = 0 • 𝑦 1 = 𝑥

Margin: with bias |𝑔 𝑥,𝑐 𝑦 | 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + • Lemma 2: 𝑦 has distance to the hyperplane 𝑔 | 𝑥 | 𝑐 = 0 Proof: 𝑥 • Let 𝑦 = 𝑦 ⊥ + 𝑠 | 𝑥 | , then |𝑠| is the distance • Multiply both sides by 𝑥 𝑈 and add 𝑐 • Left hand side: 𝑥 𝑈 𝑦 + 𝑐 = 𝑔 𝑥,𝑐 𝑦 𝑥 𝑈 𝑥 • Right hand side: 𝑥 𝑈 𝑦 ⊥ + 𝑠 | 𝑥 | + 𝑐 = 0 + 𝑠| 𝑥 |

The notation here is: 𝑧 𝑦 = 𝑥 𝑈 𝑦 + 𝑥 0 Figure from Pattern Recognition and Machine Learning , Bishop

Support Vector Machine (SVM)

SVM: objective • Margin over all training data points: |𝑔 𝑥,𝑐 𝑦 𝑗 | 𝛿 = min | 𝑥 | 𝑗 • Since only want correct 𝑔 𝑥,𝑐 , and recall 𝑧 𝑗 ∈ {+1, −1} , we have 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 𝛿 = min | 𝑥 | 𝑗 • If 𝑔 𝑥,𝑐 incorrect on some 𝑦 𝑗 , the margin is negative

SVM: objective • Maximize margin over all training data points: 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 max 𝑥,𝑐 𝛿 = max 𝑥,𝑐 min = max 𝑥,𝑐 min | 𝑥 | | 𝑥 | 𝑗 𝑗 • A bit complicated …

SVM: simplified objective • Observation: when (𝑥, 𝑐) scaled by a factor 𝑑 , the margin unchanged 𝑧 𝑗 (𝑑𝑥 𝑈 𝑦 𝑗 + 𝑑𝑐) = 𝑧 𝑗 (𝑥 𝑈 𝑦 𝑗 + 𝑐) | 𝑑𝑥 | | 𝑥 | • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closest to the hyperplane

SVM: simplified objective • Let’s consider a fixed scale such that 𝑧 𝑗 ∗ 𝑥 𝑈 𝑦 𝑗 ∗ + 𝑐 = 1 where 𝑦 𝑗 ∗ is the point closet to the hyperplane • Now we have for all data 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1 and at least for one 𝑗 the equality holds 1 • Then the margin is | 𝑥 |

SVM: simplified objective • Optimization simplified to 2 1 min 𝑥 2 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 𝑥 ∗ ? • How to find the optimum ෝ

SVM: principle for hypothesis class

Thought experiment • Suppose pick an 𝑆 , and suppose can decide if exists 𝑥 satisfying 1 2 ≤ 𝑆 𝑥 2 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 • Decrease 𝑆 until cannot find 𝑥 satisfying the inequalities

Thought experiment 𝑥 ∗ is the best weight (i.e., satisfying the smallest 𝑆 ) • ෝ 𝑥 ∗ ෝ

Thought experiment • To handle the difference between empirical and expected losses  • Choose large margin hypothesis (high confidence)  • Choose a small hypothesis class 𝑥 ∗ ෝ Corresponds to the hypothesis class

Thought experiment • Principle: use smallest hypothesis class still with a correct/good one • Also true beyond SVM • Also true for the case without perfect separation between the two classes • Math formulation: VC-dim theory, etc. 𝑥 ∗ ෝ Corresponds to the hypothesis class

Thought experiment • Principle: use smallest hypothesis class still with a correct/good one • Whatever you know about the ground truth, add it as constraint/regularizer 𝑥 ∗ ෝ Corresponds to the hypothesis class

SVM: optimization • Optimization (Quadratic Programming): 2 1 min 𝑥 2 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 • Solved by Lagrange multiplier method: 2 ℒ 𝑥, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 1] 2 𝑥 − ෍ 𝑗 where 𝜷 is the Lagrange multiplier • Details in next lecture

Reading • Review Lagrange multiplier method • E.g. Section 5 in Andrew Ng’s note on SVM • posted on the course website: http://www.cs.princeton.edu/courses/archive/spring16/cos495/

Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu - PowerPoint PPT Presentation

Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from distribution 1

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the empirical and true/population

Fitting SVM models in Matlab mdl = fitcsvm(X,y) fit a classifier using SVM X is a

An SVM- -based Masquerade Detection based Masquerade Detection An SVM Method with Online Update

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Classication SVM algorithms with interval-valued training data using triangular and

Eye-blink Detection Based on SVM Wang Xiaoxing Shanghai Jiao Tong University figure1

SVM Learning of IP Address Structure for Latency Prediction Rob Beverly, Karen Sollins and Arthur

Incorporating Detractors into SVM Marcin Orchel AGH University of Science and Technology Marcin

Training Linear SVMs By - Thorsten Joachims Prasad Seemakurthi Agenda What is SVM

Component-based Chat Room Development in SVM (Statechart Virtual Machine) Thomas Huining Feng

Elements - Chapter 12 - SVM Henry Tan Georgetown University April 13, 2015 Georgetown

Play Tes)ng CS 4730 Computer Game Design Credit:

Pr ProTrac acer er: T : Towar ards Pr ds Prac ac-c -cal Pr al Provenanc enance T e

Introduc)on to the Applica)ons Area within the IETF Murray

Lockout: Efficient Tes0ng for Deadlock Bugs Ali Kheradmand,

CDVP & TRECVID-2003 News Story Segmentation Task Csaba Czirjek, Gareth J.F. Jones, Sen

Regularization The problem of overfitting Machine Learning Example: Linear regression (housing

Detec%ng Wildlife in Uncontrolled Outdoor Video using Convolu%onal Neural Networks Connor Bowley

Introduc)on to Scien)fic and Technical compu)ng SSC 335/394, 2011 Victor

Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu - PowerPoint PPT Presentation

Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from distribution 1

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the empirical and true/population

Fitting SVM models in Matlab mdl = fitcsvm(X,y) fit a classifier using SVM X is a

An SVM- -based Masquerade Detection based Masquerade Detection An SVM Method with Online Update

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Classication SVM algorithms with interval-valued training data using triangular and

Eye-blink Detection Based on SVM Wang Xiaoxing Shanghai Jiao Tong University figure1

SVM Learning of IP Address Structure for Latency Prediction Rob Beverly, Karen Sollins and Arthur

Incorporating Detractors into SVM Marcin Orchel AGH University of Science and Technology Marcin

Training Linear SVMs By - Thorsten Joachims Prasad Seemakurthi Agenda What is SVM

Component-based Chat Room Development in SVM (Statechart Virtual Machine) Thomas Huining Feng

Elements - Chapter 12 - SVM Henry Tan Georgetown University April 13, 2015 Georgetown

Play Tes)ng CS 4730 Computer Game Design Credit:

Pr ProTrac acer er: T : Towar ards Pr ds Prac ac-c -cal Pr al Provenanc enance T e

Introduc)on to the Applica)ons Area within the IETF Murray

Lockout: Efficient Tes0ng for Deadlock Bugs Ali Kheradmand,

CDVP &amp; TRECVID-2003 News Story Segmentation Task Csaba Czirjek, Gareth J.F. Jones, Sen

Regularization The problem of overfitting Machine Learning Example: Linear regression (housing

Detec%ng Wildlife in Uncontrolled Outdoor Video using Convolu%onal Neural Networks Connor Bowley

Introduc)on to Scien)fic and Technical compu)ng SSC 335/394, 2011 Victor

CDVP & TRECVID-2003 News Story Segmentation Task Csaba Czirjek, Gareth J.F. Jones, Sen