Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu - PowerPoint PPT Presentation

Machine Learning Basics Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang

Review: SVM objective

SVM: objective 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 . Margin: • Let 𝑧 𝑗 ∈ +1, −1 , 𝑔 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 𝛿 = min | 𝑥 | 𝑗 • Support Vector Machine: 𝑧 𝑗 𝑔 𝑥,𝑐 𝑦 𝑗 max 𝑥,𝑐 𝛿 = max 𝑥,𝑐 min | 𝑥 | 𝑗

SVM: optimization • Optimization (Quadratic Programming): 2 1 min 𝑥 2 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 • Solved by Lagrange multiplier method: 2 ℒ 𝑥, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 1] 2 𝑥 − ෍ 𝑗 where 𝜷 is the Lagrange multiplier

Lagrange multiplier

Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Lagrangian: ℒ 𝑥, 𝜸 = 𝑔 𝑥 + ෍ 𝛾 𝑗 ℎ 𝑗 (𝑥) 𝑗 where 𝛾 𝑗 ’s are called Lagrange multipliers

Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 ℎ 𝑗 𝑥 = 0, ∀1 ≤ 𝑗 ≤ 𝑚 • Solved by setting derivatives of Lagrangian to 0 𝜖ℒ 𝜖ℒ = 0; = 0 𝜖𝑥 𝑗 𝜖𝛾 𝑗

Generalized Lagrangian • Consider optimization problem: min 𝑔(𝑥) 𝑥 𝑕 𝑗 𝑥 ≤ 0, ∀1 ≤ 𝑗 ≤ 𝑙 ℎ 𝑘 𝑥 = 0, ∀1 ≤ 𝑘 ≤ 𝑚 • Generalized Lagrangian: ℒ 𝑥, 𝜷, 𝜸 = 𝑔 𝑥 + ෍ 𝛽 𝑗 𝑕 𝑗 (𝑥) + ෍ 𝛾 𝑘 ℎ 𝑘 (𝑥) 𝑗 𝑘 where 𝛽 𝑗 , 𝛾 𝑘 ’s are called Lagrange multipliers

Generalized Lagrangian • Consider the quantity: 𝜄 𝑄 𝑥 ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max • Why? 𝜄 𝑄 𝑥 = ቊ𝑔 𝑥 , if 𝑥 satisfies all the constraints +∞, if 𝑥 does not satisfy the constraints • So minimizing 𝑔 𝑥 is the same as minimizing 𝜄 𝑄 𝑥 min 𝑥 𝑔 𝑥 = min 𝑥 𝜄 𝑄 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥

Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Always true: 𝑒 ∗ ≤ 𝑞 ∗

Lagrange duality • The primal problem 𝑞 ∗ ≔ min 𝑥 𝑔 𝑥 = min 𝜷,𝜸:𝛽 𝑗 ≥0 ℒ 𝑥, 𝜷, 𝜸 max 𝑥 • The dual problem 𝑒 ∗ ≔ 𝜷,𝜸:𝛽 𝑗 ≥0 min max 𝑥 ℒ 𝑥, 𝜷, 𝜸 • Interesting case: when do we have 𝑒 ∗ = 𝑞 ∗ ?

Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

Lagrange duality dual complementarity • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

Lagrange duality • Theorem: under proper conditions, there exists 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ such that primal constraints dual constraints 𝑒 ∗ = ℒ 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ = 𝑞 ∗ • Moreover, 𝑥 ∗ , 𝜷 ∗ , 𝜸 ∗ satisfy Karush-Kuhn-Tucker (KKT) conditions: 𝜖ℒ = 0, 𝛽 𝑗 𝑕 𝑗 𝑥 = 0 𝜖𝑥 𝑗 𝑕 𝑗 𝑥 ≤ 0, ℎ 𝑘 𝑥 = 0, 𝛽 𝑗 ≥ 0

Lagrange duality • What are the proper conditions? • A set of conditions (Slater conditions): • 𝑔, 𝑕 𝑗 convex, ℎ 𝑘 affine • Exists 𝑥 satisfying all 𝑕 𝑗 𝑥 < 0 • There exist other sets of conditions • Search Karush – Kuhn – Tucker conditions on Wikipedia

SVM: optimization

SVM: optimization • Optimization (Quadratic Programming): 2 1 min 𝑥 2 𝑥,𝑐 𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 ≥ 1, ∀𝑗 • Generalized Lagrangian: 2 ℒ 𝑥, 𝑐, 𝜷 = 1 𝛽 𝑗 [𝑧 𝑗 𝑥 𝑈 𝑦 𝑗 + 𝑐 − 1] 2 𝑥 − ෍ 𝑗 where 𝜷 is the Lagrange multiplier

SVM: optimization • KKT conditions: 𝜖ℒ 𝜖𝑥 = 0,  𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 (1) 𝜖ℒ 𝜖𝑐 = 0,  0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 (2) • Plug into ℒ : 1 𝑈 𝑦 𝑘 (3) ℒ 𝑥, 𝑐, 𝜷 = σ 𝑗 𝛽 𝑗 − 2 σ 𝑗𝑘 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 combined with 0 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 , 𝛽 𝑗 ≥ 0

Only depend on inner products SVM: optimization • Reduces to dual problem: 𝛽 𝑗 − 1 𝑈 𝑦 𝑘 ℒ 𝑥, 𝑐, 𝜷 = ෍ 2 ෍ 𝛽 𝑗 𝛽 𝑘 𝑧 𝑗 𝑧 𝑘 𝑦 𝑗 𝑗 𝑗𝑘 ෍ 𝛽 𝑗 𝑧 𝑗 = 0, 𝛽 𝑗 ≥ 0 𝑗 𝑈 𝑦 + 𝑐 • Since 𝑥 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗 , we have 𝑥 𝑈 𝑦 + 𝑐 = σ 𝑗 𝛽 𝑗 𝑧 𝑗 𝑦 𝑗

Kernel methods

Features 𝑦 𝜚 𝑦 Color Histogram Extract features Red Green Blue

Features

Features • Proper feature mapping can make non-linear to linear • Using SVM on the feature space {𝜚 𝑦 𝑗 } : only need 𝜚 𝑦 𝑗 𝑈 𝜚(𝑦 𝑘 ) • Conclusion: no need to design 𝜚 ⋅ , only need to design 𝑙 𝑦 𝑗 , 𝑦 𝑘 = 𝜚 𝑦 𝑗 𝑈 𝜚(𝑦 𝑘 )

Polynomial kernels • Fix degree 𝑒 and constant 𝑑 : 𝑙 𝑦, 𝑦′ = 𝑦 𝑈 𝑦′ + 𝑑 𝑒 • What are 𝜚(𝑦) ? • Expand the expression to get 𝜚(𝑦)

Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Gaussian kernels • Fix bandwidth 𝜏 : 2 /2𝜏 2 ) 𝑙 𝑦, 𝑦′ = exp(− 𝑦 − 𝑦 ′ • Also called radial basis function (RBF) kernels • What are 𝜚(𝑦) ? Consider the un-normalized version 𝑙′ 𝑦, 𝑦′ = exp(𝑦 𝑈 𝑦′/𝜏 2 ) • Power series expansion: +∞ 𝑦 𝑈 𝑦 ′ 𝑗 𝑙′ 𝑦, 𝑦 ′ = ෍ 𝜏 𝑗 𝑗! 𝑗

Mercer’s condition for kenerls • Theorem: 𝑙 𝑦, 𝑦′ has expansion +∞ 𝑏 𝑗 𝜚 𝑗 𝑦 𝜚 𝑗 (𝑦 ′ ) 𝑙 𝑦, 𝑦′ = ෍ 𝑗 if and only if for any function 𝑑(𝑦) , ∫ ∫ 𝑑 𝑦 𝑑 𝑦 ′ 𝑙 𝑦, 𝑦 ′ 𝑒𝑦𝑒𝑦 ′ ≥ 0 (Omit some math conditions for 𝑙 and 𝑑 )

Constructing new kernels • Kernels are closed under positive scaling, sum, product, pointwise +∞ 𝑏 𝑗 𝑙 𝑗 (𝑦, 𝑦 ′ ) limit, and composition with a power series σ 𝑗 • Example: 𝑙 1 𝑦, 𝑦′ , 𝑙 2 𝑦, 𝑦′ are kernels, then also is 𝑙 𝑦, 𝑦 ′ = 2𝑙 1 𝑦, 𝑦′ + 3𝑙 2 𝑦, 𝑦′ • Example: 𝑙 1 𝑦, 𝑦′ is kernel, then also is 𝑙 𝑦, 𝑦 ′ = exp(𝑙 1 𝑦, 𝑦 ′ )

Kernels v.s. Neural networks

Features 𝑦 Color Histogram Extract build 𝑧 = 𝑥 𝑈 𝜚 𝑦 features hypothesis Red Green Blue

Features: part of the model Nonlinear model build 𝑧 = 𝑥 𝑈 𝜚 𝑦 hypothesis Linear model

Polynomial kernels Figure from Foundations of Machine Learning, by M. Mohri, A. Rostamizadeh, and A. Talwalkar

Polynomial kernel SVM as two layer neural network 2 𝑦 1 2 𝑦 2 𝑦 1 2𝑦 1 𝑦 2 𝑧 = sign(𝑥 𝑈 𝜚(𝑦) + 𝑐) 2𝑑𝑦 1 𝑦 2 2𝑑𝑦 2 𝑑 First layer is fixed. If also learn first layer, it becomes two layer neural network

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu - PowerPoint PPT Presentation

Machine Learning Basics Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective SVM: objective , = + . Margin: Let +1, 1 , ,

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the empirical and true/population

Fitting SVM models in Matlab mdl = fitcsvm(X,y) fit a classifier using SVM X is a

An SVM- -based Masquerade Detection based Masquerade Detection An SVM Method with Online Update

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Classication SVM algorithms with interval-valued training data using triangular and

Eye-blink Detection Based on SVM Wang Xiaoxing Shanghai Jiao Tong University figure1

SVM Learning of IP Address Structure for Latency Prediction Rob Beverly, Karen Sollins and Arthur

Incorporating Detractors into SVM Marcin Orchel AGH University of Science and Technology Marcin

Training Linear SVMs By - Thorsten Joachims Prasad Seemakurthi Agenda What is SVM

Component-based Chat Room Development in SVM (Statechart Virtual Machine) Thomas Huining Feng

Elements - Chapter 12 - SVM Henry Tan Georgetown University April 13, 2015 Georgetown

On-line Support Vector Motivation and antecedents Formulation of SVM regression Machine

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science &

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Explicit Feature Methods for Accelerated Kernel Learning Purushottam Kar Quick Motivation

Kernel methods for Network Analysis: An introduction Chiranjib Bhattacharyya Machine Learning

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Spectral regularization methods for statistical inverse learning problems G. Blanchard

Meta-parameters of kernel methods and their optimization Petra Vidnerov Roman Neruda

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu - PowerPoint PPT Presentation

Machine Learning Basics Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective SVM: objective , = + . Margin: Let +1, 1 , ,

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the empirical and true/population

Fitting SVM models in Matlab mdl = fitcsvm(X,y) fit a classifier using SVM X is a

An SVM- -based Masquerade Detection based Masquerade Detection An SVM Method with Online Update

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Classication SVM algorithms with interval-valued training data using triangular and

Eye-blink Detection Based on SVM Wang Xiaoxing Shanghai Jiao Tong University figure1

SVM Learning of IP Address Structure for Latency Prediction Rob Beverly, Karen Sollins and Arthur

Incorporating Detractors into SVM Marcin Orchel AGH University of Science and Technology Marcin

Training Linear SVMs By - Thorsten Joachims Prasad Seemakurthi Agenda What is SVM

Component-based Chat Room Development in SVM (Statechart Virtual Machine) Thomas Huining Feng

Elements - Chapter 12 - SVM Henry Tan Georgetown University April 13, 2015 Georgetown

On-line Support Vector Motivation and antecedents Formulation of SVM regression Machine

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science &amp;

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Explicit Feature Methods for Accelerated Kernel Learning Purushottam Kar Quick Motivation

Kernel methods for Network Analysis: An introduction Chiranjib Bhattacharyya Machine Learning

L ECTURE 9: D UAL AND K ERNEL Prof. Julia Hockenmaier juliahmr@illinois.edu Linear classifiers

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Spectral regularization methods for statistical inverse learning problems G. Blanchard

Meta-parameters of kernel methods and their optimization Petra Vidnerov Roman Neruda

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science &