Support Vector Machines Here we approach the two-class - PowerPoint PPT Presentation

Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two ways: • We soften what we mean by “separates”, and • We enrich and enlarge the feature space so that separation is possible. 1 / 21

What is a Hyperplane? • A hyperplane in p dimensions is a flat affine subspace of dimension p − 1. • In general the equation for a hyperplane has the form β 0 + β 1 X 1 + β 2 X 2 + . . . + β p X p = 0 • In p = 2 dimensions a hyperplane is a line. • If β 0 = 0, the hyperplane goes through the origin, otherwise not. • The vector β = ( β 1 , β 2 , · · · , β p ) is called the normal vector — it points in a direction orthogonal to the surface of a hyperplane. 2 / 21

Hyperplane in 2 Dimensions 10 8 β =( β 1 , β 2 ) ● ● β 1 X 1 + β 2 X 2 −6=1.6 6 β 1 X 1 + β 2 X 2 −6=0 ● 4 X 2 ● β 1 X 1 + β 2 X 2 −6=−4 2 ● ● 0 −2 β 1 = 0.8 β 2 = 0.6 −2 0 2 4 6 8 10 X 1 3 / 21

Separating Hyperplanes 3 3 2 2 X 2 X 2 1 1 0 0 −1 −1 −1 0 1 2 3 −1 0 1 2 3 X 1 X 1 • If f ( X ) = β 0 + β 1 X 1 + · · · + β p X p , then f ( X ) > 0 for points on one side of the hyperplane, and f ( X ) < 0 for points on the other. • If we code the colored points as Y i = +1 for blue, say, and Y i = − 1 for mauve, then if Y i · f ( X i ) > 0 for all i , f ( X ) = 0 defines a separating hyperplane . 4 / 21

Maximal Margin Classifier Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes. Constrained optimization problem 3 maximize β 0 ,β 1 ,...,β p M 2 p X 2 � β 2 1 subject to j = 1 , j =1 0 y i ( β 0 + β 1 x i 1 + . . . + β p x ip ) ≥ M for all i = 1 , . . . , N. −1 −1 0 1 2 3 X 1 This can be rephrased as a convex quadratic program, and solved efficiently. The function svm() in package e1071 solves this problem efficiently 5 / 21

Non-separable Data 2.0 The data on the left are not separable by a linear 1.5 boundary. 1.0 X 2 0.5 This is often the case, 0.0 unless N < p . −0.5 −1.0 0 1 2 3 X 1 6 / 21

Noisy Data 3 3 2 2 X 2 X 2 1 1 0 0 −1 −1 −1 0 1 2 3 −1 0 1 2 3 X 1 X 1 Sometimes the data are separable, but noisy. This can lead to a poor solution for the maximal-margin classifier. The support vector classifier maximizes a soft margin. 7 / 21

Support Vector Classifier 10 10 4 4 7 7 3 3 11 9 9 2 2 8 8 X 2 X 2 1 1 1 1 12 3 3 0 0 5 5 4 4 2 2 −1 −1 6 6 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 X 1 X 1 p � β 2 β 0 ,β 1 ,...,β p ,ǫ 1 ,...,ǫ n M maximize subject to j = 1 , j =1 y i ( β 0 + β 1 x i 1 + β 2 x i 2 + . . . + β p x ip ) ≥ M (1 − ǫ i ) , n � ǫ i ≥ 0 , ǫ i ≤ C, i =1 8 / 21

C is a regularization parameter 3 3 2 2 1 1 X 2 X 2 0 0 −1 −1 −2 −2 −3 −3 −1 0 1 2 −1 0 1 2 X 1 X 1 3 3 2 2 1 1 X 2 X 2 0 0 −1 −1 −2 −2 −3 −3 −1 0 1 2 −1 0 1 2 X 1 X 1 9 / 21

Linear boundary can fail Sometime a linear bound- 4 ary simply won’t work, no matter what value of 2 C . X 2 0 The example on the left is such a case. −2 What to do? −4 −4 −2 0 2 4 X 1 10 / 21

Feature Expansion • Enlarge the space of features by including transformations; e.g. X 2 1 , X 3 1 , X 1 X 2 , X 1 X 2 2 , . . . . Hence go from a p -dimensional space to a M > p dimensional space. • Fit a support-vector classifier in the enlarged space. • This results in non-linear decision boundaries in the original space. Example: Suppose we use ( X 1 , X 2 , X 2 1 , X 2 2 , X 1 X 2 ) instead of just ( X 1 , X 2 ). Then the decision boundary would be of the form β 0 + β 1 X 1 + β 2 X 2 + β 3 X 2 1 + β 4 X 2 2 + β 5 X 1 X 2 = 0 This leads to nonlinear decision boundaries in the original space (quadratic conic sections). 11 / 21

Cubic Polynomials Here we use a basis expansion of cubic poly- 4 nomials From 2 variables to 9 2 X 2 The support-vector clas- 0 sifier in the enlarged space solves the problem −2 in the lower-dimensional space −4 −4 −2 0 2 4 X 1 β 0 + β 1 X 1 + β 2 X 2 + β 3 X 2 1 + β 4 X 2 2 + β 5 X 1 X 2 + β 6 X 3 1 + β 7 X 3 2 + β 8 X 1 X 2 2 + β 9 X 2 1 X 2 = 0 12 / 21

Nonlinearities and Kernels • Polynomials (especially high-dimensional ones) get wild rather fast. • There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers — through the use of kernels . • Before we discuss these, we must understand the role of inner products in support-vector classifiers. 13 / 21

Inner products and support vectors p � � x i , x i ′ � = x ij x i ′ j — inner product between vectors j =1 • The linear support vector classifier can be represented as n � f ( x ) = β 0 + α i � x, x i � — n parameters i =1 • To estimate the parameters α 1 , . . . , α n and β 0 , all we need � n � are the inner products � x i , x i ′ � between all pairs of 2 training observations. It turns out that most of the ˆ α i can be zero: � f ( x ) = β 0 + α i � x, x i � ˆ i ∈S S is the support set of indices i such that ˆ α i > 0. [see slide 8] 14 / 21

Kernels and Support Vector Machines • If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract! • Some special kernel functions can do this for us. E.g. d   p � K ( x i , x i ′ ) =  1 + x ij x i ′ j  j =1 computes the inner-products needed for d dimensional � p + d � polynomials — basis functions! d Try it for p = 2 and d = 2 . • The solution has the form � f ( x ) = β 0 + α i K ( x, x i ) . ˆ i ∈S 15 / 21

Radial Kernel p � ( x ij − x i ′ j ) 2 ) . K ( x i , x i ′ ) = exp( − γ j =1 4 � f ( x ) = β 0 + α i K ( x, x i ) ˆ i ∈S 2 Implicit feature space; X 2 very high dimensional. 0 Controls variance by −2 squashing down most dimensions severely −4 −4 −2 0 2 4 X 1 16 / 21

Example: Heart Data 1.0 1.0 0.8 0.8 True positive rate True positive rate 0.6 0.6 0.4 0.4 0.2 0.2 Support Vector Classifier SVM: γ =10 − 3 SVM: γ =10 − 2 Support Vector Classifier SVM: γ =10 − 1 0.0 0.0 LDA 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate ROC curve is obtained by changing the threshold 0 to threshold t in ˆ f ( X ) > t , and recording false positive and true positive rates as t varies. Here we see ROC curves on training data. 17 / 21

Example continued: Heart Test Data 1.0 1.0 0.8 0.8 True positive rate True positive rate 0.6 0.6 0.4 0.4 0.2 0.2 Support Vector Classifier SVM: γ =10 − 3 SVM: γ =10 − 2 Support Vector Classifier SVM: γ =10 − 1 0.0 LDA 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate False positive rate 18 / 21

SVMs: more than 2 classes? The SVM as defined works for K = 2 classes. What do we do if we have K > 2 classes? OVA One versus All. Fit K different 2-class SVM classifiers ˆ f k ( x ) , k = 1 , . . . , K ; each class versus the rest. Classify x ∗ to the class for which ˆ f k ( x ∗ ) is largest. � K � OVO One versus One. Fit all pairwise classifiers 2 f kℓ ( x ). Classify x ∗ to the class that wins the most ˆ pairwise competitions. Which to choose? If K is not too large, use OVO. 19 / 21

Support Vector versus Logistic Regression? With f ( X ) = β 0 + β 1 X 1 + . . . + β p X p can rephrase support-vector classifier optimization as   p n   � � β 2 minimize max [0 , 1 − y i f ( x i )] + λ j β 0 ,β 1 ,...,β p  i =1 j =1  8 SVM Loss This has the form Logistic Regression Loss loss plus penalty . 6 The loss is known as the Loss hinge loss . 4 Very similar to “loss” in 2 logistic regression (negative log-likelihood). 0 −6 −4 −2 0 2 y i ( β 0 + β 1 x i 1 + . . . + β p x ip ) 20 / 21

Which to use: SVM or Logistic Regression • When classes are (nearly) separable, SVM does better than LR. So does LDA. • When not, LR (with ridge penalty) and SVM very similar. • If you wish to estimate probabilities, LR is the choice. • For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive. 21 / 21

Support Vector Machines Here we approach the two-class - PowerPoint PPT Presentation

Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two ways: We soften what we mean by

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Stochastic Ising model with plastic interactions Eugene Pechersky a,b , Guillem Via a , Anatoly

Dark Matter Production with Boosted W / Z Bosons at Large Hadron Collider - LHC Rene Nsanzineza

Volunteer Outreach Training Contents Staff overview Outreach call model General

Neurodharma: The New Science and Ancient Wisdom Of Awakening Compassionate Wellbeing July 4,

1984 Round Mountain Substation Power Outage caused by one circuit breaker darkened lights for

Interlude 3 With a special dedication to Mike Hudgins The law of the L ORD is perfect, reviving

1 Hello and welcome everyone. This is BPs fourth-quarter and full-year 2013 Results webcast

September 1997. Constant Information

Support Vector Machines Here we approach the two-class - PowerPoint PPT Presentation

Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two ways: We soften what we mean by

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Stochastic Ising model with plastic interactions Eugene Pechersky a,b , Guillem Via a , Anatoly

Dark Matter Production with Boosted W / Z Bosons at Large Hadron Collider - LHC Rene Nsanzineza

Volunteer Outreach Training Contents Staff overview Outreach call model General

Neurodharma: The New Science and Ancient Wisdom Of Awakening Compassionate Wellbeing July 4,

1984 Round Mountain Substation Power Outage caused by one circuit breaker darkened lights for

Interlude 3 With a special dedication to Mike Hudgins The law of the L ORD is perfect, reviving

1 Hello and welcome everyone. This is BPs fourth-quarter and full-year 2013 Results webcast

September 1997. Constant Information

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David