Machine Learning - MT 2017 13 Support Vector Machines II Christoph - PowerPoint PPT Presentation

Machine Learning - MT 2017 13 Support Vector Machines II Christoph Haase University of Oxford November 6, 2017

Last Time ◮ Primal Formuation of SVM ◮ Slack variables for linearly non-separable data 1

SVM Formulation : Non-Separable Case N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) ≥ 1 − ζ i ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } 2

SVM Formulation : Loss Function � N 1 2 � w � 2 minimise: + C ζ i 6 2 � �� i =1 � �� Regularizer Loss Function Hinge Loss 4 subject to: 2 y i ( w · x i + w 0 ) ≥ 1 − ζ i ζ i ≥ 0 0 − 6 − 4 − 2 0 2 4 6 for i = 1 , . . . , N y ( w · x + w 0 ) Here y i ∈ {− 1 , 1 } Note that for the optimal solution, ζ i = max { 0 , 1 − y i ( w · x i + w 0 ) } Thus, SVM can be viewed as minimizing the hinge loss with regularization 3

Logistic Regression: Loss Function Here y i ∈ { 0 , 1 } , so to compare effectively to SVM, let z i = (2 y i − 1) : ◮ z i = 1 if y i = 1 ◮ z i = − 1 if y i = 0 � �� 1 1 NLL( y i ; w , x i ) = − y i log + (1 − y i ) log 1 + e − w · x i 1 + e w · x i � 1 + e − z i ( w · x i ) � � 1 + e − (2 y i − 1)( w · x i ) � = log = log 6 Logistic Loss 4 2 0 − 6 − 4 − 2 0 2 4 6 (2 y − 1)( w · x + w 0 ) 4

Loss Functions 5

Outline Dual Formulation of SVM Kernels

SVM Formulation: Non-Separable Case What if your data looks like this? 6

SVM Formulation : Constrained Minimisation N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) − (1 − ζ i ) ≥ 0 ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } 7

Contrained Optimisation with Inequalities Primal Form minimise F ( z ) subject to g i ( z ) ≥ 0 i = 1 , . . . , m h j ( z ) = 0 j = 1 , . . . , l Lagrange Function m l � � Λ( z ; α, µ ) = F ( z ) − α i g i ( z ) − µ j h j ( z ) i =1 j =1 For convex problems (as defined before), Karush-Kuhn-Tucker (KKT) conditions provide necessary and sufficient conditions for a critical point of Λ to be the minimum of the original constrained optimisation problem For non-convex problems, they are necessary but not sufficient 8

KKT Conditions Lagrange Function m l � � Λ( z ; α , µ ) = F ( z ) − α i g i ( z ) − µ j h j ( z ) i =1 j =1 For convex problems, Karush-Kuhn-Tucker (KKT) conditions give necessary and sufficient conditions for a solution (critical point of Λ ) to be optimal Dual feasibility: for i = 1 , . . . m α i ≥ 0 Primal feasibility: for i = 1 , . . . m g i ( z ) ≥ 0 for j = 1 , . . . l h j ( z ) = 0 Complementary slackness: α i g i ( z ) = 0 for i = 1 , . . . m 9

SVM Formulation N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) − (1 − ζ i ) ≥ 0 ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } Lagrange Function � N � N � N Λ( w , w 0 , ζ ; α , µ ) = 1 2 � w � 2 2 + C ζ i − α i ( y i ( w · x i + w 0 ) − (1 − ζ i )) − µ i ζ i i =1 i =1 i =1 10

SVM Dual Formulation Lagrange Function � N � N � N Λ( w , w 0 , ζ ; α , µ ) = 1 2 � w � 2 2 + C ζ i − α i ( y i ( w · x i + w 0 ) − (1 − ζ i )) − µ i ζ i i =1 i =1 i =1 We write derivatives with respect to w , w 0 and ζ i , N � ∂ Λ ∂w 0 = − α i y i i =1 ∂ Λ ∂ζ i = C − α i − µ i N � ∇ w Λ = w − α i y i x i i =1 For (KKT) dual feasibility constraints, we require α i ≥ 0 , µ i ≥ 0 11

SVM Dual Formulation Setting the derivatives to 0 , substituting the resulting expressions in Λ (and simplifying), we get a function g ( α ) and some constraints N N N � � � α i − 1 g ( α ) = α i α j y i y j x i · x j 2 i =1 i =1 j =1 Constraints 0 ≤ α i ≤ C i = 1 , . . . , N N � α i y i = 0 i =1 Finding critical points of Λ satisfying the KKT conditions corresponds to finding the maximum of g ( α ) subject to the above constraints 12

SVM: Primal and Dual Formulations Primal Form Dual Form N N N N � � � � α i − 1 1 2 � w � 2 minimise: maximise 2 + C ζ i α i α j y i y j x i · x j 2 i =1 i =1 i =1 j =1 subject to: subject to: � N y i ( w · x i + w 0 ) ≥ (1 − ζ i ) i =1 α i y i = 0 ζ i ≥ 0 0 ≤ α i ≤ C for i = 1 , . . . , N for i = 1 , . . . , N 13

KKT Complementary Slackness Conditions � � ◮ For all i , α i y i ( w · x i + w 0 ) − (1 − ζ i ) = 0 ◮ If α i > 0 , y i ( w · x i + w 0 ) = 1 − ζ i ◮ Recall the form of the solution: w = � N i =1 α i y i x i ◮ Thus, only those datapoints x i for which α i > 0 , determine the solution ◮ This is why they are called support vectors 14

Support Vectors 15

SVM Dual Formulation N N N � � � α i − 1 α i α j y i y j x T maximise i x j 2 i =1 i =1 j =1 subject to: 0 ≤ α i ≤ C i = 1 , . . . , N N � α i y i = 0 i =1 ◮ Objective depends only between dot products of training inputs ◮ Dual formulation particularly useful if inputs are high-dimensional ◮ Dual constraints are much simpler than primal ones ◮ To make a new prediction only need to know dot product with support vectors ◮ Solution is of the form w = � N i =1 α i y i x i ◮ And so w · x new = � N i =1 α i y i x i · x new 16

Outline Dual Formulation of SVM Kernels

Gram Matrix If we put the inputs in matrix X , where the i th row of X is x T i .   x T x T x T 1 x 1 1 x 2 · · · 1 x N   x T x T x T · · · 2 x 1 2 x 2 2 x N   K = XX T =   . . . ...   . . . . . .   x T x T x T N x 1 N x 2 · · · N x N ◮ The matrix K is positive definite if D > N and x i are linearly independent ◮ If we perform basis expansion φ : R D → R M then replace entries by φ ( x i ) T φ ( x j ) ◮ We only need the ability to compute inner products to use SVM 17

Kernel Trick Suppose, x ∈ R 2 and we perform degree 2 polynomial expansion, we could use the map: � � T 1 , x 1 , x 2 , x 2 1 , x 2 ψ ( x ) = 2 , x 1 x 2 But, we could also use the map: � � T √ √ √ 2 x 2 , x 2 1 , x 2 φ ( x ) = 1 , 2 x 1 , 2 , 2 x 1 x 2 If x = [ x 1 , x 2 ] T and x ′ = [ x ′ 2 ] T , then 1 , x ′ 1 ) 2 + x 2 2 ) 2 + 2 x 1 x 2 x ′ 2 + x 2 φ ( x ) T φ ( x ′ ) = 1 + 2 x 1 x ′ 1 + 2 x 2 x ′ 1 ( x ′ 2 ( x ′ 1 x ′ 2 2 ) 2 = (1 + x · x ′ ) 2 = (1 + x 1 x ′ 1 + x 2 x ′ Instead of spending ≈ D d time to compute inner products after degree d polynomial basis expansion, we only need O ( D ) time 18

Kernel Trick We can use a symmetric positive semi-definite matrix (Mercer Kernels)   κ ( x 1 , x 1 ) κ ( x 1 , x 2 ) · · · κ ( x 1 , x N )   κ ( x 2 , x 1 ) κ ( x 2 , x 2 ) · · · κ ( x 2 , x N )     K = . . . ...   . . .  . . .  κ ( x N , x 1 ) κ ( x N , x 2 ) · · · κ ( x N , x N ) Here κ ( x , x ′ ) is some measure of similarity between x and x ′ The dual program becomes N N N � � � maximise α i − α i α j y i y j K i,j i =1 i =1 j =1 subject to : 0 ≤ α i ≤ C and � N i =1 α i y i = 0 To make prediction on new x new , only need to compute κ ( x i , x new ) for support vectors x i (for which α i > 0 ) 19

Polynomial Kernels Rather than perform basis expansion, κ ( x , x ′ ) = (1 + x · x ′ ) d This gives all terms of degree up to d If we use κ ( x , x ′ ) = ( x · x ′ ) d , we get only degree d terms Linear Kernel: κ ( x , x ′ ) = x · x ′ All of these satisfy the Mercer or positive-definite condition 20

Gaussian or RBF Kernel Radial Basis Function (RBF) or Gaussian Kernel � � −� x − x ′ � 2 κ ( x , x ′ ) = exp 2 σ 2 σ 2 is known as the bandwidth 1 We used this with γ = 2 σ 2 when we studied kernel basis expansion for regression Can generalise to more general covariance matrices Results in a Mercer kernel 21

Kernels on Discrete Data : Cosine Kernel For text documents: let x denote bag of words Cosine Similarity x · x ′ κ ( x , x ′ ) = � x � 2 � x ′ � 2 Term frequency tf ( c ) = log(1 + c ) , c word count for some word w � � Inverse document frequency idf ( w ) = log N , N w #docs containing w 1+ N w tf - idf ( x ) w = tf ( x w ) idf ( w ) 22

Kernels on Discrete Data : String Kernel Let x and x ′ be strings over some alphabet A A = { A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V } κ ( x , x ′ ) = � s w s φ s ( x ) φ s ( x ′ ) φ s ( x ) is the number of times s appears in x as substring w s is the weight associated with substring s 23

How to choose a good kernel? Not always easy to tell whether a kernel function is a Mercer kernel Mercer Condition: For any finite set of points, the Kernel matrix should be positive semi-definite If the following hold: ◮ κ 1 , κ 2 are Mercer kernels for points in R D ◮ f : R D → R ◮ φ : R D → R M ◮ κ 3 is a Mercer kernel on R M the following are Mercer kernels ◮ κ 1 + κ 2 , κ 1 · κ 2 , ακ 1 for α ≥ 0 ◮ κ ( x , x ′ ) = f ( x ) f ( x ′ ) ◮ κ 3 ( φ ( x ) , φ ( x ′ )) ◮ κ ( x , x ′ ) = x T Ax ′ for A positive definite 24

Machine Learning - MT 2017 13 Support Vector Machines II Christoph - PowerPoint PPT Presentation

Machine Learning - MT 2017 13 Support Vector Machines II Christoph Haase University of Oxford November 6, 2017 Last Time Primal Formuation of SVM Slack variables for linearly non-separable data 1 SVM Formulation : Non-Separable Case

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Topics in Combinatorial OPtimization Orlando Lee Unicamp 19 de mar co de 2014 Orlando Lee

PROPERTIES OF SOME ALGEBRAICALLY DEFINED DIGRAPHS Aleksandr Kodess, Felix Lazebnik Department of

Ideal Multipartite Secret Sharing Schemes Oriol Farrs, Jaume Mart-Farr, Carles Padr

CO-AXIAL MONODROMY Alexandre Eremenko (Purdue University) www.math.purdue.edu/eremenko 2018

Spring term, 2019 Ling 5201 Syntax I 1: Valence, rules and proof Robert Levine Ohio State

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Kenneth O. May: A set of independent necessary and sufficient conditions for simple majority

Partially Localizable Networks by Goldenberg, Krishnamurthy, Maness, Yang, Young, Morse,

Machine Learning - MT 2017 13 Support Vector Machines II Christoph - PowerPoint PPT Presentation

Machine Learning - MT 2017 13 Support Vector Machines II Christoph Haase University of Oxford November 6, 2017 Last Time Primal Formuation of SVM Slack variables for linearly non-separable data 1 SVM Formulation : Non-Separable Case

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Topics in Combinatorial OPtimization Orlando Lee Unicamp 19 de mar co de 2014 Orlando Lee

PROPERTIES OF SOME ALGEBRAICALLY DEFINED DIGRAPHS Aleksandr Kodess, Felix Lazebnik Department of

Ideal Multipartite Secret Sharing Schemes Oriol Farrs, Jaume Mart-Farr, Carles Padr

CO-AXIAL MONODROMY Alexandre Eremenko (Purdue University) www.math.purdue.edu/eremenko 2018

Spring term, 2019 Ling 5201 Syntax I 1: Valence, rules and proof Robert Levine Ohio State

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

Kenneth O. May: A set of independent necessary and sufficient conditions for simple majority

Partially Localizable Networks by Goldenberg, Krishnamurthy, Maness, Yang, Young, Morse,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David