Machine Learning - MT 2017 13 Support Vector Machines II Christoph Haase University of Oxford November 6, 2017
Last Time ◮ Primal Formuation of SVM ◮ Slack variables for linearly non-separable data 1
SVM Formulation : Non-Separable Case N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) ≥ 1 − ζ i ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } 2
SVM Formulation : Loss Function � N 1 2 � w � 2 minimise: + C ζ i 6 2 � �� � i =1 � �� � Regularizer Loss Function Hinge Loss 4 subject to: 2 y i ( w · x i + w 0 ) ≥ 1 − ζ i ζ i ≥ 0 0 − 6 − 4 − 2 0 2 4 6 for i = 1 , . . . , N y ( w · x + w 0 ) Here y i ∈ {− 1 , 1 } Note that for the optimal solution, ζ i = max { 0 , 1 − y i ( w · x i + w 0 ) } Thus, SVM can be viewed as minimizing the hinge loss with regularization 3
Logistic Regression: Loss Function Here y i ∈ { 0 , 1 } , so to compare effectively to SVM, let z i = (2 y i − 1) : ◮ z i = 1 if y i = 1 ◮ z i = − 1 if y i = 0 � �� � � � 1 1 NLL( y i ; w , x i ) = − y i log + (1 − y i ) log 1 + e − w · x i 1 + e w · x i � 1 + e − z i ( w · x i ) � � 1 + e − (2 y i − 1)( w · x i ) � = log = log 6 Logistic Loss 4 2 0 − 6 − 4 − 2 0 2 4 6 (2 y − 1)( w · x + w 0 ) 4
Loss Functions 5
Outline Dual Formulation of SVM Kernels
SVM Formulation: Non-Separable Case What if your data looks like this? 6
SVM Formulation : Constrained Minimisation N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) − (1 − ζ i ) ≥ 0 ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } 7
Contrained Optimisation with Inequalities Primal Form minimise F ( z ) subject to g i ( z ) ≥ 0 i = 1 , . . . , m h j ( z ) = 0 j = 1 , . . . , l Lagrange Function m l � � Λ( z ; α, µ ) = F ( z ) − α i g i ( z ) − µ j h j ( z ) i =1 j =1 For convex problems (as defined before), Karush-Kuhn-Tucker (KKT) conditions provide necessary and sufficient conditions for a critical point of Λ to be the minimum of the original constrained optimisation problem For non-convex problems, they are necessary but not sufficient 8
KKT Conditions Lagrange Function m l � � Λ( z ; α , µ ) = F ( z ) − α i g i ( z ) − µ j h j ( z ) i =1 j =1 For convex problems, Karush-Kuhn-Tucker (KKT) conditions give necessary and sufficient conditions for a solution (critical point of Λ ) to be optimal Dual feasibility: for i = 1 , . . . m α i ≥ 0 Primal feasibility: for i = 1 , . . . m g i ( z ) ≥ 0 for j = 1 , . . . l h j ( z ) = 0 Complementary slackness: α i g i ( z ) = 0 for i = 1 , . . . m 9
SVM Formulation N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) − (1 − ζ i ) ≥ 0 ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } Lagrange Function � N � N � N Λ( w , w 0 , ζ ; α , µ ) = 1 2 � w � 2 2 + C ζ i − α i ( y i ( w · x i + w 0 ) − (1 − ζ i )) − µ i ζ i i =1 i =1 i =1 10
SVM Dual Formulation Lagrange Function � N � N � N Λ( w , w 0 , ζ ; α , µ ) = 1 2 � w � 2 2 + C ζ i − α i ( y i ( w · x i + w 0 ) − (1 − ζ i )) − µ i ζ i i =1 i =1 i =1 We write derivatives with respect to w , w 0 and ζ i , N � ∂ Λ ∂w 0 = − α i y i i =1 ∂ Λ ∂ζ i = C − α i − µ i N � ∇ w Λ = w − α i y i x i i =1 For (KKT) dual feasibility constraints, we require α i ≥ 0 , µ i ≥ 0 11
SVM Dual Formulation Setting the derivatives to 0 , substituting the resulting expressions in Λ (and simplifying), we get a function g ( α ) and some constraints N N N � � � α i − 1 g ( α ) = α i α j y i y j x i · x j 2 i =1 i =1 j =1 Constraints 0 ≤ α i ≤ C i = 1 , . . . , N N � α i y i = 0 i =1 Finding critical points of Λ satisfying the KKT conditions corresponds to finding the maximum of g ( α ) subject to the above constraints 12
SVM: Primal and Dual Formulations Primal Form Dual Form N N N N � � � � α i − 1 1 2 � w � 2 minimise: maximise 2 + C ζ i α i α j y i y j x i · x j 2 i =1 i =1 i =1 j =1 subject to: subject to: � N y i ( w · x i + w 0 ) ≥ (1 − ζ i ) i =1 α i y i = 0 ζ i ≥ 0 0 ≤ α i ≤ C for i = 1 , . . . , N for i = 1 , . . . , N 13
KKT Complementary Slackness Conditions � � ◮ For all i , α i y i ( w · x i + w 0 ) − (1 − ζ i ) = 0 ◮ If α i > 0 , y i ( w · x i + w 0 ) = 1 − ζ i ◮ Recall the form of the solution: w = � N i =1 α i y i x i ◮ Thus, only those datapoints x i for which α i > 0 , determine the solution ◮ This is why they are called support vectors 14
Support Vectors 15
SVM Dual Formulation N N N � � � α i − 1 α i α j y i y j x T maximise i x j 2 i =1 i =1 j =1 subject to: 0 ≤ α i ≤ C i = 1 , . . . , N N � α i y i = 0 i =1 ◮ Objective depends only between dot products of training inputs ◮ Dual formulation particularly useful if inputs are high-dimensional ◮ Dual constraints are much simpler than primal ones ◮ To make a new prediction only need to know dot product with support vectors ◮ Solution is of the form w = � N i =1 α i y i x i ◮ And so w · x new = � N i =1 α i y i x i · x new 16
Outline Dual Formulation of SVM Kernels
Gram Matrix If we put the inputs in matrix X , where the i th row of X is x T i . x T x T x T 1 x 1 1 x 2 · · · 1 x N x T x T x T · · · 2 x 1 2 x 2 2 x N K = XX T = . . . ... . . . . . . x T x T x T N x 1 N x 2 · · · N x N ◮ The matrix K is positive definite if D > N and x i are linearly independent ◮ If we perform basis expansion φ : R D → R M then replace entries by φ ( x i ) T φ ( x j ) ◮ We only need the ability to compute inner products to use SVM 17
Kernel Trick Suppose, x ∈ R 2 and we perform degree 2 polynomial expansion, we could use the map: � � T 1 , x 1 , x 2 , x 2 1 , x 2 ψ ( x ) = 2 , x 1 x 2 But, we could also use the map: � � T √ √ √ 2 x 2 , x 2 1 , x 2 φ ( x ) = 1 , 2 x 1 , 2 , 2 x 1 x 2 If x = [ x 1 , x 2 ] T and x ′ = [ x ′ 2 ] T , then 1 , x ′ 1 ) 2 + x 2 2 ) 2 + 2 x 1 x 2 x ′ 2 + x 2 φ ( x ) T φ ( x ′ ) = 1 + 2 x 1 x ′ 1 + 2 x 2 x ′ 1 ( x ′ 2 ( x ′ 1 x ′ 2 2 ) 2 = (1 + x · x ′ ) 2 = (1 + x 1 x ′ 1 + x 2 x ′ Instead of spending ≈ D d time to compute inner products after degree d polynomial basis expansion, we only need O ( D ) time 18
Kernel Trick We can use a symmetric positive semi-definite matrix (Mercer Kernels) κ ( x 1 , x 1 ) κ ( x 1 , x 2 ) · · · κ ( x 1 , x N ) κ ( x 2 , x 1 ) κ ( x 2 , x 2 ) · · · κ ( x 2 , x N ) K = . . . ... . . . . . . κ ( x N , x 1 ) κ ( x N , x 2 ) · · · κ ( x N , x N ) Here κ ( x , x ′ ) is some measure of similarity between x and x ′ The dual program becomes N N N � � � maximise α i − α i α j y i y j K i,j i =1 i =1 j =1 subject to : 0 ≤ α i ≤ C and � N i =1 α i y i = 0 To make prediction on new x new , only need to compute κ ( x i , x new ) for support vectors x i (for which α i > 0 ) 19
Polynomial Kernels Rather than perform basis expansion, κ ( x , x ′ ) = (1 + x · x ′ ) d This gives all terms of degree up to d If we use κ ( x , x ′ ) = ( x · x ′ ) d , we get only degree d terms Linear Kernel: κ ( x , x ′ ) = x · x ′ All of these satisfy the Mercer or positive-definite condition 20
Gaussian or RBF Kernel Radial Basis Function (RBF) or Gaussian Kernel � � −� x − x ′ � 2 κ ( x , x ′ ) = exp 2 σ 2 σ 2 is known as the bandwidth 1 We used this with γ = 2 σ 2 when we studied kernel basis expansion for regression Can generalise to more general covariance matrices Results in a Mercer kernel 21
Kernels on Discrete Data : Cosine Kernel For text documents: let x denote bag of words Cosine Similarity x · x ′ κ ( x , x ′ ) = � x � 2 � x ′ � 2 Term frequency tf ( c ) = log(1 + c ) , c word count for some word w � � Inverse document frequency idf ( w ) = log N , N w #docs containing w 1+ N w tf - idf ( x ) w = tf ( x w ) idf ( w ) 22
Kernels on Discrete Data : String Kernel Let x and x ′ be strings over some alphabet A A = { A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V } κ ( x , x ′ ) = � s w s φ s ( x ) φ s ( x ′ ) φ s ( x ) is the number of times s appears in x as substring w s is the weight associated with substring s 23
How to choose a good kernel? Not always easy to tell whether a kernel function is a Mercer kernel Mercer Condition: For any finite set of points, the Kernel matrix should be positive semi-definite If the following hold: ◮ κ 1 , κ 2 are Mercer kernels for points in R D ◮ f : R D → R ◮ φ : R D → R M ◮ κ 3 is a Mercer kernel on R M the following are Mercer kernels ◮ κ 1 + κ 2 , κ 1 · κ 2 , ακ 1 for α ≥ 0 ◮ κ ( x , x ′ ) = f ( x ) f ( x ′ ) ◮ κ 3 ( φ ( x ) , φ ( x ′ )) ◮ κ ( x , x ′ ) = x T Ax ′ for A positive definite 24
Recommend
More recommend