COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk
Outline • Geometry concepts: hyperplane, distance, parallel hyperplane, margin. • Basic idea of support vector machine (SVM). • Hard-margin SVM • Soft-margin SVM • Support Vectors • Nonlinear classification: – Kernel trick – Linear basis function model 1
History and Information • Vapnik and Lerner (1963) introduced the generalised portrait algorithm. The algorithm implemented by SVMs is a nonlinear generalisation of the generalised portrait algorithm. • Support vector machine was first introduced in 1992: – Boser et al. A training algorithm for optimal margin classifiers. Proceedings of the 5- th Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992. • More on SVM history: http://www.svms.org/history.html • Centralised website: http://www.kernel-machines.org • Popular textbook: – N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, 2000. http://www.support-vector.net • Popular library: LIBSVM, MATLAB SVM, scikit-learn (machine learning in Python). 2
Hyperplane and Distance 3D space w 1 x 1 + w 2 x 2 + … + w d x d + b = 0 ⇔ w T x + b = 0 The above is called a hyperplane. x 2 In 2D space (w 1 x 1 +w 2 x 2 +b=0), it is Hyperplane direction a straight line. w In 3D space (w 1 x 1 +w 2 x 2 +w 3 x 3 +b=0), it is a plane. w T x + b = 0 x 1 (0,0) 3
Hyperplane and Distance 3D space w 1 x 1 + w 2 x 2 + … + w d x d + b = 0 ⇔ w T x + b = 0 The above is called a hyperplane. x 2 x In 2D space (w 1 x 1 +w 2 x 2 +b=0), it is Hyperplane direction a straight line. w In 3D space (w 1 x 1 +w 2 x 2 +w 3 x 3 +b=0), it is a plane. r = w T x + b = w T x + b r: Distance from an arbitrary w d 2 w i ∑ 2 point x to the plane. Whether r is positive or negative depends i = 1 on which side of the hyperplane x lies. w T x + b = 0 x 1 (0,0) 4
Hyperplane and Distance 3D space w 1 x 1 + w 2 x 2 + … + w d x d + b = 0 ⇔ w T x + b = 0 The above is called a hyperplane. x 2 x In 2D space (w 1 x 1 +w 2 x 2 +b=0), it is Hyperplane direction a straight line. w In 3D space (w 1 x 1 +w 2 x 2 +w 3 x 3 +b=0), it is a plane. r = w T x + b = w T x + b r: Distance from an arbitrary w d 2 w i ∑ 2 point x to the plane. Whether r is positive or negative depends i = 1 on which side of the Distance from hyperplane x lies. the origin to the plane. w T x + b = 0 x 1 (0,0) 5
Parallel Hyperplanes • We focus on two parallel hyperplanes: ⎧ ⎪ w T x + b = 1 ⎨ w T x + b = − 1 ⎪ ⎩ 2 • Geometrically, distance between these two planes is w 2 x 2 w T x +b=1 2 / || w || 2 ρ =1 / || w || 2 w T x +b=0 ρ x 1 w ρ 1 - = b + x T w 6
Parallel Hyperplanes • We focus on two parallel hyperplanes: ⎧ ⎪ w T x + b = 1 ⎨ w T x + b = − 1 ⎪ ⎩ 2 • Geometrically, distance between these two planes is w 2 x 2 z: w T z +b=1 w T x +b=1 2 / || w || 2 r ρ =1 / || w || 2 w T x +b=0 ρ r = w T z + b x 1 1 w = ρ w w 1 - = b + x T w 2 2 7
We start from an ideal classification case! Linearly separable case! x 2 We focus on the binary classification problem in this lecture. x 1 8
Separation Margin • Given two parallel hyperplanes below, we separate two classes of data points by preventing the data points from falling into the margin: ⎧ equivalent w T x + b ≥ 1, if y = 1, ⎪ expression ( ) ≥ 1 y w T x + b ⎨ w T x + b ≤ − 1, if y = − 1. ⎪ ⎩ x 2 • The region bounded by these two hyperplanes is called the | 2 | w | | / 2 separation “margin”, given by w T x +b=1 x +b=0 T w 2 2 w T x +b=-1 ρ ρ = = w w 2 ρ w T w x 1 9
Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. x 2 x 1 10
Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. x 2 x 1 11
Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. Which is better? x 2 x 1 12
Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. • This can be formulated as a constrained optimisation problem: Which is better? x 2 w , b 1 2 w T w min ( ) ≥ 1 s.t. y i w T x i + b ∀ i ∈ 1,..., N { } x 1 13
Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. • This can be formulated as a constrained optimisation problem: Which is better? Margin 2 x 2 margin: maximisation w T w w , b 1 2 w T w min ( ) ≥ 1 s.t. y i w T x i + b ∀ i ∈ 1,..., N { } x 1 14
Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. • This results in the following constrained optimisation: Which is better? Margin 2 x 2 margin: maximisation w T w w , b 1 2 w T w min ( ) ≥ 1 s.t. y i w T x i + b ∀ i ∈ 1,..., N { } x 1 Stopping training samples from falling into the margin. 15
Support Vectors ( ) = 1 y i w T x i + b • Support vectors: training points that satisfy • These points are the most difficult to classify and are very important for the location of the optimal hyperplane: Support Upper vectors hyperplane x 2 w T x +b= +1 2 / || w || 2 Optimal hyperplane w T x +b= 0 w x 1 Lower Support hyperplane vectors w T x +b= -1 16
SVM Training • SVM training: the process of solving the following constrained optimisation problem: w , b 1 How to derive the dual form can 2 w T w min be found in the notes as optional reading materials. ( ) ≥ 1 s.t. y i w T x i + b ∀ i ∈ 1,..., N { } • The above problem is solved by solving a dual problem as shown below. N N N − 1 ( ) = T x j L λ ∑ ∑ ∑ λ i λ j y i y j x i λ i 2 i = 1 i = 1 j = 1 N • The new variables are called Lagrangian multipliers. They should be { } i = 1 λ i positive numbers. N • A fixed relationship exists between w , b and . { } i = 1 λ i 17
SVM Training • The dual problem is called a quardratic programing (QP) problem in optimisation. ⎧ N N N λ i − 1 The SVM we have T x j ⎪ max ∑ ∑ ∑ λ i λ j y i y j x i learned so far is 2 ⎪ λ ∈ℜ N called i = 1 i = 1 j = 1 Dual ⎪ hard-margin SVM. ⎨ N problem ∑ λ i y i = 0 ⎪ s.t. i = 1 ⎪ λ i ≥ 0 ⎪ ⎩ One way to solve the QP problem for SVM can be found in the notes as optional reading materials. • There are many QP solvers available: https://en.wikipedia.org/wiki/Quadratic_programming 18
So far, we work on simple cases like this: x 2 What if the data points look like this? x 2 x 1 separable data patterns x 1 In practice, no datasets are ideally linearly separable. This means that some data points are bounded to be non-separable data patterns misclassified by a linear hyperplane. 19
Non-separable Patterns • We use the slack variable ξ i ≥ 0 (i=1,2, … N), each of which measures the deviation of the i-th point from the ideal situation, to relax the previous constraints as: ⎧ equivalent w T x i + b ≥ 1 − ξ i , if y i = 1, ⎪ expression ⎨ ( ) ≥ 1 − ξ i y i w T x i + b w T x i + b ≤ − 1 − ξ i ( ) , if y i = − 1. ⎪ ⎩ • We don’t push all the points to stay outside the margin any more. x 2 Point within region of separation, but still in the right side: 0<ξ i ≤1 Point in the wrong side of x 1 the decision boundary: ξ i >1 20
Modified Optimisation • In addition to maximising the margin as before, we need to keep all slacks ξ i as small as possible to minimise the classification errors. The modified SVM optimisation problem becomes: N 1 2 w T w + C min ∑ C ≥ 0 is a user defined parameter, which ξ i ) ∈ℜ d + 1 , ( w , b controls the regularisation . This is the i = 1 N { } i = 1 ξ i trade-off between complexity and ⎫ y i w T x + b ( ) ≥ 1 − ξ i nonseparable patterns. ⎪ s.t. ∀ i ∈ 1,..., N { } ⎬ ξ i ≥ 0 ⎪ ⎭ • The above constrained optimisation problem can be converted to a QP problem. ⎧ N N N λ i − 1 T x j ⎪ max ∑ ∑ ∑ λ i λ j y i y j x i 2 λ ∈ℜ N ⎪ i = 1 i = 1 j = 1 Dual ⎪ Soft-margin ⎨ N problem ∑ λ i y i = 0 SVM ⎪ s.t. i = 1 ⎪ 0 ≤ λ i ≤ C ⎪ ⎩ 21
Support Vectors y i w T x i + b ( ) = 1 − ξ i ξ i ≥ 0 ( ) • Support vectors: training points that satisfy • These points either distribute along one of the two parallel hyperplanes (1), or fall within the margin (2), or stay in the wrong side of the separating hyperplane (3). support vectors x 2 Support vectors represent (3) (1) points that are difficult to (1) classify and are important (2) (2) for deciding the location of the separating w (1) (1) hyperplane. x 1 support vectors 22
So far, we can handle linear cases like this: x 2 What if the data points look like this? x 1 x 2 linear data patterns x 1 non-linear data patterns 23
Recommend
More recommend