COMP24111: Machine Learning and Optimisation Chapter 4: Support - PowerPoint PPT Presentation

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk

Outline • Geometry concepts: hyperplane, distance, parallel hyperplane, margin. • Basic idea of support vector machine (SVM). • Hard-margin SVM • Soft-margin SVM • Support Vectors • Nonlinear classification: – Kernel trick – Linear basis function model 1

History and Information • Vapnik and Lerner (1963) introduced the generalised portrait algorithm. The algorithm implemented by SVMs is a nonlinear generalisation of the generalised portrait algorithm. • Support vector machine was first introduced in 1992: – Boser et al. A training algorithm for optimal margin classifiers. Proceedings of the 5- th Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992. • More on SVM history: http://www.svms.org/history.html • Centralised website: http://www.kernel-machines.org • Popular textbook: – N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, 2000. http://www.support-vector.net • Popular library: LIBSVM, MATLAB SVM, scikit-learn (machine learning in Python). 2

Hyperplane and Distance 3D space w 1 x 1 + w 2 x 2 + … + w d x d + b = 0 ⇔ w T x + b = 0 The above is called a hyperplane. x 2 In 2D space (w 1 x 1 +w 2 x 2 +b=0), it is Hyperplane direction a straight line. w In 3D space (w 1 x 1 +w 2 x 2 +w 3 x 3 +b=0), it is a plane. w T x + b = 0 x 1 (0,0) 3

Hyperplane and Distance 3D space w 1 x 1 + w 2 x 2 + … + w d x d + b = 0 ⇔ w T x + b = 0 The above is called a hyperplane. x 2 x In 2D space (w 1 x 1 +w 2 x 2 +b=0), it is Hyperplane direction a straight line. w In 3D space (w 1 x 1 +w 2 x 2 +w 3 x 3 +b=0), it is a plane. r = w T x + b = w T x + b r: Distance from an arbitrary w d 2 w i ∑ 2 point x to the plane. Whether r is positive or negative depends i = 1 on which side of the hyperplane x lies. w T x + b = 0 x 1 (0,0) 4

Hyperplane and Distance 3D space w 1 x 1 + w 2 x 2 + … + w d x d + b = 0 ⇔ w T x + b = 0 The above is called a hyperplane. x 2 x In 2D space (w 1 x 1 +w 2 x 2 +b=0), it is Hyperplane direction a straight line. w In 3D space (w 1 x 1 +w 2 x 2 +w 3 x 3 +b=0), it is a plane. r = w T x + b = w T x + b r: Distance from an arbitrary w d 2 w i ∑ 2 point x to the plane. Whether r is positive or negative depends i = 1 on which side of the Distance from hyperplane x lies. the origin to the plane. w T x + b = 0 x 1 (0,0) 5

Parallel Hyperplanes • We focus on two parallel hyperplanes: ⎧ ⎪ w T x + b = 1 ⎨ w T x + b = − 1 ⎪ ⎩ 2 • Geometrically, distance between these two planes is w 2 x 2 w T x +b=1 2 / || w || 2 ρ =1 / || w || 2 w T x +b=0 ρ x 1 w ρ 1 - = b + x T w 6

Parallel Hyperplanes • We focus on two parallel hyperplanes: ⎧ ⎪ w T x + b = 1 ⎨ w T x + b = − 1 ⎪ ⎩ 2 • Geometrically, distance between these two planes is w 2 x 2 z: w T z +b=1 w T x +b=1 2 / || w || 2 r ρ =1 / || w || 2 w T x +b=0 ρ r = w T z + b x 1 1 w = ρ w w 1 - = b + x T w 2 2 7

We start from an ideal classification case! Linearly separable case! x 2 We focus on the binary classification problem in this lecture. x 1 8

Separation Margin • Given two parallel hyperplanes below, we separate two classes of data points by preventing the data points from falling into the margin: ⎧ equivalent w T x + b ≥ 1, if y = 1, ⎪ expression ( ) ≥ 1 y w T x + b ⎨ w T x + b ≤ − 1, if y = − 1. ⎪ ⎩ x 2 • The region bounded by these two hyperplanes is called the | 2 | w | | / 2 separation “margin”, given by w T x +b=1 x +b=0 T w 2 2 w T x +b=-1 ρ ρ = = w w 2 ρ w T w x 1 9

Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. x 2 x 1 10

Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. x 2 x 1 11

Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. Which is better? x 2 x 1 12

Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. • This can be formulated as a constrained optimisation problem: Which is better? x 2 w , b 1 2 w T w min ( ) ≥ 1 s.t. y i w T x i + b ∀ i ∈ 1,..., N { } x 1 13

Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. • This can be formulated as a constrained optimisation problem: Which is better? Margin 2 x 2 margin: maximisation w T w w , b 1 2 w T w min ( ) ≥ 1 s.t. y i w T x i + b ∀ i ∈ 1,..., N { } x 1 14

Support Vector Machine (SVM) • The aim of SVM is simply to find an optimal hyperplane to separate the two classes of data points with the widest margin. • This results in the following constrained optimisation: Which is better? Margin 2 x 2 margin: maximisation w T w w , b 1 2 w T w min ( ) ≥ 1 s.t. y i w T x i + b ∀ i ∈ 1,..., N { } x 1 Stopping training samples from falling into the margin. 15

Support Vectors ( ) = 1 y i w T x i + b • Support vectors: training points that satisfy • These points are the most difficult to classify and are very important for the location of the optimal hyperplane: Support Upper vectors hyperplane x 2 w T x +b= +1 2 / || w || 2 Optimal hyperplane w T x +b= 0 w x 1 Lower Support hyperplane vectors w T x +b= -1 16

SVM Training • SVM training: the process of solving the following constrained optimisation problem: w , b 1 How to derive the dual form can 2 w T w min be found in the notes as optional reading materials. ( ) ≥ 1 s.t. y i w T x i + b ∀ i ∈ 1,..., N { } • The above problem is solved by solving a dual problem as shown below. N N N − 1 ( ) = T x j L λ ∑ ∑ ∑ λ i λ j y i y j x i λ i 2 i = 1 i = 1 j = 1 N • The new variables are called Lagrangian multipliers. They should be { } i = 1 λ i positive numbers. N • A fixed relationship exists between w , b and . { } i = 1 λ i 17

SVM Training • The dual problem is called a quardratic programing (QP) problem in optimisation. ⎧ N N N λ i − 1 The SVM we have T x j ⎪ max ∑ ∑ ∑ λ i λ j y i y j x i learned so far is 2 ⎪ λ ∈ℜ N called i = 1 i = 1 j = 1 Dual ⎪ hard-margin SVM. ⎨ N problem ∑ λ i y i = 0 ⎪ s.t. i = 1 ⎪ λ i ≥ 0 ⎪ ⎩ One way to solve the QP problem for SVM can be found in the notes as optional reading materials. • There are many QP solvers available: https://en.wikipedia.org/wiki/Quadratic_programming 18

So far, we work on simple cases like this: x 2 What if the data points look like this? x 2 x 1 separable data patterns x 1 In practice, no datasets are ideally linearly separable. This means that some data points are bounded to be non-separable data patterns misclassified by a linear hyperplane. 19

Non-separable Patterns • We use the slack variable ξ i ≥ 0 (i=1,2, … N), each of which measures the deviation of the i-th point from the ideal situation, to relax the previous constraints as: ⎧ equivalent w T x i + b ≥ 1 − ξ i , if y i = 1, ⎪ expression ⎨ ( ) ≥ 1 − ξ i y i w T x i + b w T x i + b ≤ − 1 − ξ i ( ) , if y i = − 1. ⎪ ⎩ • We don’t push all the points to stay outside the margin any more. x 2 Point within region of separation, but still in the right side: 0<ξ i ≤1 Point in the wrong side of x 1 the decision boundary: ξ i >1 20

Modified Optimisation • In addition to maximising the margin as before, we need to keep all slacks ξ i as small as possible to minimise the classification errors. The modified SVM optimisation problem becomes: N 1 2 w T w + C min ∑ C ≥ 0 is a user defined parameter, which ξ i ) ∈ℜ d + 1 , ( w , b controls the regularisation . This is the i = 1 N { } i = 1 ξ i trade-off between complexity and ⎫ y i w T x + b ( ) ≥ 1 − ξ i nonseparable patterns. ⎪ s.t. ∀ i ∈ 1,..., N { } ⎬ ξ i ≥ 0 ⎪ ⎭ • The above constrained optimisation problem can be converted to a QP problem. ⎧ N N N λ i − 1 T x j ⎪ max ∑ ∑ ∑ λ i λ j y i y j x i 2 λ ∈ℜ N ⎪ i = 1 i = 1 j = 1 Dual ⎪ Soft-margin ⎨ N problem ∑ λ i y i = 0 SVM ⎪ s.t. i = 1 ⎪ 0 ≤ λ i ≤ C ⎪ ⎩ 21

Support Vectors y i w T x i + b ( ) = 1 − ξ i ξ i ≥ 0 ( ) • Support vectors: training points that satisfy • These points either distribute along one of the two parallel hyperplanes (1), or fall within the margin (2), or stay in the wrong side of the separating hyperplane (3). support vectors x 2 Support vectors represent (3) (1) points that are difficult to (1) classify and are important (2) (2) for deciding the location of the separating w (1) (1) hyperplane. x 1 support vectors 22

So far, we can handle linear cases like this: x 2 What if the data points look like this? x 1 x 2 linear data patterns x 1 non-linear data patterns 23

COMP24111: Machine Learning and Optimisation Chapter 4: Support - PowerPoint PPT Presentation

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu Email: tingting.mu@manchester.ac.uk Outline Geometry concepts: hyperplane, distance, parallel hyperplane, margin. Basic idea of support

COMP24111: Machine Learning and Optimisation Chapter 1A: Machine Learning Basics Dr. Tingting Mu

COMP24111 Course Unit Overview Ke Chen and Tingting Mu http:/ / syllabus.cs.manchester.ac.uk/ ugt/

COMP24111: Machine Learning and Optimisation Chapter 1: Machine Learning Basics Dr. Tingting Mu

COMP24111: Machine Learning and Optimisation Chapter 5: Neural Networks and Deep Learning Dr.

Clustering Analysis Basics Ke Chen Reading: [Ch. 7, EA], [25.1, KPM] COMP24111 Machine Learning

COMP24111: Machine Learning and Optimisation Chapter 3: Logistic Regression Dr. Tingting Mu

Medicines optimisation The road to excellence Workshop Overview of meds optimisation Your

COMP24111: Machine Learning and Optimization (Part I) Dr. Tingting Mu Email:

K -means Clustering Ke Chen Reading: [7.3, EA], [9.1, CMB] COMP24111 Machine Learning Outline

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Automated and Accurate Geometry Extraction and Shape Optimisation of 3D Topology Optimisation

Introductory Course on Non-smooth Optimisation Lecture 09 - Non-convex optimisation Jingwei Liang

Introduction to program optimisation Michel Schinz (based on Erik Stenmans slides) Advanced

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Introduction to Support Vector Machines Starting from slides drawn by Ming-Hsuan Yang and Antoine

Bisectors and foliations in the complex hyperbolic space Maciej Czarnecki Uniwersytet L

Hypercube locality-sensitive hashing for approximate near neighbors Thijs Laarhoven

Support Vector Machines Part 1 Yingyu Liang Computer Sciences 760 Fall 2017

Chapter IX: Classification* 1. Basic idea 2. Decision trees 3. Nave Bayes classifier 4.

Statistical Machine Learning Lecture 11: Support Vector Machines Kristian Kersting TU Darmstadt

Support Vector Machines 3-18-16 Reading Quiz Q1: Which of these hyperplanes would be selected by

The Perceptron CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal