Machine Learning - MT 2016 9 & 10. Support Vector Machines - PowerPoint PPT Presentation

Machine Learning - MT 2016 9 & 10. Support Vector Machines Varun Kanade University of Oxford November 7 & 9, 2016

Announcements ◮ Problem Sheet 3 due this Friday by noon ◮ Practical 2 next week ◮ (Optional) Reading a paper 1

Outline This week we’ll discuss classification using support vector machines. ◮ No clear probabilistic interpretation ◮ Maximum Margin Formulation ◮ Optimisation problem using Hinge Loss ◮ Dual Formulation ◮ Kernel Methods for non-linear classification 2

Binary Classification Goal: Find a linear separator Data is linearly separable if there exists a linear separator that classifies all points correctly Which separator should be picked? 3

Maximum Margin Principle Maximise the distance of the closest point from the decision boundary Points that are closest to the decision boundary are support vectors 4

Geometry Review Given a hyperplane: H ≡ w · x + w 0 = 0 and a point x ∈ R D , how far is x from H ? 5

Geometry Review ◮ Consider the hyperplane: H ≡ w · x + w 0 = 0 ◮ The distance of point x from H is given by: | w · x + w 0 | � w � 2 ◮ All points on one side of the hyperplane satisfy w · x + w 0 > 0 and points on the other side satisfy w · x + w 0 < 0 6

SVM Formulation : Separable Case Let D = � ( x i , y i ) � N i =1 with y i ∈ {− 1 , 1 } Ignoring the max-margin for now Find w , w 0 , such that y i ( w · x i + w 0 ) ≥ 1 for i = 1 , . . . , N This is simply a linear program! 1 For any w , w 0 satisfying the above, the smallest margin is at least � w � 2 In order to obtain a maximum-margin condition, we minimise � w � 2 2 subject to the above constraints This results in a quadratic program! 7

SVM Formulation : Separable Case 1 2 � w � 2 minimise: 2 subject to: y i ( w · x i + w 0 ) ≥ 1 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } If data is separable, then we find a classifier with no classification error on the training set 8

Non-separable Data ◮ Quadratic program on previous slide has no feasible solution ◮ Which linear separator should we try to find? ◮ Minimising the number of misclassifications is NP-hard 9

SVM Formulation : Non-Separable Case Penalty for slack terms N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) ≥ 1 − ζ i ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } Slack to violate constraints 10

SVM Formulation : Loss Function � N 1 2 � w � 2 minimise: + C ζ i 6 2 � �� i =1 � �� Regularizer Loss Function Hinge Loss 4 subject to: 2 y i ( w · x i + w 0 ) ≥ 1 − ζ i ζ i ≥ 0 0 − 6 − 4 − 2 0 2 4 6 for i = 1 , . . . , N y ( w · x + w 0 ) Here y i ∈ {− 1 , 1 } Note that for the optimal solution, ζ i = max { 0 , 1 − y i ( w · x i + w 0 ) } Thus, SVM can be viewed as minimizing the hinge loss with regularization 11

Logistic Regression: Loss Function Here y i ∈ { 0 , 1 } , so to compare effectively to SVM, let z i = (2 y i − 1) : ◮ z i = 1 if y i = 1 ◮ z i = − 1 if y i = 0 � �� 1 1 NLL( y i ; w , x i ) = − y i log + (1 − y i ) log 1 + e − w · x i 1 + e w · x i � 1 + e − z i ( w · x i ) � � 1 + e − (2 y i − 1)( w · x i ) � = log = log 6 Logistic Loss 4 2 0 − 6 − 4 − 2 0 2 4 6 (2 y − 1)( w · x + w 0 ) 12

Loss Functions 13

Outline Support Vector Machines Multiclass Classification Measuring Performance Dual Formulation of SVM Kernels

Multiclass Classification with SVMs (and beyond) It is possible to have a mathematical formulation of the max-margin principle when there are more than two classes In practice, one of the following approaches is far more common One-vs-One: � K � ◮ Train different classifiers for all pairs of classes 2 ◮ At test time, choose the most commonly occurring label One-vs-Rest: ◮ Train K different classifiers, one class vs the rest K − 1 ◮ At test time, ties may be broken by value of w · x new + w 0 14

Multiclass Classification with SVMs (and beyond) One-vs-One One-vs-Rest ◮ Training roughly K 2 / 2 classifiers ◮ Training only K classifiers ◮ Each training procedure only uses ◮ Each training procedure only uses on average 2 /K portion of the average all the training data training data ◮ Resulting learning problems are ◮ Resulting learning problems are less likely to be ‘‘natural’’ more likely to be ‘‘natural’’ For a more efficient method read the paper posted on the website Reducing Multiclass to Binary. E. Allwein, R. Schapire, Y. Singer 15

Measuring Performance We’ve encountered a few different loss functions used by learning algorithms during training time For regression problems, it made sense to use the same loss function to measure performance (though not always necessary) For classification problems, the natural measure of performance is classification error, number of misclassified datapoints However, not all mistakes are equally problematic ◮ Mistakenly blocking a legitimate comment vs failing to mark abuse on online message boards ◮ Failing to detect medical risk is worse than inaccurately predicting chance of risk 16

Measuring Performance For binary classification, we have: Actual Labels Prediction yes no yes true positive false positive no false negative true negative For multi-class classification, it is common to write confusion matrix Actual Labels Prediction 1 2 · · · K 1 N 11 N 12 · · · N 1 K 2 N 21 N 22 · · · N 2 K . . . . ... . . . . . . . . K N K 1 N K 2 · · · N KK 17

Measuring Performance For binary classification, we have: Actual Labels Prediction yes no yes true positive false positive no false negative true negative False positive errors are also called Type I errors, false negative errors are called Type II errors ◮ True Positive Rate: TPR = TP TP+FN , aka sensitivity or recall ◮ False Positive Rate: FPR = FP FP+TN ◮ Precision: P = TP TP+FP 18

Receiver Operating Characteristic 1 0 . 8 B D 0 . 6 A TPR C 0 . 4 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 FPR Which classifier would you pick? 19

Receiver Operating Characteristic 1 0 . 8 0 . 6 TPR 0 . 4 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 FPR ◮ For many classifiers, it is possible to tradeoff the FPR vs TPR ◮ Often summarised by the area under the curve (AUC) 20

Precision Recall Curves 1 0 . 8 0 . 6 Precision 0 . 4 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 Recall (TPR) ◮ For many classifiers, we can tradeoff the Precision vs Recall (TPR) ◮ More useful when number of false negatives is very large 21

How to tune classifiers to satisfy these criteria? ◮ Some classifiers like logistic regression output the probability of a label being 1 , i.e., p ( y | x , w ) ◮ In generative models, the actual prediction is based on the ratio of conditional probabilities, p ( y = 1 | x , θ ) p ( y = 0 | x , θ ) ◮ We can choose a threshold other than 1/2 (for logistic) or 1 (for generative models), to prefer one type of errors over the other ◮ For classifiers like SVM, it is harder (though possible) to have a probabilistic interpretation ◮ It is possible to reweight the training data to prefer one type of errors over the other 22

SVM Formulation: Non-Separable Case What if your data looks like this? 23

SVM Formulation : Constrained Minimisation N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) − (1 − ζ i ) ≥ 0 ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } 24

Contrained Optimisation with Inequalities Primal Form minimise F ( z ) subject to g i ( z ) ≥ 0 i = 1 , . . . , m h j ( z ) = 0 j = 1 , . . . , l Lagrange Function m l � � Λ( z ; α, µ ) = F ( z ) − α i g i ( z ) − µ j h j ( z ) i =1 j =1 For convex problems, i.e., F is convex, all g i are convex and h i are affine, necessary and sufficient conditions for a critical point of Λ to be the minimum of the original constrained optimisation problem are given by the Karush-Kuhn-Tucker (or KKT) conditions For non-convex problems, they are necessary but not sufficient 25

KKT Conditions Lagrange Function m l � � Λ( z ; α , µ ) = F ( z ) − α i g i ( z ) − µ j h j ( z ) i =1 j =1 For convex problems, Karush-Kuhn-Tucker (KKT) conditions give necessary and sufficient conditions for a solution (critical point of Λ ) to be optimal Dual feasibility: for i = 1 , . . . m α i ≥ 0 Primal feasibility: for i = 1 , . . . m g i ( z ) ≥ 0 for j = 1 , . . . l h j ( z ) = 0 Complementary slackness: α i g i ( z ) = 0 for i = 1 , . . . m 26

SVM Formulation N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) − (1 − ζ i ) ≥ 0 ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } Lagrange Function � N � N � N Λ( w , w 0 , ζ ; α , µ ) = 1 2 � w � 2 2 + C ζ i − α i ( y i ( w · x i + w 0 ) − (1 − ζ i )) − µ i ζ i i =1 i =1 i =1 27

Machine Learning - MT 2016 9 & 10. Support Vector Machines - PowerPoint PPT Presentation

Machine Learning - MT 2016 9 & 10. Support Vector Machines Varun Kanade University of Oxford November 7 & 9, 2016 Announcements Problem Sheet 3 due this Friday by noon Practical 2 next week (Optional) Reading a paper 1

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Victorian AER 2010 draft decision on DNSP revenues and prices - the end users perspectives

On the topology of transitive and cohomogeneity one actions Manuel Amann October 2019 Symmetry

On the Edge-Length Ratio of Planar Graphs Manuel Borrazzo and Fabrizio Frati Roma Tre University

Optimal Shuffle Code with Permutation Instructions Sebastian Buchwald, Manuel Mohr, Ignaz Rutter

Figure 1 Plot of percent atmospheric O 2 versus time calculated by the rock abundance model. The

Returns Optimization 101 Lesson 1: Course Overview Goals for the Course: Introduce the concept

Computer Graphics MTAT.03.015 Raimond Tunnel Course Information Course Page

CrossIndustry Technology Exploitation in Clusters 300 collaborative R&D projects in

Machine Learning - MT 2016 9 & 10. Support Vector Machines - PowerPoint PPT Presentation

Machine Learning - MT 2016 9 & 10. Support Vector Machines Varun Kanade University of Oxford November 7 & 9, 2016 Announcements Problem Sheet 3 due this Friday by noon Practical 2 next week (Optional) Reading a paper 1

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

Victorian AER 2010 draft decision on DNSP revenues and prices - the end users perspectives

On the topology of transitive and cohomogeneity one actions Manuel Amann October 2019 Symmetry

On the Edge-Length Ratio of Planar Graphs Manuel Borrazzo and Fabrizio Frati Roma Tre University

Optimal Shuffle Code with Permutation Instructions Sebastian Buchwald, Manuel Mohr, Ignaz Rutter

Figure 1 Plot of percent atmospheric O 2 versus time calculated by the rock abundance model. The

Returns Optimization 101 Lesson 1: Course Overview Goals for the Course: Introduce the concept

Computer Graphics MTAT.03.015 Raimond Tunnel Course Information Course Page

CrossIndustry Technology Exploitation in Clusters 300 collaborative R&amp;D projects in

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

CrossIndustry Technology Exploitation in Clusters 300 collaborative R&D projects in