CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabás Póczos & Aarti Singh 2014 Spring

http://barnabas-cmu-10701.appspot.com/

Linear classifiers which line is better? Which decision boundary is better? 4

Pick the one with the largest margin! Class 1 Class 2 w ∙ x + b > 0 w w ∙ x + b < 0 Data: Margin 5

Scaling Plus-Plane Classifier Boundary Minus-Plane w ∙ x + b  1 w ∙ x + b  – 1 Classification rule: Classify as.. +1 if w ∙ x + b  1 – 1 if w ∙ x + b  – 1 Universe if -1 < w ∙ x + b < 1 explodes How large is the margin of this classifier? Goal : Find the maximum margin classifier 6

Computing the margin width 2  x + M = Margin Width w  w x - Let x + and x - be such that w ∙ x + + b = +1   w ∙ x - + b = -1 x + = x - + l w   | x + – x - | = M Maximize M  minimize w · w ! 7

Observations We can assume b=0 Classify as.. +1 if w ∙ x + b  1 – 1 if w ∙ x + b  – 1 Universe if -1 < w ∙ x + b < 1 explodes This is the same as 8

The Primal Hard SVM This is a QP problem (m-dimensional) (Quadratic cost function, linear constraints) 9

Quadratic Programming Find Subject to and to Efficient Algorithms exist for QP. They often solve the dual problem instead of the primal. 10

Constrained Optimization 11

Lagrange Multiplier Moving the constraint to objective function Lagrangian: Solve: Constraint is active when a > 0 12

Lagrange Multiplier – Dual Variables Solving: When a > 0, constraint is tight 13

From Primal to Dual Primal problem: Lagrange function: 14

The Lagrange Problem The Lagrange problem: Proof cont. 15

The Dual Problem Proof cont. 16

The Dual Hard SVM Quadratic Programming (n-dimensional) Lemma 17

The Problem with Hard SVM It assumes samples are linearly separable... What can we do if data is not linearly separable??? 18

Hard 1-dimensional Dataset If the data set is not linearly separable, then adding new features (mapping the data to a larger feature space) the data might become linearly separable 19

Hard 1-dimensional Dataset M ake up a new feature! Sort of… … computed from original feature(s)  2 x x ( , ) z k k k Separable! MAGIC! x=0 Now drop this “augmented” data into our linear SVM. 20

Feature mapping  n general! points in an n-1 dimensional space is always linearly separable by a hyperspace! ) it is good to map the data to high dimensional spaces  Having n training data, is it always good enough to map the data into a feature space with dimension n-1 ? • Nope... We have to think about the test data as well! Even if we don’t know how many test data we have and what they are...  We might want to map our data to a huge ( 1 ) dimensional feature space  Overfitting? Generalization error?... We don’t care now... 21

How to do feature mapping? Use features of features of features of features…. 1 22

The Problem with Hard SVM It assumes samples are linearly separable... Solutions : 1. Use feature transformation to a larger space ) each training samples are linearly separable in the feature space ) Hard SVM can be applied  ) overfitting...  2. Soft margin SVM instead of Hard SVM • We will discuss this now 23

Hard SVM The Hard SVM problem can be rewritten: where Misclassification Correct classification 24

From Hard to Soft constraints Instead of using hard constraints (points are linearly separable) We can try to solve the soft version of it: Your loss is only 1 instead of 1 if you misclassify an instance where Misclassification Correct classification 25

Problems with l 0-1 loss It is not convex in yf( x ) ) It is not convex in w , either... ... and we like only convex functions... Let us approximate it with convex functions! 26

Approximation of the Heaviside step function Picture is taken from R. Herbrich 27

Approximations of l 0-1 loss • Piecewise linear approximations (hinge loss, l lin ) • Quadratic approximation (l quad ) 28

The hinge loss approximation of l 0-1 Where , The hinge loss upper bounds the 0-1 loss 29

Geometric interpretation: Slack Variables M =  1 2  2 w  w ฀ 7 30

The Primal Soft SVM problem where Equivalently, 31

The Primal Soft SVM problem Equivalently, We can use this form, too.: What is the dual form of primal soft SVM? 32

The Dual Soft SVM (using hinge loss) where 33

The Dual Soft SVM (using hinge loss) 34

The Dual Soft SVM (using hinge loss) 35

SVM classification in the dual space Solve the dual problem 36

Why is it called Support Vector Machine? KKT conditions

Why is it called Support Vector Machine? w ¢ x + b > 0 w ¢ x + b < 0 Hard SVM: Linear hyperplane defined by “ support vectors ” Moving other points a little doesn’t effect the decision boundary g only need to store the g support vectors to predict labels of new points 38

Support vectors in Soft SVM

Support vectors in Soft SVM  Margin support vectors  Nonmargin support vectors

Dual SVM Interpretation: Sparsity a j = 0 Only few a j s can be a j > 0 a j = 0 non-zero : where constraint is tight a j > 0 (< w , x j >+ b)y j = 1 a j > 0 a j = 0 41

What about multiple classes? 42

One against all Learn 3 classifiers separately: Class k vs. rest ( w k , b k ) k=1,2,3 y = arg max w k ¢ x + b k k But w k s may not be based on the same scale. Note: (a w) ¢ x + (ab) is also a solution 43

Learn 1 classifier: Multi-class SVM Simultaneously learn 3 sets of weights. Constraints: Margin - gap between correct class and nearest other class y = arg max k w (k) ¢ x + b (k) 44

Learn 1 classifier: Multi-class SVM Simultaneously learn 3 sets of weights y = arg max k w (k) ¢ x + b (k) Joint optimization: w k s have the same scale. 45

What you need to know • Maximizing margin • Derivation of SVM formulation • Slack variables and hinge loss • Relationship between SVMs and logistic regression • 0/1 loss • Hinge loss • Log loss • Tackling multiple class • One against All • Multiclass SVMs 46

SVM vs. Logistic Regression SVM : Hinge loss: Logistic Regression : Log loss ( log conditional likelihood) Log loss Hinge loss 0-1 loss -1 0 1 47

SVM for Regression 48

SVM classification in the dual space “Without b” “With b” 49

So why solve the dual SVM? • There are some quadratic programming algorithms that can solve the dual faster than the primal, specially in high dimensions m>>n • But, more importantly, the “ kernel trick ”!!! 50

What if data is not linearly separable? Use features of features of features of features…. For example polynomials Φ ( x ) = (x 1 2 x 2 x 3 , ….,) 3 , x 2 3 , x 3 3 , x 1 51

Dot Product of Polynomials d=1 d=2 52

Dot Product of Polynomials d 53

Higher Order Polynomials Feature space becomes really large very quickly! m – input features d – degree of polynomial grows fast: d = 6, m = 100, about 1.6 billion terms 54

Dual formulation only depends on dot-products, not on w! Φ ( x ) – High-dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K 55

Common Kernels • Polynomials of degree d • Polynomials of degree up to d • Gaussian/Radial kernels (polynomials of all orders – recall series expansion) • Sigmoid Which functions can be used as kernels??? …and why are they called kernels??? 57

Overfitting • Huge feature space with kernels, what about overfitting??? • Maximizing margin leads to sparse set of support vectors • Some interesting theory says that SVMs search for simple hypothesis with large margin • Often robust to overfitting 58

What about classification time? • For a new input x , if we need to represent  ( x ), we are in trouble! • Recall classifier: sign( w ¢  ( x )+b) • Using kernels we are cool! 59

A few results 61

Steve Gunn’s svm toolbox Results, Iris 2vs13, Linear kernel 62

Results, Iris 1vs23, 2 nd order kernel 2 nd order decision boundary: (parabola, hyperbola, ellipse) 63

Results, Iris 1vs23, 2nd order kernel 64

Results, Iris 1vs23, 13th order kernel 65

Results, Iris 1vs23, RBF kernel 66

Results, Iris 1vs23, RBF kernel 67

Chessboard dataset Results, Chessboard, Poly kernel 68

Results, Chessboard, Poly kernel 69

Results, Chessboard, poly kernel 72

Results, Chessboard, RBF kernel 73

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring http://barnabas-cmu-10701.appspot.com/ Linear classifiers which line is better? Which decision boundary is better? 4 Pick

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Stochastic Convergence Barnabs Pczos Motivation

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Reinforcement Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

10701 Recitation 5 Duality and SVM Ahmed Hefny Outline Langrangian and Duality The

SVMs, Duality and the Kernel Trick (cont.) Machine Learning 10701/15781 Carlos Guestrin

Point Estimation Linear Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie

Markov Decision Processes (MDPs) Machine Learning 10701/15781 Carlos Guestrin Carnegie

10701 Machine Learning Clustering What is Clustering? Organizing data into clusters such

Bayesian Networks Representation Machine Learning 10701/15781 Carlos Guestrin Carnegie

Deep Neural Networks Machine Learning http://www.cs.cmu.edu/~10701 Organizational info All

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

10701 Machine Learning Boosting Fighting the bias-variance tradeoff Simple (a.k.a. weak)

Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail Bounds Barnabs

Big Picture Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University March 2

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring http://barnabas-cmu-10701.appspot.com/ Linear classifiers which line is better? Which decision boundary is better? 4 Pick

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &amp;

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Stochastic Convergence Barnabs Pczos Motivation

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos &amp; Aarti Singh

Reinforcement Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

10701 Recitation 5 Duality and SVM Ahmed Hefny Outline Langrangian and Duality The

SVMs, Duality and the Kernel Trick (cont.) Machine Learning 10701/15781 Carlos Guestrin

Point Estimation Linear Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie

Markov Decision Processes (MDPs) Machine Learning 10701/15781 Carlos Guestrin Carnegie

10701 Machine Learning Clustering What is Clustering? Organizing data into clusters such

Bayesian Networks Representation Machine Learning 10701/15781 Carlos Guestrin Carnegie

Deep Neural Networks Machine Learning http://www.cs.cmu.edu/~10701 Organizational info All

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

10701 Machine Learning Boosting Fighting the bias-variance tradeoff Simple (a.k.a. weak)

Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail Bounds Barnabs

Big Picture Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University March 2

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh