Introduction to Machine Learning CMU-10701 Support Vector Machines - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabás Póczos & Aarti Singh 2014 Spring

http://barnabas-cmu-10701.appspot.com/

Linear classifiers which line is better? Which decision boundary is better? 4

Pick the one with the largest margin! Class 1 Class 2 w · x + b > 0 w w · x + b < 0 Data: Margin 5

Scaling Plus-Plane Classifier Boundary Minus-Plane w · x + b ≥ 1 w · x + b ≤ – 1 Classification rule: Classify as.. +1 if w · x + b ≥ 1 –1 if w · x + b ≤ – 1 Universe if -1 < w · x + b < 1 explodes How large is the margin of this classifier? Goal : Find the maximum margin classifier 6

Computing the margin width 2 x + M = Margin Width = w ⋅ w x - Let x + and x - be such that w · x + + b = +1 � w · x - + b = -1 � x + = x - + λ w � | x + – x - | = M � Maximize M ≡ minimize w · w ! 7

Observations We can assume b=0 Classify as.. +1 if w · x + b ≥ 1 –1 if w · x + b ≤ – 1 Universe if -1 < w · x + b < 1 explodes This is the same as 8

The Primal Hard SVM This is a QP problem (m-dimensional) (Quadratic cost function, linear constraints) 9

Quadratic Programming Find Subject to and to Efficient Algorithms exist for QP. They often solve the dual problem instead of the primal. 10

Constrained Optimization 11

Lagrange Multiplier Moving the constraint to objective function Lagrangian: Solve: Constraint is active when α α α α > 0 12

Lagrange Multiplier – Dual Variables Solving: When α α α α > 0, constraint is tight 13

From Primal to Dual Primal problem: Lagrange function: 14

The Lagrange Problem The Lagrange problem: Proof cont. 15

The Dual Problem Proof cont. 16

The Dual Hard SVM Quadratic Programming (n-dimensional) Lemma 17

The Problem with Hard SVM It assumes samples are linearly separable... What can we do if data is not linearly separable??? 18

Hard 1-dimensional Dataset If the data set is not linearly separable, then adding new features (mapping the data to a larger feature space) the data might become linearly separable 19

Hard 1-dimensional Dataset M ake up a new feature! Sort of… … computed from original feature(s) 2 = x x ( , ) z k k k Separable! MAGIC! x=0 Now drop this “augmented” data into our linear SVM. 20

Feature mapping � n general! points in an n-1 dimensional space is always linearly separable by a hyperspace! ⇒ it is good to map the data to high dimensional spaces � Having n training data, is it always good enough to map the data into a feature space with dimension n-1 ? • Nope... We have to think about the test data as well! Even if we don’t know how many test data we have and what they are... � We might want to map our data to a huge ( ∞ ) dimensional feature space � Overfitting? Generalization error?... We don’t care now... 21

How to do feature mapping? Use features of features of features of features…. ∞ 22

The Problem with Hard SVM It assumes samples are linearly separable... Solutions : 1. Use feature transformation to a larger space ⇒ each training samples are linearly separable in the feature space ⇒ Hard SVM can be applied ☺ ⇒ overfitting... � 2. Soft margin SVM instead of Hard SVM • We will discuss this now 23

Hard SVM The Hard SVM problem can be rewritten: where Misclassification Correct classification 24

From Hard to Soft constraints Instead of using hard constraints (points are linearly separable) We can try to solve the soft version of it: Your loss is only 1 instead of ∞ if you misclassify an instance where Misclassification Correct classification 25

Problems with l 0-1 loss It is not convex in yf( x ) ⇒ It is not convex in w , either... ... and we like only convex functions... Let us approximate it with convex functions! 26

Approximation of the Heaviside step function Picture is taken from R. Herbrich 27

Approximations of l 0-1 loss • Piecewise linear approximations (hinge loss, l lin ) • Quadratic approximation (l quad ) 28

The hinge loss approximation of l 0-1 Where , The hinge loss upper bounds the 0-1 loss 29

Geometric interpretation: Slack Variables M = ξ 1 2 ξ 2 w ⋅ w ξ7 30

The Primal Soft SVM problem where Equivalently, 31

The Primal Soft SVM problem Equivalently, We can use this form, too.: What is the dual form of primal soft SVM? 32

The Dual Soft SVM (using hinge loss) where 33

The Dual Soft SVM (using hinge loss) 34

The Dual Soft SVM (using hinge loss) 35

SVM classification in the dual space Solve the dual problem 36

Why is it called Support Vector Machine? KKT conditions

Why is it called Support Vector Machine? w · · x + b > 0 w · · · x + b < 0 · · · Hard SVM: Linear hyperplane defined by “support vectors” Moving other points a little doesn’t effect the decision boundary γ only need to store the γ support vectors to predict labels of new points 38

Support vectors in Soft SVM

Support vectors in Soft SVM � Margin support vectors � Nonmargin support vectors

Dual SVM Interpretation: Sparsity α j = 0 α α α Only few α j s can be α α α α j > 0 α α α α j = 0 non-zero : where constraint is tight α j > 0 α α α (< w , x j >+ b)y j = 1 α α j > 0 α α α α α j = 0 α 41

What about multiple classes? 42

One against all Learn 3 classifiers separately: Class k vs. rest ( w k , b k ) k=1,2,3 y = arg max w k · x + b k k But w k s may not be based on the same scale. Note: (a w) · · · x + (ab) is · also a solution 43

Learn 1 classifier: Multi-class SVM Simultaneously learn 3 sets of weights. Constraints: Margin - gap between correct class and nearest other class y = arg max k w (k) · x + b (k) 44

Learn 1 classifier: Multi-class SVM Simultaneously learn 3 sets of weights y = arg max k w (k) · x + b (k) Joint optimization: w k s have the same scale. 45

What you need to know • Maximizing margin • Derivation of SVM formulation • Slack variables and hinge loss • Relationship between SVMs and logistic regression • 0/1 loss • Hinge loss • Log loss • Tackling multiple class • One against All • Multiclass SVMs 46

SVM vs. Logistic Regression SVM : Hinge loss: Logistic Regression : Log loss ( log conditional likelihood) Log loss Hinge loss 0-1 loss -1 0 1 47

SVM for Regression 48

SVM classification in the dual space “Without b” “With b” 49

So why solve the dual SVM? • There are some quadratic programming algorithms that can solve the dual faster than the primal, specially in high dimensions m>>n • But, more importantly, the “ kernel trick ”!!! 50

What if data is not linearly separable? Use features of features of features of features…. For example polynomials Φ ( x ) = (x 1 3 , x 2 3 , x 3 3 , x 1 2 x 2 x 3 , ….,) 51

Dot Product of Polynomials d=1 d=2 52

Dot Product of Polynomials d 53

Higher Order Polynomials Feature space becomes really large very quickly! m – input features d – degree of polynomial grows fast: d = 6, m = 100, about 1.6 billion terms 54

Dual formulation only depends on dot-products, not on w! Φ ( x ) – High-dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K 55

Finally: The Kernel Trick! • Never represent features explicitly – Compute dot products in closed form • Constant-time high-dimensional dot- products for many classes of features 56

Common Kernels • Polynomials of degree d • Polynomials of degree up to d • Gaussian/Radial kernels (polynomials of all orders – recall series expansion) • Sigmoid Which functions can be used as kernels??? …and why are they called kernels??? 57

Overfitting • Huge feature space with kernels, what about overfitting??? • Maximizing margin leads to sparse set of support vectors • Some interesting theory says that SVMs search for simple hypothesis with large margin • Often robust to overfitting 58

What about classification time? • For a new input x , if we need to represent Φ ( x ), we are in trouble! • Recall classifier: sign( w · · · · Φ ( x )+b) • Using kernels we are cool! 59

Kernels in Logistic Regression • Define weights in terms of features: • Derive simple gradient descent rule on α i 60

A few results 61

Steve Gunn’s svm toolbox Results, Iris 2vs13, Linear kernel 62

Results, Iris 1vs23, 2 nd order kernel 2 nd order decision boundary: (parabola, hyperbola, ellipse) 63

Results, Iris 1vs23, 2nd order kernel 64

Results, Iris 1vs23, 13th order kernel 65

Results, Iris 1vs23, RBF kernel 66

Results, Iris 1vs23, RBF kernel 67

Chessboard dataset Results, Chessboard, Poly kernel 68

Results, Chessboard, Poly kernel 69

Introduction to Machine Learning CMU-10701 Support Vector Machines - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring http://barnabas-cmu-10701.appspot.com/ Linear classifiers which line is better? Which decision boundary is better? 4 Pick

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Detectors and Descriptors EECS 442 David Fouhey Fall 2019, University of Michigan

Shapes, Inc. We have been hired to model the business objects of Shapes, Inc. Following are their

Developing an Integrated and Place-Based ESL- Sociology Learning Community Aurora Bautista, Ph.D.

PA Every Student Succeeds Act Plan (ESSA) and PSSA/Keystone Exams Update UCFSD October

MoveIt: The 2D Image Manipulation Language by Benjamin Kornacki (blk2129) Thomas Rantasa

ELLIPSOID : traces on the coordinate planes are ellipses 2 2 2 x 2 y 2 z 2 = 1 a b c

Functional programming as an alternative (additional) paradigm to OOP Mikhail Smal

High order parametric polynomial approximation of conic sections Ga sper Jakli c (joint

Introduction to Machine Learning CMU-10701 Support Vector Machines - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring http://barnabas-cmu-10701.appspot.com/ Linear classifiers which line is better? Which decision boundary is better? 4 Pick

CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &amp;

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Detectors and Descriptors EECS 442 David Fouhey Fall 2019, University of Michigan

Shapes, Inc. We have been hired to model the business objects of Shapes, Inc. Following are their

Developing an Integrated and Place-Based ESL- Sociology Learning Community Aurora Bautista, Ph.D.

PA Every Student Succeeds Act Plan (ESSA) and PSSA/Keystone Exams Update UCFSD October

MoveIt: The 2D Image Manipulation Language by Benjamin Kornacki (blk2129) Thomas Rantasa

ELLIPSOID : traces on the coordinate planes are ellipses 2 2 2 x 2 y 2 z 2 = 1 a b c

Functional programming as an alternative (additional) paradigm to OOP Mikhail Smal

High order parametric polynomial approximation of conic sections Ga sper Jakli c (joint

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti