Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang
Review: machine learning basics
Math formulation β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 β’ Find π§ = π(π¦) β π that minimizes ΰ· π π Ο π=1 π π = π(π, π¦ π , π§ π ) β’ s.t. the expected loss is small π π = π½ π¦,π§ ~πΈ [π(π, π¦, π§)]
Machine learning 1-2-3 β’ Collect data and extract features β’ Build model: choose hypothesis class π and loss function π β’ Optimization: minimize the empirical loss
Loss function β’ π 2 loss: linear regression β’ Cross-entropy: logistic regression β’ Hinge loss: Perceptron β’ General principle: maximum likelihood estimation (MLE) β’ π 2 loss: corresponds to Normal distribution β’ logistic regression: corresponds to sigmoid conditional distribution
Optimization β’ Linear regression: closed form solution β’ Logistic regression: gradient descent β’ Perceptron: stochastic gradient descent β’ General principle: local improvement β’ SGD: Perceptron; can also be applied to linear regression/logistic regression
Principle for hypothesis class? β’ Yes, there exists a general principle (at least philosophically) β’ Different names/faces/connections β’ Occamβs razor β’ VC dimension theory β’ Minimum description length β’ Tradeoff between Bias and variance; uniform convergence β’ The curse of dimensionality β’ Running example: Support Vector Machine (SVM)
Motivation
Linear classification (π₯ β ) π π¦ = 0 (π₯ β ) π π¦ > 0 (π₯ β ) π π¦ < 0 π₯ β Class +1 Class -1 Assume perfect separation between the two classes
Attempt β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ π₯ π¦ ) = sign(π₯ π π¦) β’ Hypothesis π§ = sign(π β’ π§ = +1 if π₯ π π¦ > 0 β’ π§ = β1 if π₯ π π¦ < 0 β’ Letβs assume that we can optimize to find π₯
Multiple optimal solutions? Class +1 π₯ 1 π₯ 2 π₯ 3 Class -1 Same on empirical loss; Different on test/expected loss
What about π₯ 1 ? New test data Class +1 π₯ 1 Class -1
What about π₯ 3 ? New test data Class +1 π₯ 3 Class -1
Most confident: π₯ 2 New test data Class +1 π₯ 2 Class -1
Intuition: margin large margin Class +1 π₯ 2 Class -1
Margin
Margin |π π₯ π¦ | π₯ π¦ = π₯ π π¦ = 0 β’ Lemma 1: π¦ has distance | π₯ | to the hyperplane π Proof: β’ π₯ is orthogonal to the hyperplane π₯ β’ The unit direction is π¦ | π₯ | 0 π π₯ π₯ π π₯ (π¦) β’ The projection of π¦ is π¦ = | π₯ | π₯ | π₯ | π π₯ π¦ π₯
Margin: with bias π₯,π π¦ = π₯ π π¦ + π = 0 β’ Claim 1: π₯ is orthogonal to the hyperplane π Proof: β’ pick any π¦ 1 and π¦ 2 on the hyperplane β’ π₯ π π¦ 1 + π = 0 β’ π₯ π π¦ 2 + π = 0 β’ So π₯ π (π¦ 1 β π¦ 2 ) = 0
Margin: with bias βπ | π₯ | to the hyperplane π₯ π π¦ + π = 0 β’ Claim 2: 0 has distance Proof: β’ pick any π¦ 1 the hyperplane π₯ β’ Project π¦ 1 to the unit direction | π₯ | to get the distance π π₯ βπ | π₯ | since π₯ π π¦ 1 + π = 0 β’ π¦ 1 = π₯
Margin: with bias |π π₯,π π¦ | π₯,π π¦ = π₯ π π¦ + β’ Lemma 2: π¦ has distance to the hyperplane π | π₯ | π = 0 Proof: π₯ β’ Let π¦ = π¦ β₯ + π | π₯ | , then |π | is the distance β’ Multiply both sides by π₯ π and add π β’ Left hand side: π₯ π π¦ + π = π π₯,π π¦ π₯ π π₯ β’ Right hand side: π₯ π π¦ β₯ + π | π₯ | + π = 0 + π | π₯ |
The notation here is: π§ π¦ = π₯ π π¦ + π₯ 0 Figure from Pattern Recognition and Machine Learning , Bishop
Support Vector Machine (SVM)
SVM: objective β’ Margin over all training data points: |π π₯,π π¦ π | πΏ = min | π₯ | π β’ Since only want correct π π₯,π , and recall π§ π β {+1, β1} , we have π§ π π π₯,π π¦ π πΏ = min | π₯ | π β’ If π π₯,π incorrect on some π¦ π , the margin is negative
SVM: objective β’ Maximize margin over all training data points: π§ π (π₯ π π¦ π + π) π§ π π π₯,π π¦ π max π₯,π πΏ = max π₯,π min = max π₯,π min | π₯ | | π₯ | π π β’ A bit complicated β¦
SVM: simplified objective β’ Observation: when (π₯, π) scaled by a factor π , the margin unchanged π§ π (ππ₯ π π¦ π + ππ) = π§ π (π₯ π π¦ π + π) | ππ₯ | | π₯ | β’ Letβs consider a fixed scale such that π§ π β π₯ π π¦ π β + π = 1 where π¦ π β is the point closest to the hyperplane
SVM: simplified objective β’ Letβs consider a fixed scale such that π§ π β π₯ π π¦ π β + π = 1 where π¦ π β is the point closet to the hyperplane β’ Now we have for all data π§ π π₯ π π¦ π + π β₯ 1 and at least for one π the equality holds 1 β’ Then the margin is | π₯ |
SVM: simplified objective β’ Optimization simplified to 2 1 min π₯ 2 π₯,π π§ π π₯ π π¦ π + π β₯ 1, βπ π₯ β ? β’ How to find the optimum ΰ·
SVM: principle for hypothesis class
Thought experiment β’ Suppose pick an π , and suppose can decide if exists π₯ satisfying 1 2 β€ π π₯ 2 π§ π π₯ π π¦ π + π β₯ 1, βπ β’ Decrease π until cannot find π₯ satisfying the inequalities
Thought experiment π₯ β is the best weight (i.e., satisfying the smallest π ) β’ ΰ· π₯ β ΰ·
Thought experiment π₯ β is the best weight (i.e., satisfying the smallest π ) β’ ΰ· π₯ β ΰ·
Thought experiment π₯ β is the best weight (i.e., satisfying the smallest π ) β’ ΰ· π₯ β ΰ·
Thought experiment π₯ β is the best weight (i.e., satisfying the smallest π ) β’ ΰ· π₯ β ΰ·
Thought experiment π₯ β is the best weight (i.e., satisfying the smallest π ) β’ ΰ· π₯ β ΰ·
Thought experiment β’ To handle the difference between empirical and expected losses ο β’ Choose large margin hypothesis (high confidence) ο β’ Choose a small hypothesis class π₯ β ΰ· Corresponds to the hypothesis class
Thought experiment β’ Principle: use smallest hypothesis class still with a correct/good one β’ Also true beyond SVM β’ Also true for the case without perfect separation between the two classes β’ Math formulation: VC-dim theory, etc. π₯ β ΰ· Corresponds to the hypothesis class
Thought experiment β’ Principle: use smallest hypothesis class still with a correct/good one β’ Whatever you know about the ground truth, add it as constraint/regularizer π₯ β ΰ· Corresponds to the hypothesis class
SVM: optimization β’ Optimization (Quadratic Programming): 2 1 min π₯ 2 π₯,π π§ π π₯ π π¦ π + π β₯ 1, βπ β’ Solved by Lagrange multiplier method: 2 β π₯, π, π· = 1 π½ π [π§ π π₯ π π¦ π + π β 1] 2 π₯ β ΰ· π where π· is the Lagrange multiplier β’ Details in next lecture
Reading β’ Review Lagrange multiplier method β’ E.g. Section 5 in Andrew Ngβs note on SVM β’ posted on the course website: http://www.cs.princeton.edu/courses/archive/spring16/cos495/
Recommend
More recommend