CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu
Today’s class Large margin classifiers: – Why do we care about the margin? – Perceptron with margin – Support Vector Machines Dealing with outliers: – Soft margins CS446 Machine Learning 2
Large margin classifiers
What’s the best separating hyperplane? + + + + + + − − − − − − − 4
What’s the best separating hyperplane? + + + + + + − − − − − − − 5
What’s the best separating hyperplane? + + + + Margin m � + + − − − − − − − 6
Maximum margin classifiers These decision boundaries are very close This decision boundary is as far away to some items in the training data. from any training items as possible. They have small margins. It has a large margin. Minor changes in the data could lead to Minor changes in the data result in different decision boundaries (roughly) the same decision boundary CS446 Machine Learning 7
Maximum margin classifier Margin = the distance of the decision boundary to the closest items in the training data. We want to find a classifier whose decision boundary is furthest away from the nearest data points. (This classifier has the largest margin). This additional requirement ( bias) reduces the variance (i.e. reduces overfitting). CS440/ECE448: Intro AI 8
Margins
Margins Absolute distance of point x Distance of hyperplane to hyperplane hyperplane wx + b = 0 wx + b = 0: wx + b = 0 point x to origin: wx + b − b w w w Decision boundary: Hyperplane with f( x ) = 0 i.e. wx + b = 0 CS446 Machine Learning 10
Margin If the data are linearly separable, y (i) ( wx (i) +b) > 0 Euclidean distance of x (i) to the decision boundary: y ( i ) f ( x ( i ) ) = y ( i ) ( wx ( i ) + b ) w w CS446 Machine Learning 11
Functional vs. Geometric margin Geometric margin (Euclidean distance) of hyperplane wx + b = 0 to point x (i) : y ( i ) f ( x ( i ) ) = y ( i ) ( wx ( i ) + b ) w w Functional margin of hyperplane wx + b = 0 to point x (i) : γ = y (i) f( x (i) ) i.e . γ = y (i) ( wx (i) + b ) CS446 Machine Learning 12
Rescaling w and b Rescaling w and b by a factor k to k w and kb does not change the geometric margin (Euclidean distance): …spell out wx , ‖ w ‖ … …multiply by k/k … Geometric margin of x to wx + b = 0 " % " % y ( i ) ( i ) ky ( i ) ( i ) ∑ w n x n + b ∑ w n x n + b $ ' $ ' y ( i ) ( wx ( i ) + b ) # & # & n n = = w ∑ w n w n k ∑ w n w n n n " % y ( i ) ( i ) ∑ kw n x n + kb $ ' y ( i ) ( k wx ( i ) + kb ) # & n = = k w ∑ kw n kw n n Geometric margin of x …move k inside… to k wx + k b = 0 CS446 Machine Learning 13
Rescaling w and b Rescaling w and b by a factor k does change the functional margin γ by a factor k: γ = y (i) ( wx (i) + b ) k γ = y (i) ( k wx (i) + kb ) The point that is closest to the decision boundary has functional margin γ min – w and b can be rescaled so that γ min = 1 – When learning w and b , we can set γ min = 1 (and still get the same decision boundary) CS446 Machine Learning 14
The maximum margin decision boundary + + wx i = +1 = y i + + wx k = +1 = y k Margin m � + + − − wx = 0 − − − wx j = -1 = y j − − 15
Hinge loss L(y, f( x )) = max(0, 1 − yf( x )) Case 1: f( x ) > 1 x outside of margin Loss as a function of y*f(x) Hinge loss = 0 4 Hinge Loss Case 2: 0< yf( x ) <1: x inside of margin 3 Hinge loss = 1-yf( x ) yf(x) 2 Case 3: yf( x ) < 0: x misclassified 1 Hinge loss = 1-yf( x ) 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) CS446 Machine Learning 16
Perceptron with margin CS446 Machine Learning 17
Perceptron with Margin Standard Perceptron update: Update w if y m · w · x m < 0 Perceptron with Margin update: Define a functional margin γ > 0 Update w if y m · w · x m < γ CS446 Machine Learning 18
Support Vector Machines CS446 Machine Learning 19
The maximum margin decision boundary + + wx i = +1 = y i + + wx k = +1 = y k Margin m � + + − − wx = 0 − − − wx j = -1 = y j − − 20
The maximum margin decision boundary… … is defined by two parallel hyperplanes: – one that goes through the positive data points (y j = +1) that are closest to the decision boundary, and – one that goes through the negative data points (y j = − 1) that are closest to the decision boundary. CS440/ECE448: Intro AI 21
Support vectors We can express the separating hyperplane in terms of the data points x j closest to the decision boundary. These data points are called the support vectors. CS440/ECE448: Intro AI 22
Support vectors + + wx i = +1 = y i + + wx k = +1 = y k Margin m � + + − − wx = 0 − − − wx j = -1 = y j − − 23
Perceptrons and SVMs: Differences in notation Perceptrons: – Weight vector has bias term w 0 ( x 0 = dummy value 1) – Decision boundary: wx = 0 SVMs/Large Margin classifiers: – Explicit bias term b; weight vector w = ( w 1 …w n ) – Decision boundary wx + b = 0 CS440/ECE448: Intro AI 24
Support Vector Machines The functional margin of the data for ( w , b) is determined by the points closest to the hyperplane y ( n ) ( wx ( n ) + b ) ! # γ min = min " $ n wx + b Distance of x (n) to the hyperplane wx = 0: w Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' ' n ( + w , b CS446 Machine Learning 25
Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b This is difficult to optimize. Let’s convert it to an equivalent problem that is easier. CS446 Machine Learning 26
Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b Easier equivalent problem: – We can always rescale w and b without affecting Euclidian distances. – This allows us to set the functional margin to 1: min n (y (n) ( wx (n) + b ) = 1 CS446 Machine Learning 27
Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b Easier equivalent problem: a quadratic program – Setting min n (y (n) ( wx (n) + b ) = 1 implies (y (n) ( wx (n) + b ) ≥ 1 for all n – argmax(1/ ww )= argmin( ww ) = argmin(1/2· ww ) 1 argmin 2 w ⋅ w w subject to y i ( w ⋅ x i + b ) ≥ 1 ∀ i CS446 Machine Learning 28
Support Vector Machines The name “Support Vector Machine” stems from the fact that w * is supported by (i.e. is the linear span of) the examples that are exactly at a distance 1/|| w *|| from the separating hyperplane. These vectors are therefore called support vectors . Theorem: Let w * be the minimizer of the SVM optimization problem for S = {( x i , y i )}. Let I= {i: y i ( w * x i + b) = 1}. Then there exist coefficients α i > 0 such that: w * = ∑ i ∈ ¡I α i y i x i ¡ 29
The primal representation The data items x = (x 1 …x n ) have n features The weight vector w = (w 1 …w n ) has n elements Learning: Find a weight w j for each feature x j Classification: Evaluate wx CS440/ECE448: Intro AI 30
The dual representation ∑ w = α j x j j Learning: Find a weight α j ( ≥ 0) for each data point x j This requires computing the inner product x i x j between all pairs of data items x i and x j Support vectors = the set of data points x j with non-zero weights α j CS440/ECE448: Intro AI 31
Classifying test data with SVM In the primal: Compute inner product between weight vector and test item wx = 〈 w, x 〉 In the dual: Compute inner products between the support vectors and test item wx = 〈 w, x 〉 = 〈 ∑ j α j x j , x 〉 = ∑ j α j 〈 x j , x 〉 CS440/ECE448: Intro AI 32
Dealing with outliers: Soft margins
Dealing with outliers: Slack variables ξ i ξ i measures by how much example ( x i , y i ) fails to achieve margin δ CS446 Machine Learning 34
Dealing with outliers: Slack variables ξ i If x i is on correct side of the margin: ξ i = 0 otherwise ξ i = |y i − wx i | If ξ i = 1: x i is on the decision boundary wx i = 0 If ξ i > 1: x i is misclassified Replace y (n) ( wx (n) + b ) ≥ 1 (hard margin) with y (n) ( wx (n) + b ) ≥ 1 − ξ (n) (soft margin) CS446 Machine Learning 35
Soft margins n 1 ∑ argmin 2 w ⋅ w + C ξ i w i = 1 subject to ξ i ≥ 0 ∀ i y i ( w ⋅ x i + b ) ≥ (1 − ξ i ) ∀ i ξ i (slack): how far off is x i from the margin? C (cost): how much do we have to pay for misclassifying x i We want to minimize C ∑ i ξ i and maximize the margin C controls the tradeoff between margin and training error CS446 Machine Learning 36
Soft SVMs Now the optimization problem becomes Min w ½ || w || 2 + C ∑ (x,y) ∈ S max(0, 1 – y wx ) where the parameter C controls the tradeoff between choosing a large margin (small ||w||) and choosing a small hinge-loss. CS446 Machine Learning 37
Recommend
More recommend