l ecture 10
play

L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu Todays class Large


  1. CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 10: L ARGE M ARGIN C LASSIFIERS Prof. Julia Hockenmaier juliahmr@illinois.edu

  2. Today’s class Large margin classifiers: – Why do we care about the margin? – Perceptron with margin – Support Vector Machines Dealing with outliers: – Soft margins CS446 Machine Learning 2

  3. Large margin classifiers

  4. What’s the best separating hyperplane? + + + + + + − − − − − − − 4

  5. What’s the best separating hyperplane? + + + + + + − − − − − − − 5

  6. What’s the best separating hyperplane? + + + + Margin m � + + − − − − − − − 6

  7. Maximum margin classifiers These decision boundaries are very close This decision boundary is as far away to some items in the training data. from any training items as possible. They have small margins. It has a large margin. Minor changes in the data could lead to Minor changes in the data result in different decision boundaries (roughly) the same decision boundary CS446 Machine Learning 7

  8. Maximum margin classifier Margin = the distance of the decision boundary to the closest items in the training data. We want to find a classifier whose decision boundary is furthest away from the nearest data points. (This classifier has the largest margin). This additional requirement ( bias) reduces the variance (i.e. reduces overfitting). CS440/ECE448: Intro AI 8

  9. Margins

  10. Margins Absolute distance of point x Distance of hyperplane to hyperplane hyperplane wx + b = 0 wx + b = 0: wx + b = 0 point x to origin: wx + b − b w w w Decision boundary: Hyperplane with f( x ) = 0 i.e. wx + b = 0 CS446 Machine Learning 10

  11. Margin If the data are linearly separable, y (i) ( wx (i) +b) > 0 Euclidean distance of x (i) to the decision boundary: y ( i ) f ( x ( i ) ) = y ( i ) ( wx ( i ) + b ) w w CS446 Machine Learning 11

  12. Functional vs. Geometric margin Geometric margin (Euclidean distance) of hyperplane wx + b = 0 to point x (i) : y ( i ) f ( x ( i ) ) = y ( i ) ( wx ( i ) + b ) w w Functional margin of hyperplane wx + b = 0 to point x (i) : γ = y (i) f( x (i) ) i.e . γ = y (i) ( wx (i) + b ) CS446 Machine Learning 12

  13. Rescaling w and b Rescaling w and b by a factor k to k w and kb does not change the geometric margin (Euclidean distance): …spell out wx , ‖ w ‖ … …multiply by k/k … Geometric margin of x to wx + b = 0 " % " % y ( i ) ( i ) ky ( i ) ( i ) ∑ w n x n + b ∑ w n x n + b $ ' $ ' y ( i ) ( wx ( i ) + b ) # & # & n n = = w ∑ w n w n k ∑ w n w n n n " % y ( i ) ( i ) ∑ kw n x n + kb $ ' y ( i ) ( k wx ( i ) + kb ) # & n = = k w ∑ kw n kw n n Geometric margin of x …move k inside… to k wx + k b = 0 CS446 Machine Learning 13

  14. Rescaling w and b Rescaling w and b by a factor k does change the functional margin γ by a factor k: γ = y (i) ( wx (i) + b ) k γ = y (i) ( k wx (i) + kb ) The point that is closest to the decision boundary has functional margin γ min – w and b can be rescaled so that γ min = 1 – When learning w and b , we can set γ min = 1 (and still get the same decision boundary) CS446 Machine Learning 14

  15. The maximum margin decision boundary + + wx i = +1 = y i + + wx k = +1 = y k Margin m � + + − − wx = 0 − − − wx j = -1 = y j − − 15

  16. Hinge loss L(y, f( x )) = max(0, 1 − yf( x )) Case 1: f( x ) > 1 x outside of margin Loss as a function of y*f(x) Hinge loss = 0 4 Hinge Loss Case 2: 0< yf( x ) <1: x inside of margin 3 Hinge loss = 1-yf( x ) yf(x) 2 Case 3: yf( x ) < 0: x misclassified 1 Hinge loss = 1-yf( x ) 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 y*f(x) CS446 Machine Learning 16

  17. Perceptron with margin CS446 Machine Learning 17

  18. Perceptron with Margin Standard Perceptron update: Update w if y m · w · x m < 0 Perceptron with Margin update: Define a functional margin γ > 0 Update w if y m · w · x m < γ CS446 Machine Learning 18

  19. Support Vector Machines CS446 Machine Learning 19

  20. The maximum margin decision boundary + + wx i = +1 = y i + + wx k = +1 = y k Margin m � + + − − wx = 0 − − − wx j = -1 = y j − − 20

  21. The maximum margin decision boundary… … is defined by two parallel hyperplanes: – one that goes through the positive data points (y j = +1) that are closest to the decision boundary, and – one that goes through the negative data points (y j = − 1) that are closest to the decision boundary. CS440/ECE448: Intro AI 21

  22. Support vectors We can express the separating hyperplane in terms of the data points x j closest to the decision boundary. These data points are called the support vectors. CS440/ECE448: Intro AI 22

  23. Support vectors + + wx i = +1 = y i + + wx k = +1 = y k Margin m � + + − − wx = 0 − − − wx j = -1 = y j − − 23

  24. Perceptrons and SVMs: Differences in notation Perceptrons: – Weight vector has bias term w 0 ( x 0 = dummy value 1) – Decision boundary: wx = 0 SVMs/Large Margin classifiers: – Explicit bias term b; weight vector w = ( w 1 …w n ) – Decision boundary wx + b = 0 CS440/ECE448: Intro AI 24

  25. Support Vector Machines The functional margin of the data for ( w , b) is determined by the points closest to the hyperplane y ( n ) ( wx ( n ) + b ) ! # γ min = min " $ n wx + b Distance of x (n) to the hyperplane wx = 0: w Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' ' n ( + w , b CS446 Machine Learning 25

  26. Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b This is difficult to optimize. Let’s convert it to an equivalent problem that is easier. CS446 Machine Learning 26

  27. Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b Easier equivalent problem: – We can always rescale w and b without affecting Euclidian distances. – This allows us to set the functional margin to 1: min n (y (n) ( wx (n) + b ) = 1 CS446 Machine Learning 27

  28. Support Vector Machines Learn w in an SVM = maximize the margin: % ) ' ' 1 ! # y ( n ) ( wx + b ) argmax w min & * " $ ' n ' ( + w , b Easier equivalent problem: a quadratic program – Setting min n (y (n) ( wx (n) + b ) = 1 implies (y (n) ( wx (n) + b ) ≥ 1 for all n – argmax(1/ ww )= argmin( ww ) = argmin(1/2· ww ) 1 argmin 2 w ⋅ w w subject to y i ( w ⋅ x i + b ) ≥ 1 ∀ i CS446 Machine Learning 28

  29. Support Vector Machines The name “Support Vector Machine” stems from the fact that w * is supported by (i.e. is the linear span of) the examples that are exactly at a distance 1/|| w *|| from the separating hyperplane. These vectors are therefore called support vectors . Theorem: Let w * be the minimizer of the SVM optimization problem for S = {( x i , y i )}. Let I= {i: y i ( w * x i + b) = 1}. Then there exist coefficients α i > 0 such that: w * = ∑ i ∈ ¡I α i y i x i ¡ 29

  30. The primal representation The data items x = (x 1 …x n ) have n features The weight vector w = (w 1 …w n ) has n elements Learning: Find a weight w j for each feature x j Classification: Evaluate wx CS440/ECE448: Intro AI 30

  31. The dual representation ∑ w = α j x j j Learning: Find a weight α j ( ≥ 0) for each data point x j This requires computing the inner product x i x j between all pairs of data items x i and x j Support vectors = the set of data points x j with non-zero weights α j CS440/ECE448: Intro AI 31

  32. Classifying test data with SVM In the primal: Compute inner product between weight vector and test item wx = 〈 w, x 〉 In the dual: Compute inner products between the support vectors and test item wx = 〈 w, x 〉 = 〈 ∑ j α j x j , x 〉 = ∑ j α j 〈 x j , x 〉 CS440/ECE448: Intro AI 32

  33. Dealing with outliers: Soft margins

  34. Dealing with outliers: Slack variables ξ i ξ i measures by how much example ( x i , y i ) fails to achieve margin δ CS446 Machine Learning 34

  35. Dealing with outliers: Slack variables ξ i If x i is on correct side of the margin: ξ i = 0 otherwise ξ i = |y i − wx i | If ξ i = 1: x i is on the decision boundary wx i = 0 If ξ i > 1: x i is misclassified Replace y (n) ( wx (n) + b ) ≥ 1 (hard margin) with y (n) ( wx (n) + b ) ≥ 1 − ξ (n) (soft margin) CS446 Machine Learning 35

  36. Soft margins n 1 ∑ argmin 2 w ⋅ w + C ξ i w i = 1 subject to ξ i ≥ 0 ∀ i y i ( w ⋅ x i + b ) ≥ (1 − ξ i ) ∀ i ξ i (slack): how far off is x i from the margin? C (cost): how much do we have to pay for misclassifying x i We want to minimize C ∑ i ξ i and maximize the margin C controls the tradeoff between margin and training error CS446 Machine Learning 36

  37. Soft SVMs Now the optimization problem becomes Min w ½ || w || 2 + C ∑ (x,y) ∈ S max(0, 1 – y wx ) where the parameter C controls the tradeoff between choosing a large margin (small ||w||) and choosing a small hinge-loss. CS446 Machine Learning 37

Recommend


More recommend