support vector machines
play

Support Vector Machines Prof. Mike Hughes Many ideas/slides - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Support Vector Machines Prof. Mike Hughes Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Support Vector Machines Prof. Mike Hughes Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

  2. SVM Unit Objectives Big ideas: Support Vector Machine • Why maximize margin? • What is a support vector? • What is hinge loss? • Advantages over logistic regression • Less sensitive to outliers • Advantages from sparsity in when using kernels • Disadvantages • Not probabilistic • Less elegant to do multi-class Mike Hughes - Tufts COMP 135 - Spring 2019 3

  3. What will we learn? Evaluation Supervised Training Learning Data, Label Pairs Performance { x n , y n } N measure Task n =1 Unsupervised Learning data label x y Reinforcement Learning Prediction Mike Hughes - Tufts COMP 135 - Spring 2019 4

  4. Recall: Kernelized Regression Mike Hughes - Tufts COMP 135 - Spring 2019 5

  5. Linear Regression Mike Hughes - Tufts COMP 135 - Spring 2019 6

  6. Kernel Function k ( x i , x j ) = φ ( x i ) T φ ( x j ) Interpret: similarity function for x_i and x_j Properties: Input: any two vectors Output: scalar real larger values mean more similar symmetric Mike Hughes - Tufts COMP 135 - Spring 2019 7

  7. Common Kernels Mike Hughes - Tufts COMP 135 - Spring 2019 8

  8. Kernel Matrices • K_train : N x N symmetric matrix   k ( x 1 , x 1 ) k ( x 1 , x 2 ) . . . k ( x 1 , x N ) k ( x 2 , x 1 ) k ( x 2 , x 2 ) . . . k ( x 2 , x N )   K = .   .   .   k ( x N , x 1 ) k ( x N , x 2 ) . . . k ( x N , x N ) • K_test : T x N matrix for test feature Mike Hughes - Tufts COMP 135 - Spring 2019 9

  9. Linear Kernel Regression Mike Hughes - Tufts COMP 135 - Spring 2019 10

  10. Radial basis kernel aka Gaussian aka Squared Exponential Mike Hughes - Tufts COMP 135 - Spring 2019 11

  11. Compare: Linear Regression Prediction : Linear transform of G-dim features G y ( x i , θ ) = θ T φ ( x i ) = X ˆ θ g · φ ( x i ) g g =1 Training : Solve G-dim optimization problem N + L2 penalty y ( x n , θ )) 2 X min ( y n − ˆ (optional) θ n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 12

  12. Kernelized Linear Regression • Prediction : N X y ( x i , α , { x n } N ˆ n =1 ) = α n k ( x n , x i ) = X n =1 • Training: Solve N-dim optimization problem N X y ( x n , α , X )) 2 min ( y n − ˆ α n =1 Can do all needed operations with only access to kernel (no feature vectors are created in memory) Mike Hughes - Tufts COMP 135 - Spring 2019 13

  13. Math on board • Goal: Kernelized linear prediction reweights each training example Mike Hughes - Tufts COMP 135 - Spring 2019 14

  14. Can kernelize any linear model Regression: Prediction N X y ( x i , α , { x n } N ˆ n =1 ) = α n k ( x n , x i ) n =1 Logistic Regression: Prediction p ( Y i = 1 | x i ) = σ (ˆ y ( x i , α , X )) Mike Hughes - Tufts COMP 135 - Spring 2019 15

  15. Training : Reg. Vs. Logistic Reg. N y ( x n , α , X )) 2 X min ( y n − ˆ α n =1 N X min log loss( y n , σ (ˆ y ( x n , α , X ))) α n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 16

  16. Downsides of LR Log loss means any example misclassified pays a steep price, pretty sensitive to outliers Mike Hughes - Tufts COMP 135 - Spring 2019 17

  17. Stepping back Which do we prefer? Why? Mike Hughes - Tufts COMP 135 - Spring 2019 18

  18. Idea: Define Regions Separated by Wide Margin Mike Hughes - Tufts COMP 135 - Spring 2019 19

  19. w is perpendicular to boundary Mike Hughes - Tufts COMP 135 - Spring 2019 20

  20. Examples that define the margin are called support vectors Nearest positive example Nearest negative example Mike Hughes - Tufts COMP 135 - Spring 2019 21

  21. Observation: Non-support training examples do not influence margin at all Could perturb these examples without impacting boundary Mike Hughes - Tufts COMP 135 - Spring 2019 22

  22. How wide is the margin? Mike Hughes - Tufts COMP 135 - Spring 2019 23

  23. Small margin ! = #+1 &' w x + b ≥ 0 −1 &' w x + b < 0 y positive w x + b>0 y negative Margin: distance to the boundary w x + b<0 Mike Hughes - Tufts COMP 135 - Spring 2019 24

  24. Large margin ! = #+1 &' w x + b ≥ 0 −1 &' w x + b < 0 y positive w x + b>0 y negative Margin: distance to the boundary w x + b<0 Mike Hughes - Tufts COMP 135 - Spring 2019 25

  25. How wide is the margin? Distance from nearest positive example to nearest negative example along vector w: Mike Hughes - Tufts COMP 135 - Spring 2019 26

  26. How wide is the margin? Distance from nearest positive example to nearest negative example along vector w: By construction, we assume Mike Hughes - Tufts COMP 135 - Spring 2019 27

  27. SVM Training Problem Version 1: Hard margin For each n = 1, 2, …. N Requires all training examples to be correctly classified This is a constrained quadratic optimization problem. There are standard optimization methods as well methods specially designed for SVM. Mike Hughes - Tufts COMP 135 - Spring 2019 28

  28. SVM Training Problem Version 1: Hard margin Minimizing this equivalent to maximizing the margin width in feature space For each n = 1, 2, …. N Requires all training examples to be correctly classified This is a constrained quadratic optimization problem. There are standard optimization methods as well methods specially designed for SVM. Mike Hughes - Tufts COMP 135 - Spring 2019 29

  29. Soft margin: Allow some misclassifications Slack at example i Distance on wrong side of the margin Mike Hughes - Tufts COMP 135 - Spring 2019 30

  30. Hard vs. soft constraints HARD: All positive examples to satisfy SOFT: Want each positive examples to satisfy with slack as small as possible (minimize absolute value ) Mike Hughes - Tufts COMP 135 - Spring 2019 31

  31. Hinge loss for positive example Assumes current example has positive label y = +1 +1 0 Mike Hughes - Tufts COMP 135 - Spring 2019 32

  32. SVM Training Problem Version 2: Soft margin Tradeoff parameter C controls model complexity Smaller C: Simpler model, encourage large margin even if we make lots of mistakes Bigger C: Avoid mistakes Mike Hughes - Tufts COMP 135 - Spring 2019 33

  33. SVM vs LR: Compare training Both require tuning complexity hyperparameter C to avoid overfitting Mike Hughes - Tufts COMP 135 - Spring 2019 34

  34. Loss functions: SVM vs Logistic Regr. Mike Hughes - Tufts COMP 135 - Spring 2019 35

  35. SVMs: Prediction Make binary prediction via hard threshold Mike Hughes - Tufts COMP 135 - Spring 2019 36

  36. SVMs and Kernels: Prediction Make binary prediction via hard threshold Efficient training algorithms using modern quadratic programming solve the dual optimization problem of SVM soft margin problem Mike Hughes - Tufts COMP 135 - Spring 2019 37

  37. Support vectors are often small fraction of all examples Nearest positive example Nearest negative example Mike Hughes - Tufts COMP 135 - Spring 2019 38

  38. Support vectors defined by non-zero alpha in kernel view Mike Hughes - Tufts COMP 135 - Spring 2019 39

  39. SVM + Squared Exponential Kernel Mike Hughes - Tufts COMP 135 - Spring 2019 40

  40. SVM Logistic Regr Loss hinge cross entropy (log loss) Sensitive to less more outliers Probabilistic? no yes Kernelizable? Yes, with speed Yes benefits from sparsity Multi-class? Only via a separate Easy , use softmax one-vs-all for each class Mike Hughes - Tufts COMP 135 - Spring 2019 41

  41. Activity • KernelRegressionDemo.ipynb • Scroll down to SVM vs logistic regression • Can you visualize behavior with different C? • Try different kernels? • Examine alpha vector? Mike Hughes - Tufts COMP 135 - Spring 2019 42

Recommend


More recommend