support vector machines
play

Support Vector Machines Machine Learning 1 Big picture Linear - PowerPoint PPT Presentation

Support Vector Machines Machine Learning 1 Big picture Linear models 2 Big picture Linear models How good is a learning algorithm? 3 Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm?


  1. Support Vector Machines Machine Learning 1

  2. Big picture Linear models 2

  3. Big picture Linear models How good is a learning algorithm? 3

  4. Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm? 4

  5. Big picture Linear models Perceptron, Winnow Online PAC, Agnostic learning learning How good is a learning algorithm? 5

  6. Big picture Linear models Perceptron, Support Vector Winnow Machines Online PAC, Agnostic learning learning How good is a learning algorithm? 6

  7. Big picture Linear models …. Perceptron, Support Vector Winnow Machines Online PAC, Agnostic …. learning learning How good is a learning algorithm? 7

  8. This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 8

  9. This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 9

  10. VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 ln 𝜀 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 10

  11. VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 11

  12. VC dimensions and linear classifiers What we know so far What we know so far 1. 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is ! ℎ is bounded by bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1 12

  13. VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1 But are all linear classifiers the same? 13

  14. Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - - 14

  15. Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - Margin with respect to this hyperplane - 15

  16. Which line is a better choice? Why? h 1 + ++ - + - + - - - - - - + + + - - - - - - - - - - h 2 + ++ - + - + - - - - - - + + + - - - - - - - - - - 16

  17. Which line is a better choice? Why? h 1 + + ++ - + - + - - - - - - + + + - - - - - A new example, - - - not from the - - training set might be misclassified if h 2 the margin is + smaller + ++ - + - + - - - - - - + + + - - - - - - - - - - 17

  18. Data dependent VC dimension • Intuitively, larger margins are better • Suppose we only consider linear separators with margins 𝜈 ! and 𝜈 " – 𝐼 " = linear separators that have a margin 𝜈 " – 𝐼 # = linear separators that have a margin 𝜈 # – And 𝜈 " > 𝜈 # • The entire set of functions 𝐼 ! is “better” 18

  19. Data dependent VC dimension Theorem (Vapnik): – Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then 𝑊𝐷 𝐼 ≤ min 𝑆 # 𝛿 # , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data 19

  20. Data dependent VC dimension Theorem (Vapnik): – Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then 𝑊𝐷 𝐼 ≤ min 𝑆 # 𝛿 # , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data Larger margin ⇒ Lower VC dimension Lower VC dimension ⇒ Better generalization bound 20

  21. Learning strategy Find the linear separator that maximizes the margin 21

  22. This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 22

  23. Support Vector Machines So far Lower VC dimension → Better generalization • Vapnik: For linear separators, the VC dimension depends inversely • on the margin – That is, larger margin → better generalization For the separable case: • – Among all linear classifiers that separate the data, find the one that maximizes the margin – Maximizing the margin by minimizing 𝒙 ! 𝒙 if for all examples 𝑧𝒙 ! 𝒚 ≥ 1 General case: • – Introduce slack variables – one 𝜊 " for each example – Slack variables allow the margin constraint above to be violated 23

  24. Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 + ++ + + + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - 24

  25. Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 We only care about + ++ the sign, not the + + magnitude + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - 25

  26. Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 We only care about 2 + 𝑥 1 𝑐 2 𝑦 1 + 𝑥 2 + ++ the sign, not the 2 𝑦 2 = 0 + + magnitude + + + 1000𝑐 + 1000𝑥 1 𝑦 1 + 1000𝑥 2 𝑦 2 = 0 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - All these are equivalent. We could multiply or divide the coefficients by any positive number and the sign of the prediction will not change 26

  27. Maximizing margin • Margin of a hyperplane = distance of the closest point from the hyperplane 𝑧 & (𝐱 ' 𝐲 & + 𝑐) 𝛿 𝐱,% = max | 𝐱 | & • We want max w ° Some people call this the geometric margin The numerator alone is called the functional margin 27

  28. Maximizing margin • Margin of a hyperplane = distance of the closest point from the hyperplane 𝑧 & (𝐱 ' 𝐲 & + 𝑐) 𝛿 𝐱,% = max | 𝐱 | & • We want to maximize this margin: max 𝐱,% 𝛿 𝐱,% Sometimes this is called the geometric margin The numerator alone is called the functional margin 28

  29. Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) b +w 1 x 1 + w 2 x 2 =0 + ++ + + + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - We only care about the sign, not the magnitude 29

  30. Towards maximizing the margin b +w 1 x 1 + w 2 x 2 =0 𝑥 ! 𝑑 𝑦 ! + 𝑥 " 𝑑 𝑦 " + 𝑐 + ++ + + 𝑑 + + " " + 𝑥 ! + 𝑥 " 𝑑 𝑑 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - We only care about We can scale the weights the sign, not the to make the optimization easier magnitude 30

  31. Towards maximizing the margin b +w 1 x 1 + w 2 x 2 =0 𝑥 ! 𝑑 𝑦 ! + 𝑥 " 𝑑 𝑦 " + 𝑐 + ++ + + 𝑑 + + " " + 𝑥 ! + 𝑥 " 𝑑 𝑑 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - Key observation : We can We only care about We can scale the weights scale the 𝑑 so that the the sign, not the to make the optimization numerator is 1 for points easier magnitude that define the margin. 31

Recommend


More recommend