Support Vector Machines Machine Learning 1
Big picture Linear models 2
Big picture Linear models How good is a learning algorithm? 3
Big picture Linear models Perceptron, Winnow Online learning How good is a learning algorithm? 4
Big picture Linear models Perceptron, Winnow Online PAC, Agnostic learning learning How good is a learning algorithm? 5
Big picture Linear models Perceptron, Support Vector Winnow Machines Online PAC, Agnostic learning learning How good is a learning algorithm? 6
Big picture Linear models …. Perceptron, Support Vector Winnow Machines Online PAC, Agnostic …. learning learning How good is a learning algorithm? 7
This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 8
This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 9
VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 ln 𝜀 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error A function of VC dimension. Low VC dimension gives tighter bound 10
VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 11
VC dimensions and linear classifiers What we know so far What we know so far 1. 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is ! ℎ is bounded by bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1 12
VC dimensions and linear classifiers What we know so far 1. If we have 𝑛 examples, then with probability 1 - 𝜀 , the true error of a hypothesis ℎ with training error 𝑓𝑠𝑠 ! ℎ is bounded by 𝑊𝐷(𝐼) + 1 + ln 4 2𝑛 𝑊𝐷 𝐼 A function of VC dimension. ln 𝜀 Low VC dimension gives tighter bound 𝑓𝑠𝑠 ! ℎ ≤ 𝑓𝑠𝑠 " ℎ + 𝑛 Generalization error Training error 2. VC dimension of a linear classifier in 𝑒 dimensions = 𝑒 + 1 But are all linear classifiers the same? 13
Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - - 14
Recall: Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - Margin with respect to this hyperplane - 15
Which line is a better choice? Why? h 1 + ++ - + - + - - - - - - + + + - - - - - - - - - - h 2 + ++ - + - + - - - - - - + + + - - - - - - - - - - 16
Which line is a better choice? Why? h 1 + + ++ - + - + - - - - - - + + + - - - - - A new example, - - - not from the - - training set might be misclassified if h 2 the margin is + smaller + ++ - + - + - - - - - - + + + - - - - - - - - - - 17
Data dependent VC dimension • Intuitively, larger margins are better • Suppose we only consider linear separators with margins 𝜈 ! and 𝜈 " – 𝐼 " = linear separators that have a margin 𝜈 " – 𝐼 # = linear separators that have a margin 𝜈 # – And 𝜈 " > 𝜈 # • The entire set of functions 𝐼 ! is “better” 18
Data dependent VC dimension Theorem (Vapnik): – Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then 𝑊𝐷 𝐼 ≤ min 𝑆 # 𝛿 # , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data 19
Data dependent VC dimension Theorem (Vapnik): – Let H be the set of linear classifiers that separate the training set by a margin at least 𝛿 – Then 𝑊𝐷 𝐼 ≤ min 𝑆 # 𝛿 # , 𝑒 + 1 – 𝑆 is the radius of the smallest sphere containing the data Larger margin ⇒ Lower VC dimension Lower VC dimension ⇒ Better generalization bound 20
Learning strategy Find the linear separator that maximizes the margin 21
This lecture: Support vector machines • Training by maximizing margin • The SVM objective • Solving the SVM optimization problem • Support vectors, duals and kernels 22
Support Vector Machines So far Lower VC dimension → Better generalization • Vapnik: For linear separators, the VC dimension depends inversely • on the margin – That is, larger margin → better generalization For the separable case: • – Among all linear classifiers that separate the data, find the one that maximizes the margin – Maximizing the margin by minimizing 𝒙 ! 𝒙 if for all examples 𝑧𝒙 ! 𝒚 ≥ 1 General case: • – Introduce slack variables – one 𝜊 " for each example – Slack variables allow the margin constraint above to be violated 23
Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 + ++ + + + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - 24
Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 We only care about + ++ the sign, not the + + magnitude + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - 25
Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) 𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 = 0 We only care about 2 + 𝑥 1 𝑐 2 𝑦 1 + 𝑥 2 + ++ the sign, not the 2 𝑦 2 = 0 + + magnitude + + + 1000𝑐 + 1000𝑥 1 𝑦 1 + 1000𝑥 2 𝑦 2 = 0 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| = 𝑧(𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐) " + 𝑥 " - | 𝐱 | - " 𝑥 ! - - - - - - - - - - - - - - - - All these are equivalent. We could multiply or divide the coefficients by any positive number and the sign of the prediction will not change 26
Maximizing margin • Margin of a hyperplane = distance of the closest point from the hyperplane 𝑧 & (𝐱 ' 𝐲 & + 𝑐) 𝛿 𝐱,% = max | 𝐱 | & • We want max w ° Some people call this the geometric margin The numerator alone is called the functional margin 27
Maximizing margin • Margin of a hyperplane = distance of the closest point from the hyperplane 𝑧 & (𝐱 ' 𝐲 & + 𝑐) 𝛿 𝐱,% = max | 𝐱 | & • We want to maximize this margin: max 𝐱,% 𝛿 𝐱,% Sometimes this is called the geometric margin The numerator alone is called the functional margin 28
Recall: The geometry of a linear classifier Prediction = sgn(𝑐 + 𝑥 1 𝑦 1 + 𝑥 2 𝑦 2 ) b +w 1 x 1 + w 2 x 2 =0 + ++ + + + + + |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - We only care about the sign, not the magnitude 29
Towards maximizing the margin b +w 1 x 1 + w 2 x 2 =0 𝑥 ! 𝑑 𝑦 ! + 𝑥 " 𝑑 𝑦 " + 𝑐 + ++ + + 𝑑 + + " " + 𝑥 ! + 𝑥 " 𝑑 𝑑 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - We only care about We can scale the weights the sign, not the to make the optimization easier magnitude 30
Towards maximizing the margin b +w 1 x 1 + w 2 x 2 =0 𝑥 ! 𝑑 𝑦 ! + 𝑥 " 𝑑 𝑦 " + 𝑐 + ++ + + 𝑑 + + " " + 𝑥 ! + 𝑥 " 𝑑 𝑑 |𝑥 ! 𝑦 ! + 𝑥 " 𝑦 " + 𝑐| - - " + 𝑥 " - " 𝑥 ! - - - - - - - - - - - - - - - Key observation : We can We only care about We can scale the weights scale the 𝑑 so that the the sign, not the to make the optimization numerator is 1 for points easier magnitude that define the margin. 31
Recommend
More recommend