Statistics and learning Support Vector Machines S˜ A c � bastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2017 1 / 20
Linearly separable data Intuition: How would you separate whites and blacks? S. Gadat (TSE) SAD 2017 2 / 20
Separation hyperplane S. Gadat (TSE) SAD 2017 3 / 20
Separation hyperplane S. Gadat (TSE) SAD 2017 3 / 20
Separation hyperplane S. Gadat (TSE) SAD 2017 3 / 20
Separation hyperplane M + M - β Any separation hyperplane can be written ( β, β 0 ) such that: ∀ i = 1 ..N, β T x i + β 0 ≥ 0 if y i = +1 ∀ i = 1 ..N, β T x i + β 0 ≤ 0 if y i = − 1 This can be written: β T x i + β 0 � � ∀ i = 1 ..N, y i ≥ 0 S. Gadat (TSE) SAD 2017 3 / 20
Separation hyperplane M + M - β But. . . β T x i + β 0 � � y i is the signed distance between point i and the hyperplane ( β, β 0 ) β T x i + β 0 � � Margin of a separating hyperplane: min y i ? i S. Gadat (TSE) SAD 2017 3 / 20
Separation hyperplane M + M - β Optimal separating hyperplane Maximize the margin between the hyperplane and the data. max β,β 0 M β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ M and � β � = 1 S. Gadat (TSE) SAD 2017 3 / 20
Separation hyperplane M + M - β Let’s get rid of � β � = 1 : 1 β T x i + β 0 � � ∀ i = 1 ..N, � β � y i ≥ M β T x i + β 0 � � ⇒ ∀ i = 1 ..N, y i ≥ M � β � S. Gadat (TSE) SAD 2017 3 / 20
Separation hyperplane M + M - β β T x i + β 0 � � ∀ i = 1 ..N, y i ≥ M � β � If ( β, β 0 ) satisfies this constraint, then ∀ α > 0 , ( αβ, αβ 0 ) does too. β T x i + β 0 � � Let’s choose to have ∀ i = 1 ..N, y i ≥ 1 then we need to set � β � = 1 M S. Gadat (TSE) SAD 2017 3 / 20
Separation hyperplane M + M - β 1 Now M = � β � . Geometrical interpretation? So β,β 0 � β � 2 max β,β 0 M ⇔ min β,β 0 � β � ⇔ min S. Gadat (TSE) SAD 2017 3 / 20
Separation hyperplane M + M - β Optimal separating hyperplane (continued) 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 1 Maximize the margin M = � β � between the hyperplane and the data. S. Gadat (TSE) SAD 2017 3 / 20
Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! S. Gadat (TSE) SAD 2017 4 / 20
Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! N L P ( β, β 0 , α ) = 1 2 � β � 2 − β T x i + β 0 � � � � � α i y i − 1 i =1 S. Gadat (TSE) SAD 2017 4 / 20
Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! N L P ( β, β 0 , α ) = 1 2 � β � 2 − β T x i + β 0 � � � � � α i y i − 1 i =1 N ∂L P ∂β = 0 ⇒ β = � α i y i x i i =1 N ∂L P KKT conditions ∂β 0 = 0 ⇒ 0 = � α i y i i =1 β T x i + β 0 � � � � ∀ i = 1 ..N, α i y i − 1 = 0 ∀ i = 1 ..N, α i ≥ 0 S. Gadat (TSE) SAD 2017 4 / 20
Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! β T x i + β 0 � � � � ∀ i = 1 ..N, α i y i − 1 = 0 Two possibilities: β T x i + β 0 ◮ α i > 0 , then y i � � = 1 : x i is on the margin’s boundary ◮ α i = 0 , then x i is anywhere on the boundary or further . . . but does not participate in β . N � β = α i y i x i i =1 The x i for which α i > 0 are called Support Vectors . S. Gadat (TSE) SAD 2017 4 / 20
Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! N N N α i − 1 � � � α i α j y i y j x T Dual problem: α ∈ R + N L D ( α ) = max i x j 2 i =1 i =1 j =1 N � such that α i y i = 0 i =1 Solving the dual problem is a maximization in R N , rather than a (constrained) minimization in R n . Usual algorithm: SMO=Sequential Minimal Optimization. S. Gadat (TSE) SAD 2017 4 / 20
Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! And β 0 ? β T x i + β 0 � � � � Solve α i y i − 1 = 0 for any i such that α i > 0 S. Gadat (TSE) SAD 2017 4 / 20
Optimal separating hyperplane M + M - β Overall: N � β = α i y i x i i =1 With α i > 0 only for x i support vectors . � N � β T x + β 0 � � α i y i x T Prediction: f ( x ) = sign = sign � i x + β 0 i =1 S. Gadat (TSE) SAD 2017 4 / 20
Non-linearly separable data? S. Gadat (TSE) SAD 2017 5 / 20
Non-linearly separable data? S. Gadat (TSE) SAD 2017 5 / 20
Non-linearly separable data? S. Gadat (TSE) SAD 2017 5 / 20
Non-linearly separable data? Slack variables ξ = ( ξ 1 , . . . , ξ N ) y i ( β T x i + β 0 ) ≥ M − ξ i N � or and ξ i ≥ 0 and ξ i ≤ K y i ( β T x i + β 0 ) ≥ M (1 − ξ i ) i =1 S. Gadat (TSE) SAD 2017 5 / 20
Non-linearly separable data? y i ( β T x i + β 0 ) ≥ M (1 − ξ i ) ⇒ misclassification if ξ i ≥ 1 N � ξ i ≤ K ⇒ maximum K misclassifications i =1 S. Gadat (TSE) SAD 2017 5 / 20
Non-linearly separable data? Optimal separating hyperplane min β,β 0 � β � β T x i + β 0 � � y i ≥ 1 − ξ i , N such that ∀ i = 1 ..N, � ξ i ≥ 0 , ξ i ≤ K i =1 S. Gadat (TSE) SAD 2017 5 / 20
Non-linearly separable data? Optimal separating hyperplane N 1 2 � β � 2 + C � min ξ i β,β 0 � y i i =1 β T x i + β 0 � � ≥ 1 − ξ i , such that ∀ i = 1 ..N, ξ i ≥ 0 S. Gadat (TSE) SAD 2017 5 / 20
Optimal separating hyperplane Again a QP problem. N N N L P = 1 2 � β � 2 + C β T x i + β 0 � � � � � � � ξ i − α i y i − (1 − ξ i ) − µ i ξ i i =1 i =1 i =1 N ∂L P ∂β = 0 ⇒ β = � α i y i x i i =1 N ∂L P ∂β 0 = 0 ⇒ 0 = � α i y i i =1 KKT conditions ∂L P ∂ξ = 0 ⇒ α i = C − µ i β T x i + β 0 � � � � ∀ i = 1 ..N, α i y i − (1 − ξ i ) = 0 ∀ i = 1 ..N, µ i ξ i = 0 ∀ i = 1 ..N, α i ≥ 0 , µ i ≥ 0 S. Gadat (TSE) SAD 2017 6 / 20
Optimal separating hyperplane N N N α i − 1 � � � α i α j y i y j x T Dual problem: α ∈ R + N L D ( α ) = max i x j 2 i =1 i =1 j =1 N � such that α i y i = 0 i =1 and 0 ≤ α i ≤ C S. Gadat (TSE) SAD 2017 6 / 20
Optimal separating hyperplane N β T x i + β 0 � � � � � α i y i − (1 − ξ i ) = 0 and β = α i y i x i i =1 Again: β T x i + β 0 ◮ α i > 0 , then y i � � = 1 − ξ i : x i is a support vector . Among these: ◮ ξ i = 0 , then 0 ≤ α i ≤ C ◮ ξ i > 0 , then α i = C (because µ i = 0 , because µ i ξ i = 0 ) ◮ α i = 0 , then x i does not participate in β . S. Gadat (TSE) SAD 2017 6 / 20
Optimal separating hyperplane Overall: N � β = α i y i x i i =1 With α i > 0 only for x i support vectors . � N � β T x + β 0 α i y i x T � � � Prediction: f ( x ) = sign = sign i x + β 0 i =1 S. Gadat (TSE) SAD 2017 6 / 20
Non-linear SVMs? Key remark � X → H h : is a mapping to a p-dimensional Euclidean space. x �→ h ( x ) ( p ≫ n , possibly infinite) � N � SVM classifier in H : f ( x ′ ) = sign α i y i � x ′ i , x ′ � + β 0 � . i =1 Suppose K ( x, x ′ ) = � h ( x ) , h ( x ′ ) � , Then: � N � � f ( x ) = sign α i y i K ( x i , x ) + β 0 . i =1 S. Gadat (TSE) SAD 2017 7 / 20
Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. S. Gadat (TSE) SAD 2017 8 / 20
Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. Example: x 2 √ 1 X = R 2 , H = R 3 , h ( x ) = 2 x 1 x 2 x 2 2 K ( x, y ) = h ( x ) T h ( y ) S. Gadat (TSE) SAD 2017 8 / 20
Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. What if we knew that K ( · , · ) is a kernel, without explicitly building h ? The SVM would be a linear classifier in H but we would never have to compute h ( x ) for training or prediction! This is called the kernel trick . S. Gadat (TSE) SAD 2017 8 / 20
Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. Under what conditions is K ( · , · ) an acceptable kernel? Answer: if it is an inner product on a (separable) Hilbert space. In more general words, we are interested in positive, definite kernel on a Hilbert space: Positive Definite Kernels K ( · , · ) is a positive definite kernel on X if n ∀ n ∈ N , x ∈ X n and c ∈ R n , � c i c j K ( x i , x j ) ≥ 0 i,j =1 S. Gadat (TSE) SAD 2017 8 / 20
Recommend
More recommend