Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani
Not linearly separable data } Noisy data or overlapping classes ๐ฆ 2 } (we discussed about it: soft margin) } Near linearly separable ๐ฆ 1 } Non-linear decision surface ๐ฆ 2 } Transform to a new feature space 2 ๐ฆ 1
Nonlinear SVM } Assume a transformation ๐: โ ' โ โ ) on the feature space ๐ ๐ = [๐ , (๐), . . . , ๐ ) (๐)] } ๐ โ ๐ ๐ {๐ , (๐),...,๐ ) (๐)} : set of basis functions (or features) ๐ < ๐ : โ ' โ โ } Find a hyper-plane in the transformed feature space: ๐ / (๐) ๐ฆ 2 ๐: ๐ โ ๐ ๐ ๐ 2 ๐ ๐ + ๐ฅ 5 = 0 ๐ฆ 1 ๐ , (๐) 3
Soft-margin SVM in a transformed space: Primal problem } Primal problem: H 1 2 ๐ / + ๐ท E ๐ G min ๐,B C GI, ๐ 2 ๐(๐ G ) + ๐ฅ 5 โฅ 1 โ ๐ G ๐ = 1, โฆ , ๐ s. t. ๐ง G ๐ G โฅ 0 } ๐ โ โ ) : the weights that must be found } If ๐ โซ ๐ (very high dimensional feature space) then there are many more parameters to learn 4
Soft-margin SVM in a transformed space: Dual problem } Optimization problem: H H H โ 1 2 E E ๐ฝ G ๐ฝ ) ๐ง (G) ๐ง ()) ๐ ๐ (G) 2 ๐ ๐ ()) max E ๐ฝ G ๐ท GI, GI, )I, ๐ฝ G ๐ง (G) = 0 H โ Subject to } GI, 0 โค ๐ฝ G โค ๐ท ๐ = 1, โฆ , ๐ } } If we have inner products ๐ ๐ (<) 2 ๐ ๐ (\) , only ๐ท = [๐ฝ , , โฆ , ๐ฝ H ] needs to be learnt. } not necessary to learn ๐ parameters as opposed to the primal problem 5
๏ฟฝ Classifying a new data ] = ๐ก๐๐๐ ๐ฅ 5 + ๐ 2 ๐(๐) ๐ง ๐ฝ G ๐ง (G) ๐(๐ (G) ) where ๐ = โ b c d5 and ๐ฅ 5 = ๐ง (e) โ ๐ 2 ๐(๐ (e) ) 6
Kernel SVM } Learns linear decision boundary in a high dimension space without explicitly working on the mapped data } Let ๐ ๐ 2 ๐ ๐ f = ๐ฟ(๐, ๐ f ) (kernel) } Example: ๐ = ๐ฆ , , ๐ฆ / and second-order ๐ : / , ๐ฆ / / , ๐ฆ , ๐ฆ / ๐ ๐ = 1, ๐ฆ , , ๐ฆ / , ๐ฆ , ๐ฟ ๐, ๐ f f + ๐ฆ / ๐ฆ / f + ๐ฆ , f/ + ๐ฆ / f/ + ๐ฆ , ๐ฆ , / ๐ฆ , / ๐ฆ / f ๐ฆ / ๐ฆ / f = 1 + ๐ฆ , ๐ฆ , 7
๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ Kernel trick } Compute ๐ฟ ๐, ๐ f without transforming ๐ and ๐โฒ } Example: Consider ๐ฟ ๐, ๐ f = 1 + ๐ 2 ๐ f / f + ๐ฆ / ๐ฆ / f / = 1 + ๐ฆ , ๐ฆ , f + 2๐ฆ / ๐ฆ / f + ๐ฆ , f/ + ๐ฆ / f/ + 2๐ฆ , ๐ฆ , / ๐ฆ , / ๐ฆ / f ๐ฆ / ๐ฆ / f = 1 + 2๐ฆ , ๐ฆ , This is an inner product in: / , ๐ฆ / / , ๐ ๐ = 1, 2 ๐ฆ , , 2 ๐ฆ / , ๐ฆ , 2 ๐ฆ , ๐ฆ / f , f , ๐ฆโฒ , / , ๐ฆโฒ / / , f ๐ฆ / f ๐ ๐โฒ = 1, 2 ๐ฆ , 2 ๐ฆ / 2 ๐ฆ , 8
๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ Polynomial kernel: Degree two } We instead use ๐ฟ(๐, ๐โฒ) = ๐ 2 ๐โฒ + 1 / that corresponds to: ๐ -dimensional feature space ๐ = ๐ฆ , , โฆ ,๐ฆ ' 2 ๐ ๐ 2 / , . . , ๐ฆ ' / , = 1, 2 ๐ฆ , , โฆ , 2 ๐ฆ ' , ๐ฆ , 2 ๐ฆ , ๐ฆ / , โฆ , 2 ๐ฆ , ๐ฆ ' , 2 ๐ฆ / ๐ฆ i , โฆ , 2 ๐ฆ 'j, ๐ฆ ' 9
๏ฟฝ ๏ฟฝ ๏ฟฝ Polynomial kernel } This can similarly be generalized to d-dimensioan ๐ and ๐ s are polynomials of order ๐ : ๐ฟ ๐, ๐ f = 1 + ๐ 2 ๐ f l f + ๐ฆ / ๐ฆ / f + โฏ + ๐ฆ ' ๐ฆ ' f l = 1 + ๐ฆ , ๐ฆ , } Example: SVM boundary for a polynomial kernel } ๐ฅ 5 + ๐ 2 ๐ ๐ = 0 2 ๐ ๐ = 0 ๐ฝ < ๐ง (<) ๐ ๐ < } โ ๐ฅ 5 + โ b o d5 ๐ฝ < ๐ง (<) ๐(๐ < , ๐) = 0 } โ ๐ฅ 5 + โ b o d5 l Boundary is a ๐ฝ < ๐ง (<) 1 + ๐ (<) q ๐ } โ ๐ฅ 5 + โ = 0 b o d5 polynomial of order ๐ 10
Why kernel? } kernel functions ๐ฟ can indeed be efficiently computed, with a cost proportional to ๐ (the dimensionality of the input) instead of ๐ . } Example: consider the second-order polynomial transform: / , ๐ฆ , ๐ฆ / , โฆ , ๐ฆ ' ๐ฆ ' 2 ๐ ๐ = 1, ๐ฆ , , โฆ , ๐ฆ ' , ๐ฆ , ๐ = 1 + ๐ + ๐ / ' ' ' f f ๐ฆ \ f ๐ ๐ 2 ๐ ๐โฒ = 1 + E ๐ฆ < ๐ฆ < + E E ๐ฆ < ๐ฆ \ ๐ฆ < ๐(๐) \I, <I, <I, ' ' f f E ๐ฆ < ๐ฆ < ร E ๐ฆ \ ๐ฆ \ <I, \I, ๐ ๐ 2 ๐ ๐โฒ = 1 + ๐ฆ 2 ๐ฆ f + ๐ฆ 2 ๐ฆ f / ๐(๐) 11
Gaussian or RBF kernel } If ๐ฟ ๐, ๐ f is an inner product in some transformed space of x, it is good ๐j๐ w x } ๐ฟ ๐, ๐ f = exp (โ ) y } Take one dimensional case with ๐ฟ = 1 : ๐ฟ ๐ฆ, ๐ฆ f = exp โ ๐ฆ โ ๐ฆ f / = exp โ๐ฆ / exp โ๐ฆโฒ / exp 2๐ฆ๐ฆโฒ } = exp โ๐ฆ / exp โ๐ฆโฒ / E 2 { ๐ฆ { ๐ฆโฒ { ๐! {I5 12
Some common kernel functions } Linear: ๐(๐, ๐ f ) = ๐ 2 ๐โฒ } Polynomial: ๐ ๐, ๐ f = (๐ 2 ๐ f + 1) l ๐j๐ w x } Gaussian: ๐ ๐, ๐ f = exp (โ ) y } Sigmoid: ๐ ๐, ๐ f = tanh (๐๐ 2 ๐ f + ๐) 13
Kernel formulation of SVM } Optimization problem: H H H โ 1 2 E E ๐ฝ G ๐ฝ ) ๐ง (G) ๐ง ()) ๐ ๐ (G) 2 ๐ ๐ ()) ๐(๐ G , ๐ ()) ) max E ๐ฝ G ๐ท GI, GI, )I, ๐ฝ G ๐ง (G) = 0 H โ Subject to } GI, 0 โค ๐ฝ G โค ๐ท ๐ = 1, โฆ , ๐ } ๐ง , ๐ง , ๐ฟ ๐ , , ๐ , ๐ง , ๐ง H ๐ฟ ๐ H , ๐ , โฏ ๐น = โฎ โฑ โฎ ๐ง H ๐ง , ๐ฟ ๐ H , ๐ , ๐ง H ๐ง H ๐ฟ ๐ H , ๐ H โฏ 14
๏ฟฝ ๏ฟฝ ๏ฟฝ Classifying a new data ] = ๐ก๐๐๐ ๐ฅ 5 + ๐ 2 ๐(๐) ๐ง ๐ฝ G ๐ง (G) ๐(๐ (G) ) where ๐ = โ b c d5 and ๐ฅ 5 = ๐ง (e) โ ๐ 2 ๐(๐ (e) ) ๐(๐ G , ๐) 2 ] = ๐ก๐๐๐ ๐ฅ 5 + E ๐ฝ G ๐ง G ๐ ๐ G ๐ง ๐(๐) b c d5 ๐(๐ G , ๐ (e) ) 2 ๐ฅ 5 = ๐ง (e) โ E ๐ฝ G ๐ง G ๐ ๐ G ๐ ๐ e b c d5 15
๏ฟฝ Gaussian kernel } Example: SVM boundary for a Gaussian kernel } Considers a Gaussian function around each data point. ๐j๐ (o) x ๐ฝ < ๐ง (<) exp } ๐ฅ 5 + โ (โ ) = 0 b o d5 โ } SVM + Gaussian Kernel can classify any arbitrary training set } Training error is zero when ๐ โ 0 ยจ All samples become support vectors (likely overfiting) 16
Hard margin Example } For narrow Gaussian (large ๐ ), even the protection of a large margin cannot suppress overfitting. Y. Abu-Mostafa et. Al, 2012 17
๏ฟฝ SVM Gaussian kernel: Example (โ ๐ โ ๐ (<) / ๐ฝ < ๐ง (<) exp ๐ ๐ = ๐ฅ 5 + E ) 2๐ / b o d5 18 This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example 19 This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example 20 This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example 21 This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example 22 This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example 23 This example has been adopted from Zissermanโs slides
SVM Gaussian kernel: Example 24 This example has been adopted from Zissermanโs slides
Kernel trick: Idea } Kernel trick โ Extension of many well-known algorithms to kernel-based ones } By substituting the dot product with the kernel function } ๐ ๐, ๐ f = ๐ ๐ 2 ๐(๐โฒ) } ๐ ๐, ๐ f shows the dot product of ๐ and ๐ f in the transformed space. } Idea: when the input vectors appears only in the form of dot products, we can use kernel trick } Solving the problem without explicitly mapping the data } Explicit mapping is expensive if ๐ ๐ is very high dimensional 25
Kernel trick: Idea (Contโd) } Instead of using a mapping ๐: ๐ด โ โฑ to represent ๐ โ ๐ด by ๐(๐) โ โฑ , a kernel function ๐: ๐ดร๐ด โ โ is used. } We specify only an inner product function between points in the transformed space (not their coordinates) } In many cases, the inner product in the embedding space can be computed efficiently. 26
๏ฟฝ Constructing kernels } Construct kernel functions directly } Ensure that it is a valid kernel } Corresponds to an inner product in some feature space. } Example: ๐(๐, ๐ f ) = ๐ 2 ๐ f / / 2 for ๐ = / , } Corresponding mapping: ๐ ๐ = ๐ฆ , 2 ๐ฆ , ๐ฆ / , ๐ฆ / ๐ฆ , , ๐ฆ / 2 } We need a way to test whether a kernel is valid without having to construct ๐ ๐ 27
Construct Valid Kernels ๐ > 0 , ๐ 1 : valid kernel โข ๐(. ) : any function โข ๐(. ) : a polynomial with coefficients โฅ 0 โข ๐ 1 , ๐ 2 : valid kernels โข ๐(๐) : a function from ๐ to โ l โข ๐3(. , . ) : a valid kernel in โ l ๐ฉ : a symmetric positive semi-definite โข matrix ๐ ลฝ and ๐ โข are variables (not necessarily โข disjoint) with ๐ = (๐ ลฝ , ๐ โข ) , and ๐ ลฝ and ๐ โข are valid kernel functions over their respective spaces. [Bishop] 28
Recommend
More recommend