kernel methods
play

Kernel Methods CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani Not linearly separable data } Noisy data or overlapping classes 2 } (we discussed about it: soft margin) } Near linearly separable 1 }


  1. Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani

  2. Not linearly separable data } Noisy data or overlapping classes ๐‘ฆ 2 } (we discussed about it: soft margin) } Near linearly separable ๐‘ฆ 1 } Non-linear decision surface ๐‘ฆ 2 } Transform to a new feature space 2 ๐‘ฆ 1

  3. Nonlinear SVM } Assume a transformation ๐œš: โ„ ' โ†’ โ„ ) on the feature space ๐” ๐’š = [๐œš , (๐’š), . . . , ๐œš ) (๐’š)] } ๐’š โ†’ ๐” ๐’š {๐œš , (๐’š),...,๐œš ) (๐’š)} : set of basis functions (or features) ๐œš < ๐’š : โ„ ' โ†’ โ„ } Find a hyper-plane in the transformed feature space: ๐œš / (๐’š) ๐‘ฆ 2 ๐œš: ๐’š โ†’ ๐” ๐’š ๐’™ 2 ๐” ๐’š + ๐‘ฅ 5 = 0 ๐‘ฆ 1 ๐œš , (๐’š) 3

  4. Soft-margin SVM in a transformed space: Primal problem } Primal problem: H 1 2 ๐’™ / + ๐ท E ๐œŠ G min ๐’™,B C GI, ๐’™ 2 ๐”(๐’š G ) + ๐‘ฅ 5 โ‰ฅ 1 โˆ’ ๐œŠ G ๐‘œ = 1, โ€ฆ , ๐‘‚ s. t. ๐‘ง G ๐œŠ G โ‰ฅ 0 } ๐’™ โˆˆ โ„ ) : the weights that must be found } If ๐‘› โ‰ซ ๐‘’ (very high dimensional feature space) then there are many more parameters to learn 4

  5. Soft-margin SVM in a transformed space: Dual problem } Optimization problem: H H H โˆ’ 1 2 E E ๐›ฝ G ๐›ฝ ) ๐‘ง (G) ๐‘ง ()) ๐” ๐’š (G) 2 ๐” ๐’š ()) max E ๐›ฝ G ๐œท GI, GI, )I, ๐›ฝ G ๐‘ง (G) = 0 H โˆ‘ Subject to } GI, 0 โ‰ค ๐›ฝ G โ‰ค ๐ท ๐‘œ = 1, โ€ฆ , ๐‘‚ } } If we have inner products ๐” ๐’š (<) 2 ๐” ๐’š (\) , only ๐œท = [๐›ฝ , , โ€ฆ , ๐›ฝ H ] needs to be learnt. } not necessary to learn ๐‘› parameters as opposed to the primal problem 5

  6. ๏ฟฝ Classifying a new data ] = ๐‘ก๐‘—๐‘•๐‘œ ๐‘ฅ 5 + ๐’™ 2 ๐”(๐’š) ๐‘ง ๐›ฝ G ๐‘ง (G) ๐”(๐’š (G) ) where ๐’™ = โˆ‘ b c d5 and ๐‘ฅ 5 = ๐‘ง (e) โˆ’ ๐’™ 2 ๐”(๐’š (e) ) 6

  7. Kernel SVM } Learns linear decision boundary in a high dimension space without explicitly working on the mapped data } Let ๐” ๐’š 2 ๐” ๐’š f = ๐ฟ(๐’š, ๐’š f ) (kernel) } Example: ๐’š = ๐‘ฆ , , ๐‘ฆ / and second-order ๐” : / , ๐‘ฆ / / , ๐‘ฆ , ๐‘ฆ / ๐” ๐’š = 1, ๐‘ฆ , , ๐‘ฆ / , ๐‘ฆ , ๐ฟ ๐’š, ๐’š f f + ๐‘ฆ / ๐‘ฆ / f + ๐‘ฆ , f/ + ๐‘ฆ / f/ + ๐‘ฆ , ๐‘ฆ , / ๐‘ฆ , / ๐‘ฆ / f ๐‘ฆ / ๐‘ฆ / f = 1 + ๐‘ฆ , ๐‘ฆ , 7

  8. ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ Kernel trick } Compute ๐ฟ ๐’š, ๐’š f without transforming ๐’š and ๐’šโ€ฒ } Example: Consider ๐ฟ ๐’š, ๐’š f = 1 + ๐’š 2 ๐’š f / f + ๐‘ฆ / ๐‘ฆ / f / = 1 + ๐‘ฆ , ๐‘ฆ , f + 2๐‘ฆ / ๐‘ฆ / f + ๐‘ฆ , f/ + ๐‘ฆ / f/ + 2๐‘ฆ , ๐‘ฆ , / ๐‘ฆ , / ๐‘ฆ / f ๐‘ฆ / ๐‘ฆ / f = 1 + 2๐‘ฆ , ๐‘ฆ , This is an inner product in: / , ๐‘ฆ / / , ๐” ๐’š = 1, 2 ๐‘ฆ , , 2 ๐‘ฆ / , ๐‘ฆ , 2 ๐‘ฆ , ๐‘ฆ / f , f , ๐‘ฆโ€ฒ , / , ๐‘ฆโ€ฒ / / , f ๐‘ฆ / f ๐” ๐’šโ€ฒ = 1, 2 ๐‘ฆ , 2 ๐‘ฆ / 2 ๐‘ฆ , 8

  9. ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ Polynomial kernel: Degree two } We instead use ๐ฟ(๐’š, ๐’šโ€ฒ) = ๐’š 2 ๐’šโ€ฒ + 1 / that corresponds to: ๐‘’ -dimensional feature space ๐’š = ๐‘ฆ , , โ€ฆ ,๐‘ฆ ' 2 ๐” ๐’š 2 / , . . , ๐‘ฆ ' / , = 1, 2 ๐‘ฆ , , โ€ฆ , 2 ๐‘ฆ ' , ๐‘ฆ , 2 ๐‘ฆ , ๐‘ฆ / , โ€ฆ , 2 ๐‘ฆ , ๐‘ฆ ' , 2 ๐‘ฆ / ๐‘ฆ i , โ€ฆ , 2 ๐‘ฆ 'j, ๐‘ฆ ' 9

  10. ๏ฟฝ ๏ฟฝ ๏ฟฝ Polynomial kernel } This can similarly be generalized to d-dimensioan ๐’š and ๐œš s are polynomials of order ๐‘ : ๐ฟ ๐’š, ๐’š f = 1 + ๐’š 2 ๐’š f l f + ๐‘ฆ / ๐‘ฆ / f + โ‹ฏ + ๐‘ฆ ' ๐‘ฆ ' f l = 1 + ๐‘ฆ , ๐‘ฆ , } Example: SVM boundary for a polynomial kernel } ๐‘ฅ 5 + ๐’™ 2 ๐” ๐’š = 0 2 ๐” ๐’š = 0 ๐›ฝ < ๐‘ง (<) ๐” ๐’š < } โ‡’ ๐‘ฅ 5 + โˆ‘ b o d5 ๐›ฝ < ๐‘ง (<) ๐‘™(๐’š < , ๐’š) = 0 } โ‡’ ๐‘ฅ 5 + โˆ‘ b o d5 l Boundary is a ๐›ฝ < ๐‘ง (<) 1 + ๐’š (<) q ๐’š } โ‡’ ๐‘ฅ 5 + โˆ‘ = 0 b o d5 polynomial of order ๐‘ 10

  11. Why kernel? } kernel functions ๐ฟ can indeed be efficiently computed, with a cost proportional to ๐‘’ (the dimensionality of the input) instead of ๐‘› . } Example: consider the second-order polynomial transform: / , ๐‘ฆ , ๐‘ฆ / , โ€ฆ , ๐‘ฆ ' ๐‘ฆ ' 2 ๐” ๐’š = 1, ๐‘ฆ , , โ€ฆ , ๐‘ฆ ' , ๐‘ฆ , ๐‘Ÿ = 1 + ๐‘’ + ๐‘’ / ' ' ' f f ๐‘ฆ \ f ๐” ๐’š 2 ๐” ๐’šโ€ฒ = 1 + E ๐‘ฆ < ๐‘ฆ < + E E ๐‘ฆ < ๐‘ฆ \ ๐‘ฆ < ๐‘ƒ(๐‘Ÿ) \I, <I, <I, ' ' f f E ๐‘ฆ < ๐‘ฆ < ร— E ๐‘ฆ \ ๐‘ฆ \ <I, \I, ๐” ๐’š 2 ๐” ๐’šโ€ฒ = 1 + ๐‘ฆ 2 ๐‘ฆ f + ๐‘ฆ 2 ๐‘ฆ f / ๐‘ƒ(๐‘’) 11

  12. Gaussian or RBF kernel } If ๐ฟ ๐’š, ๐’š f is an inner product in some transformed space of x, it is good ๐’šj๐’š w x } ๐ฟ ๐’š, ๐’š f = exp (โˆ’ ) y } Take one dimensional case with ๐›ฟ = 1 : ๐ฟ ๐‘ฆ, ๐‘ฆ f = exp โˆ’ ๐‘ฆ โˆ’ ๐‘ฆ f / = exp โˆ’๐‘ฆ / exp โˆ’๐‘ฆโ€ฒ / exp 2๐‘ฆ๐‘ฆโ€ฒ } = exp โˆ’๐‘ฆ / exp โˆ’๐‘ฆโ€ฒ / E 2 { ๐‘ฆ { ๐‘ฆโ€ฒ { ๐‘™! {I5 12

  13. Some common kernel functions } Linear: ๐‘™(๐’š, ๐’š f ) = ๐’š 2 ๐’šโ€ฒ } Polynomial: ๐‘™ ๐’š, ๐’š f = (๐’š 2 ๐’š f + 1) l ๐’šj๐’š w x } Gaussian: ๐‘™ ๐’š, ๐’š f = exp (โˆ’ ) y } Sigmoid: ๐‘™ ๐’š, ๐’š f = tanh (๐‘๐’š 2 ๐’š f + ๐‘) 13

  14. Kernel formulation of SVM } Optimization problem: H H H โˆ’ 1 2 E E ๐›ฝ G ๐›ฝ ) ๐‘ง (G) ๐‘ง ()) ๐” ๐’š (G) 2 ๐” ๐’š ()) ๐‘™(๐’š G , ๐’š ()) ) max E ๐›ฝ G ๐œท GI, GI, )I, ๐›ฝ G ๐‘ง (G) = 0 H โˆ‘ Subject to } GI, 0 โ‰ค ๐›ฝ G โ‰ค ๐ท ๐‘œ = 1, โ€ฆ , ๐‘‚ } ๐‘ง , ๐‘ง , ๐ฟ ๐’š , , ๐’š , ๐‘ง , ๐‘ง H ๐ฟ ๐’š H , ๐’š , โ‹ฏ ๐‘น = โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ง H ๐‘ง , ๐ฟ ๐’š H , ๐’š , ๐‘ง H ๐‘ง H ๐ฟ ๐’š H , ๐’š H โ‹ฏ 14

  15. ๏ฟฝ ๏ฟฝ ๏ฟฝ Classifying a new data ] = ๐‘ก๐‘—๐‘•๐‘œ ๐‘ฅ 5 + ๐’™ 2 ๐”(๐’š) ๐‘ง ๐›ฝ G ๐‘ง (G) ๐”(๐’š (G) ) where ๐’™ = โˆ‘ b c d5 and ๐‘ฅ 5 = ๐‘ง (e) โˆ’ ๐’™ 2 ๐”(๐’š (e) ) ๐‘™(๐’š G , ๐’š) 2 ] = ๐‘ก๐‘—๐‘•๐‘œ ๐‘ฅ 5 + E ๐›ฝ G ๐‘ง G ๐” ๐’š G ๐‘ง ๐”(๐’š) b c d5 ๐‘™(๐’š G , ๐’š (e) ) 2 ๐‘ฅ 5 = ๐‘ง (e) โˆ’ E ๐›ฝ G ๐‘ง G ๐” ๐’š G ๐” ๐’š e b c d5 15

  16. ๏ฟฝ Gaussian kernel } Example: SVM boundary for a Gaussian kernel } Considers a Gaussian function around each data point. ๐’šj๐’š (o) x ๐›ฝ < ๐‘ง (<) exp } ๐‘ฅ 5 + โˆ‘ (โˆ’ ) = 0 b o d5 โ€ž } SVM + Gaussian Kernel can classify any arbitrary training set } Training error is zero when ๐œ โ†’ 0 ยจ All samples become support vectors (likely overfiting) 16

  17. Hard margin Example } For narrow Gaussian (large ๐œ ), even the protection of a large margin cannot suppress overfitting. Y. Abu-Mostafa et. Al, 2012 17

  18. ๏ฟฝ SVM Gaussian kernel: Example (โˆ’ ๐’š โˆ’ ๐’š (<) / ๐›ฝ < ๐‘ง (<) exp ๐‘” ๐’š = ๐‘ฅ 5 + E ) 2๐œ / b o d5 18 This example has been adopted from Zissermanโ€™s slides

  19. SVM Gaussian kernel: Example 19 This example has been adopted from Zissermanโ€™s slides

  20. SVM Gaussian kernel: Example 20 This example has been adopted from Zissermanโ€™s slides

  21. SVM Gaussian kernel: Example 21 This example has been adopted from Zissermanโ€™s slides

  22. SVM Gaussian kernel: Example 22 This example has been adopted from Zissermanโ€™s slides

  23. SVM Gaussian kernel: Example 23 This example has been adopted from Zissermanโ€™s slides

  24. SVM Gaussian kernel: Example 24 This example has been adopted from Zissermanโ€™s slides

  25. Kernel trick: Idea } Kernel trick โ†’ Extension of many well-known algorithms to kernel-based ones } By substituting the dot product with the kernel function } ๐‘™ ๐’š, ๐’š f = ๐” ๐’š 2 ๐”(๐’šโ€ฒ) } ๐‘™ ๐’š, ๐’š f shows the dot product of ๐’š and ๐’š f in the transformed space. } Idea: when the input vectors appears only in the form of dot products, we can use kernel trick } Solving the problem without explicitly mapping the data } Explicit mapping is expensive if ๐” ๐’š is very high dimensional 25

  26. Kernel trick: Idea (Contโ€™d) } Instead of using a mapping ๐”: ๐’ด โ† โ„ฑ to represent ๐’š โˆˆ ๐’ด by ๐”(๐’š) โˆˆ โ„ฑ , a kernel function ๐‘™: ๐’ดร—๐’ด โ†’ โ„ is used. } We specify only an inner product function between points in the transformed space (not their coordinates) } In many cases, the inner product in the embedding space can be computed efficiently. 26

  27. ๏ฟฝ Constructing kernels } Construct kernel functions directly } Ensure that it is a valid kernel } Corresponds to an inner product in some feature space. } Example: ๐‘™(๐’š, ๐’š f ) = ๐’š 2 ๐’š f / / 2 for ๐’š = / , } Corresponding mapping: ๐” ๐’š = ๐‘ฆ , 2 ๐‘ฆ , ๐‘ฆ / , ๐‘ฆ / ๐‘ฆ , , ๐‘ฆ / 2 } We need a way to test whether a kernel is valid without having to construct ๐” ๐’š 27

  28. Construct Valid Kernels ๐‘‘ > 0 , ๐‘™ 1 : valid kernel โ€ข ๐‘”(. ) : any function โ€ข ๐‘Ÿ(. ) : a polynomial with coefficients โ‰ฅ 0 โ€ข ๐‘™ 1 , ๐‘™ 2 : valid kernels โ€ข ๐”(๐’š) : a function from ๐’š to โ„ l โ€ข ๐‘™3(. , . ) : a valid kernel in โ„ l ๐‘ฉ : a symmetric positive semi-definite โ€ข matrix ๐’š ลฝ and ๐’š โ€ข are variables (not necessarily โ€ข disjoint) with ๐’š = (๐’š ลฝ , ๐’š โ€ข ) , and ๐‘™ ลฝ and ๐‘™ โ€ข are valid kernel functions over their respective spaces. [Bishop] 28

Recommend


More recommend