machine learning
play

Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban - PowerPoint PPT Presentation

Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 Agenda Agenda Motivations Kernel Definition Mercers Theorem Kernel Matrix Kernel Construction


  1. Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1

  2. Agenda Agenda  Motivations  Kernel Definition  Mercer’s Theorem  Kernel Matrix  Kernel Construction Sharif University of Technology, Computer Engineering Department, Machine Learning Course 2

  3. Mot Motiv ivati ations ons  Learning linear classifiers can be done effectively (SVM, Perceptron, …).  How to generalize existing efficient linear classifiers to non-linear ones.  It may be hard to classify data points in the original feature space.  Use an appropriate high dimensional non-linear map to change the feature space. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 3

  4. Kernel ernel Def Defini initi tion on  Consider data x lying in R n .  Use a high dimensional mapping Ф : R n  R N , with N>n.  Define the kernel function K(x,x’)=Ф(x) T Ф(x’).  That is the kernel function is the dot product in the new feature space.  Dot product measures the similarity of two data points.  K(x,x’) shows the similarity of x and x’. It is efficient to use K instead of Ф if the dimensionality of Ф is high (Why?).  Sharif University of Technology, Computer Engineering Department, Machine Learning Course 4

  5. Kernel ernel Def Defini initi tion on  A simple example: Consider x = (x 1 , x 2 ) lies in 2 dimensional plane and Ф : R 2  R 3 with the following definition     2 2 ( x x , ) ( z z z , , ) ( x , 2 x x x , ) 1 2 1 2 3 1 1 2 2  A linear classifier in new space will become (w’ is a vector in new space):  T   T    2   2  g x ( ) w ' x ' w ' w ' ( ) x w ' w x ' 2 w x x ' w x ' w ' 0 0 1 1 2 1 2 3 2 0  What will be the shape of separating curve in the original space? 2   2   w x ' 2 w ' x x w ' x w ' 0 1 1 2 1 2 3 2 0 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 5

  6. Kernel ernel Def Defini initi tion on  What will be the kernel function in the previous example? T     2 2 u v 1 1       T       K u v ( , ) ( ) u ( ) v 2 u u 2 v v 1 2 1 2         2 2 u v     2 2        2 2    u v 2 u v u v u v 1 1 1 1 2 2 2 2     2 2    T u v u v u v 1 1 2 2 The dot product in the new space is squared of the dot product in the original space.  Can we construct an arbitrary conic section in original feature space? Why? u v  T 2 ( 1) We instead use Sharif University of Technology, Computer Engineering Department, Machine Learning Course 6

  7. Kernel ernel Def Defini initi tion on  Some typical kernels include :  Linear  K u v ( , ) ( u v T )   d  Polynomial:  T   K u v ( , ) u v c , c 0        T K u v ( , ) tanh u v Sigmoid:   2      K u v ( , ) exp u v /2 2 Gaussian RBF:  Can any function K(u,v) be a valid kernel function?  That is, does there exist a function Ф with K(u,v) = Ф(u) T Ф(v)?  In the case of Mercer’s condition, it is a valid kernel function. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 7

  8. Mercer’s Theorem  If for any squared integrable function f(.), we have     K x x f x f x dxdx ( , ) ( ) ( ) 0 2 n R then the function K(x, x’) is a valid kernel function.  In this case the components of the corresponding function Ф are proportional to the eigenfunctions of K(x, x’), that is     ( ) x  1 1      ( ) x        ( ) x K u v ( , ) ( ) v dv ( ) u 2 2   i i i   n R     In fact Mercer’s theorem checks that if K(x, y) is positive semi -definite and hence all 𝝁 i ≥ 0. . Sharif University of Technology, Computer Engineering Department, Machine Learning Course 8

  9. Kernel ernel Mat Matri rix  Restricting the kernel function to a set of points {x 1 , …, x k }, the kernel function can be represented with a matrix :  K x x ( , ) K x x ( , ) K x x ( , )  1 1 1 2 1 k   K x ( , x )     2 1 K     K x ( , x ) K x ( , x )    k 1 k k  A matrix K is a valid kernel matrix if it is a positive semi-definite matrix,  That is, all its eigenvalues are greater or equal to zero.  The eigenvectors multiplied by squared roots of eigenvalues will be the restrictions of φ i to the set {x 1 , …, x k }. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 9

  10. Polynomial Polynomial Kernel ernel  2nd degree polynomial:   2   2  T   K u v ( , ) u v u v u v 1 1 2 2 T     2 2 u v 1 1       2 u u   2 v v  1 2 1 2         2 2 u v     2 2  Up to 2nd degree polynomial:   2   2      T K u v ( , ) u v 1 u v u v 1 1 1 2 2  Can construct any 2nd order function in T         u 2 v 2 original feature space     1 1     2 u u 2 v v  1 2   1 2       2 u 2 v 1 1      2 u   2 v  2 2     2 2 u v     2 2     1 1     Sharif University of Technology, Computer Engineering Department, Machine Learning Course 10

  11. RB RBF F Kernel ernel  An example  That is the input space -5<u<5 will be mapped to a curve using only 2 dimensions of Ф . 𝝌 2 ( 𝝌 1 , 𝝌 2 ) u 𝝌 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 11

  12. RB RBF F Kernel ernel  An example (cont.)   2     2 K u v ( , ) exp u v /2  Consider the Gaussian kernel :  Where u lies in a subset of R, -5<u<5.  The eigenfunctions of K are illustrated. Ф = ( 𝝌 1 , …, 𝝌 10, …). Sharif University of Technology, Computer Engineering Department, Machine Learning Course 12

  13. RB RBF F Kernel ernel  An example (cont.)  Consider a linear classifier in the new space.  The corresponding classifier in the u space is clearly non-linear in the original space. 𝝌 2 ( 𝝌 1 , 𝝌 2 ) C 1 u C 2 C 2 C 2 C 1 𝝌 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 13

  14. RB RBF F Kernel ernel  RBF kernel considers a Gaussian around each data point.  Linear discriminant function cuts through the surface in embedding function.  Therefore any arbitrary set of points can be classified by RBF kernels.  Training error is zero when σ  0. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 14

  15. Kernel ernel Const Constructi ruction on  How to build valid kernels from existing kernels?  According to Mercer’s theorem if c > 0 and k 1 , k 2 are valid kernels, and ψ is an arbitrary function, then following functions will also be valid kernels:  K(u,v) = ck 1 (u,v)  K(u,v) = k 1 (u,v) + k 2 (u,v)  K(u,v) = k 1 (u,v) k 2 (u,v)  K(u,v) = k 1 ( ψ (u), ψ (v)) Sharif University of Technology, Computer Engineering Department, Machine Learning Course 15

  16. Kernel ernel Const Constructi ruction on  Construct kernels from probabilistic generative models (class conditional probabilities, HMM, …) and then use the kernel in a discriminative model (such as SVM or linear discriminant functions, …).  K(x,x’) = p(x)p(x’) is clearly a valid kernel, which states that x and x’ are similar if they both have high probability (Why it is valid?).  A better kernel can be constructed in the same way : n   K u v ( , ) p u c p v c p c ( | ) ( | ) ( ) i i i  i 1  That is u and v are similar if they have high probabilities under same classes. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 16

  17. Kernel ernel Const Constructi ruction on  State of the arts methods tries to learn the kernel from (probably many) training points.  The simplest one is the multiple kernel learning.  Consider {k 1 , …, k n } as n valid kernels. n    K u v ( , ) c k u v ( , ), c 0  Find an appropriate kernel, k(u,v), from the training data i i i  i 1  Minimize training loss (MSE) by changing c i and simultaneously minimize trace of the kernel matrix on training data to avoid overfitting.  Many variations of the algorithm are developed. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 17

  18. Exa Exampl mple e 1 Solution Sharif University of Technology, Computer Engineering Department, Machine Learning Course 18

  19. Exa Exampl mple e 2 Solution Sharif University of Technology, Computer Engineering Department, Machine Learning Course 19

Recommend


More recommend