Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1
Agenda Agenda Motivations Kernel Definition Mercer’s Theorem Kernel Matrix Kernel Construction Sharif University of Technology, Computer Engineering Department, Machine Learning Course 2
Mot Motiv ivati ations ons Learning linear classifiers can be done effectively (SVM, Perceptron, …). How to generalize existing efficient linear classifiers to non-linear ones. It may be hard to classify data points in the original feature space. Use an appropriate high dimensional non-linear map to change the feature space. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 3
Kernel ernel Def Defini initi tion on Consider data x lying in R n . Use a high dimensional mapping Ф : R n R N , with N>n. Define the kernel function K(x,x’)=Ф(x) T Ф(x’). That is the kernel function is the dot product in the new feature space. Dot product measures the similarity of two data points. K(x,x’) shows the similarity of x and x’. It is efficient to use K instead of Ф if the dimensionality of Ф is high (Why?). Sharif University of Technology, Computer Engineering Department, Machine Learning Course 4
Kernel ernel Def Defini initi tion on A simple example: Consider x = (x 1 , x 2 ) lies in 2 dimensional plane and Ф : R 2 R 3 with the following definition 2 2 ( x x , ) ( z z z , , ) ( x , 2 x x x , ) 1 2 1 2 3 1 1 2 2 A linear classifier in new space will become (w’ is a vector in new space): T T 2 2 g x ( ) w ' x ' w ' w ' ( ) x w ' w x ' 2 w x x ' w x ' w ' 0 0 1 1 2 1 2 3 2 0 What will be the shape of separating curve in the original space? 2 2 w x ' 2 w ' x x w ' x w ' 0 1 1 2 1 2 3 2 0 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 5
Kernel ernel Def Defini initi tion on What will be the kernel function in the previous example? T 2 2 u v 1 1 T K u v ( , ) ( ) u ( ) v 2 u u 2 v v 1 2 1 2 2 2 u v 2 2 2 2 u v 2 u v u v u v 1 1 1 1 2 2 2 2 2 2 T u v u v u v 1 1 2 2 The dot product in the new space is squared of the dot product in the original space. Can we construct an arbitrary conic section in original feature space? Why? u v T 2 ( 1) We instead use Sharif University of Technology, Computer Engineering Department, Machine Learning Course 6
Kernel ernel Def Defini initi tion on Some typical kernels include : Linear K u v ( , ) ( u v T ) d Polynomial: T K u v ( , ) u v c , c 0 T K u v ( , ) tanh u v Sigmoid: 2 K u v ( , ) exp u v /2 2 Gaussian RBF: Can any function K(u,v) be a valid kernel function? That is, does there exist a function Ф with K(u,v) = Ф(u) T Ф(v)? In the case of Mercer’s condition, it is a valid kernel function. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 7
Mercer’s Theorem If for any squared integrable function f(.), we have K x x f x f x dxdx ( , ) ( ) ( ) 0 2 n R then the function K(x, x’) is a valid kernel function. In this case the components of the corresponding function Ф are proportional to the eigenfunctions of K(x, x’), that is ( ) x 1 1 ( ) x ( ) x K u v ( , ) ( ) v dv ( ) u 2 2 i i i n R In fact Mercer’s theorem checks that if K(x, y) is positive semi -definite and hence all 𝝁 i ≥ 0. . Sharif University of Technology, Computer Engineering Department, Machine Learning Course 8
Kernel ernel Mat Matri rix Restricting the kernel function to a set of points {x 1 , …, x k }, the kernel function can be represented with a matrix : K x x ( , ) K x x ( , ) K x x ( , ) 1 1 1 2 1 k K x ( , x ) 2 1 K K x ( , x ) K x ( , x ) k 1 k k A matrix K is a valid kernel matrix if it is a positive semi-definite matrix, That is, all its eigenvalues are greater or equal to zero. The eigenvectors multiplied by squared roots of eigenvalues will be the restrictions of φ i to the set {x 1 , …, x k }. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 9
Polynomial Polynomial Kernel ernel 2nd degree polynomial: 2 2 T K u v ( , ) u v u v u v 1 1 2 2 T 2 2 u v 1 1 2 u u 2 v v 1 2 1 2 2 2 u v 2 2 Up to 2nd degree polynomial: 2 2 T K u v ( , ) u v 1 u v u v 1 1 1 2 2 Can construct any 2nd order function in T u 2 v 2 original feature space 1 1 2 u u 2 v v 1 2 1 2 2 u 2 v 1 1 2 u 2 v 2 2 2 2 u v 2 2 1 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 10
RB RBF F Kernel ernel An example That is the input space -5<u<5 will be mapped to a curve using only 2 dimensions of Ф . 𝝌 2 ( 𝝌 1 , 𝝌 2 ) u 𝝌 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 11
RB RBF F Kernel ernel An example (cont.) 2 2 K u v ( , ) exp u v /2 Consider the Gaussian kernel : Where u lies in a subset of R, -5<u<5. The eigenfunctions of K are illustrated. Ф = ( 𝝌 1 , …, 𝝌 10, …). Sharif University of Technology, Computer Engineering Department, Machine Learning Course 12
RB RBF F Kernel ernel An example (cont.) Consider a linear classifier in the new space. The corresponding classifier in the u space is clearly non-linear in the original space. 𝝌 2 ( 𝝌 1 , 𝝌 2 ) C 1 u C 2 C 2 C 2 C 1 𝝌 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 13
RB RBF F Kernel ernel RBF kernel considers a Gaussian around each data point. Linear discriminant function cuts through the surface in embedding function. Therefore any arbitrary set of points can be classified by RBF kernels. Training error is zero when σ 0. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 14
Kernel ernel Const Constructi ruction on How to build valid kernels from existing kernels? According to Mercer’s theorem if c > 0 and k 1 , k 2 are valid kernels, and ψ is an arbitrary function, then following functions will also be valid kernels: K(u,v) = ck 1 (u,v) K(u,v) = k 1 (u,v) + k 2 (u,v) K(u,v) = k 1 (u,v) k 2 (u,v) K(u,v) = k 1 ( ψ (u), ψ (v)) Sharif University of Technology, Computer Engineering Department, Machine Learning Course 15
Kernel ernel Const Constructi ruction on Construct kernels from probabilistic generative models (class conditional probabilities, HMM, …) and then use the kernel in a discriminative model (such as SVM or linear discriminant functions, …). K(x,x’) = p(x)p(x’) is clearly a valid kernel, which states that x and x’ are similar if they both have high probability (Why it is valid?). A better kernel can be constructed in the same way : n K u v ( , ) p u c p v c p c ( | ) ( | ) ( ) i i i i 1 That is u and v are similar if they have high probabilities under same classes. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 16
Kernel ernel Const Constructi ruction on State of the arts methods tries to learn the kernel from (probably many) training points. The simplest one is the multiple kernel learning. Consider {k 1 , …, k n } as n valid kernels. n K u v ( , ) c k u v ( , ), c 0 Find an appropriate kernel, k(u,v), from the training data i i i i 1 Minimize training loss (MSE) by changing c i and simultaneously minimize trace of the kernel matrix on training data to avoid overfitting. Many variations of the algorithm are developed. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 17
Exa Exampl mple e 1 Solution Sharif University of Technology, Computer Engineering Department, Machine Learning Course 18
Exa Exampl mple e 2 Solution Sharif University of Technology, Computer Engineering Department, Machine Learning Course 19
Recommend
More recommend