Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 — UCSD — ECE 174A
Inner Product Matrix & PCA Given the centered data matrix X c : • 1) Construct the inner product matrix K c = X c T X c • 2) Compute its eigendecomposition ( 2 , M ) PCA: For a covariance matrix = L T • Principal Components are given by = X c M -1 • Principle Values are given by L 1/2 = (1 / n ) • Projection of the centered data onto the principal components is given by - - = = T T 1 1 X X X M K M c c c c This allows the computation of the eigenvalues and PCA coefficients when we only have access to the dot-product (inner product) matrix K c 2
The Inner Product Form This turns out to be the case for many learning algorithms If you manipulate expressions a little bit, you can often write them in “dot product form” Definition: a learning algorithm is in inner product form if, given a training data set D = {( x 1 ,y 1 ) , ..., ( x n ,y n )} , it only depends on the points X i through their inner products i X i , X j = X i T X j For example, let’s look at k -means 3
K-means Clustering We saw that the k-means algorithm iterates between • 1) (re-) Classification: 2 = - * ( ) i x argmin x i i • 2) (re-) Estimation: = 1 new ( ) i x i j n j Note that: ( ( 2 T - = - - x x x i i i = - T T T x x 2 x i i i 4
K-means Clustering Combining this expansion with the sample mean formula, = 1 ( ) i x i j n j allows us to write the distance between a data sample x k and the class center i as a function of the inner products x i , x j = x i T x j : 2 1 2 - = - T T ( ) i ( ) i T ( ) i x x x x x x x k i k k k j j l 2 n n j jl 5
“The Kernel Trick” x 2 Why is this interesting? x x x x x x Consider the following transformation x x o o x x of the feature space: o o o o x x o o o o • Introduce a mapping to a “better” o o x 1 (i.e., linearly separable) feature space : X Z where, generally, dim( Z ) > dim( X ) . x x x x x x • If a classification algorithm only depends on x x x x the data through inner products then, in the x o o x o o transformed space, it depends on o o x o 1 o o o ( ( o o ( ( = T x , x x x x n i j i j x 2 x 3 6
The Inner Product Implementation In the transformed space, the learning algorithms only requires inner products ( x i ), ( x j ) = ( x j ) T ( x i ) Note that we do not need to store the ( x j ), but only the n 2 (scalar) component values of the inner product matrix Interestingly, this holds even if ( x ) takes its value in an infinite dimensional space. • We get a reduction from infinity to n 2 ! • There is, however, still one problem: • When ( x j ) is infinite dimensional the computation of the inner product ( x i ), ( x j ) looks impossible. 7
“The Kernel Trick” “ Instead of defining ( x ), then computing ( x i ) for each i, and then computing ( x i ), ( x j ) for each pair (i,j), simply define a kernel function def = K x z ( , ) ( ), ( ) x z and work with it directly .” K(x,z) is called an inner product or dot-product kernel Since we only use the kernel, why bother to define ( x )? Just define the kernel K(x,z) directly! Then we never have to deal with the complexity of ( x ). This is usually called “ the kernel trick ” 8
Important Questions How do I know that if I pick a function bivariate function K(x,z), it is actually equivalent to an inner product? • Answer: In fact, in general it is not. (More about this later.) If it is, how do I know what ( x ) is? • Answer: you may never know. E.g. the Gaussian kernel 2 - x z - = = K x z ( , ) e ( ), ( ) x z is a very popular choice. But, it is not obvious what ( x ) is. However, on the positive side, we do not need to know how to choose ( x ) . Choosing an admissible kernel K(x,z) is sufficient. Why is it that using K(x,z) is easier/better? • Answer: Complexity management . let’s look at an example. 9
Polynomial Kernels d , consider the square of the inner product In between two vectors: 2 ( d d d 2 = = = T x z x z x z x z i i i i j j = = = i 1 i 1 j 1 d d = x x z z i j i j = = i 1 j 1 = x x z z x x z z x x z z d d 1 1 1 1 1 2 1 2 1 1 x x z z x x z z x x z z 2 1 2 1 2 2 2 2 2 d 2 d x x z z x x z z x x z z d 1 d 1 d 2 d 2 d d d d 10
Polynomial Kernels This can be written as ( 2 = = T T K x z ( , ) x z ( ) x ( ) z 2 d d with : x 1 ( T x x , x x , , x x , , x x , x x , , x x d d d d d 1 1 1 2 1 1 2 x z z d 1 1 z z Hence, we have 1 2 z z 1 d ( 2 = T x z x x , x x , , x x , , x x , x x , , x x , ( ) z d d d d d 1 1 1 2 1 1 2 z z ( x ) T d 1 z z d 2 z z d d 11
Polynomial Kernels The point is that: • The computation of ( x ) T ( z ) has complexity O ( d 2 ) • The direct computation of K ( x,z ) = ( x T z ) 2 has complexity O ( d ) Direct evaluation is more efficient by a factor of d As d goes to infinity this allows a feasible implementation BTW, you just met another kernel family • This implements polynomials of second order • In general, the family of polynomial kernels is defined as ( k = T K x z ( , ) 1 x z , k 1,2, • I don’t even want to think about writing down ( x ) ! 12
Kernel Summary D not easy to deal with in X , apply feature transformation : X Z Z , 1. such that dim( Z ) >> dim( X ) Constructing and computing (x) directly is too expensive: 2. • Write your learning algorithm in inner product form Then, instead of (x) , we only need ( x i ), ( x j ) for all i and j , • which we can compute by defining an “inner product kernel” = K x z ( , ) ( ), ( ) x z and computing K(x i ,x j ) " i,j directly • Note: the matrix = K K x z ( , ) i j is called the “ Kernel matrix ” or Gram matrix Moral: Forget about (x) and instead use K(x,z) from the start! 3. 13
Question? What is a good inner product kernel? • This is a difficult question (see Prof. Lenckriet’s work) In practice, the usual recipe is: • Pick a kernel from a library of known kernels • we have already met • the linear kernel K(x,z) = x T z • the Gaussian family 2 - x z - = K x z e ( , ) • the polynomial family ( k = T K x z ( , ) 1 x z , k 1,2, 14
Inner Product Kernel Families Why introduce simple, known kernel families ? • Obtain the benefits of a high-dimensional space without paying a price in complexity (avoid the “curse of dimensionality”). • The kernel simply adds a few parameters (e.g., o r k ), whereas learning it would imply introducing many parameters (up to n 2 ) How does one check whether K(x,z) is a kernel? Definition: a mapping x 2 k: X x X X x x x x x x x x o o x x o o o o x ( x,y ) k ( x,y ) x o o o o o o x 1 is an inner product kernel if and only if x x x x x x x x x x H x o o x k(x,y) = ( x ), ( y ) o o o o x o o 1 o o o o x n 3 x 2 x where : X H , H is a vector space and < . , . > is an inner product in H 15
Positive Definite Matrices Recall that (e.g. Linear Algebra and Applications, Strang) Definition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (strictly) positive definite: i) x T Ax > 0, " x 0 ii) All (real) eigenvalues of A satisfy l i > 0 iii) All upper-left submatrices A k have strictly positive determinant iv) There is a matrix R with independent columns with A = R T R Upper left submatrices: a a a 1 , 1 1 , 2 1 , 3 a a = = = 1 , 1 1 , 2 A a A A a a a 1 1 , 1 2 3 2 , 1 2 , 2 2 , 3 a a 2 , 1 2 , 2 a a a 3 , 1 3 , 2 3 , 3 16
Positive definite matrices Property (iv) is particularly interesting • In d , <x,y> = x T Ay is an inner product kernel if and only if A is positive definite (from definition of inner product). • From iv) this holds iif there is full column rank R such that A = R T R • Hence <x,y> = x T Ay = ( Rx ) T ( Ry ) = ( x ) T ( y ) with : d d x Rx I.e. the inner product kernel k(x,z) = x T Az ( A symmetric & positive definite) is the standard inner product in the range space of the mapping (x) = Rx 17
Recommend
More recommend