kernels kernelization
play

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) - PowerPoint PPT Presentation

Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 UCSD ECE 174A Inner Product Matrix & PCA Given the centered data matrix X c : 1) Construct the inner product matrix K c = X c T X c 2) Compute its


  1. Kernels & Kernelization Ken Kreutz-Delgado (Nuno Vasconcelos) Winter 2012 — UCSD — ECE 174A

  2. Inner Product Matrix & PCA Given the centered data matrix X c : • 1) Construct the inner product matrix K c = X c T X c • 2) Compute its eigendecomposition (  2 , M ) PCA: For a covariance matrix  = L T • Principal Components are given by  = X c M  -1 • Principle Values are given by L 1/2 = (1 /  n )  • Projection of the centered data onto the principal components is given by - -  =  =  T T 1 1 X X X M K M c c c c This allows the computation of the eigenvalues and PCA coefficients when we only have access to the dot-product (inner product) matrix K c 2

  3. The Inner Product Form This turns out to be the case for many learning algorithms If you manipulate expressions a little bit, you can often write them in “dot product form” Definition: a learning algorithm is in inner product form if, given a training data set D = {( x 1 ,y 1 ) , ..., ( x n ,y n )} , it only depends on the points X i through their inner products i  X i , X j  = X i T X j For example, let’s look at k -means 3

  4. K-means Clustering We saw that the k-means algorithm iterates between • 1) (re-) Classification: 2 = -  * ( ) i x argmin x i i • 2) (re-) Estimation: =  1    new ( ) i x i j n j Note that: (  (  2 T -  = -  -  x x x i i i = -     T T T x x 2 x i i i 4

  5. K-means Clustering Combining this expansion with the sample mean formula,  =  1 ( ) i x i j n j allows us to write the distance between a data sample x k and the class center  i as a function of the inner products  x i , x j  = x i T x j : 2 1   2 -  = -  T T ( ) i ( ) i T ( ) i x x x x x x x k i k k k j j l 2 n n j jl 5

  6. “The Kernel Trick” x 2 Why is this interesting? x x x x x x Consider the following transformation x x o o x x of the feature space: o o o o x x o o o o • Introduce a mapping to a “better” o o x 1 (i.e., linearly separable) feature space   : X  Z where, generally, dim( Z ) > dim( X ) . x x x x x x • If a classification algorithm only depends on x x x x the data through inner products then, in the x o o x o o transformed space, it depends on o o x o 1 o o o  ( (   o o (  (     =   T x , x x x x n i j i j x 2 x 3 6

  7. The Inner Product Implementation In the transformed space, the learning algorithms only requires inner products   ( x i ),  ( x j )  =  ( x j ) T  ( x i ) Note that we do not need to store the  ( x j ), but only the n 2 (scalar) component values of the inner product matrix Interestingly, this holds even if  ( x ) takes its value in an infinite dimensional space. • We get a reduction from infinity to n 2 ! • There is, however, still one problem: • When  ( x j ) is infinite dimensional the computation of the inner product   ( x i ),  ( x j )  looks impossible. 7

  8. “The Kernel Trick” “ Instead of defining  ( x ), then computing  ( x i ) for each i, and then computing   ( x i ),  ( x j )  for each pair (i,j), simply define a kernel function def =     K x z ( , ) ( ), ( ) x z and work with it directly .” K(x,z) is called an inner product or dot-product kernel Since we only use the kernel, why bother to define  ( x )? Just define the kernel K(x,z) directly! Then we never have to deal with the complexity of  ( x ). This is usually called “ the kernel trick ” 8

  9. Important Questions How do I know that if I pick a function bivariate function K(x,z), it is actually equivalent to an inner product? • Answer: In fact, in general it is not. (More about this later.) If it is, how do I know what  ( x ) is? • Answer: you may never know. E.g. the Gaussian kernel 2 - x z - = =      K x z ( , ) e ( ), ( ) x z is a very popular choice. But, it is not obvious what  ( x ) is. However, on the positive side, we do not need to know how to choose  ( x ) . Choosing an admissible kernel K(x,z) is sufficient. Why is it that using K(x,z) is easier/better? • Answer: Complexity management . let’s look at an example. 9

  10. Polynomial Kernels d , consider the square of the inner product In between two vectors:   2     (  d d d    2 = = = T       x z x z x z x z i i i i j j      = = = i 1 i 1 j 1 d d  = x x z z i j i j = = i 1 j 1 =     x x z z x x z z x x z z d d 1 1 1 1 1 2 1 2 1 1      x x z z x x z z x x z z 2 1 2 1 2 2 2 2 2 d 2 d     x x z z x x z z x x z z d 1 d 1 d 2 d 2 d d d d 10

  11. Polynomial Kernels This can be written as (  2 = =   T T K x z ( , ) x z ( ) x ( ) z   2 d d with :   x 1    (  T x x , x x , , x x , , x x , x x , , x x   d d d d d 1 1 1 2 1 1 2    x     z z d 1 1    z z    Hence, we have 1 2          z z 1 d (    2 =    T     x z x x , x x , , x x , , x x , x x , , x x , ( ) z  d d d d d 1 1 1 2 1 1 2   z z  ( x ) T d 1   z z   d 2        z z  d d 11

  12. Polynomial Kernels The point is that: • The computation of  ( x ) T  ( z ) has complexity O ( d 2 ) • The direct computation of K ( x,z ) = ( x T z ) 2 has complexity O ( d ) Direct evaluation is more efficient by a factor of d As d goes to infinity this allows a feasible implementation BTW, you just met another kernel family • This implements polynomials of second order • In general, the family of polynomial kernels is defined as (    k =   T K x z ( , ) 1 x z , k 1,2, • I don’t even want to think about writing down  ( x ) ! 12

  13. Kernel Summary D not easy to deal with in X , apply feature transformation  : X  Z Z , 1. such that dim( Z ) >> dim( X ) Constructing and computing  (x) directly is too expensive: 2. • Write your learning algorithm in inner product form Then, instead of  (x) , we only need   ( x i ),  ( x j )  for all i and j , • which we can compute by defining an “inner product kernel” =     K x z ( , ) ( ), ( ) x z and computing K(x i ,x j ) " i,j directly • Note: the matrix     =  K K x z ( , )  i j     is called the “ Kernel matrix ” or Gram matrix Moral: Forget about  (x) and instead use K(x,z) from the start! 3. 13

  14. Question? What is a good inner product kernel? • This is a difficult question (see Prof. Lenckriet’s work) In practice, the usual recipe is: • Pick a kernel from a library of known kernels • we have already met • the linear kernel K(x,z) = x T z • the Gaussian family 2 - x z - =  K x z e ( , ) • the polynomial family (    k =   T K x z ( , ) 1 x z , k 1,2, 14

  15. Inner Product Kernel Families Why introduce simple, known kernel families ? • Obtain the benefits of a high-dimensional space without paying a price in complexity (avoid the “curse of dimensionality”). • The kernel simply adds a few parameters (e.g.,  o r k ), whereas learning it would imply introducing many parameters (up to n 2 ) How does one check whether K(x,z) is a kernel? Definition: a mapping x 2 k: X x X   X x x x x x x x x o o x x o o o o x ( x,y )  k ( x,y ) x o o o o o o x 1  is an inner product kernel if and only if x x x x x x x x x x H x o o x k(x,y) =   ( x ),  ( y )  o o o o x o o 1 o o o o x n 3 x 2 x where  : X  H , H is a vector space and < . , . > is an inner product in H 15

  16. Positive Definite Matrices Recall that (e.g. Linear Algebra and Applications, Strang) Definition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (strictly) positive definite: i) x T Ax > 0, " x  0 ii) All (real) eigenvalues of A satisfy l i > 0 iii) All upper-left submatrices A k have strictly positive determinant iv) There is a matrix R with independent columns with A = R T R Upper left submatrices:   a a a   1 , 1 1 , 2 1 , 3   a a = = = 1 , 1 1 , 2    A a A A a a a   1 1 , 1 2 3 2 , 1 2 , 2 2 , 3 a a     2 , 1 2 , 2 a a a   3 , 1 3 , 2 3 , 3 16

  17. Positive definite matrices Property (iv) is particularly interesting • In  d , <x,y> = x T Ay is an inner product kernel if and only if A is positive definite (from definition of inner product). • From iv) this holds iif there is full column rank R such that A = R T R • Hence <x,y> = x T Ay = ( Rx ) T ( Ry ) =  ( x ) T  ( y ) with  :  d   d x  Rx I.e. the inner product kernel k(x,z) = x T Az ( A symmetric & positive definite) is the standard inner product in the range space of the mapping  (x) = Rx 17

Recommend


More recommend