l inear kernel cont
play

L INEAR KERNEL CONT . For the linear kernel, 1 2 + 2 || Y K c || - PowerPoint PPT Presentation

R EGULARIZED L EAST S QUARES AND S UPPORT V ECTOR M ACHINES Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu BISS 2012 March 14, 2012 Regularization Methods for High Dimensional Learning RLS and SVM A BOUT THIS CLASS


  1. R EGULARIZED L EAST S QUARES AND S UPPORT V ECTOR M ACHINES Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu BISS 2012 March 14, 2012 Regularization Methods for High Dimensional Learning RLS and SVM

  2. A BOUT THIS CLASS G OAL To introduce two main examples of Tikhonov regularization, deriving and comparing their computational properties. Regularization Methods for High Dimensional Learning RLS and SVM

  3. B ASICS : D ATA Training set: S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . Inputs: X = { x 1 , . . . , x n } . Labels: Y = { y 1 , . . . , y n } . Regularization Methods for High Dimensional Learning RLS and SVM

  4. B ASICS : RKHS, K ERNEL RKHS H with a positive semidefinite kernel function K : K ( x i , x j ) = x T linear: i x j K ( x i , x j ) = ( x T i x j + 1 ) d polynomial: � � −|| x i − x j || 2 gaussian: K ( x i , x j ) = exp σ 2 Define the kernel matrix K to satisfy K ij = K ( x i , x j ) . The kernel function with one argument fixed is K x = K ( x , · ) . Given an arbitrary input x ∗ , K x ∗ is a vector whose i th entry is K ( x i , x ∗ ) . Regularization Methods for High Dimensional Learning RLS and SVM

  5. T IKHONOV R EGULARIZATION We are interested into studying Tikhonov Regularization n � V ( y i , f ( x i )) 2 + λ � f � 2 argmin { H } . f ∈H i = 1 Regularization Methods for High Dimensional Learning RLS and SVM

  6. R EPRESENTER T HEOREM The representer theorem guarantees that the solution can be written as n � f = c j K x j j = 1 for some c = ( c 1 , . . . , c n ) ∈ R n . So K c is a vector whose i th element is f ( x i ) : n n � � f ( x i ) = c j K x i ( x j ) = c j K ij j = 1 j = 1 and � f � 2 H = c T K c . Regularization Methods for High Dimensional Learning RLS and SVM

  7. RKHS N ORM AND R EPRESENTER T HEOREM Since f = � n j = 1 c j K x j , then � f � 2 = � f , f � H H n n � � = � c i K x i , c j K x j � H i = 1 j = 1 n n � � = c i c j � K x i , K x j � H i = 1 j = 1 n n � � c i c j K ( x i , x j ) = c t K c = i = 1 j = 1 Regularization Methods for High Dimensional Learning RLS and SVM

  8. P LAN RLS dual problem regularization path linear case SVM dual problem linear case historical derivation Regularization Methods for High Dimensional Learning RLS and SVM

  9. T HE RLS PROBLEM Goal: Find the function f ∈ H that minimizes the weighted sum of the square loss and the RKHS norm n { 1 ( f ( x i ) − y i ) 2 + λ � 2 || f || 2 argmin H } . 2 f ∈H i = 1 Regularization Methods for High Dimensional Learning RLS and SVM

  10. RLS AND R EPRESENTER T HEOREM Using the representer theorem the RLS problem is: 1 2 + λ 2 � Y − K c � 2 2 c T K c argmin f ∈H The above functional is differentiable, we can find the minimum setting the gradient w.r.t c to 0: Regularization Methods for High Dimensional Learning RLS and SVM

  11. RLS AND R EPRESENTER T HEOREM Using the representer theorem the RLS problem is: 1 2 + λ 2 � Y − K c � 2 2 c T K c argmin f ∈H The above functional is differentiable, we can find the minimum setting the gradient w.r.t c to 0: − K ( Y − K c ) + λ K c = 0 ( K + λ I ) c = Y ( K + λ I ) − 1 Y c = We find c by solving a system of linear equations. Regularization Methods for High Dimensional Learning RLS and SVM

  12. S OLVING RLS FOR FIXED P ARAMETERS ( K + λ I ) c = Y . The matrix K + λ I is symmetric positive definite, so the appropriate algorithm is Cholesky factorization. In Matlab, the “slash” operator seems to be using Cholesky, so you can just write c = ( K + l ∗ I ) \ Y , but to be safe, (or in octave), I suggest R = chol ( K + l ∗ I ); c = ( R \ (R’ \ Y)); . The above algorithm has complexity O ( n 3 ) . Regularization Methods for High Dimensional Learning RLS and SVM

  13. T HE RLS S OLUTION , C OMMENTS c = ( K + λ I ) − 1 Y The prediction at a new input x ∗ is: n � f ( x ∗ ) = c j K x j ( x ∗ ) j = 1 = K x ∗ c K x ∗ G − 1 Y , = where G = K + λ I . Note that the above operation is O ( n 2 ) . Regularization Methods for High Dimensional Learning RLS and SVM

  14. RLS R EGULARIZATION P ATH Typically we have to choose λ and hence to compute the solutions corresponding to different values of λ . Is there a more efficent method than solving c ( λ ) = ( K + λ I ) − 1 Y anew for each λ ? Regularization Methods for High Dimensional Learning RLS and SVM

  15. RLS R EGULARIZATION P ATH Typically we have to choose λ and hence to compute the solutions corresponding to different values of λ . Is there a more efficent method than solving c ( λ ) = ( K + λ I ) − 1 Y anew for each λ ? Form the eigendecomposition K = Q Λ Q T , where Λ is diagonal with Λ ii ≥ 0 and QQ T = I . Then G = K + λ I Q Λ Q T + λ I = Q (Λ + λ I ) Q T , = which implies that G − 1 = Q (Λ + λ I ) − 1 Q T . Regularization Methods for High Dimensional Learning RLS and SVM

  16. RLS R EGULARIZATION P ATH C ONT ’ D O ( n 3 ) time to solve one (dense) linear system, or to compute the eigendecomposition (constant is maybe 4x worse). Given Q and Λ , we can find c ( λ ) in O ( n 2 ) time: c ( λ ) = Q (Λ + λ I ) − 1 Q T Y , noting that (Λ + λ I ) is diagonal. Finding c ( λ ) for many λ ’s is (essentially) free! Regularization Methods for High Dimensional Learning RLS and SVM

  17. P ARAMETER CHOICE idea: try different λ and see which one performs best How to try them? A simple choice is to use a validation set of data If we have "enough" training data we may sample out a training and a validation set. Otherwise a common practice is K-fold Cross Validation (KCV): Divide data into K sets of equal size: S 1 , . . . , S k 1 For each i train on the other K − 1 sets and test on the i th 2 set If K = n we get the leave-one-out strategy (LOO) Regularization Methods for High Dimensional Learning RLS and SVM

  18. P ARAMETER CHOICE Notice that some data should always be kept aside to be used as test set, to test the generalization performance of the system after parameter tuning took place TRAINING VALIDATION TEST Entire set of data Regularization Methods for High Dimensional Learning RLS and SVM

  19. T HE L INEAR C ASE The linear kernel is K ( x i , x j ) = x T i x j . The linear kernel offers many advantages for computation. Key idea: we get a decomposition of the kernel matrix for free: K = XX T — where X = [ x ⊤ 1 , . . . , x ⊤ n ] is the data matrix n × d In the linear case, we will see that we have two different computation options. Regularization Methods for High Dimensional Learning RLS and SVM

  20. L INEAR KERNEL , LINEAR FUNCTION With a linear kernel, the function we are learning is linear as well: f ( x ∗ ) = K x ∗ c x T ∗ X T c = x T = ∗ w , where we define w to be X T c . Regularization Methods for High Dimensional Learning RLS and SVM

  21. L INEAR KERNEL CONT . For the linear kernel, 1 2 + λ 2 || Y − K c || 2 2 c T K c min c ∈ R n 1 2 + λ 2 || Y − XX T c || 2 2 c T XX T c = min c ∈ R n 1 2 + λ 2 || Y − X w || 2 2 || w || 2 = min 2 . w ∈ R d Taking the gradient with respect to w and setting it to zero X T X w − X T Y + λ w = 0 we get w = ( X T X + λ I ) − 1 X T Y . Regularization Methods for High Dimensional Learning RLS and SVM

  22. S OLUTION FOR FIXED PARAMETER w = ( X T X + λ I ) − 1 X T Y . Choleski decomposition allows to solve the above problem in O ( d 3 ) for any fixed λ . We can work with the covariance matrix X T X ∈ R d × d . The algorithm is identical to solving a general RLS problem replacing the kernel matrix by X T X and the labels vector by X T y . We can classify new points in O ( d ) time, using w , rather than having to compute a weighted sum of n kernel products (which will usually cost O ( nd ) time). Regularization Methods for High Dimensional Learning RLS and SVM

  23. R EGULARIZATION P ATH VIA SVD To compute solutions corresponding to multiple values of λ we can again consider an eigend-ecomposition/svd. We need O ( nd ) memory to store the data in the first place. The SVD also requires O ( nd ) memory, and O ( nd 2 ) time. Compared to the nonlinear case, we have replaced an O ( n ) with an O ( d ) , in both time and memory. If n >> d , this can represent a huge savings. Regularization Methods for High Dimensional Learning RLS and SVM

  24. S UMMARY S O F AR When can we solve one RLS problem? (I.e. what are the bottlenecks?) Regularization Methods for High Dimensional Learning RLS and SVM

  25. S UMMARY S O F AR When can we solve one RLS problem? (I.e. what are the bottlenecks?) We need to form K , which takes O ( n 2 d ) time and O ( n 2 ) memory. We need to perform a Cholesky factorization or an eigendecomposition of K , which takes O ( n 3 ) time. In the linear case we have replaced an O ( n ) with an O ( d ) , in both time and memory. If n >> d , this can represent a huge savings. Usually, we run out of memory before we run out of time. The practical limit on today’s workstations is (more-or-less) 10,000 points (using Matlab). Regularization Methods for High Dimensional Learning RLS and SVM

  26. P LAN RLS dual problem regularization path linear case SVM dual problem linear case historical derivation Regularization Methods for High Dimensional Learning RLS and SVM

Recommend


More recommend