regularized least squares
play

Regularized Least Squares Charlie Frogner 1 MIT 2012 1 Slides - PowerPoint PPT Presentation

Regularized Least Squares Charlie Frogner 1 MIT 2012 1 Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Regularized Least Squares Summary In RLS, the Tikhonov minimization problem boils down to solving a linear system (and this is


  1. Regularized Least Squares Charlie Frogner 1 MIT 2012 1 Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Regularized Least Squares

  2. Summary In RLS, the Tikhonov minimization problem boils down to solving a linear system (and this is good). We can compute the solution for each of a bunch of λ ’s, by using the eigendecomposition of the kernel matrix. We can compute the leave-one-out error over the whole training set about as cheaply as solving the minimization problem once. The linear kernel allows us to do all of this when n ≫ d . C. Frogner Regularized Least Squares

  3. Basics: Data Training set: S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . Inputs: X = { x 1 , . . . , x n } . Labels: Y = { y 1 , . . . , y n } . C. Frogner Regularized Least Squares

  4. Basics: RKHS, Kernel RKHS H with a positive semidefinite kernel function K : K ( x i , x j ) = x T linear: i x j K ( x i , x j ) = ( x T i x j + 1 ) d polynomial: � � −|| x i − x j || 2 K ( x i , x j ) = exp Gaussian: σ 2 Define the kernel matrix K to satisfy K ij = K ( x i , x j ) . The kernel function with one argument fixed is K x = K ( x , · ) . Given an arbitrary input x ∗ , K x ∗ is a vector whose i th entry is K ( x i , x ∗ ) . (So the training set X is assumed.) C. Frogner Regularized Least Squares

  5. The RLS Setup Goal: Find the function f ∈ H that minimizes the weighted sum of the square loss and the RKHS norm n 1 ( f ( x i ) − y i ) 2 + λ � 2 || f || 2 argmin H . (1) 2 f ∈H i = 1 This loss function makes sense for regression. We can also use it for binary classification, where it is less immediately intuitive but works great. Also called “ridge regression.” C. Frogner Regularized Least Squares

  6. Applying the Representer Claim : We can rewrite (1) as 1 2 + λ 2 || Y − K c || 2 2 || f || 2 argmin H . c ∈ R n Proof : The representer theorem guarantees that the solution to (1) can be written as n � f ( · ) = c j K x j ( · ) j = 1 for some c ∈ R n . So K c gives a vector whose i th element is f ( x i ) : n n � � f ( x i ) = c j K x i ( x j ) = c j K ij = ( K i , · ) c j = 1 j = 1 C. Frogner Regularized Least Squares

  7. Applying the Representer Theorem, Part II Claim : � f � 2 H = c T K c . Proof : n � f ( · ) = c j K x j ( · ) , j = 1 so || f || 2 = < f , f > H H � n n � � � = c i K x i , c j K x j i = 1 j = 1 H n n � � � � = c i c j K x i , K x j H i = 1 j = 1 n n � � c i c j K ( x i , x j ) = c t K c = i = 1 j = 1 C. Frogner Regularized Least Squares

  8. The RLS Solution Putting it all together, the RLS problem is: 1 2 + λ 2 || Y − K c || 2 2 c T K c argmin c ∈ R n This is convex in c (why?), so we can find its minimum by setting the gradient w.r.t c to 0: − K ( Y − K c ) + λ K c = 0 ( K + λ I ) c = Y ( K + λ I ) − 1 Y c = We find c by solving a system of linear equations. C. Frogner Regularized Least Squares

  9. Solving RLS, Parameters Fixed. We just need to solve a single linear system ( K + λ I ) c = Y . The matrix K + λ I is symmetric positive definite, so the appropriate algorithm is Cholesky factorization. In Matlab, the “slash” operator seems to be using Cholesky, so you can just write c = ( K +l*I) \ Y , but to be safe, (or in octave), I suggest R = chol( K +l*I); c = (R \ (R’ \ Y)); . C. Frogner Regularized Least Squares

  10. The RLS Solution, Comments Define G ( λ ) = K + λ I . (Often λ is clear from context and we write G .) The prediction at a new test input x ∗ is: n � f ( x ∗ ) = c j K x j ( x ∗ ) j = 1 = K x ∗ c K x ∗ G − 1 Y = The use of G − 1 (or other inverses) is formal only. We do not recommend taking matrix inverses. C. Frogner Regularized Least Squares

  11. Solving RLS, Varying λ Situation: We don’t know what λ to use, . Is there a more efficent method than solving c ( λ ) = ( K + λ I ) − 1 Y anew for each λ ? Form the eigendecomposition K = Q Λ Q T , where Λ is diagonal with Λ ii ≥ 0 and QQ T = I . Then G = K + λ I Q Λ Q T + λ I = Q (Λ + λ I ) Q T , = which implies that G − 1 = Q (Λ + λ I ) − 1 Q T . C. Frogner Regularized Least Squares

  12. Solving RLS, Varying λ Situation: We don’t know what λ to use, . Is there a more efficent method than solving c ( λ ) = ( K + λ I ) − 1 Y anew for each λ ? Form the eigendecomposition K = Q Λ Q T , where Λ is diagonal with Λ ii ≥ 0 and QQ T = I . Then G = K + λ I Q Λ Q T + λ I = Q (Λ + λ I ) Q T , = which implies that G − 1 = Q (Λ + λ I ) − 1 Q T . C. Frogner Regularized Least Squares

  13. Solving RLS, Varying λ , Cont’d O ( n 3 ) time to solve one (dense) linear system, or to compute the eigendecomposition (constant is maybe 4x worse). Given Q and Λ , we can find c ( λ ) in O ( n 2 ) time: c ( λ ) = Q (Λ + λ I ) − 1 Q T Y , noting that (Λ + λ I ) is diagonal. Finding c ( λ ) for many λ ’s is (essentially) free! C. Frogner Regularized Least Squares

  14. Validation We showed how to find c ( λ ) quickly as we vary λ . But how do we decide if a given λ is “good”? Simplest idea: Use the training set error. Problem: This invariably overfits. Don’t do this! Other methods are possible, but today we consider validation . Validation means checking our function’s behavior on points other than the training set. C. Frogner Regularized Least Squares

  15. Validation We showed how to find c ( λ ) quickly as we vary λ . But how do we decide if a given λ is “good”? Simplest idea: Use the training set error. Problem: This invariably overfits. Don’t do this! Other methods are possible, but today we consider validation . Validation means checking our function’s behavior on points other than the training set. C. Frogner Regularized Least Squares

  16. Types of Validation If we have a huge amount of data, we could hold back some percentage of our data (30% is typical), and use this development set to choose hyperparameters. More common is k-fold cross-validation , which means a couple of different things: Divide your data into k equal sets S 1 , . . . , S k . For each i , train on the other k − 1 sets and test on the i th set. A total of k times, randomly split your data into a training and test set. The limit of (the first kind of) k-fold validation is leave-one-out cross-validation. C. Frogner Regularized Least Squares

  17. Leave-One-Out Cross-Validation For each data point x i , build a classifier using the remaining n − 1 data points, and measure the error at x i . Empirically, this seems to be the method of choice when n is small. Problem: We have to build n different predictors, on data sets of size n − 1. We will now proceed to show that for RLS, obtaining the LOO error is (essentially) free! C. Frogner Regularized Least Squares

  18. Leave-One-Out CV: Notation Define S i to be the data set with the i th point removed: S i = { ( x 1 , y 1 ) , . . . , ( x i − 1 , y i − 1 ) , *poof* , ( x i + 1 , y i + 1 ) , . . . , ( x n , y n ) } The i th leave-one-out value is f S i ( x i ) . The i th leave-one-out error is y i − f S i ( x i ) . Define L V and L E to be the vectors of leave-one-out values and errors over the training set. || L E || 2 2 is considered a good empirical proxy for the error on future points, and we often want to choose parameters by minimizing this quantity. C. Frogner Regularized Least Squares

  19. L E derivation, I Imagine that we already know f S i ( x i ) . Define the vector Y i via � y j j � = i y i j = f S i ( x i ) j = i C. Frogner Regularized Least Squares

  20. L E derivation, II Claim : Solving RLS using Y i gives us f S i , i.e. n 1 j − f ( x j )) 2 + λ � ( y i 2 � f � 2 f S i = argmin H = (*) . 2 f ∈H j = 1 Proof : i − f ( x i )) 2 ≥ 0 ∀ f (1) = ( y i i − f S i ( x i )) 2 = ( f S i ( x i ) − f S i ( x i )) 2 = 0 ( y i and ⇒ f S i minimizes (1) j − f ( x j )) 2 + λ ( y i 2 � f � 2 f S i also minimizes � H = (2) j � = i ⇒ f S i minimizes (*) = (1) + (2) C. Frogner Regularized Least Squares

  21. L E derivation, III Therefore, c i G − 1 Y i = ( KG − 1 Y i ) i f S i ( x i ) = This is circular reasoning so far, because we need to know f S i ( x i ) to form Y i in the first place. However, assuming we have already solved RLS for the whole training set, and we have computed f S ( X ) = KG − 1 Y , we can do something nice . . . C. Frogner Regularized Least Squares

  22. L E derivation, IV � ( KG − 1 ) ij ( y i f S i ( x i ) − f S ( x i ) = j − y j ) j ( KG − 1 ) ii ( f S i ( x i ) − y i ) = f S ( x i ) − ( KG − 1 ) ii y i f S i ( x i ) = 1 − ( KG − 1 ) ii ( KG − 1 Y ) i − ( KG − 1 ) ii y i = . 1 − ( KG − 1 ) ii C. Frogner Regularized Least Squares

  23. L E derivation, V KG − 1 Y − diag m ( KG − 1 ) Y L V = , diag v ( I − KG − 1 ) L E = Y − L V Y + diag m ( KG − 1 ) Y − KG − 1 Y = diag v ( I − KG − 1 ) diag m ( I − KG − 1 ) Y diag v ( I − KG − 1 ) + diag m ( KG − 1 ) Y − KG − 1 Y = diag v ( I − KG − 1 ) Y − KG − 1 Y = diag v ( I − KG − 1 ) . C. Frogner Regularized Least Squares

  24. L E derivation, VI We can simplify our expressions in a way that leads to better computational and numerical properties by noting KG − 1 Q Λ Q T Q (Λ + λ I ) − 1 Q T = Q Λ(Λ + λ I ) − 1 Q T = Q (Λ + λ I − λ I )(Λ + λ I ) − 1 Q T = I − λ G − 1 . = C. Frogner Regularized Least Squares

Recommend


More recommend