Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Regularized Least Squares
Summary In RLS, the Tikhonov minimization problem boils down to solving a linear system (and this is good). We can compute the solution for each of a bunch of λ ’s, by using the eigendecomposition of the kernel matrix. We can compute the leave-one-out error over the whole training set about as cheaply as solving the minimization problem once. The linear kernel allows us to do all of this when n ≫ d . C. Frogner Regularized Least Squares
Basics: Data Training set: S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . Inputs: X = { x 1 , . . . , x n } . Labels: Y = { y 1 , . . . , y n } . C. Frogner Regularized Least Squares
Basics: RKHS, Kernel RKHS H with a positive semidefinite kernel function K : K ( x i , x j ) = x T linear: i x j K ( x i , x j ) = ( x T i x j + 1 ) d polynomial: � � −|| x i − x j || 2 K ( x i , x j ) = exp gaussian: σ 2 Define the kernel matrix K to satisfy K ij = K ( x i , x j ) . The kernel function with one argument fixed is K x = K ( x , · ) . Given an arbitrary input x ∗ , K x ∗ is a vector whose i th entry is K ( x i , x ∗ ) . (So the training set X is assumed.) C. Frogner Regularized Least Squares
The RLS Setup Goal: Find the function f ∈ H that minimizes the weighted sum of the square loss and the RKHS norm n 1 ( f ( x i ) − y i ) 2 + λ � 2 || f || 2 argmin H . (1) 2 f ∈H i = 1 This loss function makes sense for regression. We can also use it for binary classification, where it is less immediately intuitive but works great. Also called “ridge regression.” C. Frogner Regularized Least Squares
Applying the Representer Claim : We can rewrite (1) as 1 2 + λ 2 || Y − K c || 2 2 || f || 2 argmin H . c ∈ R n Proof : The representer theorem guarantees that the solution to (1) can be written as n � f ( · ) = c j K x j ( · ) j = 1 for some c ∈ R n . So K c gives a vector whose i th element is f ( x i ) : n n � � f ( x i ) = c j K x i ( x j ) = c j K ij = ( K i , · ) c j = 1 j = 1 C. Frogner Regularized Least Squares
Applying the Representer Theorem, Part II Claim : � f � 2 H = c T K c . Proof : n � f ( · ) = c j K x j ( · ) , j = 1 so || f || 2 = < f , f > H H � n n � � � = c i K x i , c j K x j i = 1 j = 1 H n n � � � � = c i c j K x i , K x j H i = 1 j = 1 n n � � c i c j K ( x i , x j ) = c t K c = i = 1 j = 1 C. Frogner Regularized Least Squares
The RLS Solution Putting it all together, the RLS problem is: 1 2 + λ 2 || Y − K c || 2 2 c T K c argmin f ∈H This is convex in c (why?), so we can find its minimum by setting the gradient w.r.t c to 0: − K ( Y − K c ) + λ K c = 0 ( K + λ I ) c = Y ( K + λ I ) − 1 Y = c We find c by solving a system of linear equations. C. Frogner Regularized Least Squares
The RLS Solution, Comments The solution exists and is unique (for λ > 0). Define G ( λ ) = K + λ I . (Often λ is clear from context and we write G .) The prediction at a new test input x ∗ is: n � f ( x ∗ ) = c j K x j ( x ∗ ) j = 1 = K x ∗ c K x ∗ G − 1 Y = The use of G − 1 (or other inverses) is formal only. We do not recommend taking matrix inverses. C. Frogner Regularized Least Squares
Solving RLS, Parameters Fixed. Situation: All hyperparameters fixed We just need to solve a single linear system ( K + λ I ) c = Y . The matrix K + λ I is symmetric positive definite, so the appropriate algorithm is Cholesky factorization. In Matlab, the “slash” operator seems to be using Cholesky, so you can just write c = (K+l*I) \ Y , but to be safe, (or in octave), I suggest R = chol(K+l*I); c = (R \ (R’ \ Y)); . C. Frogner Regularized Least Squares
Solving RLS, Varying λ Situation: We don’t know what λ to use, all other hyperparameters fixed. Is there a more efficent method than solving c ( λ ) = ( K + λ I ) − 1 Y anew for each λ ? Form the eigendecomposition K = Q Λ Q T , where Λ is diagonal with Λ ii ≥ 0 and QQ T = I . Then G = K + λ I Q Λ Q T + λ I = Q (Λ + λ I ) Q T , = which implies that G − 1 = Q (Λ + λ I ) − 1 Q T . C. Frogner Regularized Least Squares
Solving RLS, Varying λ Situation: We don’t know what λ to use, all other hyperparameters fixed. Is there a more efficent method than solving c ( λ ) = ( K + λ I ) − 1 Y anew for each λ ? Form the eigendecomposition K = Q Λ Q T , where Λ is diagonal with Λ ii ≥ 0 and QQ T = I . Then G = K + λ I Q Λ Q T + λ I = Q (Λ + λ I ) Q T , = which implies that G − 1 = Q (Λ + λ I ) − 1 Q T . C. Frogner Regularized Least Squares
Solving RLS, Varying λ , Cont’d O ( n 3 ) time to solve one (dense) linear system, or to compute the eigendecomposition (constant is maybe 4x worse). Given Q and Λ , we can find c ( λ ) in O ( n 2 ) time: c ( λ ) = Q (Λ + λ I ) − 1 Q T Y , noting that (Λ + λ I ) is diagonal. Finding c ( λ ) for many λ ’s is (essentially) free! C. Frogner Regularized Least Squares
Validation We showed how to find c ( λ ) quickly as we vary λ . But how do we decide if a given λ is “good”? Simplest idea: Use the training set error. Problem: This invariably overfits. Don’t do this! Other methods are possible, but today we consider validation . Validation means checking our function’s behavior on points other than the training set. C. Frogner Regularized Least Squares
Validation We showed how to find c ( λ ) quickly as we vary λ . But how do we decide if a given λ is “good”? Simplest idea: Use the training set error. Problem: This invariably overfits. Don’t do this! Other methods are possible, but today we consider validation . Validation means checking our function’s behavior on points other than the training set. C. Frogner Regularized Least Squares
Types of Validation If we have a huge amount of data, we could hold back some percentage of our data (30% is typical), and use this development set to choose hyperparameters. More common is k-fold cross-validation , which means a couple of different things: Divide your data into k equal sets S 1 , . . . , S k . For each i , train on the other k − 1 sets and test on the i th set. A total of k times, randomly split your data into a training and test set. The limit of (the first kind of) k-fold validation is leave-one-out cross-validation. C. Frogner Regularized Least Squares
Leave-One-Out Cross-Validation For each data point x i , build a classifier using the remaining n − 1 data points, and measure the error at x i . Empirically, this seems to be the method of choice when n is small. Problem: We have to build n different predictors, on data sets of size n − 1. We will now proceed to show that for RLS, obtaining the LOO error is (essentially) free! C. Frogner Regularized Least Squares
Leave-One-Out CV: Notation Define S i to be the data set with the i th point removed: S i = { ( x 1 , y 1 ) , . . . , ( x i − 1 , y i − 1 ) , *poof* , ( x i + 1 , y i + 1 ) , . . . , ( x n , y n ) } The i th leave-one-out value is f S i ( x i ) . The i th leave-one-out error is y i − f S i ( x i ) . Define L V and L E to be the vectors of leave-one-out values and errors over the training set. || L E || 2 2 is considered a good empirical proxy for the error on future points, and we often want to choose parameters by minimizing this quantity. C. Frogner Regularized Least Squares
L E derivation, I Imagine that we already know f S i ( x i ) . Define the vector Y i via � y j j � = i y i j = f S i ( x i ) j = i C. Frogner Regularized Least Squares
L E derivation, II Claim : Solving RLS using Y i gives us f S i , i.e. n 1 j − f ( x j )) 2 + λ � ( y i 2 � f � 2 f S i = argmin H = (*) . 2 f ∈H j = 1 Proof : i − f ( x i )) 2 ≥ 0 ∀ f (1) = ( y i i − f S i ( x i )) 2 = ( f S i ( x i ) − f S i ( x i )) 2 = 0 ( y i and ⇒ f S i minimizes (1) j − f ( x j )) 2 + λ ( y i 2 � f � 2 f S i also minimizes � H = (2) j � = i ⇒ f S i minimizes (*) = (1) + (2) C. Frogner Regularized Least Squares
L E derivation, III Therefore, c i G − 1 Y i = ( KG − 1 Y i ) i f S i ( x i ) = This is circular reasoning so far, because we need to know f S i ( x i ) to form Y i in the first place. However, assuming we have already solved RLS for the whole training set, and we have computed f S ( X ) = KG − 1 Y , we can do something nice . . . C. Frogner Regularized Least Squares
L E derivation, IV � ( KG − 1 ) ij ( y i f S i ( x i ) − f S ( x i ) = j − y j ) j ( KG − 1 ) ii ( f S i ( x i ) − y i ) = f S ( x i ) − ( KG − 1 ) ii y i f S i ( x i ) = 1 − ( KG − 1 ) ii ( KG − 1 Y ) i − ( KG − 1 ) ii y i = . 1 − ( KG − 1 ) ii C. Frogner Regularized Least Squares
L E derivation, V KG − 1 Y − diag m ( KG − 1 ) Y L V = , diag v ( I − KG − 1 ) L E = Y − L V Y + diag m ( KG − 1 ) Y − KG − 1 Y = diag v ( I − KG − 1 ) diag m ( I − KG − 1 ) Y diag v ( I − KG − 1 ) + diag m ( KG − 1 ) Y − KG − 1 Y = diag v ( I − KG − 1 ) Y − KG − 1 Y = diag v ( I − KG − 1 ) . C. Frogner Regularized Least Squares
Recommend
More recommend