Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides - PowerPoint PPT Presentation

Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Regularized Least Squares

Summary In RLS, the Tikhonov minimization problem boils down to solving a linear system (and this is good). We can compute the solution for each of a bunch of λ ’s, by using the eigendecomposition of the kernel matrix. We can compute the leave-one-out error over the whole training set about as cheaply as solving the minimization problem once. The linear kernel allows us to do all of this when n ≫ d . C. Frogner Regularized Least Squares

Basics: Data Training set: S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . Inputs: X = { x 1 , . . . , x n } . Labels: Y = { y 1 , . . . , y n } . C. Frogner Regularized Least Squares

Basics: RKHS, Kernel RKHS H with a positive semidefinite kernel function K : K ( x i , x j ) = x T linear: i x j K ( x i , x j ) = ( x T i x j + 1 ) d polynomial: � � −|| x i − x j || 2 K ( x i , x j ) = exp gaussian: σ 2 Define the kernel matrix K to satisfy K ij = K ( x i , x j ) . The kernel function with one argument fixed is K x = K ( x , · ) . Given an arbitrary input x ∗ , K x ∗ is a vector whose i th entry is K ( x i , x ∗ ) . (So the training set X is assumed.) C. Frogner Regularized Least Squares

The RLS Setup Goal: Find the function f ∈ H that minimizes the weighted sum of the square loss and the RKHS norm n 1 ( f ( x i ) − y i ) 2 + λ � 2 || f || 2 argmin H . (1) 2 f ∈H i = 1 This loss function makes sense for regression. We can also use it for binary classification, where it is less immediately intuitive but works great. Also called “ridge regression.” C. Frogner Regularized Least Squares

Applying the Representer Claim : We can rewrite (1) as 1 2 + λ 2 || Y − K c || 2 2 || f || 2 argmin H . c ∈ R n Proof : The representer theorem guarantees that the solution to (1) can be written as n � f ( · ) = c j K x j ( · ) j = 1 for some c ∈ R n . So K c gives a vector whose i th element is f ( x i ) : n n � � f ( x i ) = c j K x i ( x j ) = c j K ij = ( K i , · ) c j = 1 j = 1 C. Frogner Regularized Least Squares

Applying the Representer Theorem, Part II Claim : � f � 2 H = c T K c . Proof : n � f ( · ) = c j K x j ( · ) , j = 1 so || f || 2 = < f , f > H H � n n � � � = c i K x i , c j K x j i = 1 j = 1 H n n � � � � = c i c j K x i , K x j H i = 1 j = 1 n n � � c i c j K ( x i , x j ) = c t K c = i = 1 j = 1 C. Frogner Regularized Least Squares

The RLS Solution Putting it all together, the RLS problem is: 1 2 + λ 2 || Y − K c || 2 2 c T K c argmin f ∈H This is convex in c (why?), so we can find its minimum by setting the gradient w.r.t c to 0: − K ( Y − K c ) + λ K c = 0 ( K + λ I ) c = Y ( K + λ I ) − 1 Y = c We find c by solving a system of linear equations. C. Frogner Regularized Least Squares

The RLS Solution, Comments The solution exists and is unique (for λ > 0). Define G ( λ ) = K + λ I . (Often λ is clear from context and we write G .) The prediction at a new test input x ∗ is: n � f ( x ∗ ) = c j K x j ( x ∗ ) j = 1 = K x ∗ c K x ∗ G − 1 Y = The use of G − 1 (or other inverses) is formal only. We do not recommend taking matrix inverses. C. Frogner Regularized Least Squares

Solving RLS, Parameters Fixed. Situation: All hyperparameters fixed We just need to solve a single linear system ( K + λ I ) c = Y . The matrix K + λ I is symmetric positive definite, so the appropriate algorithm is Cholesky factorization. In Matlab, the “slash” operator seems to be using Cholesky, so you can just write c = (K+l*I) \ Y , but to be safe, (or in octave), I suggest R = chol(K+l*I); c = (R \ (R’ \ Y)); . C. Frogner Regularized Least Squares

Solving RLS, Varying λ Situation: We don’t know what λ to use, all other hyperparameters fixed. Is there a more efficent method than solving c ( λ ) = ( K + λ I ) − 1 Y anew for each λ ? Form the eigendecomposition K = Q Λ Q T , where Λ is diagonal with Λ ii ≥ 0 and QQ T = I . Then G = K + λ I Q Λ Q T + λ I = Q (Λ + λ I ) Q T , = which implies that G − 1 = Q (Λ + λ I ) − 1 Q T . C. Frogner Regularized Least Squares

Solving RLS, Varying λ , Cont’d O ( n 3 ) time to solve one (dense) linear system, or to compute the eigendecomposition (constant is maybe 4x worse). Given Q and Λ , we can find c ( λ ) in O ( n 2 ) time: c ( λ ) = Q (Λ + λ I ) − 1 Q T Y , noting that (Λ + λ I ) is diagonal. Finding c ( λ ) for many λ ’s is (essentially) free! C. Frogner Regularized Least Squares

Validation We showed how to find c ( λ ) quickly as we vary λ . But how do we decide if a given λ is “good”? Simplest idea: Use the training set error. Problem: This invariably overfits. Don’t do this! Other methods are possible, but today we consider validation . Validation means checking our function’s behavior on points other than the training set. C. Frogner Regularized Least Squares

Types of Validation If we have a huge amount of data, we could hold back some percentage of our data (30% is typical), and use this development set to choose hyperparameters. More common is k-fold cross-validation , which means a couple of different things: Divide your data into k equal sets S 1 , . . . , S k . For each i , train on the other k − 1 sets and test on the i th set. A total of k times, randomly split your data into a training and test set. The limit of (the first kind of) k-fold validation is leave-one-out cross-validation. C. Frogner Regularized Least Squares

Leave-One-Out Cross-Validation For each data point x i , build a classifier using the remaining n − 1 data points, and measure the error at x i . Empirically, this seems to be the method of choice when n is small. Problem: We have to build n different predictors, on data sets of size n − 1. We will now proceed to show that for RLS, obtaining the LOO error is (essentially) free! C. Frogner Regularized Least Squares

Leave-One-Out CV: Notation Define S i to be the data set with the i th point removed: S i = { ( x 1 , y 1 ) , . . . , ( x i − 1 , y i − 1 ) , *poof* , ( x i + 1 , y i + 1 ) , . . . , ( x n , y n ) } The i th leave-one-out value is f S i ( x i ) . The i th leave-one-out error is y i − f S i ( x i ) . Define L V and L E to be the vectors of leave-one-out values and errors over the training set. || L E || 2 2 is considered a good empirical proxy for the error on future points, and we often want to choose parameters by minimizing this quantity. C. Frogner Regularized Least Squares

L E derivation, I Imagine that we already know f S i ( x i ) . Define the vector Y i via � y j j � = i y i j = f S i ( x i ) j = i C. Frogner Regularized Least Squares

L E derivation, II Claim : Solving RLS using Y i gives us f S i , i.e. n 1 j − f ( x j )) 2 + λ � ( y i 2 � f � 2 f S i = argmin H = (*) . 2 f ∈H j = 1 Proof : i − f ( x i )) 2 ≥ 0 ∀ f (1) = ( y i i − f S i ( x i )) 2 = ( f S i ( x i ) − f S i ( x i )) 2 = 0 ( y i and ⇒ f S i minimizes (1) j − f ( x j )) 2 + λ ( y i 2 � f � 2 f S i also minimizes � H = (2) j � = i ⇒ f S i minimizes (*) = (1) + (2) C. Frogner Regularized Least Squares

L E derivation, III Therefore, c i G − 1 Y i = ( KG − 1 Y i ) i f S i ( x i ) = This is circular reasoning so far, because we need to know f S i ( x i ) to form Y i in the first place. However, assuming we have already solved RLS for the whole training set, and we have computed f S ( X ) = KG − 1 Y , we can do something nice . . . C. Frogner Regularized Least Squares

L E derivation, IV � ( KG − 1 ) ij ( y i f S i ( x i ) − f S ( x i ) = j − y j ) j ( KG − 1 ) ii ( f S i ( x i ) − y i ) = f S ( x i ) − ( KG − 1 ) ii y i f S i ( x i ) = 1 − ( KG − 1 ) ii ( KG − 1 Y ) i − ( KG − 1 ) ii y i = . 1 − ( KG − 1 ) ii C. Frogner Regularized Least Squares

L E derivation, V KG − 1 Y − diag m ( KG − 1 ) Y L V = , diag v ( I − KG − 1 ) L E = Y − L V Y + diag m ( KG − 1 ) Y − KG − 1 Y = diag v ( I − KG − 1 ) diag m ( I − KG − 1 ) Y diag v ( I − KG − 1 ) + diag m ( KG − 1 ) Y − KG − 1 Y = diag v ( I − KG − 1 ) Y − KG − 1 Y = diag v ( I − KG − 1 ) . C. Frogner Regularized Least Squares

Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides - PowerPoint PPT Presentation

Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Regularized Least Squares Summary In RLS, the Tikhonov minimization problem boils down to solving a linear system (and this is

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

Regularized Least Squares Charlie Frogner 1 MIT 2012 1 Slides mostly stolen from Ryan Rifkin

Model Selection and Fast Rates for Regularized Least-Squares Andrea Caponnetto 1 Plan

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

ECE 516: Adaptive Digital Filters Lecture 13 (Recursive Least-Squares) Mojtaba Soltanalian 2

The Mathemagic of Magic Squares History of Magic Squares Mathematics and Magic Squares

Solving Regularized Total Least Squares Problems Based on Eigensolvers Heinrich Voss

Statistical Geometry Processing Winter Semester 2011/2012 Least-Squares Least-Squares Fitting

9. Equality constraints and tradeoffs More least squares Example: moving average model

8. Least squares Review of linear equations Least squares Example: curve-fitting

Linear Least Squares I Steve Marschner Cornell CS 322 Cornell CS 322 Linear Least Squares I 1

Moving Least Squares Outline The Approximation Power of Moving Least- Squares D. Levin

Non linear Least Squares Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Geometry of Least Squares 2 Least squares from the

Benchmarking Big Data Systems Invited Talk Raghunath Nambiar, Cisco About me Cisco

Boston University Superfund Research Program* *Established 1995 New Bedford Harbor/Cape Cod Top

Do Health Behaviors Explain the Association Between Personality and Mortality? Dan Mroczek, Ph.D.

Lecture 3/Chapter 3 Measurements, Mistakes, Misunderstandings Definitions: validity,

I m pact of Local I nterconnects on Tim ing and Pow er in a High Perform ance Microprocessor

RLS Stein-rule in Gretl Lee C. Adkins Department of Economics Oklahoma State University

Recursive Lattice Search: Hierarchical Heavy Hitters Revisited Kenjiro Cho IIJ Research

ePPI: Locator Service in Information Networks with Personalized Privacy Preservation Yuzhe Tang,