Condition estimation of linear algebraic equations and its application to feature selection Joab Winkler Department of Computer Science, The University of Sheffield, Sheffield, United Kingdom The Institute of High Performance Computing A* Agency for Science, Technology and Research Singapore January 2019
Introduction 1 Mathematical background Regression 2 Condition numbers and regularisation 3 The effective condition number The discrete Picard condition Tikhonov regularisation Componentwise condition numbers Feature selection 4 Feature selection and condition estimation Summary 5
Introduction Several problems require the prediction of the output of a physical system for which the sample size n is much smaller than the dimension of the data p : Chemometrics Brain imaging Genomics Gene selection from microarray data Text analysis The condition n < p implies that there are many models that satisfy the given data and important issues therefore arise: Which model from this infinite set of models should be chosen? What is the criterion that should be used for this selection? Can the selection be generic, that is, not problem dependent, such that prior information is not required?
Mathematical background These problems yield an equation of the form A ∈ R m × n , b ∈ R m , x ∈ R n Ax = b + ε, where m < n , rank A = m and ε is the noise. The least squares minimisation of � ε � leads to the normal equation A T Ax = A T b whose solution is x soln = A † b = VS † U T b where the superscript † denotes the pseudo-inverse. The solution is � 0 m S − 1 � 1 U T b � � x soln = x ln + x 0 , x ln = V , x 0 = V 0 n − m r where x ln is the minimum norm solution, x 0 lies in the null space of A , r is arbitrary, and � S 1 � S = 0 m , n − m
The solution x ln is unsatisfactory for two reasons: Prediction accuracy : This solution may have low bias and high variance. Prediction accuracy can sometimes be improved by reducing or setting to zero some coefficients of x ln . Interpretation : It is usually desirable to choose the most important components of x ln that characterise the physical system being considered. Methods that are used to overcome these problems: Ridge regression : The magnitude of the components of x ln is reduced continuously: It is more stable than subset selection. It does not set any components to zero and thus it does not yield a sparse model that can be easily interpreted. Subset selection : Components of x ln are deleted in discrete steps: The models are strongly dependent on the components that are deleted because the elimination procedure is discrete. A small change in the data can cause a large change in the selected model, which reduces the prediction accuracy.
Ridge regression (Tikhonov regularisation) The sensitivity of the solution x ln to perturbations in b can be reduced by a constraint on the magnitude of the solution x reg : ( Ax − b ) T ( Ax − b ) + λ � x � 2 � � x reg = arg min , λ > 0 x and thus A T A + λ I x reg = A T b , � � λ > 0 The lasso (‘least absolute shrinkage and selection operator’) The retension of the advantages of ridge regression (stability) and subset selection (sparsity) are combined in the lasso: ( Ax − b ) T ( Ax − b ) � � x lasso = arg min � x � 1 ≤ t subject to x which can also be written as � ( Ax − b ) T ( Ax − b ) + λ � x � 1 � x lasso = arg min , λ > 0 x
The elastic net This method is an improvement on the lasso and it combines L 1 and L 2 regularisation: ( Ax − b ) T ( Ax − b ) + λ 1 � x � 1 + λ 2 � x � 2 � � x elastic = arg min x where λ 1 , λ 2 > 0 The solutions from Tikhonov regularisation, the lasso and the elastic net reduce the sensitivity of the least norm solution x ln to perturbations in b , but there are differences between these forms of regularisation.
Compare Tikhonov regularisation Tikhonov regularisation imposes a Gaussian prior on the parameters of the model. Tikhonov regularisation does not impose sparsity on x reg . The solution x reg has a closed form expression. with the lasso The lasso imposes a Laplacian prior on the parameters of the model. The lasso favours sparse solutions because some coefficients of x lasso are set to zero. The sparsity of x lasso increases as λ increases. The solution x lasso does not have a closed form expression and quadratic programming is required for its computation. and the elastic net The sparsity of x elastic is similar to the sparsity of x lasso . The solution x elastic favours a model in which strongly correlated predictors are usually either all included, or all excluded. The solution x elastic is much better than x lasso for some problems.
A regularised solution (Tikhonov, the lasso and the elastic net) is stable with respect to perturbations in b , but several points arise: Is regularisation always required when the data b are corrupted by noise? Must specific conditions on A and b be satisfied in order that regularisation is imposed only when it is required? What are consequences of applying regularisation when it is not required? If regularisation is required, then r method = x ln − x method � = 0 , method = { reg , lasso , elastic } Can bounds be imposed on � r reg � , � r lasso � and � r elastic � , such that these errors induced by regularisation are quantified? The answers to these questions are most easily obtained if Tikhonov regularisation is considered because the constraint in the 2-norm lends itself naturally to the SVD .
Regression The use of regularisation is usually justified for three reasons: It reduces or eliminates over-fitting in regression. It reduces the sensitivity of the regression curve to noise in the data. It imposes a unique solution in feature selection Ax = b where A ∈ R m × n , b ∈ R m , x ∈ R n , m < n and rank A = m . But There are well-defined problems for which regularisation must not be used because it causes a large degradation in the solution. and thus Can a quantitative test be established such that regularisation is used only when it is required? Regression provides a good example of the correct use, and the incorrect use, of regularisation.
Example 1 Consider the points ( x i , y i ) , i = 1 , . . . , 100, where the independent variables x i are not uniformly distributed in the interval I = [ 1 , . . . , 20 ] , the dependent variables y i are given by 33 � � − ( x i − d k ) 2 � y i = a k exp , i = 1 , . . . , 100 2 σ 2 d k = 1 the centres d k of the 33 basis functions are uniformly distributed in I and σ d = 1 . 35. Consider two sets of data points, y = y 1 and y = y 2 , and the perturbations δ y 1 and δ y 2 , µ = 0 , σ 2 = 25 × 10 − 8 � δ y 1 , δ y 2 ∼ N � and � δ y 1 � � δ y 2 � � y 1 � = 3 . 41 × 10 − 6 � y 2 � = 8 . 27 × 10 − 4 and
250 200 150 100 50 0 -50 10 20 30 40 50 60 70 80 90 100 i 10 4 1 0.5 0 exact data -0.5 perturbed data -1 0 5 10 15 20 25 30 35 i Figure: The exact curve (top), and the coefficients a i (bottom) for the exact data y = y 1 and the perturbed data y = y 1 + δ y 1 .
1 0.5 0 -0.5 -1 -1.5 -2 -2.5 10 20 30 40 50 60 70 80 90 100 i 10 7 1 0.5 0 -0.5 -1 5 10 15 20 25 30 i Figure: The exact curve (top), and the coefficients log 10 | a i | (bottom) for the exact data y = y 2 and the perturbed data y = y 2 + δ y 2 .
The interpolated curve is unstable for the data set y = y 1 . The interpolated curve is stable for the data set y = y 2 . The coefficient matrix A ∈ R 100 × 33 is the same for y = y 1 and y = y 2 , and its condition number is κ ( A ) = 5 . 11 × 10 8 Thus The presence of noise in the vector b , where Ax = b , does not imply that x is sensitive to changes in b . The condition κ ( A ) ≫ 1 does not imply that the equation Ax = b is ill-conditioned. Tikhonov regularisation yields a very good result (numerically stable and a small error between the theoretically exact solution and the regularised solution) for y = y 1 , but an unsatisfactory result for y = y 2 (a very large error between the theoretically exact solution and the regularised solution).
Condition numbers and regularisation The 2-norm condition number of A ∈ R m × n is κ ( A ) = s 1 , p = min ( m , n ) s p where s i , i = 1 , . . . , p , are the singular values of A and rank A = p . The condition number κ ( A ) cannot be a measure of the stability of Ax = b because it is independent of b . It is necessary to develop a measure of stability that is a function of A and b . This leads to: A refined normwise condition number - the effective condition number - which is a function of A and b . Componentwise condition numbers - one condition number for each component of x .
The effective condition number The effective condition number η ( A , b ) of A T Ax = A T b , A ∈ R m × n , m ≥ n is a refined normwise condition number. Theorem 1 Let the relative errors ∆ x and ∆ b be ∆ x = � δ x � ∆ b = � δ b � and � x � � b � The effective condition number η ( A , b ) of A T Ax = A T b is equal to the maximum value of the ratio of ∆ x to ∆ b with respect to all perturbations δ b ∈ R m ∆ x ∆ b = 1 � c � η ( A , b ) = max � S † c � s n δ b ∈ R m where A = USV T is the SVD of A and c = U T b .
Recommend
More recommend