adaptive lasso for correlated predictors
play

Adaptive Lasso for correlated predictors Keith Knight Department of - PowerPoint PPT Presentation

Adaptive Lasso for correlated predictors Keith Knight Department of Statistics University of Toronto e-mail: keith@utstat.toronto.edu This research was supported by NSERC of Canada. OUTLINE 1. Introduction 2. The Lasso under collinearity 3.


  1. Adaptive Lasso for correlated predictors Keith Knight Department of Statistics University of Toronto e-mail: keith@utstat.toronto.edu This research was supported by NSERC of Canada.

  2. OUTLINE 1. Introduction 2. The Lasso under collinearity 3. Projection pursuit with the Lasso 4. Example: Diabetes data

  3. 1. INTRODUCTION • Assume a linear model for { ( x i .Y i ) : i = 1 . · · · , n } : = β 0 + β 1 x 1 i + · · · + β p x pi + ε i Y i x T = i β + ε i ( i = 1 , · · · , n ) • Assume that the predictors are centred and scaled to have mean 0 and variance 1. – We can estimate β 0 by ¯ Y — least squares estimator. – Thus we can assume that { Y i } are centred to have mean 0. • In many applications, p can be much greater than n . • In this talk, we will assume implicitly that p < n .

  4. Shrinkage estimation • Bridge regression: Minimize p n � � i β ) 2 + λ ( Y i − x T | β j | γ i =1 j =1 for some γ > 0. • Includes the Lasso (Tibshirani, 1996) and ridge regression as special cases with γ = 1 and 2 respectively. – For γ ≤ 1, it’s possible to obtain exact 0 parameter estimates. – Many other variations of the Lasso: elastic nets (Zou & Hastie, 2005), fused lasso (Tibshirani et al. , 2006) among others. – The Dantzig selector of Cand` es & Tao (2007) is similar in spirit to the Lasso.

  5. ( k ) , minimize • Stagewise fitting: Given � β n � ( k ) − x T i � i φ ) 2 ( Y i − x T β i =1 over φ with all but 1 (or a small number) of its elements equal to 0. Then define ( k +1) = � ( k ) + ǫ � � (0 < ǫ ≤ 1) β β φ and repeat until “convergence”. – This is a special case of boosting (Shapire, 1990). – Also related to LARS (Efron et al. , 2004), which in turn is related to the Lasso.

  6. 2. THE LASSO UNDER COLLINEARITY • For given λ , the Lasso estimator � β ( λ ) can be defined in a number of equivalent ways: 1. � β ( λ ) minimizes n � subject to � p i β ) 2 ( Y i − x T j =1 | β j | ≤ t ( λ ); i =1 2. � β ( λ ) minimizes � � � � n n � � � � ( x T i β ) 2 ( Y i − x T subject to i β ) x ij � ≤ λ � � � i =1 i =1 for j = 1 , · · · , p .

  7. • The advantage of the Lasso is that it produces exact 0 estimates while � β ( λ ) is a smooth function of λ . – This is very useful when p ≫ n to produce “sparse” models. • However, when the predictors { x i } are highly correlated then � β ( λ ) may contain too many zeroes. • This is not necessarily undesirable but some important effects may be missed as a result. – How does one interpret a “sparse” model under high collinearity?

  8. Question: Why does this happen? Answer: Redundancy in the constraints � � � � n � � � ( Y i − x T i β ) x ij � ≤ λ for j = 1 , · · · , p � � � i =1 due to collinearity; that is, we don’t have p independent constraints. • The Dantzig selector minimizes � j | β j | subject to similar constraints on the correlations, and thus will tend to behave similarly.

  9. • For LS estimation ( λ = 0), we have n � i � ( Y i − x T β ) x T i a = 0 i =1 for any a . • Similarly, we could try to consider estimates � β such that � � � � n � � � i � ( Y i − x T β ) x T � ≤ λ i a ℓ � � � i =1 for some set of vectors (projections) { a ℓ : ℓ ∈ L} . • If the set L is finite, we can incorporate predictors { a T ℓ x } into the Lasso.

  10. Example: Principal components regression ( |L| = p ) where a 1 , · · · , a p are the eigenvectors of n � x i x T C = i . i =1 However ... • Projections obtain via PC are based solely on information in the design. • Moreover, they need not be particular easy to interpret. • More generally, there’s no problem in taking |L| ≫ p .

  11. 3. PROJECTION PURSUIT WITH THE LASSO • For collinear predictors, it’s often desirable to consider projections of the original predictors. • Given predictors x 1 , · · · , x p and projections { a ℓ : ℓ ∈ L} , we want to identify “interesting” (data-driven) projections a ℓ 1 , · · · , a ℓ p and define new predictors a T ℓ 1 x , · · · , a T ℓ p x . • We can take L to be very large – but the projections we consider should be easily interpretable. – Coordinate projections (i.e. original predictors). – Sums and differences of two or more predictors.

  12. Question: How do we do this? Answer: Two possibilities: • Use the Lasso on the projections. – But we need to worry about the choice of λ . – The “active” projections will depend on λ . • Look at the Lasso solution as λ ↓ 0. – This identifies a set of p projections. – These projections can be used in the Lasso.

  13. Question: What happens to the Lasso solution as λ → 0? • Suppose that � β ( λ ) minimizes p n � � i β ) 2 + λ ( Y i − x T | β j | i =1 j =1 and that n � x i x T C = i i =1 is singular. • Define � � n n � � i φ ) 2 = min i β ) 2 ( Y i − x T ( Y i − x T D = φ : . β i =1 i =1

  14. Proposition: For the Lasso estimate β ( λ ), we have     p � � lim β ( λ ) = argmin | φ j | : φ ∈ D  .  λ ↓ 0 j =1 “Proof”. Assume (for simplicity) that the minimum RSS is 0. Then � β ( λ ) minimizes p n � � Z λ ( β ) = 1 i β ) 2 + ( Y i − x T | β j | . λ i =1 j =1 As λ ↓ 0, the first term of Z λ blows up for β �∈ D and is exactly 0 for β ∈ D . The conclusion follows using convexity of Z λ . Corollary: The Dantzig selector estimator has the same limit as λ ↓ 0.

  15. • In our problem, define t iℓ to be a scaled version of a T ℓ x i . • The model now becomes � = φ ℓ t iℓ + ε i Y i ℓ ∈L t T = i φ + ε i ( i = 1 , · · · , n ) • We estimate φ by minimizing n � � ( Y i − t T | φ ℓ | subject to i φ ) t i = 0 . i =1 ℓ ∈L • This can be solved using linear programming methods. – Software for the Lasso tends to be unstable as λ ↓ 0.

  16. Asymptotics: • Assume p < r = |L| are fixed and n → ∞ . • Define matrices n � 1 x i x T = lim C i n n →∞ i =1 n � 1 t i t T = lim D i n n →∞ i =1 where C is non-singular and D singular with rank p . p • Then � − → some φ 0 . φ n • We also have √ n ( � d φ n − φ 0 ) − → V where the distribution of V is concentrated on the orthogonal complement of the null space of D .

  17. 4. EXAMPLE Diabetes data (Efron et al. , 2004) • Response: measure of disease progression. • Predictors: age, sex, BMI, blood pressure, and 6 blood serum measurements (TC, LDL, HDL, TCH, LTG, GLU). – Some predictors are quite highly correlated. • Analysis indicates that the most important variables are LTG, BMI, BP, TC, and sex. • Look at coordinate-wise projections as well as pairwise sums and differences (100 projections in total).

  18. 40 LTG BMI LDL 20 BP coefficients TCH HDL GLU 0 AGE SEX −20 −40 TC 0.0 0.2 0.4 0.6 0.8 1.0 proportional bound Lasso plot for original predictors.

  19. Results: Estimated projections Projections Estimates BMI + LTG 29 . 86 LTG − TC 14 . 79 LDL − TC 10 . 32 BP − SEX 9 . 61 BMI + BP 6 . 64 BMI + GLU 5 . 36 BP + LTG 5 . 33 TCH − SEX 4 . 18 HDL + TCH 3 . 48 BP − AGE 0 . 55

  20. 30 BMI+LTG 25 20 coefficients 15 LTG−TC 10 LDL−TC BP−SEX BMI+BP BP+LTG BMI+GLU 5 TCH−SEX HDL+TCH BP−AGE 0 0.0 0.2 0.4 0.6 0.8 1.0 proportional bound Lasso plot for the 10 identified projections.

  21. 40 LTG BMI LDL 20 BP coefficients TCH HDL GLU 0 AGE SEX −20 −40 TC 0.0 0.2 0.4 0.6 0.8 1.0 proportional bound Lasso trajectories for original predictors using the projections.

Recommend


More recommend