high dimensional regression with unknown variance
play

High-dimensional regression with unknown variance Christophe Giraud - PowerPoint PPT Presentation

High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: i . i . d . N (0 , 2 ) Y i = f i + i with i f = ( f 1 , . . . , f n )


  1. High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012

  2. Setting Gaussian regression with unknown variance: i . i . d . ∼ N (0 , σ 2 ) ◮ Y i = f i + ε i with ε i ◮ f = ( f 1 , . . . , f n ) ∗ and σ 2 are unknown ◮ we want to estimate f Ex 1 : sparse linear regression ◮ f = X β with β ”sparse” in some sense and X ∈ R n × p with possibly p > n Ex 2 : non-parametric regression ◮ f i = F ( x i ) with F : X → R

  3. A plethora of estimators Sparse linear regression ◮ Coordinate sparsity: Lasso, Dantzig, Elastic-Net, Exponential-Weighting, Projection on subspaces { V λ : λ ∈ Λ } given by PCA, Random Forest, etc. ◮ Structured sparsity: Group-lasso, Fused-Lasso, Bayesian estimators, etc Non-parametric regression ◮ Spline smoothing, Nadaraya kernel smoothing, kernel ridge estimators, nearest neighbors, L 2 -basis projection, Sparse Additive Models, etc

  4. Important practical issues Which estimator should be used? ◮ Sparse regression : Lasso? Random-Forest? Exponential-Weighting? ◮ Non-parametric regression : Kernel regression? (which kernel?) Spline smoothing? Which ”tuning” parameter? ◮ which penalty level for the lasso? ◮ which bandwith for kernel regression? ◮ etc

  5. The objective Difficulties ◮ No procedure is universally better than the others ◮ A sensible choice of the tuning parameters depends on ◮ some unknown characteristics of f (sparsity, smoothness, etc) ◮ the unknown variance σ 2 . Ideal objective ◮ Select the ”best” estimator among a collection { ˆ f λ , λ ∈ Λ } . (alternative objective: combine at best the estimators)

  6. Impact of not knowing the variance

  7. Impact of the unknown variance? Case of coordinate-sparse linear regression σ known or k known σ unknown and k unknown Minimax risk n k Ultra-high dimension 2 k log( p/k ) ≥ n Minimax prediction risk over k -sparse signal as a function of k

  8. Ultra-high dimensional phenomenon Theorem (N. Verzelen EJS 2012) When σ 2 is unknown, there exist designs X of size n × p such that for any estimator � β , we have either � β − 0 p ) � 2 � > C 1 n σ 2 , � X ( � sup or E σ 2 > 0 � �� � β − β 0 ) � 2 � � p � � p k � X ( � σ 2 . sup > C 2 k log exp n log E C 3 k k β 0 k -sparse σ 2 > 0 Consequence When σ 2 unknown, the best we can expect to have is � β − β 0 ) � 2 � � 2 + � β � 0 log( p ) σ 2 � � X ( β − β 0 ) � 2 � X ( � E ≤ C inf β � =0 for any σ 2 > 0 and any β 0 fulfilling 1 ≤ � β 0 � 0 ≤ C ′ n / log( p ).

  9. Some generic selection schemes

  10. Cross-Validation ◮ Hold-out ◮ V -fold CV ◮ Leave- q -out Penalized empirical lost ◮ Penalized log-likelihood (AIC, BIC, etc) ◮ Plug-in criteria (with Mallows’ C p , etc) ◮ Slope heuristic Approximation versus complexity penalization ◮ LinSelect

  11. LinSelect (Y. Baraud, C. G. & S. Huet) Ingredients ◮ A collection S of linear spaces (for approximation) ◮ A weight function ∆ : S → R + (measure of complexity) Criterion: residuals + approximation + complexity � � f λ � 2 + 1 f λ � 2 + pen ∆ ( S ) � Crit ( � � Y − Π S � 2 � � f λ − Π S � σ 2 f λ ) = inf S S ∈ b S where ◮ � S ⊂ S , possibly data-dependent, ◮ Π S orthogonal projector onto S , ◮ pen ∆ ( S ) ≍ dim ( S ) ∨ 2∆( S ) when dim ( S ) ∨ 2∆( S ) ≤ 2 n / 3, S = � Y − Π S Y � 2 σ 2 ◮ � n − dim( S ) . 2

  12. Non-asymptotic risk bound Assumptions 1. 1 ≤ dim ( S ) ∨ 2∆( S ) ≤ 2 n / 3 for all S ∈ S , 2. � S ∈S e − ∆( S ) ≤ 1. Theorem (Y. Baraud, C.G., S. Huet) � λ � 2 � � f − � E f b ≤ � � f λ � 2 + [ dim ( S ) ∨ ∆( S )] σ 2 ��� � f λ � 2 + inf � f − � � � f λ − Π S � inf C E λ ∈ Λ S ∈ b S The bound also holds in deviation.

  13. Sparse linear regression

  14. Instantiation of LinSelect Estimators �� � f λ = X � Linear regressor: β λ : λ ∈ Λ . (e.g. Lasso, Exponential-Weighting, etc) Approximation and complexity � � ◮ S = range( X J ) : J ⊂ { 1 , . . . , p } , 1 ≤ |J | ≤ n / (3 log p ) � � p ◮ ∆( S ) = log + log(dim( S )) ≈ dim( S ) log( p ) . dim( S ) Subcollection � S � � We set � S λ = range X supp(ˆ and define β λ ) � � � � � S λ , λ ∈ � � where � λ ∈ Λ : � S = Λ , Λ = S λ ∈ S .

  15. Case of the Lasso estimators Lasso estimators � � � Y − X β � 2 + 2 λ � β � 1 � β λ = argmin , λ > 0 β Parameter tuning: theory For X with columns normalized to 1 � λ ≍ σ 2 log( p ) Parameter tuning: practice ◮ V -fold CV ◮ BIC criterion

  16. Recent criterions pivotal with respect to the variance ◮ ℓ 1 -penalized log-likelihood. (Stadler, Buhlmann, van de Geer) � � n log( σ ′ ) + � Y − X β � 2 + λ � β � 1 β LL � σ LL 2 λ , � λ := argmin . 2 σ ′ 2 σ ′ β ∈ R p ,σ ′ > 0 ◮ ℓ 1 -penalized Huber’s loss. (Belloni et al. , Antoniadis) � n σ ′ � 2 + � Y − X β � 2 β SR � σ SR 2 λ , � := argmin + λ � β � 1 . λ 2 σ ′ β ∈ R p ,σ ′ > 0 Equivalent to Square-Root Lasso (introduced before) �� � 2 + λ β SR � � Y − X β � 2 = argmin √ n � β � 1 . λ β ∈ R p Sun & Zhang : optimization with a single LARS-call

  17. The compatibility constant � � | T | 1 / 2 � X u � 2 / � u T � 1 κ [ ξ, T ] = min , u ∈C ( ξ, T ) where C ( ξ, T ) = { u : � u T c � 1 < ξ � u T � 1 } . Restricted eigenvalue For k ∗ = n / (3 log( p )) we set φ ∗ = sup {� Xu � 2 / � u � 2 : u k ∗ -sparse } Theorem for Square-Root Lasso (Sun & Zhang) � For λ = 2 2 log( p ), if we assume that ◮ � β 0 � 0 ≤ C 1 κ 2 [4 , supp ( β 0 )] × n log( p ) , then, with high probability, � � � β � 0 log( p ) � X ( � β − β 0 ) � 2 � X ( β 0 − β ) � 2 κ 2 [4 , supp ( β )] σ 2 ≤ inf 2 + C 2 . 2 β � =0

  18. The compatibility constant � � | T | 1 / 2 � X u � 2 / � u T � 1 κ [ ξ, T ] = min , u ∈C ( ξ, T ) where C ( ξ, T ) = { u : � u T c � 1 < ξ � u T � 1 } . Restricted eigenvalue For k ∗ = n / (3 log( p )) we set φ ∗ = sup {� Xu � 2 / � u � 2 : u k ∗ -sparse } Theorem for LinSelect Lasso If we assume that ◮ � β 0 � 0 ≤ C 1 κ 2 [4 , supp ( β 0 )] × n φ ∗ log( p ) , then, with high probability, � � � β � 0 log( p ) � X ( � β − β 0 ) � 2 � X ( β 0 − β ) � 2 φ ∗ κ 2 [4 , supp ( β )] σ 2 ≤ C inf 2 + C 2 . 2 β � =0

  19. Numerical experiments (1/2) Tuning the Lasso ◮ 165 examples extracted from the literature ◮ each example e is evaluated on the basis of 400 runs Comparison to the oracle � β λ ∗ procedure quantiles 0% 50% 75% 90% 95% Lasso 10-fold CV 1.03 1.11 1.15 1.19 1.24 Lasso LinSelect 0.97 1.03 1.06 1.19 2.52 Square-Root Lasso 1.32 2.61 3.37 11.2 17 � � � � � � For each procedure ℓ , quantiles of R β ˆ λ ℓ ; β 0 / R β λ ∗ ; β 0 , for e = 1 , . . . , 165.

  20. Numerical experiments (2/2) Computation time n p 10-fold CV LinSelect Square-Root 100 100 4 s 0.21 s 0.18 s 100 500 4.8 s 0.43 s 0.4 s 500 500 300 s 11 s 6.3 s Packages: ◮ enet for 10-fold CV and LinSelect ◮ lars for Square-Root Lasso (procedure of Sun & Zhang)

  21. Non-parametric regression

  22. An important class of estimators Linear estimators : � f λ = A λ Y with A λ ∈ R n × n ◮ spline smoothing or kernel ridge estimators with smoothing parameter λ ∈ R + ◮ Nadaraya estimators A λ with smoothing parameter λ ∈ R + ◮ λ -nearest neighbors, λ ∈ { 1 , . . . , k } ◮ L 2 -basis projection (on the λ first elements) ◮ etc Selection criterions (with σ 2 unknown) ◮ Cross-Validation schemes (including GCV) ◮ Mallows’ C L + plug-in / slope heuristic ◮ LinSelect

  23. An important class of estimators Linear estimators : � f λ = A λ Y with A λ ∈ R n × n ◮ spline smoothing or kernel ridge estimators with smoothing parameter λ ∈ R + ◮ Nadaraya estimators A λ with smoothing parameter λ ∈ R + ◮ λ -nearest neighbors, λ ∈ { 1 , . . . , k } ◮ L 2 -basis projection (on the λ first elements) ◮ etc Selection criterions (with σ 2 unknown) ◮ Cross-Validation schemes (including GCV) ◮ Mallows’ C L + plug-in / slope heuristic ◮ LinSelect

  24. Slope heuristic (Arlot & Bach) Procedure for � f λ = A λ Y � � f λ � 2 + σ ′ Tr(2 A λ − A ∗ 1. compute � � Y − � λ 0 ( σ ′ ) = argmin λ λ A λ ) 2. select � σ such that Tr( A ˆ σ ) ) ∈ [ n / 10 , n / 3] λ 0 (ˆ � � f λ � 2 + 2 � σ 2 Tr( A λ ) 3. select � � Y − � λ = argmin λ . Main assumptions ◮ A λ ≈ shrinkage or ”averaging” matrix (covers all classics) ◮ Bias assumption : ∃ λ 1 , Tr( A λ 1 ) ≤ √ n and � ( I − A λ 1 ) f � 2 ≤ σ 2 � n log( n ) Theorem (Arlot & Bach) λ − f � 2 ≤ (1 + ε ) inf λ � � f λ − f � 2 + C ε − 1 log( n ) σ 2 With high proba: � � f ˆ

Recommend


More recommend