High-dimensional regression with unknown variance Christophe Giraud - PowerPoint PPT Presentation

High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012

Setting Gaussian regression with unknown variance: i . i . d . ∼ N (0 , σ 2 ) ◮ Y i = f i + ε i with ε i ◮ f = ( f 1 , . . . , f n ) ∗ and σ 2 are unknown ◮ we want to estimate f Ex 1 : sparse linear regression ◮ f = X β with β ”sparse” in some sense and X ∈ R n × p with possibly p > n Ex 2 : non-parametric regression ◮ f i = F ( x i ) with F : X → R

A plethora of estimators Sparse linear regression ◮ Coordinate sparsity: Lasso, Dantzig, Elastic-Net, Exponential-Weighting, Projection on subspaces { V λ : λ ∈ Λ } given by PCA, Random Forest, etc. ◮ Structured sparsity: Group-lasso, Fused-Lasso, Bayesian estimators, etc Non-parametric regression ◮ Spline smoothing, Nadaraya kernel smoothing, kernel ridge estimators, nearest neighbors, L 2 -basis projection, Sparse Additive Models, etc

Important practical issues Which estimator should be used? ◮ Sparse regression : Lasso? Random-Forest? Exponential-Weighting? ◮ Non-parametric regression : Kernel regression? (which kernel?) Spline smoothing? Which ”tuning” parameter? ◮ which penalty level for the lasso? ◮ which bandwith for kernel regression? ◮ etc

The objective Difficulties ◮ No procedure is universally better than the others ◮ A sensible choice of the tuning parameters depends on ◮ some unknown characteristics of f (sparsity, smoothness, etc) ◮ the unknown variance σ 2 . Ideal objective ◮ Select the ”best” estimator among a collection { ˆ f λ , λ ∈ Λ } . (alternative objective: combine at best the estimators)

Impact of not knowing the variance

Impact of the unknown variance? Case of coordinate-sparse linear regression σ known or k known σ unknown and k unknown Minimax risk n k Ultra-high dimension 2 k log( p/k ) ≥ n Minimax prediction risk over k -sparse signal as a function of k

Ultra-high dimensional phenomenon Theorem (N. Verzelen EJS 2012) When σ 2 is unknown, there exist designs X of size n × p such that for any estimator � β , we have either � β − 0 p ) � 2 � > C 1 n σ 2 , � X ( � sup or E σ 2 > 0 � �� β − β 0 ) � 2 � � p � � p k � X ( � σ 2 . sup > C 2 k log exp n log E C 3 k k β 0 k -sparse σ 2 > 0 Consequence When σ 2 unknown, the best we can expect to have is � β − β 0 ) � 2 � � 2 + � β � 0 log( p ) σ 2 � � X ( β − β 0 ) � 2 � X ( � E ≤ C inf β � =0 for any σ 2 > 0 and any β 0 fulfilling 1 ≤ � β 0 � 0 ≤ C ′ n / log( p ).

Some generic selection schemes

Cross-Validation ◮ Hold-out ◮ V -fold CV ◮ Leave- q -out Penalized empirical lost ◮ Penalized log-likelihood (AIC, BIC, etc) ◮ Plug-in criteria (with Mallows’ C p , etc) ◮ Slope heuristic Approximation versus complexity penalization ◮ LinSelect

LinSelect (Y. Baraud, C. G. & S. Huet) Ingredients ◮ A collection S of linear spaces (for approximation) ◮ A weight function ∆ : S → R + (measure of complexity) Criterion: residuals + approximation + complexity � � f λ � 2 + 1 f λ � 2 + pen ∆ ( S ) � Crit ( � � Y − Π S � 2 � � f λ − Π S � σ 2 f λ ) = inf S S ∈ b S where ◮ � S ⊂ S , possibly data-dependent, ◮ Π S orthogonal projector onto S , ◮ pen ∆ ( S ) ≍ dim ( S ) ∨ 2∆( S ) when dim ( S ) ∨ 2∆( S ) ≤ 2 n / 3, S = � Y − Π S Y � 2 σ 2 ◮ � n − dim( S ) . 2

Non-asymptotic risk bound Assumptions 1. 1 ≤ dim ( S ) ∨ 2∆( S ) ≤ 2 n / 3 for all S ∈ S , 2. � S ∈S e − ∆( S ) ≤ 1. Theorem (Y. Baraud, C.G., S. Huet) � λ � 2 � � f − � E f b ≤ � � f λ � 2 + [ dim ( S ) ∨ ∆( S )] σ 2 �� f λ � 2 + inf � f − � � � f λ − Π S � inf C E λ ∈ Λ S ∈ b S The bound also holds in deviation.

Sparse linear regression

Instantiation of LinSelect Estimators �� f λ = X � Linear regressor: β λ : λ ∈ Λ . (e.g. Lasso, Exponential-Weighting, etc) Approximation and complexity � � ◮ S = range( X J ) : J ⊂ { 1 , . . . , p } , 1 ≤ |J | ≤ n / (3 log p ) � � p ◮ ∆( S ) = log + log(dim( S )) ≈ dim( S ) log( p ) . dim( S ) Subcollection � S � � We set � S λ = range X supp(ˆ and define β λ ) � � � � � S λ , λ ∈ � � where � λ ∈ Λ : � S = Λ , Λ = S λ ∈ S .

Case of the Lasso estimators Lasso estimators � � � Y − X β � 2 + 2 λ � β � 1 � β λ = argmin , λ > 0 β Parameter tuning: theory For X with columns normalized to 1 � λ ≍ σ 2 log( p ) Parameter tuning: practice ◮ V -fold CV ◮ BIC criterion

Recent criterions pivotal with respect to the variance ◮ ℓ 1 -penalized log-likelihood. (Stadler, Buhlmann, van de Geer) � � n log( σ ′ ) + � Y − X β � 2 + λ � β � 1 β LL � σ LL 2 λ , � λ := argmin . 2 σ ′ 2 σ ′ β ∈ R p ,σ ′ > 0 ◮ ℓ 1 -penalized Huber’s loss. (Belloni et al. , Antoniadis) � n σ ′ � 2 + � Y − X β � 2 β SR � σ SR 2 λ , � := argmin + λ � β � 1 . λ 2 σ ′ β ∈ R p ,σ ′ > 0 Equivalent to Square-Root Lasso (introduced before) �� 2 + λ β SR � � Y − X β � 2 = argmin √ n � β � 1 . λ β ∈ R p Sun & Zhang : optimization with a single LARS-call

The compatibility constant � � | T | 1 / 2 � X u � 2 / � u T � 1 κ [ ξ, T ] = min , u ∈C ( ξ, T ) where C ( ξ, T ) = { u : � u T c � 1 < ξ � u T � 1 } . Restricted eigenvalue For k ∗ = n / (3 log( p )) we set φ ∗ = sup {� Xu � 2 / � u � 2 : u k ∗ -sparse } Theorem for Square-Root Lasso (Sun & Zhang) � For λ = 2 2 log( p ), if we assume that ◮ � β 0 � 0 ≤ C 1 κ 2 [4 , supp ( β 0 )] × n log( p ) , then, with high probability, � � � β � 0 log( p ) � X ( � β − β 0 ) � 2 � X ( β 0 − β ) � 2 κ 2 [4 , supp ( β )] σ 2 ≤ inf 2 + C 2 . 2 β � =0

The compatibility constant � � | T | 1 / 2 � X u � 2 / � u T � 1 κ [ ξ, T ] = min , u ∈C ( ξ, T ) where C ( ξ, T ) = { u : � u T c � 1 < ξ � u T � 1 } . Restricted eigenvalue For k ∗ = n / (3 log( p )) we set φ ∗ = sup {� Xu � 2 / � u � 2 : u k ∗ -sparse } Theorem for LinSelect Lasso If we assume that ◮ � β 0 � 0 ≤ C 1 κ 2 [4 , supp ( β 0 )] × n φ ∗ log( p ) , then, with high probability, � � � β � 0 log( p ) � X ( � β − β 0 ) � 2 � X ( β 0 − β ) � 2 φ ∗ κ 2 [4 , supp ( β )] σ 2 ≤ C inf 2 + C 2 . 2 β � =0

Numerical experiments (1/2) Tuning the Lasso ◮ 165 examples extracted from the literature ◮ each example e is evaluated on the basis of 400 runs Comparison to the oracle � β λ ∗ procedure quantiles 0% 50% 75% 90% 95% Lasso 10-fold CV 1.03 1.11 1.15 1.19 1.24 Lasso LinSelect 0.97 1.03 1.06 1.19 2.52 Square-Root Lasso 1.32 2.61 3.37 11.2 17 � � � � � � For each procedure ℓ , quantiles of R β ˆ λ ℓ ; β 0 / R β λ ∗ ; β 0 , for e = 1 , . . . , 165.

Numerical experiments (2/2) Computation time n p 10-fold CV LinSelect Square-Root 100 100 4 s 0.21 s 0.18 s 100 500 4.8 s 0.43 s 0.4 s 500 500 300 s 11 s 6.3 s Packages: ◮ enet for 10-fold CV and LinSelect ◮ lars for Square-Root Lasso (procedure of Sun & Zhang)

Non-parametric regression

An important class of estimators Linear estimators : � f λ = A λ Y with A λ ∈ R n × n ◮ spline smoothing or kernel ridge estimators with smoothing parameter λ ∈ R + ◮ Nadaraya estimators A λ with smoothing parameter λ ∈ R + ◮ λ -nearest neighbors, λ ∈ { 1 , . . . , k } ◮ L 2 -basis projection (on the λ first elements) ◮ etc Selection criterions (with σ 2 unknown) ◮ Cross-Validation schemes (including GCV) ◮ Mallows’ C L + plug-in / slope heuristic ◮ LinSelect

Slope heuristic (Arlot & Bach) Procedure for � f λ = A λ Y � � f λ � 2 + σ ′ Tr(2 A λ − A ∗ 1. compute � � Y − � λ 0 ( σ ′ ) = argmin λ λ A λ ) 2. select � σ such that Tr( A ˆ σ ) ) ∈ [ n / 10 , n / 3] λ 0 (ˆ � � f λ � 2 + 2 � σ 2 Tr( A λ ) 3. select � � Y − � λ = argmin λ . Main assumptions ◮ A λ ≈ shrinkage or ”averaging” matrix (covers all classics) ◮ Bias assumption : ∃ λ 1 , Tr( A λ 1 ) ≤ √ n and � ( I − A λ 1 ) f � 2 ≤ σ 2 � n log( n ) Theorem (Arlot & Bach) λ − f � 2 ≤ (1 + ε ) inf λ � � f λ − f � 2 + C ε − 1 log( n ) σ 2 With high proba: � � f ˆ

High-dimensional regression with unknown variance Christophe Giraud - PowerPoint PPT Presentation

High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: i . i . d . N (0 , 2 ) Y i = f i + i with i f = ( f 1 , . . . , f n )

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Analysis of variance and regression December 4, 2007 Variance component models Variance

Analysis of variance and regression Other types of regression models Other types of regression

On oracle inequalities related to high dimensional linear models Yuri Golubev CNRS, Universit

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Big Data - Lecture 2 High dimensional regression with the Lasso S. Gadat Toulouse, Octobre 2014

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Bayesian Regression with Input Noise for High Dimensional Data Jo-Anne Ting 1 , Aaron DSouza 2

Analysis of variance and regression May 13, 2008 Repeated measurements over time Presentation

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Module 15 Standard Costing and Variance Analysis Dr. Varadraj Bapat 1 Standard Costing

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Gaussian model selection with an unknown variance Yannick Baraud Laboratoire J.A. Dieudonn e

Endpoints in Gynecologic Cancer Clinical Trials Keiichi Fujiwara Saitama Medical University

Space Charge Efgect Simulation with Liquid Argon Flow Michael Mooney Colorado State University

Outline Interactions with grouping factors Mixed models in R using the lme4 package Part 6:

A general procedure to combine estimators Fr ed eric Lavancier and Paul Rochet Laboratoire

On spectral bounds for symmetric Markov chains with coarse Ricci curvatures Kazuhiro Kuwae

On extremal type III codes Darwin Villar RWTH-Aachen ALCOMA 15 Introduction ALgebraic Let

Estimating Cost Reductions Associated with the Community Support Program for People Experiencing