Statistics for high-dimensional data: p-values and confidence intervals Peter B¨ uhlmann Seminar f¨ ur Statistik, ETH Z¨ urich June 2014
High-dimensional data Behavioral economics and genetics (with Ernst Fehr, U. Zurich) ◮ n = 1 ′ 525 persons ◮ genetic information (SNPs): p ≈ 10 6 ◮ 79 response variables, measuring “behavior” p ≫ n Number of significant target SNPs per phenotype ● 700 goal: find significant associations 600 ● Number of significant target SNPs 500 ● between behavioral responses 400 ● 300 ● and genetic markers ● 200 ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Phenotype index
in high-dimensional statistics: a lot of progress has been achieved over the last 8-10 years for ◮ point estimation ◮ rates of convergence but very little work on assigning measures of uncertainty, p-values, confidence intervals
we need uncertainty quantification! (the core of statistics)
goal (regarding the title of the talk): p-values/confidence interval for a high-dimensional linear model (and we can then generalize to other models)
Motif regression and variable selection for finding HIF1 α transcription factor binding sites in DNA seq. M¨ uller, Meier, PB & Ricci for coarse DNA segments i = 1 , ..., n : ◮ predictor X i = ( X ( 1 ) , . . . , X ( p ) ) ∈ R p : i i abundance score of candidate motifs j = 1 , ..., p in DNA segment i (using sequence data and computational biology algorithms, e.g. MDSCAN) ◮ univariate response Y i ∈ R : binding intensity of HIF1 α to coarse DNA segment (from CHIP-chip experiments)
question: relation between the binding intensity Y and the abundance of short candidate motifs? ❀ linear model is often reasonable “motif regression” ( Conlon, X.S. Liu, Lieb & J.S. Liu, 2003 ) p � j X ( j ) β 0 Y i = + ε i i j = 1 i = 1 , . . . , n = 143 , p = 195 goal: variable selection and significance of variables ❀ find the relevant motifs among the p = 195 candidates
Lasso for variable selection: ˆ { j ; ˆ β j ( λ ) � = 0 } S ( λ ) = { j ; β 0 estimate for S 0 = j � = 0 } no significance testing involved it’s convex optimization only! and it’s very popular ( Meinshausen & PB, 2006;Zhao & Yu, 2006; Wainwright, 2009;... )
for motif regression (finding HIF1 α transcription factor binding sites) n=143, p=195 ❀ Lasso selects 26 covariates when choosing λ = ˆ λ CV via cross-validation and resulting R 2 ≈ 50 % i.e. 26 interesting candidate motifs how significant are the findings?
for motif regression (finding HIF1 α transcription factor binding sites) n=143, p=195 ❀ Lasso selects 26 covariates when choosing λ = ˆ λ CV via cross-validation and resulting R 2 ≈ 50 % i.e. 26 interesting candidate motifs how significant are the findings?
estimated coefficients ˆ β (ˆ λ CV ) original data 0.20 0.15 coefficients 0.10 0.05 0.00 0 50 100 150 200 variables p-values for H 0 , j : β 0 j = 0 ?
P-values for high-dimensional linear models Y = X β 0 + ε goal: statistical hypothesis testing H 0 , j : β 0 j = 0 or H 0 , G : β 0 j = 0 for all j ∈ G ⊆ { 1 , . . . , p } background: if we could handle the asymptotic distribution of the Lasso ˆ β ( λ ) under the null-hypothesis ❀ could construct p-values this is very difficult! asymptotic distribution of ˆ β has some point mass at zero,... Knight and Fu (2000) for p < ∞ and n → ∞
❀ standard bootstrapping and subsampling cannot be used either but there are recent proposals when using adaptations of standard resampling methods ( Chatterjee & Lahiri, 2013; Liu & Yu, 2013 ) ❀ non-uniformity/super-efficiency issues remain...
Low-dimensional projections and bias correction Or de-sparsifying the Lasso estimator related work by Zhang and Zhang (2011) motivation: ˆ γ ( j ) β OLS , j = projection of Y onto residuals ( X j − X − j ˆ OLS ) projection not well defined if p > n ❀ use “regularized” residuals from Lasso on X -variables γ ( j ) Z j = X j − X − j ˆ Lasso γ ( j ) = argmin γ � X j − X − j γ � + λ j � γ � 1 ˆ
using Y = X β 0 + ε ❀ � Z T j Y = Z T j X j β 0 Z T j X k + Z T j + j ε k � = j and hence Z T Z T Z T j Y j X k j ε � = β 0 β 0 j + + k Z T Z T Z T j X j j X j j X j k � = j � �� � � �� � noise component bias ❀ Z T Z T j Y j X k � ˆ ˆ − b j = β Lasso ; k Z T Z T j X j j X j k � = j � �� � Lasso-estim. bias corr.
ˆ b j is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is “optimal” (see later) ◮ target: low-dimensional component β 0 j ◮ η := { β 0 k ; k � = j } is a high-dimensional nuisance parameter ❀ exactly as in semiparametric modeling! and sparsely estimated (e.g. with Lasso)
ˆ b j is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is “optimal” (see later) ◮ target: low-dimensional component β 0 j ◮ η := { β 0 k ; k � = j } is a high-dimensional nuisance parameter ❀ exactly as in semiparametric modeling! and sparsely estimated (e.g. with Lasso)
⇒ let’s turn to the blackboard! =
A general principle: de-sparsifying via “inversion” of KKT KKT conditions: sub-differential of objective function � Y − X β � 2 2 / n + λ � β � 1 − X T ( Y − X ˆ β ) / n + λ ˆ τ = 0 ���� X β 0 + ε τ j = sign (ˆ β j ) if ˆ � ˆ τ � ∞ ≤ 1 , ˆ β j � = 0 . with ˆ Σ = X T X / n ❀ ˆ Σ(ˆ β − β 0 ) + λ ˆ τ = X T ε/ n “regularized inverse” of ˆ Σ , denoted by ˆ Θ ( not e.g. GLasso) β − β 0 + ˆ ˆ τ = ˆ Θ X T ε/ n − ∆ , Θ λ ˆ ∆ = (ˆ Θˆ Σ − I )(ˆ β − β 0 ) new estimator: ˆ b = ˆ β + ˆ τ = ˆ β + ˆ Θ X T ( Y − X ˆ Θ λ ˆ β ) / n
❀ ˆ b is exactly the same estimator as before (based on low-dimensional projection using residual vectors Z j ) ... when taking ˆ Θ (“regularized inverse of ˆ Σ ”) having rows using the (“nodewise”) Lasso-estimated coefficients from X j versus X − j : γ j = argmin γ ∈ R p − 1 � X j − X − j γ � 2 ˆ 2 / n + 2 λ j � γ � 1 Denote by 1 − ˆ γ 1 , 2 · · · − ˆ γ 1 , p − ˆ · · · − ˆ γ 2 , 1 1 γ 2 , p ˆ C = . . . ... . . . . . . − ˆ γ p , 1 − ˆ γ p , 2 · · · 1 Let T 2 = diag (ˆ ˆ τ 2 τ 2 τ 2 γ j � 2 j = � X j − X − j ˆ 2 / n + λ j � ˆ γ j � 1 , 1 , . . . , ˆ p ) , ˆ T − 2 ˆ Θ Lasso = ˆ ˆ C not symmetric... !
“inverting” the KKT conditions is a very general principle ❀ the principle can be used for GLMs and many other models
Asymptotic pivot and optimality Theorem (van de Geer, PB & Ritov, 2013) √ n (ˆ b j − β 0 j ) ⇒ N ( 0 , σ 2 ε Ω jj ) ( j = 1 , . . . , p ) Ω jj explicit expression ∼ (Σ − 1 ) jj optimal! reaching semiparametric information bound ❀ asympt. optimal p-values and confidence intervals if we assume: ◮ sub-Gaussian design (i.i.d. rows of X sub-Gaussian) with population Cov ( X ) = Σ has minimal eigenvalue ≥ M > 0 √ ◮ sparsity for regr. Y vs. X : s 0 = o ( √ n / log ( p )) “quite sparse” ◮ sparsity of design: Σ − 1 sparse i.e. sparse regressions X j vs. X − j : s j = o ( n / log ( p )) “maybe restrictive”
It is optimal! Cramer-Rao
for design with Σ − 1 non-sparse: ◮ Ridge projection ( PB, 2013 ): good type I error control but not optimal in terms of power ◮ convex program instead of Lasso for Z j ( Javanmard & Montanari, 2013; MSc. thesis Dezeure, 2013 ) Javanmard & Montanari prove optimality so far: no convincing empirical evidence that we can deal well with such scenarios ( Σ − 1 non-sparse)
Uniform convergence: √ n (ˆ b j − β 0 j ) ⇒ N ( 0 , σ 2 ε Ω jj ) ( j = 1 , . . . , p ) convergence is uniform over B ( s 0 ) = { β ; � β � 0 ≤ s 0 } ❀ honest tests and confidence regions! and we can avoid post model selection inference (cf. P¨ otscher and Leeb )
Simultaneous inference over all components: √ n (ˆ b − β 0 ) ≈ ( W 1 , . . . , W p ) ∼ N p ( 0 , σ 2 ε Ω) ❀ can construct P-values for: H 0 , G with any G : test-statistics max j ∈ G | ˆ b j | since covariance structure Ω is known and can easily do efficient multiple testing adjustment since covariance structure Ω is known!
Recommend
More recommend