Statistics for High-Dimensional Data: Selected Topics Peter B¨ uhlmann Seminar f¨ ur Statistik, ETH Z¨ urich June 2014
High-dimensional linear model: basic concepts in methodology, theory and computation
High-dimensional data Riboflavin production with Bacillus Subtilis (in collaboration with DSM (Switzerland)) goal: improve riboflavin production rate of Bacillus Subtilis using clever genetic engineering response variables Y ∈ R : riboflavin (log-) production rate covariates X ∈ R p : expressions from p = 4088 genes sample size n = 115, p ≫ n Y versus 9 “reasonable” genes gene expression data −6 −6 −6 −7 −7 −7 y y y −8 −8 −8 −9 −9 −9 −10 −10 −10 7.6 8.0 8.4 8.8 7 8 9 10 11 12 6 7 8 9 10 xsel xsel xsel −6 −6 −6 −7 −7 −7 y −8 y −8 y −8 −9 −9 −9 −10 −10 −10 7.5 8.0 8.5 9.0 7.0 7.5 8.0 8.5 9.0 8 9 10 11 xsel xsel xsel −6 −6 −6 −7 −7 −7 y −8 y −8 y −8 −9 −9 −9 −10 −10 −10 8 9 10 11 12 7.5 8.0 8.5 9.0 8.5 9.0 9.5 10.0 11.0 xsel xsel xsel
general framework: Z 1 , . . . , Z n (with some ”i.i.d. components”) dim ( Z i ) ≫ n for example: Z i = ( X i , Y i ) , X i ∈ X ⊆ R p , Y i ∈ Y ⊆∈ R : regression with p ≫ n Z i = ( X i , Y i ) , X i ∈ X ⊆∈ R p , Y i ∈ { 0 , 1 } : classification for p ≫ n numerous applications: biology, imaging, economy, environmental sciences, ...
High-dimensional linear models p � j X ( j ) β 0 Y i = + ε i , i = 1 , . . . , n i j = 1 p ≫ n in short: Y = X β + ε goals: ◮ prediction, e.g. w.r.t. squared prediction error ◮ estimation of β 0 , e.g. w.r.t. � ˆ β − β 0 � q ( q = 1 , 2 ) ◮ variable selection i.e. estimating the active set with the effective variables (having corresponding coefficient � = 0)
we need to regularize... and there are many proposals ◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...
we need to regularize... and there are many proposals ◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...
we need to regularize... and there are many proposals ◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...
Penalty-based methods if true β 0 is sparse w.r.t. ◮ � β 0 � 0 0 = number of non-zero coefficients ❀ regularize with the � · � 0 -penalty: argmin β ( n − 1 � Y − X β � 2 + λ � β � 0 0 ) , e.g. AIC, BIC ❀ computationally infeasible if p is large (2 p sub-models) ◮ � β 0 � 1 = � p j = 1 | β 0 j | ❀ penalize with the � · � 1 -norm, i.e. Lasso: argmin β ( n − 1 � Y − X β � 2 + λ � β � 1 ) ❀ convex optimization: computationally feasible and very fast for large p
as we will see: regularization with the ℓ 1 -norm is good (“nearly optimal”) even if truth is sparse w.r.t. � β 0 � 0 0 ❀ can exploit computational advantage of � . � 1 -norm regularization even if the problem has a different sparsity structure
The Lasso ( Tibshirani, 1996 ) Lasso for linear models β ( λ ) = argmin β ( n − 1 � Y − X β � 2 + ˆ λ � β � 1 ) ���� ���� ≥ 0 � p j = 1 | β j | ❀ convex optimization problem ◮ Lasso does variable selection some of the ˆ β j ( λ ) = 0 (because of “ ℓ 1 -geometry”) ◮ ˆ β ( λ ) is a shrunken LS-estimate
more about “ ℓ 1 -geometry” equivalence to primal problem β primal ( R ) = argmin β ; � β � 1 ≤ R � Y − X β � 2 ˆ 2 / n , with a correspondence between λ and R which depends on the data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) [such an equivalence holds since ◮ � Y − X β � 2 2 / n is convex in β ◮ convex constraint � β � 1 ≤ R see e.g. Bertsekas (1995) ]
p=2 left: ℓ 1 -“world” residual sum of squares reaches a minimal value (for certain constellations of the data) if its contour lines hit the ℓ 1 -ball in its corner ❀ ˆ β 1 = 0
ℓ 2 -“world” is different Ridge regression, � � ˆ � Y − X β � 2 2 / n + λ � β � 2 β Ridge ( λ ) = argmin β , 2 equivalent primal equivalent solution β Ridge ; primal ( R ) = argmin β ; � β � 2 ≤ R � Y − X β � 2 ˆ 2 / n , with a one-to-one correspondence between λ and R
A note on the Bayesian approach model: β 1 , . . . , β p i.i.d. ∼ p ( β ) d β, given β : Y ∼ N n ( X β, σ 2 I n ) with density f ( y | σ 2 , β ) posterior density: f ( Y | β, σ 2 ) p ( β ) p ( β | Y , σ 2 ) = f ( Y | β, σ 2 ) p ( β ) d β ∝ f ( Y | β, σ 2 ) p ( β ) � and hence for the MAP (Maximum A-Posteriori) estimator: � � ˆ argmax β p ( β | Y , σ 2 ) = argmin β − log f ( Y | β, σ 2 ) p ( β ) β MAP = p 1 � 2 σ 2 � Y − X β � 2 2 − = argmin β log ( p ( β j )) j = 1
examples: 1. Double-Exponential prior DExp( ξ ): p ( β ) = τ 2 exp ( − τβ ) ❀ ˆ β MAP equals the Lasso with penalty parameter λ = n − 1 2 σ 2 τ 2. Gaussian prior N ( 0 , τ 2 ) : 1 2 πτ exp ( − β 2 / ( 2 τ 2 )) p ( β ) = √ ❀ ˆ β MAP equals the Ridge estimator with penalty parameter λ = n − 1 σ 2 /τ 2 but we will argue that Lasso (i.e., the MAP estimator) is also good if the truth is sparse with respect to � β 0 � 0 0 , e.g. if prior is (much) more spiky around zero than Double-Exponential distribution
Orthonormal design Y = X β + ε, n − 1 X T X = I Lasso = soft-thresholding estimator ˆ = ( n − 1 X T Y ) j , β j ( λ ) = sign ( Z j )( | Z j | − λ/ 2 ) + , Z j ���� = OLS ˆ β j ( λ ) = g soft ( Z j ) , [Exercise] threshold functions Adaptive Lasso 3 Hard−thresholding Soft−thresholding 2 1 0 −1 −2 −3 −3 −2 −1 0 1 2 3 z
Prediction goal: predict a new observation Y new consider expected (w.r.t. new data; and random X ) squared error loss: β ) 2 ] = σ 2 + E X new [( X new ( β 0 − ˆ E X new , Y new [( Y new − X new ˆ β )) 2 ] σ 2 + (ˆ (ˆ β − β 0 ) T β − β 0 ) = Σ ���� Cov ( X ) ❀ terminology “prediction error”: for random design X : (ˆ β − β 0 ) T Σ(ˆ β − β 0 ) = E X new [( X new (ˆ β − β 0 )) 2 ] β − β 0 ) T ˆ for fixed design X : (ˆ Σ(ˆ β − β 0 ) = � X (ˆ β − β 0 ) � 2 2 / n
binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P [ Y = 1 | X = x ] = E [ Y | X = x ] ˆ p ( x ) via linear model; can then do classification ❀ cross-validated misclassification error (2/3 training; 1/3 test) Lasso L 2 Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% best 200 genes (Wilcoxon test) with variable selection no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it!
binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P [ Y = 1 | X = x ] = E [ Y | X = x ] ˆ p ( x ) via linear model; can then do classification ❀ cross-validated misclassification error (2/3 training; 1/3 test) Lasso L 2 Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% best 200 genes (Wilcoxon test) with variable selection no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it!
and in fact: we will hear that ◮ Lasso is consistent for prediction assuming “essentially nothing” ◮ Lasso is optimal for prediction assuming the “compatibility condition” for X
Estimation of regression coefficients Y = X β 0 + ε, p ≫ n with fixed (deterministic) design X problem of identifiability: for p > n : X β 0 = X θ for any θ = β 0 + ξ , ξ in the null-space of X ❀ cannot say anything about � ˆ β − β 0 � without further assumptions! ❀ we will work with the compatibility assumption (see later) and we will explain: under compatibility condition β − β 0 � 1 ≤ C s 0 � � ˆ log ( p ) / n , φ 2 0 s 0 = | supp ( β 0 ) | = |{ j ; β 0 j � = 0 }|
⇒ let’s turn to the blackboard! =
various conditions and their relations ( van de Geer & PB, 2009 ) oracle inequalities for prediction and estimation 7 3 RIP weak (S,2s) - RIP adaptive (S, 2s) - (S,2s) -restricted 2 restricted regression eigenvalue S -compatibility 8 4 3 adaptive (S, s) - (S,s) -restricted |S \S| ≤ s coherence * restricted regression eigenvalue 6 5 9 6 6 6 6 |S \S| =0 weak (S, 2s) - (S,2s)- irrepresentable (S,s) -uniform * S =S irrepresentable irrepresentable 6 *
Recommend
More recommend