Statistics for High-Dimensional Data: Selected Topics Peter B - PowerPoint PPT Presentation

Statistics for High-Dimensional Data: Selected Topics Peter B¨ uhlmann Seminar f¨ ur Statistik, ETH Z¨ urich June 2014

High-dimensional linear model: basic concepts in methodology, theory and computation

High-dimensional data Riboflavin production with Bacillus Subtilis (in collaboration with DSM (Switzerland)) goal: improve riboflavin production rate of Bacillus Subtilis using clever genetic engineering response variables Y ∈ R : riboflavin (log-) production rate covariates X ∈ R p : expressions from p = 4088 genes sample size n = 115, p ≫ n Y versus 9 “reasonable” genes gene expression data −6 −6 −6 −7 −7 −7 y y y −8 −8 −8 −9 −9 −9 −10 −10 −10 7.6 8.0 8.4 8.8 7 8 9 10 11 12 6 7 8 9 10 xsel xsel xsel −6 −6 −6 −7 −7 −7 y −8 y −8 y −8 −9 −9 −9 −10 −10 −10 7.5 8.0 8.5 9.0 7.0 7.5 8.0 8.5 9.0 8 9 10 11 xsel xsel xsel −6 −6 −6 −7 −7 −7 y −8 y −8 y −8 −9 −9 −9 −10 −10 −10 8 9 10 11 12 7.5 8.0 8.5 9.0 8.5 9.0 9.5 10.0 11.0 xsel xsel xsel

general framework: Z 1 , . . . , Z n (with some ”i.i.d. components”) dim ( Z i ) ≫ n for example: Z i = ( X i , Y i ) , X i ∈ X ⊆ R p , Y i ∈ Y ⊆∈ R : regression with p ≫ n Z i = ( X i , Y i ) , X i ∈ X ⊆∈ R p , Y i ∈ { 0 , 1 } : classification for p ≫ n numerous applications: biology, imaging, economy, environmental sciences, ...

High-dimensional linear models p � j X ( j ) β 0 Y i = + ε i , i = 1 , . . . , n i j = 1 p ≫ n in short: Y = X β + ε goals: ◮ prediction, e.g. w.r.t. squared prediction error ◮ estimation of β 0 , e.g. w.r.t. � ˆ β − β 0 � q ( q = 1 , 2 ) ◮ variable selection i.e. estimating the active set with the effective variables (having corresponding coefficient � = 0)

we need to regularize... and there are many proposals ◮ Bayesian methods for regularization ◮ greedy algorithms: aka forward selection or boosting ◮ preliminary dimension reduction ◮ ... e.g. 4’160’000 entries on Google Scholar for “high dimensional linear model” ...

Penalty-based methods if true β 0 is sparse w.r.t. ◮ � β 0 � 0 0 = number of non-zero coefficients ❀ regularize with the � · � 0 -penalty: argmin β ( n − 1 � Y − X β � 2 + λ � β � 0 0 ) , e.g. AIC, BIC ❀ computationally infeasible if p is large (2 p sub-models) ◮ � β 0 � 1 = � p j = 1 | β 0 j | ❀ penalize with the � · � 1 -norm, i.e. Lasso: argmin β ( n − 1 � Y − X β � 2 + λ � β � 1 ) ❀ convex optimization: computationally feasible and very fast for large p

as we will see: regularization with the ℓ 1 -norm is good (“nearly optimal”) even if truth is sparse w.r.t. � β 0 � 0 0 ❀ can exploit computational advantage of � . � 1 -norm regularization even if the problem has a different sparsity structure

The Lasso ( Tibshirani, 1996 ) Lasso for linear models β ( λ ) = argmin β ( n − 1 � Y − X β � 2 + ˆ λ � β � 1 ) �� ≥ 0 � p j = 1 | β j | ❀ convex optimization problem ◮ Lasso does variable selection some of the ˆ β j ( λ ) = 0 (because of “ ℓ 1 -geometry”) ◮ ˆ β ( λ ) is a shrunken LS-estimate

more about “ ℓ 1 -geometry” equivalence to primal problem β primal ( R ) = argmin β ; � β � 1 ≤ R � Y − X β � 2 ˆ 2 / n , with a correspondence between λ and R which depends on the data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) [such an equivalence holds since ◮ � Y − X β � 2 2 / n is convex in β ◮ convex constraint � β � 1 ≤ R see e.g. Bertsekas (1995) ]

p=2 left: ℓ 1 -“world” residual sum of squares reaches a minimal value (for certain constellations of the data) if its contour lines hit the ℓ 1 -ball in its corner ❀ ˆ β 1 = 0

ℓ 2 -“world” is different Ridge regression, � � ˆ � Y − X β � 2 2 / n + λ � β � 2 β Ridge ( λ ) = argmin β , 2 equivalent primal equivalent solution β Ridge ; primal ( R ) = argmin β ; � β � 2 ≤ R � Y − X β � 2 ˆ 2 / n , with a one-to-one correspondence between λ and R

A note on the Bayesian approach model: β 1 , . . . , β p i.i.d. ∼ p ( β ) d β, given β : Y ∼ N n ( X β, σ 2 I n ) with density f ( y | σ 2 , β ) posterior density: f ( Y | β, σ 2 ) p ( β ) p ( β | Y , σ 2 ) = f ( Y | β, σ 2 ) p ( β ) d β ∝ f ( Y | β, σ 2 ) p ( β ) � and hence for the MAP (Maximum A-Posteriori) estimator: � � ˆ argmax β p ( β | Y , σ 2 ) = argmin β − log f ( Y | β, σ 2 ) p ( β ) β MAP =   p  1 � 2 σ 2 � Y − X β � 2 2 − = argmin β log ( p ( β j ))  j = 1

examples: 1. Double-Exponential prior DExp( ξ ): p ( β ) = τ 2 exp ( − τβ ) ❀ ˆ β MAP equals the Lasso with penalty parameter λ = n − 1 2 σ 2 τ 2. Gaussian prior N ( 0 , τ 2 ) : 1 2 πτ exp ( − β 2 / ( 2 τ 2 )) p ( β ) = √ ❀ ˆ β MAP equals the Ridge estimator with penalty parameter λ = n − 1 σ 2 /τ 2 but we will argue that Lasso (i.e., the MAP estimator) is also good if the truth is sparse with respect to � β 0 � 0 0 , e.g. if prior is (much) more spiky around zero than Double-Exponential distribution

Orthonormal design Y = X β + ε, n − 1 X T X = I Lasso = soft-thresholding estimator ˆ = ( n − 1 X T Y ) j , β j ( λ ) = sign ( Z j )( | Z j | − λ/ 2 ) + , Z j �� = OLS ˆ β j ( λ ) = g soft ( Z j ) , [Exercise] threshold functions Adaptive Lasso 3 Hard−thresholding Soft−thresholding 2 1 0 −1 −2 −3 −3 −2 −1 0 1 2 3 z

Prediction goal: predict a new observation Y new consider expected (w.r.t. new data; and random X ) squared error loss: β ) 2 ] = σ 2 + E X new [( X new ( β 0 − ˆ E X new , Y new [( Y new − X new ˆ β )) 2 ] σ 2 + (ˆ (ˆ β − β 0 ) T β − β 0 ) = Σ �� Cov ( X ) ❀ terminology “prediction error”: for random design X : (ˆ β − β 0 ) T Σ(ˆ β − β 0 ) = E X new [( X new (ˆ β − β 0 )) 2 ] β − β 0 ) T ˆ for fixed design X : (ˆ Σ(ˆ β − β 0 ) = � X (ˆ β − β 0 ) � 2 2 / n

binary lymph node classification using gene expressions: a high noise problem n = 49 samples, p = 7130 gene expressions despite that it is classification: P [ Y = 1 | X = x ] = E [ Y | X = x ] ˆ p ( x ) via linear model; can then do classification ❀ cross-validated misclassification error (2/3 training; 1/3 test) Lasso L 2 Boosting FPLR Pelora 1-NN DLDA SVM 21.1% 17.7% 35.25% 27.8% 43.25% 36.12% 36.88% best 200 genes (Wilcoxon test) with variable selection no additional variable selection Lasso selected on CV-average 13.12 out of p = 7130 genes from a practical perspective: if you trust in cross-validation: can validate how good we are i.e. prediction may be a black box, but we can evaluate it!

and in fact: we will hear that ◮ Lasso is consistent for prediction assuming “essentially nothing” ◮ Lasso is optimal for prediction assuming the “compatibility condition” for X

Estimation of regression coefficients Y = X β 0 + ε, p ≫ n with fixed (deterministic) design X problem of identifiability: for p > n : X β 0 = X θ for any θ = β 0 + ξ , ξ in the null-space of X ❀ cannot say anything about � ˆ β − β 0 � without further assumptions! ❀ we will work with the compatibility assumption (see later) and we will explain: under compatibility condition β − β 0 � 1 ≤ C s 0 � � ˆ log ( p ) / n , φ 2 0 s 0 = | supp ( β 0 ) | = |{ j ; β 0 j � = 0 }|

⇒ let’s turn to the blackboard! =

various conditions and their relations ( van de Geer & PB, 2009 ) oracle inequalities for prediction and estimation 7 3 RIP weak (S,2s) - RIP adaptive (S, 2s) - (S,2s) -restricted 2 restricted regression eigenvalue S -compatibility 8 4 3 adaptive (S, s) - (S,s) -restricted |S \S| ≤ s coherence * restricted regression eigenvalue 6 5 9 6 6 6 6 |S \S| =0 weak (S, 2s) - (S,2s)- irrepresentable (S,s) -uniform * S =S irrepresentable irrepresentable 6 *

Statistics for High-Dimensional Data: Selected Topics Peter B - PowerPoint PPT Presentation

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur Statistik, ETH Z urich June 2014 High-dimensional linear model: basic concepts in methodology, theory and computation High-dimensional data

Final Selected Abstracts Final Selected Abstracts Final Selected Abstracts Final Selected

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Selected Topics in Cybersecurity Cybersecurity Selected Topics in Eugene H. Spafford Spafford

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

Mean field asymptotics in high-dimensional statistics: A few references Andrea Montanari July

Arrays (2) Higher-Dimensional Arrays Arrays of Character Strings Topics Variables and Arrays

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

Statistics for high-dimensional data: p-values and confidence intervals Peter B uhlmann

High-dimensional statistics and probability Christophe Giraud Universit e Paris Saclay M2

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat

Selected new results from the Spectroscopy Selected new results from the Spectroscopy in the

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

+ Two Dimensional Arrays + Two Dimensional Arrays So far we have studied how to store linear

2017/18 3 We remain committed to:- Growth, regeneration and economic development of the

for Growth Nigel Morrison 27 March 2012 Introduction SKYCITY Strategic Growth Projects

The PHP 5.4 Features You Will Actually Use About Me Lorna Jane Mitchell PHP

TOP SOIL BUILDERS KEN LAING ORCHARD HILL FARM ST .THOM AS, ON OHF UNFAIR ADVANTAGES

How computation has changed research in chemistry and biology Peter Schuster Institut fr

Secure Grid Services for Cooperative Work in Medicine and Life Science Anette Weisbecker,

Rice Industry Field Day Global Market and Strategy Update Rob Gordon - Chief Executive Officer

CO 2 Capture and EOR: Can Both Be Profitable