High-Dimensional Multivariate Bayesian Linear Regression with Shrinkage Priors Ray Bai Department of Statistics, University of Florida Joint work with Dr. Malay Ghosh March 20, 2018 Ray Bai (University of Florida) MBSP March 20, 2018 1 / 48
Overview Overview of High-Dimensional Multivariate Linear Regression 1 Multivariate Bayesian Model with Shrinkage Priors (MBSP) 2 Posterior Consistency of MBSP 3 Low-Dimensional Case Ultrahigh-Dimensional Case Implementation of the MBSP Model 4 Simulation Study 5 Yeast Cell Cycle Data Analysis 6 Ray Bai (University of Florida) MBSP March 20, 2018 2 / 48
Simultaneous Prediction and Estimation There are many scenarios where we would want to simultaneously predict q continuous response variables y 1 , ..., y q : Longitudinal data : The q response variables represent measurements at q consecutive time points. mRNA levels at different time points children’s heights at different ages of development CD4 cell counts over time for HIV/AIDS patients The data have a group structure : The q response variables represent a “group.” In genomics, genes within the same pathway often act together in regulating a biological system. Ray Bai (University of Florida) MBSP March 20, 2018 3 / 48
Multivariate Linear Regression Consider the multivariate linear regression model, Y = XB + E , where Y = ( y 1 , ..., y q ) is an n × q response matrix of n samples and q response variables, X is an n × p matrix of n samples and p covariates, B ∈ R p × q is the coefficient matrix, and E = ( ε 1 , ..., ε n ) T is an n × q noise i.i.d. ∼ N q ( 0 , Σ ) , i = 1, ..., n . matrix, where ε i Throughout, we assume that X is centered, so there is no intercept term. Ray Bai (University of Florida) MBSP March 20, 2018 4 / 48
Multivariate Linear Regression For the multivariate linear regression model, Y n × q = X n × p B p × q + E n × q , i.i.d. where E = ( ε 1 , ..., ε n ) T , ε i ∼ N q ( 0 , Σ ) , i = 1, ..., n , Σ represents the covariance structure of the q response variables. We wish to estimate the coefficient matrix B . Model selection from the p covariates is also often desired. This can be done using multivariate generalizations of AIC, BIC, or Mallow’s C p . Ray Bai (University of Florida) MBSP March 20, 2018 5 / 48
Multivariate Linear Regression For the multivariate linear regression model, the usual maximum likelihood estimator (MLE) is the ordinary least squares estimator, B = ( X T X ) − 1 X T Y . � The MLE is only unique if p ≤ n . It is well-known that the MLE is an inconsistent estimator of B if p / n → c , c > 0. Variable selection using AIC, BIC, and Mallow’s C p is infeasible for large p , since it requires searching over a model space of 2 p models. Ray Bai (University of Florida) MBSP March 20, 2018 6 / 48
High-Dimensional Multivariate Linear Regression To handle cases where p is large (including the p > n regime), frequentists typically use penalized regression (e.g. Li et al. (2015), Vincent and HAnsen (2014), Wilms and Croux (2017)): p B || Y − XB || 2 ∑ min 2 + λ || b i || 2 , i = 1 where b i represents the i th row of B and λ > 0 is a tuning parameter. The group lasso penalty, || · || 2 , shrinks entire rows of B to exactly 0 , leading to a sparse estimate of B and facilitating variable selection from the p estimators. We can use adaptive group lasso penalty to avoid overshrinkage of b i , i = 1, ..., p . Ray Bai (University of Florida) MBSP March 20, 2018 7 / 48
Bayesian High-Dimensional Multivariate Linear Regression The Bayesian approach is to put a prior distribution on B , π ( B ) . That is, given the model, Y = XB + E and data ( X , Y ) , we have π ( B | Y ) ∝ f ( Y | X , B ) π ( B ) . Inference can be conducted through the posterior, π ( B | Y ) . Ray Bai (University of Florida) MBSP March 20, 2018 8 / 48
Bayesian High-Dimensional Multivariate Linear Regression To achieve sparsity and variable selection, a common approach is to place spike-and-slab priors on the rows of B (e.g. Brown et al. (1998), Liquet et al. (2017)): i . i . d . b T ∼ ( 1 − p ) δ { 0 } + p N q ( 0 , τ 2 V ) , i = 1, ..., p . i δ { 0 } represents a point mass at 0 ∈ R q , and V is a q × q symmetric positive definite matrix. τ 2 can be treated as a tuning parameter, or a prior can be placed on τ 2 . A prior can also be placed on p so that the model adapts to the underlying sparsity. Usually, we put a Beta prior on p . Ray Bai (University of Florida) MBSP March 20, 2018 9 / 48
Bayesian High-Dimensional Multivariate Linear Regression For the spike-and-slab approach, i . i . d . b T ∼ ( 1 − p ) δ { 0 } + p N q ( 0 , τ 2 V ) , i = 1, ..., p , i τ 2 ∼ µ ( τ 2 ) , p ∼ B ( a , b ) , Taking the posterior median will give a point estimate of B with rows equal to 0 T , thus recovering a sparse estimate of B and facilitating variable selection. Due to the point mass at 0 , this model can be very, very slow for large p . Ray Bai (University of Florida) MBSP March 20, 2018 10 / 48
Bayesian High-Dimensional Multivariate Linear Regression Due to the computational inefficiency of discontinuous priors, it is often desirable to put a continuous prior on the parameters of interest. For the multivariate linear regression model, Y = XB + E , our aim to estimate B . This requires putting a prior density on a p × q matrix. A popular continuous prior to place on B is the matrix-normal prior . Ray Bai (University of Florida) MBSP March 20, 2018 11 / 48
The Matrix-Normal Prior Definition A random matrix X is said to have the matrix-normal density if X has the density function (on the space R a × b ): f ( X ) = | U | − b / 2 | V | − a / 2 e − 1 2 tr [ U − 1 ( X − M ) V − 1 ( X − M ) T ] , ( 2 π ) ab / 2 where M ∈ R a × b , and U and V are positive semi-definite matrices of dimension a × a and b × b respectively. If X is distributed as a matrix-normal distribution with pdf above, we write X ∼ MN a × b ( M, U, V ) . Ray Bai (University of Florida) MBSP March 20, 2018 12 / 48
Multivariate Bayesian Model with Shrinkage Priors (MBSP) By adding an additional layer in the Bayesian hierarchy, we can obtain a row-sparse estimate of B . This row-sparse estimate also facilitates variable selection from the p variables. Our model is specified as follows: Y | X , B , Σ ∼ MN n × q ( XB , I n , Σ ) , B | ξ 1 , ..., ξ p , Σ ∼ MN p × q ( O , τ diag ( ξ 1 , ..., ξ p ) , Σ ) , ind ∼ π ( ξ i ) , i = 1, ..., p , ξ i where τ > 0 is a tuning parameter, and π ( ξ i ) is a polynomial-tailed prior density of the form, π ( ξ i ) = K ( ξ i ) − a − 1 L ( ξ i ) , where K > 0 is the constant of proportionality, a is positive real number, and L is a a positive measurable, non-constant, slowly varying function over ( 0, ∞ ) . Ray Bai (University of Florida) MBSP March 20, 2018 13 / 48
Examples of Polynomial-Tailed Priors π ( ξ i ) / C L ( ξ i ) Prior ξ − a − 1 Student’s t exp ( − a / ξ i ) exp − a / ξ i i ξ − 1 / 2 ( 1 + ξ i ) − 1 ξ a Horseshoe i / ( 1 + ξ i ) i ξ − 1 / 2 ( ξ i − 1 ) − 1 log ( ξ i ) i ( ξ i − 1 ) − 1 log ( ξ i ) ξ a Horseshoe+ i ( 1 + ξ i ) − 1 − a { ξ i / ( 1 + ξ i ) } a + 1 NEG ξ u − 1 { ξ i / ( 1 + ξ i ) } a + u ( 1 + ξ i ) − a − u TPBN � � i � ∞ � ∞ � − λ 2 ξ i λ 2 λ 2 a − 1 exp ( − ηλ ) d λ 0 t a exp ( − t − η GDP 2 exp 2 t / ξ i ) dt 0 2 � � ( 1 + ξ i ) − ( a + u ) exp { ξ i / ( 1 + ξ i ) } a + u ξ u − 1 s − HIB i 1 + ξ i � � − 1 � � � � − 1 φ 2 + 1 − φ 2 φ 2 + 1 − φ 2 s × × exp − 1 + ξ i 1 + ξ i 1 + ξ i Table: Polynomial-tailed priors, their respective prior densities for π ( ξ i ) up to normalizing constant C , and the slowly-varying component L ( ξ i ) . Ray Bai (University of Florida) MBSP March 20, 2018 14 / 48
Sparse Estimation of B : Examples If π ( ξ j ) ind ∼ Inverse-Gamma ( α j , γ j 2 ) , then the marginal density for B , π ( B ) , under the MBSP model is proportional to � � − ( α j + q p 2 ) || b j ( τ Σ ) − 1 / 2 || 2 ∏ 2 + γ j , j = 1 which corresponds to a multivariate t -distribution. Here b j denotes the j th row of B . Ray Bai (University of Florida) MBSP March 20, 2018 15 / 48
Sparse Estimation of B : Examples If π ( ξ j ) ∝ ξ q / 2 − 1 ( 1 + ξ j ) − 1 , then the joint density π ( B , ξ 1 , ..., ξ p ) under j the MBSP model is proportional to p − 1 2 ξ j || b j ( τ Σ ) − 1 / 2 || 2 ξ − 1 ( 1 + ξ j ) − 1 e ∏ 2 , j j = 1 and integrating out the ξ j ’s gives a multivariate horseshoe density function. Ray Bai (University of Florida) MBSP March 20, 2018 16 / 48
Notation For any two sequences of positive real numbers { a n } and { b n } with b n � = 0, � � � � � a n � ≤ M for all n , for some positive real number M a n = O ( b n ) if b n independent of n a n = o ( b n ) if lim n → ∞ a n b n = 0. Therefore, a n = o ( 1 ) if lim n → ∞ a n = 0. � For a vector v ∈ R n , || v || 2 : = ∑ n i = 1 v 2 i denote the ℓ 2 norm. � For a matrix A ∈ R a × b with entries a ij , || A || F : = tr ( A T A ) � ∑ a i = 1 ∑ b j = 1 a 2 = ij denotes the Frobenius norm of A . For a symmetric matrix A , we denote its minimum and maximum eigenvalues by λ min ( A ) and λ max ( A ) respectively. Ray Bai (University of Florida) MBSP March 20, 2018 17 / 48
Recommend
More recommend