Insights and algorithms for the multivariate square-root lasso Aaron J. Molstad Department of Statistics and Genetics Institute University of Florida June 12th, 2020 Statistical Learning Seminar
Outline of the talk 1. Multivariate response linear regression 2. Considerations in high-dimensional settings 3. The multivariate square-root lasso ◮ Motivation/interpretation ◮ Theoretical tuning ◮ Computation ◮ Simulation studies ◮ Genomic data example
Multivariate response linear regression model The multivariate response linear regression model assumes the measured response for the i th subject y i ∈ R q is a realization of the random vector Y i = β ′ x i + ǫ i , ( i = 1 , . . . , n ) , where ◮ x i ∈ R p is the p -variate predictor for the i th subject, ◮ β ∈ R p × q is the unknown regression coefficient matrix, ◮ ǫ i ∈ R q are iid random vectors with mean zero and covariance Σ ≡ Ω − 1 ∈ S q + . Let the observed data be organized into: ◮ Y = ( y 1 , . . . , y n ) ′ ∈ R n × q , X = ( x 1 , . . . , x n ) ′ ∈ R n × p .
Multivariate response linear regression model Most natural estimator when n > p is the least-squares estimator (i.e., squared Frobenius norm): ˆ β ∈ R p × q � Y − X β � 2 β OLS = arg min F where � A � 2 F = tr( A ′ A ) = � i , j A 2 i , j . Setting the gradient to zero, X ′ X ˆ β OLS − X ′ Y = 0 = ⇒ ˆ β OLS = ( X ′ X ) − 1 X ′ Y . = ⇒ same estimator we would get if we performed q separate least squares regressions.
Multivariate response linear regression model Most natural estimator when n > p is the least-squares estimator (i.e., squared Frobenius norm): ˆ β ∈ R p × q � Y − X β � 2 β OLS = arg min F where � A � 2 F = tr( A ′ A ) = � i , j A 2 i , j . Setting the gradient to zero, X ′ X ˆ β OLS − X ′ Y = 0 = ⇒ ˆ β OLS = ( X ′ X ) − 1 X ′ Y . = ⇒ same estimator we would get if we performed q separate least squares regressions.
Multivariate response linear regression model If we assume the errors are multivariate normal, then the maximum likelihood estimator is � � n − 1 ( Y − X β )Ω( Y − X β ) ′ � � arg min tr − log det(Ω) . β ∈ R p × q , Ω ∈ S q + ◮ Equivalent to least squares only if Ω ∝ I q is known and fixed. However, the first order optimality conditions for β are X ′ X ˆ β MLE Ω − X ′ Y Ω = 0 , which implies β MLE = ( X ′ X ) − 1 X ′ Y = ˆ ˆ β OLS . Intuitive...? What about the errors?
Multivariate response linear regression model If we assume the errors are multivariate normal, then the maximum likelihood estimator is � � n − 1 ( Y − X β )Ω( Y − X β ) ′ � � arg min tr − log det(Ω) . β ∈ R p × q , Ω ∈ S q + ◮ Equivalent to least squares only if Ω ∝ I q is known and fixed. However, the first order optimality conditions for β are X ′ X ˆ β MLE Ω − X ′ Y Ω = 0 , which implies β MLE = ( X ′ X ) − 1 X ′ Y = ˆ ˆ β OLS . Intuitive...? What about the errors?
When n < p , ˆ β OLS is non-unique, so we may want to apply some type of shrinkage/regularization, or impose some type of parsimonious parametric restriction.
Estimating β in high-dimensions When p and q are large, one way to estimate β is to minimize some loss plus a penalty: arg min { ℓ ( β, θ ) + λ P ( β ) } , β ∈ R p × q ,θ ∈ Θ where the choice of penalty depends on assumptions about β : ◮ Elementwise sparse: P ( β ) = � j , k | β j , k | (Tibshirani, 1996) Row-wise sparse: P ( β ) = � p j = 1 � β j , · � 2 (Yuan and Lin, 2007; Obozinski et al., 2011) j , k | β j , k | + ( 1 − α ) � p Bi-level sparse: P ( β ) = α � j = 1 � β j , · � 2 (Peng et al., 2012; Simon et al., 2013) Low-rank: P ( β ) = � min( p , q ) ϕ j ( β ) (Yuan et al., 2007; j = 1 Bunea et al., 2011; Chen et al., 2013)
Estimating β in high-dimensions When p and q are large, one way to estimate β is to minimize some loss plus a penalty: arg min { ℓ ( β, θ ) + λ P ( β ) } , β ∈ R p × q ,θ ∈ Θ where the choice of penalty depends on assumptions about β : ◮ Elementwise sparse: P ( β ) = � j , k | β j , k | (Tibshirani, 1996) ◮ Row-wise sparse: P ( β ) = � p j = 1 � β j , · � 2 (Yuan and Lin, 2007; Obozinski et al., 2011) j , k | β j , k | + ( 1 − α ) � p Bi-level sparse: P ( β ) = α � j = 1 � β j , · � 2 (Peng et al., 2012; Simon et al., 2013) Low-rank: P ( β ) = � min( p , q ) ϕ j ( β ) (Yuan et al., 2007; j = 1 Bunea et al., 2011; Chen et al., 2013)
Estimating β in high-dimensions When p and q are large, one way to estimate β is to minimize some loss plus a penalty: arg min { ℓ ( β, θ ) + λ P ( β ) } , β ∈ R p × q ,θ ∈ Θ where the choice of penalty depends on assumptions about β : ◮ Elementwise sparse: P ( β ) = � j , k | β j , k | (Tibshirani, 1996) ◮ Row-wise sparse: P ( β ) = � p j = 1 � β j , · � 2 (Yuan and Lin, 2007; Obozinski et al., 2011) j , k | β j , k | + ( 1 − α ) � p ◮ “Bi-level” sparse: P ( β ) = α � j = 1 � β j , · � 2 (Peng et al., 2012; Simon et al., 2013) Low-rank: P ( β ) = � min( p , q ) ϕ j ( β ) (Yuan et al., 2007; j = 1 Bunea et al., 2011; Chen et al., 2013)
Estimating β in high-dimensions When p and q are large, one way to estimate β is to minimize some loss plus a penalty: arg min { ℓ ( β, θ ) + λ P ( β ) } , β ∈ R p × q ,θ ∈ Θ where the choice of penalty depends on assumptions about β : ◮ Elementwise sparse: P ( β ) = � j , k | β j , k | (Tibshirani, 1996) ◮ Row-wise sparse: P ( β ) = � p j = 1 � β j , · � 2 (Yuan and Lin, 2007; Obozinski et al., 2011) j , k | β j , k | + ( 1 − α ) � p ◮ “Bi-level” sparse: P ( β ) = α � j = 1 � β j , · � 2 (Peng et al., 2012; Simon et al., 2013) ◮ Low-rank: P ( β ) = � β � ∗ = � min( p , q ) ϕ j ( β ) or j = 1 P ( β ) = Rank ( β ) (Yuan et al., 2007; Bunea et al., 2011; Chen et al., 2013)
Can we ignore the error covariance in these high-dimensional settings? No! But of course, Ω is unknown in practice.
High-dimensional maximum likelihood The penalized normal maximum likelihood estimator with Ω known is � � n − 1 ( Y − X β ) Ω ( Y − X β ) ′ � � ˆ β P ∈ arg min tr + λ P ( β ) . β ∈ R p × q The first order optimality conditions are: X ′ X ˆ β P Ω − X ′ Y Ω + λ∂ P (ˆ β P ) ∋ 0 , where ∂ P (ˆ β P ) is the subgradient of P evaluated at ˆ β P ) . ⇒ ˆ = β P depends on Ω Equivalent to penalized least squares if Ω ∝ I q
High-dimensional maximum likelihood The penalized normal maximum likelihood estimator with Ω known is � � n − 1 ( Y − X β ) Ω ( Y − X β ) ′ � � ˆ β P ∈ arg min tr + λ P ( β ) . β ∈ R p × q Then, the first order optimality conditions are: X ′ X ˆ β P Ω − X ′ Y Ω + λ∂ P (ˆ β P ) ∋ 0 , where ∂ P (ˆ β P ) is the subgradient of P evaluated at ˆ β P . ⇒ ˆ = β P depends on Ω Equivalent to penalized least squares if Ω ∝ I q
High-dimensional maximum likelihood The penalized normal maximum likelihood estimator with Ω known is � � n − 1 ( Y − X β ) Ω ( Y − X β ) ′ � � ˆ β P ∈ arg min tr + λ P ( β ) . β ∈ R p × q Then, the first order optimality conditions are: X ′ X ˆ β P Ω − X ′ Y Ω + λ∂ P (ˆ β P ) ∋ 0 , where ∂ P (ˆ β P ) is the subgradient of P evaluated at ˆ β P . ⇒ ˆ = β P , the shrinkage estimator, depends on Ω . Equivalent to penalized least squares if Ω ∝ I q
High-dimensional maximum likelihood The penalized normal maximum likelihood estimator with Ω known is � � n − 1 ( Y − X β ) Ω ( Y − X β ) ′ � � ˆ β P ∈ arg min tr + λ P ( β ) . β ∈ R p × q Then, the first order optimality conditions are: X ′ X ˆ β P Ω − X ′ Y Ω + λ∂ P (ˆ β P ) ∋ 0 , where ∂ P (ˆ β P ) is the subgradient of P evaluated at ˆ β P . ⇒ ˆ = β P , the shrinkage estimator, depends on Ω . Equivalent to penalized least squares if Ω ∝ I q .
Can we ignore the error covariance in these high-dimensional settings? No! But of course, Ω is unknown in practice.
Penalized normal maximum likelihood When Ω is unknown and the ǫ i are normally distributed, the (doubly) penalized maximum likelihood estimator is: arg min {F ( β, Ω) + λ β P β ( β ) + λ Ω P Ω (Ω) } , β ∈ R p × q , Ω ∈ S q + where � n − 1 ( Y − X β )Ω( Y − X β ) ′ � F ( β, Ω) = tr − log det(Ω) .
Penalized normal maximum likelihood When Ω is unknown and the ǫ i are normally distributed, the (doubly) penalized maximum likelihood estimator is: arg min {F ( β, Ω) + λ β P β ( β ) + λ Ω P Ω (Ω) } , β ∈ R p × q , Ω ∈ S q + where � n − 1 ( Y − X β )Ω( Y − X β ) ′ � F ( β, Ω) = tr − log det(Ω) . ◮ Rothman et al. (2010) and Yin and Li (2011) use ℓ 1 -penalties on both β and Ω .
Recommend
More recommend