De-biasing arbitrary convex regularizers and asymptotic normality Pierre C Bellec, Rutgers University Mathematical Methods of Modern Statistics 2, June 2020
Joint work with Cun-Hui Zhang (Rutgers). ◮ Second order Poincaré inequalities and de-biasing arbitrary convex regularizers arXiv:1912.11943 ◮ De-biasing the Lasso with degrees-of-freedom adjustment. arXiv:1902.08885.
High-dimensional statistics ◮ n data points ( x i , Y i , i = 1 , ..., n ) ◮ p covariates, x i ∈ R p p ≥ n α p ≥ n , p ≥ cn For instance, linear model Y i = x ⊤ i β + ǫ i for unknown β
M-estimators and regularization � � n � 1 ˆ ℓ ( x ⊤ β = arg min i b , Y i ) + regularizer( b ) n b ∈ R p i =1 for some loss ℓ ( · , · ) and regularization penalty. Typically in the linear model, with the least-squares loss, � � ˆ � y − Xb � 2 / (2 n ) + g ( b ) β = arg min b ∈ R p with g convex. Example ◮ Lasso, Elastic-Net ◮ Bridge g ( b ) = � p j =1 | b j | c ◮ Group-Lasso ◮ Nuclear Norm penalty ◮ Sorted L1 penalty (SLOPE)
Different goals, different scales � � y − Xb � 2 / (2 n ) + g ( b ) � , ˆ β = arg min b ∈ R p g convex 1. Design of regularizer g with intuition about complexity, structure ◮ convex relaxation of unknown structure (sparsity, low-rank) ◮ ℓ 1 balls are spiky at sparse vectors 2. Upper and lower bounds on the risk of ˆ β : β − β � 2 ≤ Cr n . cr n ≤ � ˆ 3. Characterization of the risk β − β � 2 = r n (1 + o P (1)) � ˆ under some asymptotics, e.g., p / n → γ or s log( p / s ) / n → 0. 4. Asymp. distribution in fixed direction a 0 ∈ R p (resp a 0 = e j ) and confidence interval for a ⊤ 0 β (resp β j ) √ n a ⊤ √ n ( � β − β ) → ? N (0 , V 0 ) , β j − β j ) → ? N (0 , V j ) . 0 (ˆ
Focus of today: Confidence interval in the linear model based on convex regularized estimators of the form � , � � y − Xb � 2 / (2 n ) + g ( b ) ˆ β = arg min b ∈ R p g convex √ n (ˆ b j − β j ) ⇒ N (0 , V j ) , β j unknown parameter of interest
Confidence interval in the linear model Design X with iid N (0 , Σ ) rows, known Σ , noise ε ∼ N (0 , σ 2 I n ), and a given initial estimator ˆ y = X β + ε , β . Goal: Inference for θ = a ⊤ 0 β , projection in direction a 0 Examples: ◮ a 0 = e j , interested in inference on the j -th coefficient β j ◮ a 0 = x new where x new is the characteristics of a new patient, inference for x new ⊤ β .
De-biasing, confidence intervals for the Lasso
Confidence interval in the linear model Design X with iid N (0 , Σ ) rows, known Σ , noise ε ∼ N (0 , σ 2 I n ), and a given initial estimator ˆ y = X β + ε , β . Goal: Inference for θ = a ⊤ 0 β , projection in direction a 0 Examples: ◮ a 0 = e j , interested in inference on the j -th coefficient β j ◮ a 0 = x new where x new is the characteristics of a new patient, inference for x new ⊤ β . De-biasing: construct an unbiased estimate in the direction a 0 0 ˆ i.e., find a correction such that [ a ⊤ β − correction] is an unbiased estimator of a ⊤ 0 β ∗
Existing results Lasso ◮ Zhang and Zhang (2014) ( s log( p / s ) / n → 0) ◮ Javanmard and Montanari (2014a) ; Javanmard and Montanari (2014b) ; Javanmard and Montanari (2018) ( s log ( p / s ) / n → 0) ◮ Van de Geer et al. (2014) ( s log( p / s ) / n → 0) ◮ Bayati and Montanari (2012) ; Miolane and Montanari (2018) ( p / n → γ ) Beyond Lasso? ◮ Robust M -estimators El Karoui et al. (2013) Lei, Bickel, and El Karoui (2018) Donoho and Montanari (2016) ( p / n → γ ) ◮ Celentano and Montanari (2019) symmetric convex penalty and ( Σ = I p , p / n → γ ), using Approximate Message Passing ideas from statistical physics ◮ logistic regression Sur and Candès (2018) ( Σ = I p , p / n → γ )
Focus today: General theory for confidence intervals based on any convex regularized estimators of the form � , � � y − Xb � 2 / (2 n ) + g ( b ) ˆ β = arg min b ∈ R p g convex. Little or no constraint on the convex regularizer g .
Degrees-of-freedom of estimator � � ˆ � y − Xb � 2 / (2 n ) + g ( b ) β = arg min b ∈ R p ◮ then y → X ˆ β for fixed X is 1-Lipscthiz ◮ the Jacobian of y �→ X ˆ β exists everywhere (Rademacher’s theorem) � X ∂ ˆ � β ( X , y ) df = trace ∇ ( y �→ X ˆ ˆ ˆ β ) , df = trace . ∂ y used for instance in Stein’s Unbiased Risk Estimate (SURE). The Jacobian matrix ˆ H is also useful. ˆ H is always symmetric 1 H = X ∂ ˆ β ( X , y ) ˆ ∈ R n × n ∂ y 1 P.C.B and C.-H. Zhang (2019) Second order Poincaré inequalities and de-biasing arbitrary convex regularizers when p / n → γ
Isotropic design, any g , p / n → γ (B. and Zhang, 2019) Assumptions ◮ Sequence of linear regression problems y = X β + ε ◮ with n , p → + ∞ and p / n → γ ∈ (0 , ∞ ), ◮ g : R p → R coercive convex penalty, strongly convex if γ ≥ 1. ◮ Rows of X are iid N ( 0 , I p ) and ◮ Noise ε ∼ N (0 , σ 2 I n ) is independent of X .
Isotropic design, any penalty g , p / n → γ Theorem (B. and Zhang, 2019) � � ˆ � y − Xb � 2 / (2 n ) + g ( b ) β = arg min b ∈ R p ◮ β j = � e j , β � parameter of interest ◮ ˆ H = X ( ∂/∂ y )ˆ df = trace ˆ ˆ β , H , β � 2 + trace[( ˆ ◮ ˆ V ( β j ) = � y − X ˆ H − I n ) 2 ]( � β j − β j ) 2 . Then there exists a subset J p ⊂ [ p ] of size at least ( p − log log p ) s.t. � � ( n − ˆ df)( � j X ⊤ ( y − X ˆ � � β j − β j ) + e ⊤ β ) � � sup ≤ t − Φ( t ) � → 0 . � P ˆ V ( β j ) 1 / 2 j ∈ J p
Correlated design, any g , p / n → γ Assumption ◮ Sequence of linear regression problems y = X β + ε ◮ with n , p → + ∞ and p / n → γ ∈ (0 , ∞ ), ◮ g : R p → R coercive convex penalty, strongly convex if γ ≥ 1. ◮ Rows of X are iid N ( 0 , Σ ) and ◮ Noise ε ∼ N (0 , σ 2 I n ) is independent of X .
Correlated design, any penalty g , p / n → γ Theorem (B. and Zhang, 2019) � � ˆ � y − Xb � 2 / (2 n ) + g ( b ) β = arg min b ∈ R p ◮ θ = � a 0 , β � parameter of interest ◮ ˆ H = X ( ∂/∂ y )ˆ df = trace ˆ ˆ β , H , β � 2 + trace[( ˆ ◮ ˆ V ( θ ) = � y − X ˆ H − I n ) 2 ]( � a 0 , ˆ β � − θ ) 2 . ◮ Assume a ⊤ 0 Σ a 0 = 1 and set z 0 = Σ − 1 a 0 . Then there exists a subset S ⊂ S p − 1 with relative volume | S | / | S p − 1 | ≥ 1 − 2 e − p 0 . 99 � � � ( n − ˆ df)( � ˆ β , a 0 � − θ ) + � z 0 , y − X ˆ � β � � � sup ≤ t − Φ( t ) � → 0 . � P ˆ V ( θ ) 1 / 2 a 0 ∈ Σ 1 / 2 S This applies to at least ( p − φ cond ( Σ ) log log p ) indices j ∈ [ p ].
Resulting 0 . 95 confidence interval � � � � � ( n − ˆ df)( � ˆ β , a 0 � − θ ) + � z 0 , y − X ˆ β � � � ˆ CI = θ ∈ R : � ≤ 1 . 96 ˆ V ( θ ) 1 / 2 Variance approximation β � 2 and the length of the interval is Typically, ˆ V ( θ ) ≈ � y − X ˆ � 2 · 1 . 96 � y − X ˆ ( n − ˆ β � df) . � � � � � ( n − ˆ df)( � ˆ β , a 0 � − θ ) + � z 0 , y − X ˆ β � � � ˆ CI approx = θ ∈ R : � ≤ 1 . 96 . � y − X ˆ β �
Simulations using the approximation ˆ V ( θ ) ≈ � y − X ˆ β � 2 n = 750, p = 500, correlated Σ . β is the vectorization of a row-sparse matrix of size 25 × 20. a 0 is a direction that leads to large initial bias. Estimators: 7 different penalty functions: ◮ Group Lasso with tuning parameters µ 1 , µ 2 ◮ Lasso with tuning parameters λ 1 , ..., λ 4 ◮ Nuclear norm penalty Boxplots of initial errors √ n a ⊤ 0 (ˆ β − β ) (biased!)
Simulations using the approximation ˆ V ( θ ) ≈ � y − X ˆ β � 2 n = 750, p = 500, correlated Σ β is the vectorization of a row-sparse matrix of size 25 × 20 Estimators: 7 different penalty functions: ◮ Group Lasso with tuning parameters µ 1 , µ 2 ◮ Lasso with tuning parameters λ 1 , ..., λ 4 ◮ Nuclear norm penalty Boxplots of √ n [ a ⊤ 0 (ˆ 0 ( y − X ˆ β − β ) + z ⊤ β )]
Before/after bias correction
QQ-plot, Lasso, λ 1 , λ 2 , λ 3 , λ 3 . � � For Lasso, ˆ � { j = 1 , ..., p : � � . df = β j � = 0 } β � 2 instead of ˆ Pivotal quantity when using � y − X ˆ V ( θ ) for the variance. ◮ The visible discrepancy in the last plot is fixed when using ˆ V ( θ ) instead.
QQ-plot, Group Lasso, µ 1 , µ 2 . Explicit formula for ˆ df
QQ-plot, Nuclear norm penalty No explicit formula for ˆ df available, although it is possible to compute numerical approximations.
Summary of the main result 2 Asymptotic normality result, and valid 1 − α confidence interval by de-biasing any convex regularized M estimator. ◮ Asymptotics p / n → γ ◮ Under Gaussian design, known covariance matrix Σ ◮ Strong convexity of the penalty required if γ ≥ 1; otherwise any penalty is allowed. 2 P.C.B and C.-H. Zhang (2019) Second order Poincaré inequalities and de-biasing arbitrary convex regularizers when p / n → γ
Time-pertmitting 1. Necessity of degrees-of-freedom adjustment 2. Central Limit Theorems and Second Order Poincar’e inequalities 3. Unknown Σ .
Recommend
More recommend