De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel - PowerPoint PPT Presentation

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel Javanmard (USC ) Hypothesis Testing October 2015 1 / 39

An example Kaggle challenge: Identify patients diagnosed with type-2 diabetes Adel Javanmard (USC ) Hypothesis Testing October 2015 2 / 39

Statistical model Data ( Y 1 , X 1 ) ,..., ( Y n , X n ) : Y i = Patient i gets type-2 diabetes ∈ { 0 , 1 } ∈ R p X i = Features of patient i θ 0 ∈ R p Y i ∼ f θ 0 ( ·| X i ) θ 0 , j = contribution of feature j Adel Javanmard (USC ) Hypothesis Testing October 2015 3 / 39

Regularized estimator � � � θ ≡ argmin L ( θ ) + λ � θ � 1 . � �� θ ∈ R p logistic loss regularizer Convex optimization Variable selection Adel Javanmard (USC ) Hypothesis Testing October 2015 4 / 39

Practice fusion data set (Kaggle) Database n = 500 : patients p = 805 : medical information (meds, lab results, diagnosis, . . . ) Adel Javanmard (USC ) Hypothesis Testing October 2015 5 / 39

0.4 0.3 Blood Billirubin pressure 0.2 Globulin 0.1 � θ 0 − 0.1 − 0.2 (HDL) cholesterol − 0.3 − 0.4 Year of birth − 0.5 0 200 400 600 800 Regularized logreg selects 62 features ( λ chosen via cross validation resulting AUC = 0 . 75 ) Shall we trust our findings? Adel Javanmard (USC ) Hypothesis Testing October 2015 6 / 39

In summary Will focus on linear model and Lasso Compute confidence intervals/p-values Adel Javanmard (USC ) Hypothesis Testing October 2015 7 / 39

Outline Problem definition 1 Debiasing approach 2 Hypothesis testing under nearly optimal sample size 3 Adel Javanmard (USC ) Hypothesis Testing October 2015 8 / 39

Problem definition Adel Javanmard (USC ) Hypothesis Testing October 2015 9 / 39

Linear model We focus on linear models: Y = X θ 0 + W Y ∈ R n (response), X ∈ R n × p (design matrix), θ 0 ∈ R p (parameters) Noise vector has independent entries with i ) = σ 2 , E ( W 2 E ( W i ) = 0 , E ( | W i | 2 + κ ) < ∞ , for some κ > 0 . Adel Javanmard (USC ) Hypothesis Testing October 2015 10 / 39

Problem Confidence intervals: For each i ∈ { 1 ,..., p } , θ i , θ i ∈ R such that � � θ 0 , i ∈ [ θ i , θ i ] ≥ 1 − α P We would like | θ i − θ i | as small as possible. Hypothesis testing: H 0 , i : θ 0 , i = 0 , H A , i : θ 0 , i � = 0 Adel Javanmard (USC ) Hypothesis Testing October 2015 11 / 39

LASSO � 1 � � 2 n � y − X θ � 2 θ ≡ argmin 2 + λ � θ � 1 . θ ∈ R p [Tibshirani 1996, Chen, Donoho 1996] Distribution of � θ ? Adel Javanmard (USC ) Hypothesis Testing October 2015 12 / 39

LASSO � 1 � � 2 n � y − X θ � 2 θ ≡ argmin 2 + λ � θ � 1 . θ ∈ R p [Tibshirani 1996, Chen, Donoho 1996] Distribution of � θ ? Debiasing approach: (LASSO is biased towards small ℓ 1 norm.) Adel Javanmard (USC ) Hypothesis Testing October 2015 12 / 39

LASSO � 1 � � 2 n � y − X θ � 2 θ ≡ argmin 2 + λ � θ � 1 . θ ∈ R p [Tibshirani 1996, Chen, Donoho 1996] Distribution of � θ ? Debiasing approach: (LASSO is biased towards small ℓ 1 norm.) debiasing � → � θ d θ − − − − − − − − − We characterize distribution of � θ d . Adel Javanmard (USC ) Hypothesis Testing October 2015 12 / 39

Debiasing approach Adel Javanmard (USC ) Hypothesis Testing October 2015 13 / 39

Classical setting ( n ≫ p ) We know everything about the least-square estimator: θ LS = 1 � � Σ − 1 X T Y , n where � Σ ≡ ( X T X ) / n is empirical covariance. Adel Javanmard (USC ) Hypothesis Testing October 2015 14 / 39

Classical setting ( n ≫ p ) We know everything about the least-square estimator: θ LS = 1 � � Σ − 1 X T Y , n where � Σ ≡ ( X T X ) / n is empirical covariance. • Confidence intervals: � ( � Σ − 1 ) ii [ θ i , θ i ] = [ � − c α ∆ i , � θ LS θ LS + c α ∆ i ] , ∆ i ≡ σ i i n Adel Javanmard (USC ) Hypothesis Testing October 2015 14 / 39

High-dimensional setting ( n < p ) θ LS = 1 � Σ − 1 X T Y � n Problem in high dimension: � Σ is not invertible! Adel Javanmard (USC ) Hypothesis Testing October 2015 15 / 39

High-dimensional setting ( n < p ) θ LS = 1 � Σ − 1 X T Y � n Take your favorite M ∈ R p × p : θ ∗ = 1 � nM X T Y = 1 nM X T X θ 0 + 1 nM X T W + 1 = θ 0 +( M � nM X T W Σ − I ) θ 0 � �� bias Gaussian error Adel Javanmard (USC ) Hypothesis Testing October 2015 15 / 39

Debiased estimator + 1 θ ∗ = θ 0 +( M � � nM X T W Σ − I ) θ 0 � �� bias Gaussian error Adel Javanmard (USC ) Hypothesis Testing October 2015 16 / 39

Debiased estimator + 1 θ ∗ = θ 0 +( M � � nM X T W Σ − I ) θ 0 � �� bias Gaussian error Let us (try to) subtract the bias θ d = � θ ∗ − ( M � � Σ − I ) � θ Lasso Adel Javanmard (USC ) Hypothesis Testing October 2015 16 / 39

Debiased estimator + 1 θ ∗ = θ 0 +( M � � nM X T W Σ − I ) θ 0 � �� bias Gaussian error Let us (try to) subtract the bias θ d = � θ ∗ − ( M � � Σ − I ) � θ Lasso Debiased estimator ( � θ = � θ Lasso ) θ + 1 θ d ≡ � � nM X T ( Y − X � θ ) Adel Javanmard (USC ) Hypothesis Testing October 2015 16 / 39

Debiased estimator: Choosing M ? θ + 1 θ d ≡ � � nM X T ( y − X � θ ) Gaussian design ( x i ∼ N ( 0 , Σ ) ) � Assume known Σ (relevant in semi-supervised learning) � M = Σ − 1 [Javanmard, Montanari 2012] Adel Javanmard (USC ) Hypothesis Testing October 2015 17 / 39

Debiased estimator: Choosing M ? θ + 1 θ d ≡ � � nM X T ( y − X � θ ) Gaussian design ( x i ∼ N ( 0 , Σ ) ) � Assume known Σ (relevant in semi-supervised learning) � M = Σ − 1 [Javanmard, Montanari 2012] Does this remind you anything? θ + Σ − 1 1 θ d ≡ � � n X T ( y − X � θ ) Adel Javanmard (USC ) Hypothesis Testing October 2015 17 / 39

Debiased estimator: Choosing M ? θ + 1 θ d ≡ � � nM X T ( y − X � θ ) Gaussian design ( x i ∼ N ( 0 , Σ ) ) � Assume known Σ (relevant in semi-supervised learning) � M = Σ − 1 [Javanmard, Montanari 2012] Does this remind you anything? θ + Σ − 1 1 θ d ≡ � � n X T ( y − X � θ ) (pseudo-) Newton method Adel Javanmard (USC ) Hypothesis Testing October 2015 17 / 39

Debiased estimator: Choosing M ? θ + 1 θ d ≡ � � nM X T ( y − X � θ ) Gaussian design ( x i ∼ N ( 0 , Σ ) ) � Assume known Σ (relevant in semi-supervised learning) � M = Σ − 1 [Javanmard, Montanari 2012] Approximate inverse of � Σ : nodewise LASSO on X (under row-sparsity assumption on Σ − 1 ) [S. van de Geer, P . Bühlmann, Y. Ritov, R. Dezeure 2014] Adel Javanmard (USC ) Hypothesis Testing October 2015 17 / 39

Debiased estimator: Choosing M ? Our approach: Optimizing two objectives (bias and variance of � θ d ) [A. Javanmard, A. Montanari 2014] √ n ( � θ d − θ 0 ) = √ n ( M � Σ − I )( θ 0 − � θ ) + Z � �� bias ↓ Σ = 1 Z | X ∼ N ( 0 , σ 2 M � � Σ M T n XX T ) , � ��  � noise covariance Adel Javanmard (USC ) Hypothesis Testing October 2015 18 / 39

Debiased estimator: Choosing M ? Our approach: Find M by solving an optimization problem: [A. Javanmard, A. Montanari] 1 ≤ i ≤ p ( M � Σ M T ) i , i max minimize M | M � Σ − I | ∞ ≤ ξ subject to Adel Javanmard (USC ) Hypothesis Testing October 2015 18 / 39

Debiased estimator: Choosing M ? Our approach: Find M by solving an optimization problem: [A. Javanmard, A. Montanari] i � m T Σ m i minimize m i � � Σ m i − e i � ∞ ≤ ξ subject to The optimization can be decoupled and solved in parallel. Adel Javanmard (USC ) Hypothesis Testing October 2015 18 / 39

What does it look like? 0.4 0.3 Density 0.2 0.1 0.0 -10 -5 0 5 10 � θ d i Can estimate σ ‘Ground truth’ from n tot = 10 , 000 records. Adel Javanmard (USC ) Hypothesis Testing October 2015 19 / 39

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel - PowerPoint PPT Presentation

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel Javanmard (USC ) Hypothesis Testing

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Exercise 10: Importance biasing FLUKA Beginners Course Exercise 10: Importance biasing Aim of

Status of Generic Biasing Parallel 5B - Biasing & Channeling Fermilab Geant4 Collaboration

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

Omitted variable bias of Lasso-based inference methods: A finite sample analysis uthrich

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Regulators Chapter 7 7.1 Analog IC biasing Although often ignored during the course of

Sample Size Power, Sample Size, and the FDR How many observations do we need? Depends on

METHODS OF REGULARIZATION AND THEIR JUSTIFICATIONS WON (RYAN) LEE We turn to the question of

Evaluation of Point Estimators Lecture 11 Biostatistics 602 - Statistical Inference . .

PAGE: 2 ITEM: 3 Purpose 1. Overview of main land use planning issues 2. Draft Planning

Approval and Gazettal of PSPs and ICPs- Process and Timeframes Alix Rhodes- Executive

The use of area frame surveys and remote sensing Javier.gallego@jrc.ec.europa.eu Main approaches

On the Least Median Square On the Least Median Square Problem Problem Jeff Erickson University

Using a Hybrid Rate Estimator Based On Hydrometeor Type Michael J. Dixon, S. M. Ellis, T. M.

Conditional Choice Probability Estimators of Single-Agent Dynamic Discrete Choice Models Hotz

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel - PowerPoint PPT Presentation

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel Javanmard (USC ) Hypothesis Testing

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Exercise 10: Importance biasing FLUKA Beginners Course Exercise 10: Importance biasing Aim of

Status of Generic Biasing Parallel 5B - Biasing &amp; Channeling Fermilab Geant4 Collaboration

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Sparse CCA using Lasso Anastasia Lykou &amp; Joe Whittaker Department of Mathematics and

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

Omitted variable bias of Lasso-based inference methods: A finite sample analysis uthrich

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Regulators Chapter 7 7.1 Analog IC biasing Although often ignored during the course of

Sample Size Power, Sample Size, and the FDR How many observations do we need? Depends on

METHODS OF REGULARIZATION AND THEIR JUSTIFICATIONS WON (RYAN) LEE We turn to the question of

Evaluation of Point Estimators Lecture 11 Biostatistics 602 - Statistical Inference . .

PAGE: 2 ITEM: 3 Purpose 1. Overview of main land use planning issues 2. Draft Planning

Approval and Gazettal of PSPs and ICPs- Process and Timeframes Alix Rhodes- Executive

The use of area frame surveys and remote sensing Javier.gallego@jrc.ec.europa.eu Main approaches

On the Least Median Square On the Least Median Square Problem Problem Jeff Erickson University

Using a Hybrid Rate Estimator Based On Hydrometeor Type Michael J. Dixon, S. M. Ellis, T. M.

Conditional Choice Probability Estimators of Single-Agent Dynamic Discrete Choice Models Hotz

Status of Generic Biasing Parallel 5B - Biasing & Channeling Fermilab Geant4 Collaboration

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and