high dimensional m estimation inference from
play

High Dimensional M -Estimation & Inference from Observational - PowerPoint PPT Presentation

High Dimensional M -Estimation & Inference from Observational Data with Incomplete Responses A Semi-Parametric Doubly Robust Framework Abhishek Chakrabortty 1 Department of Statistics University of Pennsylvania Group Meeting April 24, 2019


  1. The Two Standard (Fundamental) Assumptions Ignorability assumption: T ⊥ ⊥ Y | X . 1 A.k.a. ‘missing at random’ (MAR) in the missing data literature. A.k.a. ‘no unmeasured confounding’ (NUC) in causal inference. Special case: T ⊥ ⊥ ( Y , X ). A.k.a. missing completely at random (MCAR) in missing data literature, and complete randomization (e.g. randomized trials) in causal inference (CI) literature. Positivity assumption (a.k.a. ‘sufficient overlap’ in CI literature): 2 Let π ( X ) := P ( T = 1 | X ) be the propensity score (PS), and let π 0 := P ( T = 1). Then, π ( · ) is uniformly bounded away from 0: 1 ≥ π ( x ) ≥ δ π > 0 ∀ x ∈ X , for some constant δ π > 0 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 6/50

  2. Relevance in Biomedical Studies: EHR Data Rich resources of data for discovery research; fast growing literature. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 7/50

  3. Relevance in Biomedical Studies: EHR Data Rich resources of data for discovery research; fast growing literature. Detailed clinical and phenotypic data collected electronically for large patient cohorts, as part of routine health care delivery. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 7/50

  4. Relevance in Biomedical Studies: EHR Data Rich resources of data for discovery research; fast growing literature. Detailed clinical and phenotypic data collected electronically for large patient cohorts, as part of routine health care delivery. Structured data: ICD codes, medications, lab tests, demographics etc. Unstructured text data (extracted from clinician notes via NLP): signs and symptoms, family history, social history, radiology reports etc. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 7/50

  5. EHR Data: The Promises and the Challenges Information on a variety of phenotypes (unlike usual cohort studies). Opens up unique opportunities for novel integrative analyses . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 8/50

  6. EHR Data: The Promises and the Challenges Information on a variety of phenotypes (unlike usual cohort studies). Opens up unique opportunities for novel integrative analyses . EHR + Bio-repositories � genome-phenome association networks, PheWAS studies and genomic risk prediction of diseases. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 8/50

  7. EHR Data: The Promises and the Challenges Information on a variety of phenotypes (unlike usual cohort studies). Opens up unique opportunities for novel integrative analyses . EHR + Bio-repositories � genome-phenome association networks, PheWAS studies and genomic risk prediction of diseases. The key challenges and bottlenecks for EHR driven research: Logistic difficulty in obtaining validated phenotype (Y) information. Often time/labor/cost intensive (and the ICD codes are imprecise). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 8/50

  8. EHR Data and Incompleteness: Various Examples Some examples of missing Y in EHRs and the reason for missingness: Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 9/50

  9. EHR Data and Incompleteness: Various Examples Some examples of missing Y in EHRs and the reason for missingness: Y � some (binary) disease phenotype (e.g. Rheumatoid Arthritis). 1 Requires manual chart review by physicians (logistic constraints). Y � some biomarker (e.g. anti-CCP, an important RA biomarker). 2 Requires lab tests (cost constraints). Similarly, any Y requiring genomic measurements may also have cost/logistics constraints. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 9/50

  10. EHR Data and Incompleteness: Various Examples Some examples of missing Y in EHRs and the reason for missingness: Y � some (binary) disease phenotype (e.g. Rheumatoid Arthritis). 1 Requires manual chart review by physicians (logistic constraints). Y � some biomarker (e.g. anti-CCP, an important RA biomarker). 2 Requires lab tests (cost constraints). Similarly, any Y requiring genomic measurements may also have cost/logistics constraints. Verified phenotypes/treatment response/biomarkers/genomic vars (Y) available only for a subset. Clinical features ( X ) available for all . Further issues: selection bias/treatment by indication/preferential labeling (e.g. sicker patients get labeled/treated/tested more often). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 9/50

  11. EHR Data and Incompleteness: Various Examples Some examples of missing Y in EHRs and the reason for missingness: Y � some (binary) disease phenotype (e.g. Rheumatoid Arthritis). 1 Requires manual chart review by physicians (logistic constraints). Y � some biomarker (e.g. anti-CCP, an important RA biomarker). 2 Requires lab tests (cost constraints). Similarly, any Y requiring genomic measurements may also have cost/logistics constraints. Verified phenotypes/treatment response/biomarkers/genomic vars (Y) available only for a subset. Clinical features ( X ) available for all . Further issues: selection bias/treatment by indication/preferential labeling (e.g. sicker patients get labeled/treated/tested more often). Causal inference problems (treatment effects estimation): EHRs also facilitate comparative effectiveness research on a large scale. Many treatments/medications (and responses) being observed. All other clinical features ( X ) serve as potential confounders. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 9/50

  12. Another Example: eQTL Studies (Integrative Genomics) Association studies for gene expression ( Y ) vs. genetic variants ( X ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 10/50

  13. Another Example: eQTL Studies (Integrative Genomics) Association studies for gene expression ( Y ) vs. genetic variants ( X ). Popular tools in integrative genomics (genetic association studies + gene expression profiling) for understanding gene regulatory networks. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 10/50

  14. Another Example: eQTL Studies (Integrative Genomics) Association studies for gene expression ( Y ) vs. genetic variants ( X ). Popular tools in integrative genomics (genetic association studies + gene expression profiling) for understanding gene regulatory networks. Missing data issue: gene expression data often missing (loss of power), while genetic variants data often available for a much larger group. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 10/50

  15. Another Example: eQTL Studies (Integrative Genomics) Association studies for gene expression ( Y ) vs. genetic variants ( X ). Popular tools in integrative genomics (genetic association studies + gene expression profiling) for understanding gene regulatory networks. Missing data issue: gene expression data often missing (loss of power), while genetic variants data often available for a much larger group. Causal inference: estimate the causal effect of any one variant (the ‘treatment’) on Y while all other variants are potential confounders. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 10/50

  16. High Dimensional M -Estimation: The Parameter(s) of Interest Goal for M -estimation: estimation and inference, based on D n , of θ 0 ∈ R d (possibly high dimensional), defined as the risk minimizer: θ 0 ≡ θ 0 ( P ) := arg min R ( θ ) , where R ( θ ) := E { L ( Y , X , θ ) } and θ ∈ R d L ( · ) ∈ R + is any ‘loss’ function that is convex and differentiable in θ . Existence of θ 0 implicitly assumed (guaranteed for most usual probs). d can diverge with n (including d ≫ n ). Also, θ 0 ( P ) is ‘model free’ (no restrictions on P ). In particular, no model assumptions on Y | X . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 11/50

  17. High Dimensional M -Estimation: The Parameter(s) of Interest Goal for M -estimation: estimation and inference, based on D n , of θ 0 ∈ R d (possibly high dimensional), defined as the risk minimizer: θ 0 ≡ θ 0 ( P ) := arg min R ( θ ) , where R ( θ ) := E { L ( Y , X , θ ) } and θ ∈ R d L ( · ) ∈ R + is any ‘loss’ function that is convex and differentiable in θ . Existence of θ 0 implicitly assumed (guaranteed for most usual probs). d can diverge with n (including d ≫ n ). Also, θ 0 ( P ) is ‘model free’ (no restrictions on P ). In particular, no model assumptions on Y | X . The key challenges: the missingness via T (if not accounted for, the estimator will be inconsistent!) and the high dimensional setting. Need suitable methods - involves estimation of nuisance functions and careful analyses (due to error terms with complex dependencies). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 11/50

  18. High Dimensional M -Estimation: The Parameter(s) of Interest Goal for M -estimation: estimation and inference, based on D n , of θ 0 ∈ R d (possibly high dimensional), defined as the risk minimizer: θ 0 ≡ θ 0 ( P ) := arg min R ( θ ) , where R ( θ ) := E { L ( Y , X , θ ) } and θ ∈ R d L ( · ) ∈ R + is any ‘loss’ function that is convex and differentiable in θ . Existence of θ 0 implicitly assumed (guaranteed for most usual probs). d can diverge with n (including d ≫ n ). Also, θ 0 ( P ) is ‘model free’ (no restrictions on P ). In particular, no model assumptions on Y | X . The key challenges: the missingness via T (if not accounted for, the estimator will be inconsistent!) and the high dimensional setting. Need suitable methods - involves estimation of nuisance functions and careful analyses (due to error terms with complex dependencies). Special (but low- d ) case: θ 0 = E ( Y ) and L ( Y , X , θ ) = ( Y − θ ) 2 . Leads to the average treatment effect (ATE) estimation prob in CI. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 11/50

  19. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  20. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  21. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. High dimensional settings (but low dimensional parameters): lot of attention in recent times on mean (or ATE) estimation. Belloni et al. (2014, 2017); Farrell (2015); Chernozhukov et al. (2018). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  22. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. High dimensional settings (but low dimensional parameters): lot of attention in recent times on mean (or ATE) estimation. Belloni et al. (2014, 2017); Farrell (2015); Chernozhukov et al. (2018). Much less attention when the parameter itself is high dimensional. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  23. M -Estimation and Missing Data/Causal Inference Problems: A Review The framework includes a broad class of M / Z -estimation problems. M -estimation for fully observed data: well studied with rich literature. Classical settings: Van der Vaart (2000); High dimensional settings: Negahban et al. (2012), Loh and Wainwright (2012, 2015) etc. Missing data/causal inference problems: semi-parametric inference. Classical settings: vast literature (typically for mean estimation). Tsiatis (2007); Bang and Robins (2005); Robins et al. (1994) etc. High dimensional settings (but low dimensional parameters): lot of attention in recent times on mean (or ATE) estimation. Belloni et al. (2014, 2017); Farrell (2015); Chernozhukov et al. (2018). Much less attention when the parameter itself is high dimensional. This work contributes to both literature above: M -estimation + missing data + high dimensional setting and parameter. (Also has applications in heterogeneous treatment effects estimation in CI). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 12/50

  24. HD M -Estimation: A Few (Class of) Applications All standard high dimensional (HD) regression problems with: (a) 1 missing outcomes and (b) potentially misspecified (working) models. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 13/50

  25. HD M -Estimation: A Few (Class of) Applications All standard high dimensional (HD) regression problems with: (a) 1 missing outcomes and (b) potentially misspecified (working) models. E.g. squared loss: L ( Y , X , θ ) := ( Y − X ′ θ ) 2 � linear regression; logistic loss: L ( Y , X , θ ) := log { 1 + exp( X ′ θ ) } − Y ( X ′ θ ) � logistic regression (for binary Y ), exponential loss (Poisson reg.) so on . . . . Note: throughout, regardless of any motivating ‘working model’ being true or not, the definition of θ 0 is completely ‘model free’. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 13/50

  26. HD M -Estimation: A Few (Class of) Applications All standard high dimensional (HD) regression problems with: (a) 1 missing outcomes and (b) potentially misspecified (working) models. E.g. squared loss: L ( Y , X , θ ) := ( Y − X ′ θ ) 2 � linear regression; logistic loss: L ( Y , X , θ ) := log { 1 + exp( X ′ θ ) } − Y ( X ′ θ ) � logistic regression (for binary Y ), exponential loss (Poisson reg.) so on . . . . Note: throughout, regardless of any motivating ‘working model’ being true or not, the definition of θ 0 is completely ‘model free’. Series estimation problems (model free) with missing Y and HD basis 2 functions (instead of X in Example 1 above). E.g. spline bases. Use the same choices of L ( · ) as in Example 1 above with X replaced by any set of d (possibly HD) basis functions Ψ ( X ) := { ψ j ( X ) } d j =1 . E.g. polynomial bases: Ψ ( X ) := { 1 , x k j : 1 ≤ j ≤ p , 1 ≤ k ≤ d 0 } . ( d 0 = 1 � linear bases as in Example 1; d 0 = 3 � cubic splines). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 13/50

  27. Another Application: HD Single Index Models (SIMs) Signal recovery in high dimensional single index models (SIMs) with elliptically symmetric design distribution (e.g. X is Gaussian). 0 X , ǫ ) with f : R 2 → Y unknown (i.e. β 0 identifiable Let Y = f ( β ′ ⊥ X | β ′ only upto scalar multiples) and ǫ ⊥ ⊥ X (i.e., Y ⊥ 0 X ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 14/50

  28. Another Application: HD Single Index Models (SIMs) Signal recovery in high dimensional single index models (SIMs) with elliptically symmetric design distribution (e.g. X is Gaussian). 0 X , ǫ ) with f : R 2 → Y unknown (i.e. β 0 identifiable Let Y = f ( β ′ ⊥ X | β ′ only upto scalar multiples) and ǫ ⊥ ⊥ X (i.e., Y ⊥ 0 X ). Consider any of the regression problems introduced in Example 1. Let θ 0 := arg min θ ∈ R p E { L ( Y , X ′ θ ) } for any convex loss function L ( · ) : R 2 → R (convex in the second argument). Then, θ 0 ∝ β 0 ! A remarkable result due to Li and Duan (1989). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 14/50

  29. Another Application: HD Single Index Models (SIMs) Signal recovery in high dimensional single index models (SIMs) with elliptically symmetric design distribution (e.g. X is Gaussian). 0 X , ǫ ) with f : R 2 → Y unknown (i.e. β 0 identifiable Let Y = f ( β ′ ⊥ X | β ′ only upto scalar multiples) and ǫ ⊥ ⊥ X (i.e., Y ⊥ 0 X ). Consider any of the regression problems introduced in Example 1. Let θ 0 := arg min θ ∈ R p E { L ( Y , X ′ θ ) } for any convex loss function L ( · ) : R 2 → R (convex in the second argument). Then, θ 0 ∝ β 0 ! A remarkable result due to Li and Duan (1989). Classic example of a misspecified parametric model defining θ 0 , yet θ 0 directly relates to an actual (interpretable) semi-parametric model! The proportionality result also preserves any sparsity assumptions. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 14/50

  30. Applications in Causal Inference (Treatment Effects Estimation) Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine): Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 15/50

  31. Applications in Causal Inference (Treatment Effects Estimation) Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine): Linear heterogeneous treatment effects estimation: application of 1 the linear regression example (twice). Write { Y (0) , Y (1) } linearly as: Y ( j ) = X ′ β ( j ) + ǫ ( j ) , E ( ǫ ( j ) X ) = 0 ∀ j = 0 , 1 , so that Y (1) − Y (0) = X ′ β ∗ + ǫ ∗ , β ∗ := β (1) − β (0) , ǫ ∗ := ǫ (1) − ǫ (0) . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 15/50

  32. Applications in Causal Inference (Treatment Effects Estimation) Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine): Linear heterogeneous treatment effects estimation: application of 1 the linear regression example (twice). Write { Y (0) , Y (1) } linearly as: Y ( j ) = X ′ β ( j ) + ǫ ( j ) , E ( ǫ ( j ) X ) = 0 ∀ j = 0 , 1 , so that Y (1) − Y (0) = X ′ β ∗ + ǫ ∗ , β ∗ := β (1) − β (0) , ǫ ∗ := ǫ (1) − ǫ (0) . β ∗ denotes the (model free) linear projection of Y (1) − Y (0) | X . Of interest in HD settings when E { Y (1) − Y (0) | X } is difficult to model (Chernozhukov et al., 2017; Chernozhukov and Semenova, 2017). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 15/50

  33. Applications in Causal Inference (Treatment Effects Estimation) Applications of all these problems in causal inference (estimation of treatment effects with useful applications in precision medicine): Linear heterogeneous treatment effects estimation: application of 1 the linear regression example (twice). Write { Y (0) , Y (1) } linearly as: Y ( j ) = X ′ β ( j ) + ǫ ( j ) , E ( ǫ ( j ) X ) = 0 ∀ j = 0 , 1 , so that Y (1) − Y (0) = X ′ β ∗ + ǫ ∗ , β ∗ := β (1) − β (0) , ǫ ∗ := ǫ (1) − ǫ (0) . β ∗ denotes the (model free) linear projection of Y (1) − Y (0) | X . Of interest in HD settings when E { Y (1) − Y (0) | X } is difficult to model (Chernozhukov et al., 2017; Chernozhukov and Semenova, 2017). Average conditional treatment effects (ACTE) estimation via series 2 estimators: application of the series estimation example (twice). Causal inference via SIMs (signal recovery, ACTE estimation and 3 ATE estimation): application of the SIM example (twice). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 15/50

  34. Before Getting Started: A Few Facts and Considerations Some notations: m ( X ) := E ( Y | X ) and φ ( X , θ ) := E { L ( Y , X , θ ) | X } . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 16/50

  35. Before Getting Started: A Few Facts and Considerations Some notations: m ( X ) := E ( Y | X ) and φ ( X , θ ) := E { L ( Y , X , θ ) | X } . It is generally necessary to ‘account’ for the missingness in Y . The ‘complete case’ estimator of θ 0 in general will be inconsistent ! Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 16/50

  36. Before Getting Started: A Few Facts and Considerations Some notations: m ( X ) := E ( Y | X ) and φ ( X , θ ) := E { L ( Y , X , θ ) | X } . It is generally necessary to ‘account’ for the missingness in Y . The ‘complete case’ estimator of θ 0 in general will be inconsistent ! That estimator may be consistent only if: (1) ∇ φ ( X , θ 0 ) = 0 a.s. for every X (for regression problems, this indicates the ‘correct model’ case), and/or (2) T ⊥ ⊥ ( Y , X ) (i.e. the MCAR case). Illustration of (1) for sq. loss: ∇ φ ( X , θ 0 ) = E { X ( Y − X ′ θ 0 ) | X } = 0 . Hence, E ( Y | X ) = X ′ θ 0 (i.e. a ‘linear model’ holds for Y | X ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 16/50

  37. Before Getting Started: A Few Facts and Considerations Some notations: m ( X ) := E ( Y | X ) and φ ( X , θ ) := E { L ( Y , X , θ ) | X } . It is generally necessary to ‘account’ for the missingness in Y . The ‘complete case’ estimator of θ 0 in general will be inconsistent ! That estimator may be consistent only if: (1) ∇ φ ( X , θ 0 ) = 0 a.s. for every X (for regression problems, this indicates the ‘correct model’ case), and/or (2) T ⊥ ⊥ ( Y , X ) (i.e. the MCAR case). Illustration of (1) for sq. loss: ∇ φ ( X , θ 0 ) = E { X ( Y − X ′ θ 0 ) | X } = 0 . Hence, E ( Y | X ) = X ′ θ 0 (i.e. a ‘linear model’ holds for Y | X ). With θ 0 (and X ) being high dimensional (compared to n ), we need some further structural constraints on θ 0 to estimate it using D n . We assume that θ 0 is s -sparse: � θ 0 � 0 := s and s ≤ min( n , d ). Note: the sparsity requirement has attractive (and fairly intuitive) geometric justification for all the examples we have given here. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 16/50

  38. Estimation of θ 0 : Getting Identifiable Representation(s) of R ( θ ) Under MAR assmpn., R ( θ ) := E { L ( Y , X , θ ) } ≡ E X { φ ( X , θ ) } admits the following debiased and doubly robust (DDR) representation: Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 17/50

  39. Estimation of θ 0 : Getting Identifiable Representation(s) of R ( θ ) Under MAR assmpn., R ( θ ) := E { L ( Y , X , θ ) } ≡ E X { φ ( X , θ ) } admits the following debiased and doubly robust (DDR) representation: � T � R ( θ ) = E X { φ ( X , θ ) } + E π ( X ) { L ( Y , X , θ ) − φ ( X , θ ) } . (1) Purely non-parametric identification based on the observable Z and the nuisance functions: π ( X ) and φ ( X , θ ) (unknown but estimable ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 17/50

  40. Estimation of θ 0 : Getting Identifiable Representation(s) of R ( θ ) Under MAR assmpn., R ( θ ) := E { L ( Y , X , θ ) } ≡ E X { φ ( X , θ ) } admits the following debiased and doubly robust (DDR) representation: � T � R ( θ ) = E X { φ ( X , θ ) } + E π ( X ) { L ( Y , X , θ ) − φ ( X , θ ) } . (1) Purely non-parametric identification based on the observable Z and the nuisance functions: π ( X ) and φ ( X , θ ) (unknown but estimable ). 2 nd term is simply 0, can be seen as a ‘debiasing’ term (of sorts). Plays a crucial role in analyzing the empirical version of (1). Ensures first order insensitivity to any estimation errors of π ( · ) and φ ( · ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 17/50

  41. Estimation of θ 0 : Getting Identifiable Representation(s) of R ( θ ) Under MAR assmpn., R ( θ ) := E { L ( Y , X , θ ) } ≡ E X { φ ( X , θ ) } admits the following debiased and doubly robust (DDR) representation: � T � R ( θ ) = E X { φ ( X , θ ) } + E π ( X ) { L ( Y , X , θ ) − φ ( X , θ ) } . (1) Purely non-parametric identification based on the observable Z and the nuisance functions: π ( X ) and φ ( X , θ ) (unknown but estimable ). 2 nd term is simply 0, can be seen as a ‘debiasing’ term (of sorts). Plays a crucial role in analyzing the empirical version of (1). Ensures first order insensitivity to any estimation errors of π ( · ) and φ ( · ). Double robustness (DR) aspect: replace { φ ( X , θ ) , π ( X ) } by any { φ ∗ ( X , θ ) , π ∗ ( X ) } and (1) continues to hold as long as one but not necessarily both of φ ∗ ( · ) = φ ( · ) or π ∗ ( · ) = π ( · ) hold. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 17/50

  42. The DDR Estimator of θ 0 π ( · ) , � Given any estimators { � φ ( · ) } be of the nuisance fns. { π ( · ) , φ ( · ) } , we define our L 1 -penalized DDR estimator � θ DDR of θ 0 as: � � � θ DDR ≡ � L DDR θ DDR ( λ n ) := arg min ( θ ) + λ n � θ � 1 , where n θ ∈ R d � � n � ( θ ) := 1 T i � L ( Y i , X i , θ ) − � L DDR φ ( X i , θ ) + φ ( X i , θ ) , n n π ( X i ) � i =1 π ( · ) , � λ n ≥ 0 is the tuning parameter and { � φ ( · ) } are arbitrary except for satisfying two basic conditions regarding their construction: Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 18/50

  43. The DDR Estimator of θ 0 π ( · ) , � Given any estimators { � φ ( · ) } be of the nuisance fns. { π ( · ) , φ ( · ) } , we define our L 1 -penalized DDR estimator � θ DDR of θ 0 as: � � θ DDR ≡ � � L DDR θ DDR ( λ n ) := arg min ( θ ) + λ n � θ � 1 , where n θ ∈ R d � � n � ( θ ) := 1 T i � L ( Y i , X i , θ ) − � L DDR φ ( X i , θ ) + φ ( X i , θ ) , n n � π ( X i ) i =1 π ( · ) , � λ n ≥ 0 is the tuning parameter and { � φ ( · ) } are arbitrary except for satisfying two basic conditions regarding their construction: i =1 only ; { � π ( · ) obtained from the data T n := { T i , X i } n φ ( X i , θ ) } n � i =1 obtained in a ‘cross-fitted’ manner (via sample splitting). π ( · ) , � Assume (temporarily) { � φ ( · ) } are both ‘correct’. DR properties (consistency) of � θ DDR under their misspecfications discussed later. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 18/50

  44. Simplifying Assumptions and User Friendly Implementation Algorithm For simplicity, assume that the gradient ∇ L ( Y , X , θ ) of L ( · ) satisfies a ‘separable form’ as follows: for some h ( X ) ∈ R d and g ( X , θ ) ∈ R , Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 19/50

  45. Simplifying Assumptions and User Friendly Implementation Algorithm For simplicity, assume that the gradient ∇ L ( Y , X , θ ) of L ( · ) satisfies a ‘separable form’ as follows: for some h ( X ) ∈ R d and g ( X , θ ) ∈ R , ∇ L ( Y , X , θ ) = h ( X ) { Y − g ( X , θ ) } , and hence, ∇ � φ ( X , θ ) = h ( X ) { � m ( X ) − g ( X , θ ) } , where m ( X ) denotes the corresponding (cross-fitted) estimator of m ( X ). � This simplifying assumption holds for all examples given before. m ( X i ) and not � Assumed form ⇒ only need to obtain � φ ( X i , θ ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 19/50

  46. Simplifying Assumptions and User Friendly Implementation Algorithm For simplicity, assume that the gradient ∇ L ( Y , X , θ ) of L ( · ) satisfies a ‘separable form’ as follows: for some h ( X ) ∈ R d and g ( X , θ ) ∈ R , ∇ L ( Y , X , θ ) = h ( X ) { Y − g ( X , θ ) } , and hence, ∇ � φ ( X , θ ) = h ( X ) { � m ( X ) − g ( X , θ ) } , where m ( X ) denotes the corresponding (cross-fitted) estimator of m ( X ). � This simplifying assumption holds for all examples given before. m ( X i ) and not � Assumed form ⇒ only need to obtain � φ ( X i , θ ). Implementation algorithm. � θ DDR can be obtained simply as: � � n � 1 θ DDR ≡ � � L ( � θ DDR ( λ n ) := arg min Y i , X i , θ ) + λ n � θ � 1 , n θ ∈ R d i =1 where � T i Y i := � m ( X i ) + π ( X i ) { Y i − � m ( X i ) } , ∀ i , is a ‘pseudo’ outcome. � Can use ‘glmnet’ in R . Pretend to have a ‘full’ data: { � Y i , X i } n i =1 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 19/50

  47. Properties of � θ DDR : Deterministic Deviation Bounds Assume L ( · ) is convex and differentiable in θ and L DDR ( θ ) satisfies n the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ 0 . Then, for any choice of λ n ≥ 2 � ∇ L DDR ( θ 0 ) � ∞ , n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 20/50

  48. Properties of � θ DDR : Deterministic Deviation Bounds Assume L ( · ) is convex and differentiable in θ and L DDR ( θ ) satisfies n the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ 0 . Then, for any choice of λ n ≥ 2 � ∇ L DDR ( θ 0 ) � ∞ , n � � � � √ s , and � � � � �� �� θ DDR ( λ n ) − θ 0 2 � λ n θ DDR ( λ n ) − θ 0 1 � λ n s . � � where s := � θ 0 � 0 . This is a deterministic deviation bound. Holds for any choices of { � π ( · ) , � m ( · ) } and for any realization of D n . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 20/50

  49. Properties of � θ DDR : Deterministic Deviation Bounds Assume L ( · ) is convex and differentiable in θ and L DDR ( θ ) satisfies n the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ 0 . Then, for any choice of λ n ≥ 2 � ∇ L DDR ( θ 0 ) � ∞ , n � � � � √ s , and � � � � �� �� θ DDR ( λ n ) − θ 0 2 � λ n θ DDR ( λ n ) − θ 0 1 � λ n s . � � where s := � θ 0 � 0 . This is a deterministic deviation bound. Holds for any choices of { � π ( · ) , � m ( · ) } and for any realization of D n . The RSC (or ‘cone’) condition for L DDR ( θ ) is exactly the same as n the usual RSC condition required under a fully observed data! The fully observed data RSC condition’s validity is well studied. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 20/50

  50. Properties of � θ DDR : Deterministic Deviation Bounds Assume L ( · ) is convex and differentiable in θ and L DDR ( θ ) satisfies n the Restricted Strong Convexity (RSC) condition (Negahban et al., 2012) at θ = θ 0 . Then, for any choice of λ n ≥ 2 � ∇ L DDR ( θ 0 ) � ∞ , n � � � � √ s , and � � � � �� �� θ DDR ( λ n ) − θ 0 2 � λ n θ DDR ( λ n ) − θ 0 1 � λ n s . � � where s := � θ 0 � 0 . This is a deterministic deviation bound. Holds for any choices of { � π ( · ) , � m ( · ) } and for any realization of D n . The RSC (or ‘cone’) condition for L DDR ( θ ) is exactly the same as n the usual RSC condition required under a fully observed data! The fully observed data RSC condition’s validity is well studied. Key quantity of interest: the random lower bound � ∇ L DDR ( θ 0 ) � ∞ for n λ n . Need probabilistic bounds to determine convergence rate of � θ DDR . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 20/50

  51. The Main Goal from Hereon: Probabilistic Bounds for � ∇ L DDR ( θ 0 ) � ∞ n Bounds on � ∇ L DDR ( θ 0 ) � ∞ determines the rate of choice of λ n and n hence the convergence rate of � θ DDR (using the deviation bound). Probabilistic bounds for � ∇ L DDR ( θ 0 ) � ∞ : the basic decomposition n � � � ∇ L DDR � ( θ 0 ) ∞ ≤ � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ + � R π, m , n � ∞ , n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 21/50

  52. The Main Goal from Hereon: Probabilistic Bounds for � ∇ L DDR ( θ 0 ) � ∞ n Bounds on � ∇ L DDR ( θ 0 ) � ∞ determines the rate of choice of λ n and n hence the convergence rate of � θ DDR (using the deviation bound). Probabilistic bounds for � ∇ L DDR ( θ 0 ) � ∞ : the basic decomposition n � � � ∇ L DDR � ( θ 0 ) ∞ ≤ � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ + � R π, m , n � ∞ , n where T 0 , n is the ‘main’ term (a centered iid average), T π, n is the ‘ π -error’ term involving � π ( · ) − π ( · ) and T m , n is the ‘ m -error’ term involving � m ( · ) − m ( · ), while R π, m , n is the ‘( π, m )-error’ term (usually lower order) involving the product of � π ( · ) − π ( · ) and � m ( · ) − m ( · ). Control each term separately. The analyses are all non-asymptotic and nuanced, especially in order to get sharp rates for T π, n and T m , n . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 21/50

  53. The Main Goal from Hereon: Probabilistic Bounds for � ∇ L DDR ( θ 0 ) � ∞ n Bounds on � ∇ L DDR ( θ 0 ) � ∞ determines the rate of choice of λ n and n hence the convergence rate of � θ DDR (using the deviation bound). Probabilistic bounds for � ∇ L DDR ( θ 0 ) � ∞ : the basic decomposition n � � � ∇ L DDR � ( θ 0 ) ∞ ≤ � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ + � R π, m , n � ∞ , n where T 0 , n is the ‘main’ term (a centered iid average), T π, n is the ‘ π -error’ term involving � π ( · ) − π ( · ) and T m , n is the ‘ m -error’ term involving � m ( · ) − m ( · ), while R π, m , n is the ‘( π, m )-error’ term (usually lower order) involving the product of � π ( · ) − π ( · ) and � m ( · ) − m ( · ). Control each term separately. The analyses are all non-asymptotic and nuanced, especially in order to get sharp rates for T π, n and T m , n . � We show: � ∇ L DDR ( θ 0 ) � ∞ � (log d ) / n with high probability, and n � hence � � θ DDR − θ 0 � 2 � s (log d ) / n . So, clearly it is rate optimal. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 21/50

  54. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  55. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: | � π ( x ) − π ( x ) | � P δ n ,π and | � m ( x ) − m ( x ) | � P ξ n , m ∀ x ∈ X , (2) � for some sequences δ n ,π , ξ n , m ≥ 0 such that ( δ n ,π + ξ n , m ) log( nd ) � = o (1) and the product δ n ,π ξ n , m (log n ) = o ( (log d ) / n ). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  56. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: | � π ( x ) − π ( x ) | � P δ n ,π and | � m ( x ) − m ( x ) | � P ξ n , m ∀ x ∈ X , (2) � for some sequences δ n ,π , ξ n , m ≥ 0 such that ( δ n ,π + ξ n , m ) log( nd ) � = o (1) and the product δ n ,π ξ n , m (log n ) = o ( (log d ) / n ). Under condition (2), along with some more ‘suitable’ tail assumptions (sub-Gaussian tails etc.), we have: with high probability, � � � � � log d log d � T 0 , n � ∞ � n , � T π, n � ∞ � δ n ,π log( nd ) , and n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  57. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: | � π ( x ) − π ( x ) | � P δ n ,π and | � m ( x ) − m ( x ) | � P ξ n , m ∀ x ∈ X , (2) � for some sequences δ n ,π , ξ n , m ≥ 0 such that ( δ n ,π + ξ n , m ) log( nd ) � = o (1) and the product δ n ,π ξ n , m (log n ) = o ( (log d ) / n ). Under condition (2), along with some more ‘suitable’ tail assumptions (sub-Gaussian tails etc.), we have: with high probability, � � � � � log d log d � T 0 , n � ∞ � n , � T π, n � ∞ � δ n ,π log( nd ) , and n � � � � log d � T m , n � ∞ � ξ n , m log( nd ) , � R π, m , n � ∞ � δ n ,π ξ n , m (log n ) . n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  58. ( θ 0 ) � ∞ (and � Convergence Rates and Bounds for � ∇ L DDR θ DDR ) n Basic (high level) consistency conditions on { � π ( · ) , � m ( · ) } . Let { � π ( · ) , m ( · ) } be any general and ‘correct’ estimators of { π ( · ) , m ( · ) } , and � assume they satisfy the following pointwise convergence rates: | � π ( x ) − π ( x ) | � P δ n ,π and | � m ( x ) − m ( x ) | � P ξ n , m ∀ x ∈ X , (2) � for some sequences δ n ,π , ξ n , m ≥ 0 such that ( δ n ,π + ξ n , m ) log( nd ) � = o (1) and the product δ n ,π ξ n , m (log n ) = o ( (log d ) / n ). Under condition (2), along with some more ‘suitable’ tail assumptions (sub-Gaussian tails etc.), we have: with high probability, � � � � � log d log d � T 0 , n � ∞ � n , � T π, n � ∞ � δ n ,π log( nd ) , and n � � � � log d � T m , n � ∞ � ξ n , m log( nd ) , � R π, m , n � ∞ � δ n ,π ξ n , m (log n ) . n � log d Hence, � ∇ L DDR ( θ 0 ) � ∞ � n { 1 + o (1) } with high probability. n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 22/50

  59. HD Inference for � θ DDR : Desparsification and Asymptotic Linear Expansion Consider � θ DDR for the squared loss: L ( Y , X , θ ) := { Y − Ψ ( X ) ′ θ } 2 , where Ψ ( X ) ∈ R d denotes any HD vector of basis functions of X . Define Σ := E { Ψ ( X ) Ψ ( X ) ′ } , Ω := Σ − 1 , and let � Ω be any reasonable estimator of Ω (and assume Ω is sparse if required). We then define the desparsified DDR estimator � θ DDR as follows. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 23/50

  60. HD Inference for � θ DDR : Desparsification and Asymptotic Linear Expansion Consider � θ DDR for the squared loss: L ( Y , X , θ ) := { Y − Ψ ( X ) ′ θ } 2 , where Ψ ( X ) ∈ R d denotes any HD vector of basis functions of X . Define Σ := E { Ψ ( X ) Ψ ( X ) ′ } , Ω := Σ − 1 , and let � Ω be any reasonable estimator of Ω (and assume Ω is sparse if required). We then define the desparsified DDR estimator � θ DDR as follows. n � Ω 1 θ DDR := � � θ DDR + � { � Y i − Ψ ( X i ) ′ � θ DDR } Ψ ( X i ) , where n i =1 � �� � Desparsification/Debiasing term T i � Y i := � π ( X i ) { Y i − � m ( X i ) } are the pseudo outcomes. m ( X i ) + � Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 23/50

  61. HD Inference for � θ DDR : Desparsification and Asymptotic Linear Expansion Consider � θ DDR for the squared loss: L ( Y , X , θ ) := { Y − Ψ ( X ) ′ θ } 2 , where Ψ ( X ) ∈ R d denotes any HD vector of basis functions of X . Define Σ := E { Ψ ( X ) Ψ ( X ) ′ } , Ω := Σ − 1 , and let � Ω be any reasonable estimator of Ω (and assume Ω is sparse if required). We then define the desparsified DDR estimator � θ DDR as follows. n � Ω 1 θ DDR := � � θ DDR + � { � Y i − Ψ ( X i ) ′ � θ DDR } Ψ ( X i ) , where n i =1 � �� � Desparsification/Debiasing term T i � Y i := � π ( X i ) { Y i − � m ( X i ) } are the pseudo outcomes. m ( X i ) + � Debiasing similar (in spirit) to van de Geer et al. (2014), except its the ‘right’ one for this problem (using pseudo outcomes in the full data). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 23/50

  62. The Desparisfied DDR Estimator: Asymptotic Linear Expansion Assume: the basic convergence conditions (2) for { � π ( · ) , � m ( · ) } , ΩX is sub-Gaussian and that � � Ω − Ω � 1 = O P ( a n ), � I − � Ω � Σ � max = O P ( b n ), √ log d = o (1) and b n s √ log d = o (1), where s := � θ 0 � 0 . with a n Then, � θ DDR satisfies the asymptotic linear expansion (ALE) : Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 24/50

  63. The Desparisfied DDR Estimator: Asymptotic Linear Expansion Assume: the basic convergence conditions (2) for { � π ( · ) , � m ( · ) } , ΩX is sub-Gaussian and that � � Ω − Ω � 1 = O P ( a n ), � I − � Ω � Σ � max = O P ( b n ), √ log d = o (1) and b n s √ log d = o (1), where s := � θ 0 � 0 . with a n Then, � θ DDR satisfies the asymptotic linear expansion (ALE) : n � θ DDR − θ 0 ) = 1 Ω { ψ 0 ( Z i ) } + ∆ n , where � ∆ n � ∞ = o P ( n − 1 ( � 2 ) n i =1 � � T { m ( X ) − Ψ ( X ) ′ θ 0 } + and ψ 0 ( Z ) := π ( X ) { Y − m ( X ) } Ψ ( X ) with E { ψ 0 ( Z ) } = 0 . The ALE facilitates inference (e.g. confidence intervals etc.) for any low-d component of θ 0 via Gaussian approx. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 24/50

  64. The Desparisfied DDR Estimator: Asymptotic Linear Expansion Assume: the basic convergence conditions (2) for { � π ( · ) , � m ( · ) } , ΩX is sub-Gaussian and that � � Ω − Ω � 1 = O P ( a n ), � I − � Ω � Σ � max = O P ( b n ), √ log d = o (1) and b n s √ log d = o (1), where s := � θ 0 � 0 . with a n Then, � θ DDR satisfies the asymptotic linear expansion (ALE) : n � θ DDR − θ 0 ) = 1 Ω { ψ 0 ( Z i ) } + ∆ n , where � ∆ n � ∞ = o P ( n − 1 ( � 2 ) n i =1 � � T { m ( X ) − Ψ ( X ) ′ θ 0 } + and ψ 0 ( Z ) := π ( X ) { Y − m ( X ) } Ψ ( X ) with E { ψ 0 ( Z ) } = 0 . The ALE facilitates inference (e.g. confidence intervals etc.) for any low-d component of θ 0 via Gaussian approx. Further, the ALE is also ‘optimal’. The function Ω ψ 0 ( Z ) =: Ψ eff ( Z ) is the ‘efficient’ influence function for θ 0 (Robins et al., 1994). Thus, in classical settings, � θ DDR achieves the semi-parametric efficiency bound. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 24/50

  65. The Desparsified Estimator: Asymptotic Normality and Some Final Remarks Coordinate-wise asymptotic normality of � θ DDR : ∀ 1 ≤ j ≤ d , √ n ( � d → N ( 0 , σ 2 0 , j ) , where σ 2 0 , j := Var { Ω ′ θ DDR − θ 0 ) j j · ψ 0 ( Z ) } . Further, max 1 ≤ j ≤ d | � σ 0 , j − σ 0 , j | = o P (1), where � σ 0 , j is the plug-in estimator obtained by plugging in � m ( · ) in Var { Ω ′ Ω , � π ( · ) and � j · ψ 0 ( Z ) } . Can choose � Ω to be any standard (sparse) precision matrix estimator, � e.g. the node-wise Lasso estimator. Here, a n = s Ω (log d ) / n and � 1 ≤ j ≤ d � Ω j · � 0 . b n = (log d ) / n under suitable conditions, with s Ω := max Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 25/50

  66. The Desparsified Estimator: Asymptotic Normality and Some Final Remarks Coordinate-wise asymptotic normality of � θ DDR : ∀ 1 ≤ j ≤ d , √ n ( � d → N ( 0 , σ 2 0 , j ) , where σ 2 0 , j := Var { Ω ′ θ DDR − θ 0 ) j j · ψ 0 ( Z ) } . Further, max 1 ≤ j ≤ d | � σ 0 , j − σ 0 , j | = o P (1), where � σ 0 , j is the plug-in estimator obtained by plugging in � m ( · ) in Var { Ω ′ Ω , � π ( · ) and � j · ψ 0 ( Z ) } . Can choose � Ω to be any standard (sparse) precision matrix estimator, � e.g. the node-wise Lasso estimator. Here, a n = s Ω (log d ) / n and � 1 ≤ j ≤ d � Ω j · � 0 . b n = (log d ) / n under suitable conditions, with s Ω := max The error ∆ n can be decomposed as: ∆ n = ∆ n , 1 + ∆ n , 2 + ∆ n , 3 , Ω − Ω ) � n n ( � i =1 ψ 0 ( Z i ) , ∆ n , 2 := ( I d − � Ω � Σ )( � where ∆ n , 1 := 1 θ DDR − θ 0 ) and ∆ n , 3 := � Ω ( T π, n + T m , n + R π, m , n ), with � ∆ n , 3 � ∞ � P n − 1 2 and � � log d log d � ∆ n , 1 � ∞ � a n and � ∆ n , 2 � ∞ � b n s . n n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 25/50

  67. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  68. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, � � � log d � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ � 1 + 1 ( π ∗ , m ∗ ) � =( π, m ) n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  69. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, � � � log d � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ � 1 + 1 ( π ∗ , m ∗ ) � =( π, m ) n � � and � R π, m , n � ∞ � δ n ,π 1 ( m ∗ � = m ) + ξ n , m 1 ( π ∗ � = π ) + δ n ,π ξ n , m (log n ) . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  70. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, � � � log d � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ � 1 + 1 ( π ∗ , m ∗ ) � =( π, m ) n � � and � R π, m , n � ∞ � δ n ,π 1 ( m ∗ � = m ) + ξ n , m 1 ( π ∗ � = π ) + δ n ,π ξ n , m (log n ) . � The 2 nd and/or 3 rd terms also contribute now to the rate (log d ) / n . The 4 th term is o (1) but no longer ignorable (and may be slower). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  71. The DR Aspect: General Convergence Rates (under Misspecification) m ( · ) } → { π ∗ ( · ) , m ∗ ( · ) } , with either π ∗ ( · ) = π ( · ) or Finally, let { � π ( · ) , � m ∗ ( · ) = m ( · ) but not necessarily both. Assume the same pointwise convergence conditions and rates ( δ n ,π , ξ n , m ) for { � π ( · ) , � m ( · ) } as in (2), but now with { π ( · ) , m ( · ) } therein replaced by { π ∗ ( · ) , m ∗ ( · ) } . Under some ‘suitable’ assumptions, we have: with high probability, � � � log d � T 0 , n � ∞ + � T π, n � ∞ + � T m , n � ∞ � 1 + 1 ( π ∗ , m ∗ ) � =( π, m ) n � � and � R π, m , n � ∞ � δ n ,π 1 ( m ∗ � = m ) + ξ n , m 1 ( π ∗ � = π ) + δ n ,π ξ n , m (log n ) . � The 2 nd and/or 3 rd terms also contribute now to the rate (log d ) / n . The 4 th term is o (1) but no longer ignorable (and may be slower). Regardless, this establishes general convergence rates and the DR property of � θ DDR under possible misspecification of { � π ( · ) , � m ( · ) } . For the 4 th term, sharper rates need a case-by-case analysis. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 26/50

  72. Choices of the Nuisance Component Estimators � π ( · ) and � m ( · ) Note: our theory holds generally for any choices of � π ( · ) and � m ( · ) under mild conditions (provided they are both ‘correct’ estimators). Under misspecifications, consistency & general non-sharp rates are also established. Sharp rates need case-by-case analyses. Even for mean (or ATE) estimation problem, this can be quite tricky in HD settings. See Smucler et al. (2019) for a detailed analysis. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 27/50

  73. Choices of the Nuisance Component Estimators � π ( · ) and � m ( · ) Note: our theory holds generally for any choices of � π ( · ) and � m ( · ) under mild conditions (provided they are both ‘correct’ estimators). Under misspecifications, consistency & general non-sharp rates are also established. Sharp rates need case-by-case analyses. Even for mean (or ATE) estimation problem, this can be quite tricky in HD settings. See Smucler et al. (2019) for a detailed analysis. π ( · ) and � m ( · ) that may be Below we provide only some choices of � used to implement our theory & methods for � θ DDR . In general, one can use any reasonable method (including black box ML methods). Choices of � π ( · ) and � m ( · ): we consider estimators from two families. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 27/50

  74. Choices of the Nuisance Component Estimators � π ( · ) and � m ( · ) Note: our theory holds generally for any choices of � π ( · ) and � m ( · ) under mild conditions (provided they are both ‘correct’ estimators). Under misspecifications, consistency & general non-sharp rates are also established. Sharp rates need case-by-case analyses. Even for mean (or ATE) estimation problem, this can be quite tricky in HD settings. See Smucler et al. (2019) for a detailed analysis. π ( · ) and � m ( · ) that may be Below we provide only some choices of � used to implement our theory & methods for � θ DDR . In general, one can use any reasonable method (including black box ML methods). Choices of � π ( · ) and � m ( · ): we consider estimators from two families. Parametric and ‘extended’ parametric families (series estimators). Semi-parametric single index families. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 27/50

  75. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  76. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). ‘Extended’ parametric family: π ( x ) = g { α ′ Ψ ( X ) } , where g ( · ) ∈ [0 , 1] is a known function [e.g. g expit ( u ) := exp( u ) / { 1 + exp( u ) } ], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and α ∈ R K is an unknown (sparse) parameter vector. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  77. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). ‘Extended’ parametric family: π ( x ) = g { α ′ Ψ ( X ) } , where g ( · ) ∈ [0 , 1] is a known function [e.g. g expit ( u ) := exp( u ) / { 1 + exp( u ) } ], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and α ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. Further, the case of π ( · ) = constant (but unknown) i.e. MCAR is also included. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  78. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). ‘Extended’ parametric family: π ( x ) = g { α ′ Ψ ( X ) } , where g ( · ) ∈ [0 , 1] is a known function [e.g. g expit ( u ) := exp( u ) / { 1 + exp( u ) } ], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and α ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. Further, the case of π ( · ) = constant (but unknown) i.e. MCAR is also included. π ( X ) = g { � α ′ Ψ ( X ) } , where � Estimator: we set � α denotes any suitable estimator (possibly penalized) of α based on T n := { T i , X i } n i =1 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  79. Choices of � π ( · ): ‘Extended’ Parametric Families (Series Estimators) If π ( · ) is known, we set � π ( · ) := π ( · ). Otherwise, we estimate π ( · ) via two (class of) choices of � π ( · ) (each assumed to be ‘correct’). ‘Extended’ parametric family: π ( x ) = g { α ′ Ψ ( X ) } , where g ( · ) ∈ [0 , 1] is a known function [e.g. g expit ( u ) := exp( u ) / { 1 + exp( u ) } ], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and α ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. Further, the case of π ( · ) = constant (but unknown) i.e. MCAR is also included. π ( X ) = g { � α ′ Ψ ( X ) } , where � Estimator: we set � α denotes any suitable estimator (possibly penalized) of α based on T n := { T i , X i } n i =1 . α : when g ( · ) = g expit ( · ), � Example of � α may be obtained based on a standard L 1 -penalized logistic regression of { T i vs. Ψ ( X i ) } n i =1 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 28/50

  80. Choices of � π ( · ): Semi-Parametric Single Index Families Semi-parametric single index family: π ( X ) = g ( α ′ X ), where g ( · ) ∈ (0 , 1) is unknown and α ∈ R p is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set � α � 2 = 1 wlog). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 29/50

  81. Choices of � π ( · ): Semi-Parametric Single Index Families Semi-parametric single index family: π ( X ) = g ( α ′ X ), where g ( · ) ∈ (0 , 1) is unknown and α ∈ R p is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set � α � 2 = 1 wlog). α of α , we estimate π ( X ) ≡ E ( T | α ′ X ) as: Given an estimator � � n �� � α ′ ( X i − x ) / h 1 i =1 T i K nh � , π ( x ) ≡ � � π ( � α , x ) := � n �� α ′ ( X i − x ) / h 1 i =1 K nh where K ( · ) denotes any standard (2 nd order) kernel function and h = h n > 0 denotes the bandwidth sequence with h = o (1). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 29/50

  82. Choices of � π ( · ): Semi-Parametric Single Index Families Semi-parametric single index family: π ( X ) = g ( α ′ X ), where g ( · ) ∈ (0 , 1) is unknown and α ∈ R p is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set � α � 2 = 1 wlog). α of α , we estimate π ( X ) ≡ E ( T | α ′ X ) as: Given an estimator � � n �� � α ′ ( X i − x ) / h 1 i =1 T i K nh � , π ( x ) ≡ � � π ( � α , x ) := � n �� α ′ ( X i − x ) / h 1 i =1 K nh where K ( · ) denotes any standard (2 nd order) kernel function and h = h n > 0 denotes the bandwidth sequence with h = o (1). Obtaining � α : In general, any approach (if available) from (high dimensional) single index model literature can be used. But if X is elliptically symmetric, then � α may be obtained as simply as a standard L 1 -penalized logistic regression of { T i vs. X i } n i =1 . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 29/50

  83. Choices of � m ( · ): ‘Extended’ Parametric Families (Series Estimators) ‘Extended’ parametric family: m ( x ) = g { γ ′ Ψ ( X ) } , where g ( · ) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and γ ∈ R K is an unknown (sparse) parameter vector. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 30/50

  84. Choices of � m ( · ): ‘Extended’ Parametric Families (Series Estimators) ‘Extended’ parametric family: m ( x ) = g { γ ′ Ψ ( X ) } , where g ( · ) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and γ ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 30/50

  85. Choices of � m ( · ): ‘Extended’ Parametric Families (Series Estimators) ‘Extended’ parametric family: m ( x ) = g { γ ′ Ψ ( X ) } , where g ( · ) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and γ ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. γ ′ Ψ ( X ) } , where � Estimator: we set � m ( X ) = g { � γ denotes any suitable estimator (possibly penalized) of γ based on the data subset of ‘complete cases’: D ( c ) := { ( Y i , X i ) | T i = 1 } n i =1 . n Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 30/50

  86. Choices of � m ( · ): ‘Extended’ Parametric Families (Series Estimators) ‘Extended’ parametric family: m ( x ) = g { γ ′ Ψ ( X ) } , where g ( · ) is a known ‘link’ function [e.g. ‘canonical’ links: identity, expit or exp], Ψ ( X ) := { ψ k ( X ) } K k =1 is any set of K basis functions (with K ≫ n possibly), and γ ∈ R K is an unknown (sparse) parameter vector. Example: Ψ ( X ) may correspond to the polynomial bases of X upto any fixed degree k . Note: the special case of linear bases ( k = 1) includes all standard parametric regression models. γ ′ Ψ ( X ) } , where � Estimator: we set � m ( X ) = g { � γ denotes any suitable estimator (possibly penalized) of γ based on the data subset of ‘complete cases’: D ( c ) := { ( Y i , X i ) | T i = 1 } n i =1 . n Example of � γ : when g ( · ) := any ‘canonical’ link function, � γ may be simply obtained based on the respective usual L 1 -penalized ‘canonical’ link based regression (e.g. linear, logistic or poisson) of i =1 from the ‘complete case’ data D ( c ) { ( Y i vs. X i ) | T i = 1 } n n . Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 30/50

  87. Choices of � m ( · ): Semi-Parametric Single Index Families Semi-parametric single index family: m ( X ) = g ( γ ′ X ), where g ( · ) is an unknown ‘link’ and γ ∈ R p is a (sparse) unknown parameter (identifiable only upto scalar multiples, hence set � γ � 2 = 1 wlog). Abhishek Chakrabortty High-Dim. M -Estimation with Missing Responses: A Semi-Parametric Framework 31/50

Recommend


More recommend