Bootstrapping Sensitivity Analysis Qingyuan Zhao Department of Statistics, The Wharton School University of Pennsylvania June 2, 2019 @ OSU Bayesian Causal Inference Workshop (Joint work with Bhaswar B. Battacharya and Dylan S. Small)
Why sensitivity analysis? ◮ Unless we have perfectly executed randomized experiment, causal inference is based on some unverifiable assumptions . ◮ In observational studies, the most commonly used assumption is ignorability or no unmeasured confounding : � X . � A ⊥ ⊥ Y (0) , Y (1) We can only say this assumption is “plausible”. ◮ Sensitivity analysis asks: what if this assumption does not hold? Does our qualitative conclusion still hold? ◮ This question appears in many settings: 1. Confounded observational studies. 2. Survey sampling with missing not at random (MNAR). 3. Longitudinal study with non-ignorable dropout. ◮ In general, this means that the target parameter (e.g. average treatment effect) is only partially identified . 1/20
Overview: Bootstrapping sensitivity analysis Point-identified parameter: Efron’s bootstrap Bootstrap Point estimator = = = = = = = = = = = = ⇒ Confidence interval Partially identified parameter: An analogy Optimization Percentile Bootstrap Minimax inequality Extrema estimator = = = = = = = = = = = = ⇒ Confidence interval Rest of the talk Apply this idea to IPW estimators in a marginal sensitivity model. 2/20
Some existing sensitivity models Generally, we need to specify how unconfoundedness is violated. 1. Y models: Consider a specific difference between the conditional distribution Y ( a ) | X , A and Y ( a ) | X . ◮ Commonly called “pattern mixture models”. ◮ Robins (1999, 2002); Birmingham et al. (2003); Vansteelandt et al. (2006); Daniels and Hogan (2008). 2. A models: Consider a specific difference between the conditional distribution A | X , Y ( a ) and A | X . ◮ Commonly called “selection models”. ◮ Scharfstein et al. (1999); Gilbert et al. (2003). 3. Simultaneous models: Consider a range of A models and/or Y models and report the “worst case” result. ◮ Cornfield et al. (1959); Rosenbaum (2002); Ding and VanderWeele (2016). Our sensitivity model— A hybrid of 2nd and 3rd, similar to Rosenbaum’s. 3/20
Rosenbaum’s sensitivity model ◮ Imagine there is an unobserved confounder U that “summarizes” all confounding, so A ⊥ ⊥ Y (0) , Y (1) | X , U . ◮ Let e 0 ( x , u ) = P 0 ( A = 1 | X = x , U = u ). Rosenbaum’s sensitivity model e ( x , u ) : 1 � � R (Γ) = Γ ≤ OR ( e ( x , u 1 ) , e ( x , u 2 )) ≤ Γ , ∀ x ∈ X , u 1 , u 2 , where OR ( p 1 , p 2 ) := [ p 1 / (1 − p 1 )] / [ p 2 / (1 − p 2 )] is the odds ratio . ◮ Rosenbaum’s question: can we reject the sharp null hypothesis Y (0) ≡ Y (1) for every e 0 ( x , u ) ∈ R (Γ)? ◮ Robins (2002): we don’t need to assume the existence of U . Let U = Y (1) when the goal is to estimate E [ Y (1)]. 4/20
Our sensitivity model ◮ Let e 0 ( x ) = P 0 ( A = 1 | X = x ) be the propensity score. Marginal sensitivity models e ( x , y ) : 1 � � M (Γ) = Γ ≤ OR ( e ( x , y ) , e 0 ( x )) ≤ Γ , ∀ x ∈ X , y . ◮ Compare this to Rosenbaum’s model: e ( x , u ) : 1 � � R (Γ) = Γ ≤ OR ( e ( x , u 1 ) , e ( x , u 2 )) ≤ Γ , ∀ x ∈ X , u 1 , u 2 . ◮ Tan (2006) first considered this model, but he did not consider statistical inference in finite sample. √ Γ) ⊆ R (Γ) ⊆ M (Γ) . 1 ◮ Relationship between the two models: M ( ◮ For observational studies, we assume both P 0 ( A = 1 | X , Y (1)) , P 0 ( A = 1 | X , Y (0)) ∈ M (Γ). 1 The second part needs “compatibility”: e ( x , y ) marginalizes to e 0 ( x ). 5/20
Parametric extension ◮ In practice, the propensity score e 0 ( X ) = P 0 ( A = 1 | X ) is often estimated by a parametric model. Definition (Parametric marginal sensitivity models) e ( x , y ) : 1 � � M β 0 (Γ) = Γ ≤ OR ( e ( x , y ) , e β 0 ( x )) ≤ Γ , ∀ x ∈ X , y , where e β 0 ( x ) is the best parametric approximation of e 0 ( x ). This sensitivity model covers both 1. Model misspecification , that is, e β 0 ( x ) � = e 0 ( x ); and 2. Missing not at random , that is, e 0 ( x ) � = e 0 ( x , y ). 6/20
Logistic representations 1. Rosenbaum’s sensitivity model: logit ( e ( x , u )) = g ( x ) + γ u , where 0 ≤ U ≤ 1 and γ = log Γ. 2. Marginal sensitivity model: logit ( e ( h ) ( x , y )) = logit ( e 0 ( x )) + h ( x , y ) , where � h � ∞ = sup | h ( x , y ) | ≤ γ . Due to this representation, we also call it a marginal L ∞ -sensitivity model . 3. Parametric marginal sensitivity model: logit ( e ( h ) ( x , y )) = logit ( e β 0 ( x )) + h ( x , y ) , where � h � ∞ = sup | h ( x , y ) | ≤ γ . 7/20
Confidence interval I ◮ For simplicity, consider the “missing data” problem where Y = Y (1) is only observed if A = 1. ◮ Observe i.i.d. samples ( A i , X i , A i Y i ), i = 1 , . . . , n . ◮ The estimand is µ 0 = E 0 [ Y ], however it is only partially identified under a simultaneous sensitivity model. Goal 1 (Coverage of true parameter) Construct a data-dependent interval [ L , U ] such that � � P 0 µ 0 ∈ [ L , U ] ≥ 1 − α whenever e 0 ( X , Y ) = P 0 ( A = 1 | X , Y ) ∈ M (Γ). 8/20
Confidence interval II ◮ The inverse probability weighting (IPW) identity: � AY � MAR AY � � E 0 [ Y ] = E = E . e 0 ( X , Y ) e 0 ( X ) ◮ Define � � AY µ ( h ) = E 0 e ( h ) ( X , Y ) ◮ Partially identified region: { µ ( h ) : e ( h ) ∈ M (Γ) } . Goal 2 (Coverage of partially identified region) Construct a data-dependent interval [ L , U ] such that � { µ ( h ) : e ( h ) ∈ M (Γ) } ⊆ [ L , U ] � ≥ 1 − α. P 0 ◮ Imbens and Manski (2004) have discussed the difference between these two Goals. 9/20
An intuitive idea: “The Union Method” ◮ Suppose for any h , we have a confidence interval [ L ( h ) , U ( h ) ] such that n →∞ P 0 ( µ ( h ) ∈ [ L ( h ) , U ( h ) ]) ≥ 1 − α lim inf � h � L ( h ) and U = sup U ( h ) , so [ L , U ] is the union interval . ◮ Let L = inf � h � Theorem 1. [ L , U ] satisfies Goal 1 asymptotically. 2. Furthermore if the intervals are “congruent”: ∃ α ′ < α such that µ ( h ) < L ( h ) � µ ( h ) > U ( h ) � � ≤ α ′ , lim sup � ≤ α − α ′ . lim sup n →∞ P 0 n →∞ P 0 Then [ L , U ] satisfies Goal 2 asymptotically. 10/20
Practical challenge: How to take the union? ◮ Suppose ˆ g ( x ) is an estimate of logit ( e 0 ( x )). ◮ For a specific difference h , we can estimate e ( h ) ( x , y ) by 1 e ( h ) ( x , y ) = ˆ g ( x , y ) . 1 + e h ( x , y ) − ˆ ◮ This leads to an (stabilized) IPW estimate of µ ( h ) : n � − 1 � 1 n � 1 � A i A i Y i µ ( h ) = � � ˆ . e ( h ) ( X i , Y i ) e ( h ) ( X i , Y i ) n ˆ n ˆ i =1 i =1 ◮ Under regularity conditions, the Z-estimation theory tells us µ ( h ) − µ ( h ) � d √ n � → N (0 , ( σ ( h ) ) 2 ) ˆ σ ( h ) 2 · ˆ µ ( h ) ∓ z α ◮ Therefore we can use [ L ( h ) , U ( h ) ] = ˆ √ n . ◮ However, computing the union interval requires solving a complicated optimization problem. 11/20
Bootstrapping sensitivity analysis Point-identified parameter: Efron’s bootstrap Bootstrap Point estimator = = = = = = = = = = = = ⇒ Confidence interval Partially identified parameter: An analogy Optimization Percentile Bootstrap Minimax inequality Extrema estimator = = = = = = = = = = = = ⇒ Confidence interval A simple procedure for simultaneous sensitivity analysis 1. Generate B random resamples of the data. For each resample, compute the extrema of IPW estimates under M β 0 (Γ). 2. Construct the confidence interval using L = Q α/ 2 of the B minima and U = Q 1 − α/ 2 of the B maxima. Theorem [ L , U ] achieves Goal 2 for M β 0 (Γ) asymptotically. 12/20
Proof of the Theorem Partially identified parameter: Three ideas Optimization 1. Percentile Bootstrap 2. Minimax inequality Extrema estimator = = = = = = = = = = = = ⇒ Confidence interval µ ( h ) can be captured by bootstrap. The 1. The sampling variability of ˆ percentile bootstrap CI is given by � � � � �� µ ( h ) µ ( h ) ˆ ˆ ˆ , Q 1 − α ˆ . Q α b b 2 2 2. Generalized minimax inequality: Percentile Bootstrap CI � � � � � � � � µ ( h ) µ ( h ) µ ( h ) µ ( h ) ˆ ˆ ˆ ˆ Q α inf ˆ ≤ inf h Q α ˆ ≤ sup Q 1 − α ˆ ≤ Q 1 − α sup ˆ . b b b b 2 2 2 2 h h h Union CI 13/20
Computation Partially identified parameter: Three ideas 3. Optimization Percentile Bootstrap Minimax inequality Extrema estimator = = = = = = = = = = = = ⇒ Confidence interval µ ( h ) is a linear fractional programming : 3. Computing extrema of ˆ Let z i = e h ( X i , Y i ) , we just need to solve � n � 1 + z i e − ˆ g ( X i ) � i =1 A i Y i max or min g ( X i ) � , � n � 1 + z i e − ˆ i =1 A i z i ∈ [Γ − 1 , Γ] , i = 1 , . . . , n . subject to ◮ This can be converted to a linear programming. ◮ Moreover, the solution z must have the same/opposite order as Y , so the time complexity can be reduced to O ( n ) (optimal). The role of Bootstrap Comapred to the union method, the workflow is greatly simplified: 1. No need to derive σ ( h ) analytically (though we could). 2. No need to optimize σ ( h ) (which is very challenging). 14/20
Recommend
More recommend