Controlling for confounders through approximate sufficiency Rina Foygel Barber (joint with Lucas Janson) http://www.stat.uchicago.edu/~rina/
Collaborator Lucas Janson (Harvard U.) 2/27
✶ Intro: testing conditional independence confounders Z ? response Y features X Classical (parametric) approach: • Assume a parametric model such as Y | X , Z ∼ f ( · ; α ⊤ X + β ⊤ Z ) • Parametric inference to test H 0 : α = 0 3/27
Intro: testing conditional independence confounders Z ? response Y features X Classical (parametric) approach: • Assume a parametric model such as Y | X , Z ∼ f ( · ; α ⊤ X + β ⊤ Z ) • Parametric inference to test H 0 : α = 0 Model-X approach a.k.a.Conditional Randomization Test (Cand` es et al 2018) • Known distribution of X | Z (distrib. of Y unknown) • Choose function T ( X ; Y , Z ) that measures association • Resample copies ˜ X (1) , . . . , ˜ X ( M ) iid ∼ (distrib. of X | Z ) pval = 1 + � m ✶ { T ( ˜ X ( m ) ; Y , Z ) ≥ T ( X ; Y , Z ) } � 1 + M 3/27
Intro: testing conditional independence confounders Z ? response Y features X 4/27
Intro: testing conditional independence confounders Z ? response Y features X Model-X approach via sufficient statistics (Huang & Janson 2019) • Distribution of X | Z is only partially known • By conditioning on sufficient statistic S ( X , Z ), can resample copies ˜ X (1) , . . . , ˜ X ( M ) iid ∼ (distrib. of X | S ( X , Z )) & compute p-value for test statistic T as before 4/27
Intro: testing conditional independence confounders Z ? response Y features X Model-X approach via sufficient statistics (Huang & Janson 2019) • Distribution of X | Z is only partially known • By conditioning on sufficient statistic S ( X , Z ), can resample copies ˜ X (1) , . . . , ˜ X ( M ) iid ∼ (distrib. of X | S ( X , Z )) & compute p-value for test statistic T as before • Example: canonical GLMs � � X i · Z ⊤ i θ − a ( Z ⊤ — X i ∼ exp i θ ) , i = 1 , . . . , n , with θ unknown — S ( X , Z ) = � i X i Z i is suff. stat. for X = ( X 1 , . . . , X n ) 4/27
Intro: testing goodness-of-fit (GoF) More generally... Goodness-of-fit test Testing H 0 : X ∼ P θ for some θ ∈ Θ, where { P θ : θ ∈ Θ } is a parametric family 5/27
Intro: testing goodness-of-fit (GoF) More generally... Goodness-of-fit test Testing H 0 : X ∼ P θ for some θ ∈ Θ, where { P θ : θ ∈ Θ } is a parametric family Conditional independence testing can be a special case: • Assume X | Z ∼ P θ ( ·| Z ) for some θ ∈ Θ • Null hypothesis H 0 : X ⊥ ⊥ Y | Z • Equivalently... H 0 : X | Y , Z ∼ P θ ( ·| Z ) for some θ ∈ Θ • Note: we condition on Y and Z (i.e., treat as fixed) 5/27
Intro: testing goodness-of-fit (GoF) A general framework: • Choose any test statistic T : X → R • Draw copies ˜ X (1) , . . . , ˜ X ( M ) • Compute rank-based p-value pval = 1 + � m ✶ { T ( ˜ X ( m ) ) ≥ T ( X ) } 1 + M X ( M ) are exchangeable under H 0 � p-value is valid • If X , ˜ X (1) , . . . , ˜ 6/27
Co-sufficient sampling (CSS) Co-sufficient sampling X ( m ) ∼ (distrib. of X | S ( X )), Sample copies ˜ where S ( X ) is a sufficient statistic for the family { P θ : θ ∈ Θ } Can be applied to: 1. Test goodness-of-fit (GoF) (Engen & Lilleg˚ ard 1997, Lockhart et al 2007, Stephens 2012, Hazra 2013 ....) 2. Test conditional independence (special case of GoF) (Rosenbaum 1984, Kolassa 2003, Huang & Janson 2019) 3. Construct conf. intervals for a parameter of interest (by inverting GoF tests) 7/27
Co-sufficient sampling (CSS) Co-sufficient sampling X ( m ) ∼ (distrib. of X | S ( X )), Sample copies ˜ where S ( X ) is a sufficient statistic for the family { P θ : θ ∈ Θ } 8/27
Co-sufficient sampling (CSS) Co-sufficient sampling X ( m ) ∼ (distrib. of X | S ( X )), Sample copies ˜ where S ( X ) is a sufficient statistic for the family { P θ : θ ∈ Θ } Permutation tests are an example of CSS iid • H 0 : X 1 , . . . , X n ∼ D for D ∈ (some set) • The order statistics X (1) ≤ · · · ≤ X ( n ) are sufficient under the null • Permutation test ⇔ resampling X conditional on order statistics • Application: testing X ⊥ ⊥ Y H 0 : conditional on Y 1 , . . . , Y n , it holds that X 1 , . . . , X n are i.i.d. 8/27
Co-sufficient sampling (CSS) Limitation of co-sufficient sampling... no power in many settings! Example—logistic model: • X = ( X 1 , . . . , X n ) ∈ { 0 , 1 } n , Z = ( Z 1 , . . . , Z n ) ∈ ( R k ) n • If the Z i ’s are in general position, then � i X i Z i ∈ R k uniquely determines X X (1) = · · · = ˜ X ( M ) = X � zero power) (so if we resample, will have ˜ 9/27
Co-sufficient sampling (CSS) Limitation of co-sufficient sampling... no power in many settings! 10/27
Co-sufficient sampling (CSS) Limitation of co-sufficient sampling... no power in many settings! For many other models, the minimal sufficient statistic S ( X ) is essentially the data itself, e.g., • Mixture of Gaussians or mixture of GLMs • Non-canonical GLMs • Heavy tailed distributions (e.g., multivariate t) • Models with missing or corrupted data 10/27
Approximate sufficiency For a family { P θ : θ ∈ Θ } , a function S ( X ) is a sufficient statistic if ∀ θ, θ ′ . (distrib. of X | S ( X ) , X ∼ P θ ) = (distrib. of X | S ( X ) , X ∼ P θ ′ ) Asymptotic sufficiency: (Le Cam, Wald, ...) Informally... ∀ θ, θ ′ . (distrib. of X | S ( X ) , X ∼ P θ ) ≈ (distrib. of X | S ( X ) , X ∼ P θ ′ ) • Under regularity conditions, S ( X ) = � θ MLE ( X ) is asymp. suff. 11/27
Approximate co-sufficient sampling (aCSS) Main idea: • Let � θ ∈ Θ be an approximate MLE given the data X • Let p θ ( ·| � θ ) = distrib. of X | � θ , if marginally X ∼ P θ � under the null, X | � θ ∼ p θ 0 ( ·| � θ ) for the unknown true θ 0 X ( M ) from p � • Sample copies ˜ X (1) , . . . , ˜ θ ( ·| � θ ) ≈ p θ 0 ( ·| � θ ) � �� � by approx. sufficiency X ( M ) ≈ exchangeable under H 0 � p-value is ≈ valid X , ˜ X (1) , . . . , ˜ 12/27
Approximate co-sufficient sampling (aCSS) Distance to exchangeability � � �� d exch ( X , ˜ X (1) , . . . , ˜ ( X , ˜ X (1) , . . . , ˜ X ( M ) ) = X ( M ) ) , D inf d TV Exch. distrib. D on X M +1 For any test statistic T ( X ), the p-value pval = 1 + � m ✶ { T ( ˜ X ( m ) ) ≥ T ( X ) } 1 + M satisfies P { pval ≤ α } ≤ α + d exch ( X , ˜ X (1) , . . . , ˜ X ( M ) ) . 13/27
aCSS algorithm • Step 1: choose a test statistic T : X → R • Step 2: observe data X , and compute an approximate MLE � θ X ( M ) from ≈ distribution of X | � • Step 3: sample copies ˜ X (1) , . . . , ˜ θ • Step 4: compute a rank-based p-value to test H 0 : pval = 1 + � m ✶ { T ( ˜ X ( m ) ) ≥ T ( X ) } 1 + M 14/27
aCSS algorithm • Step 1: choose a test statistic T : X → R • Step 2: observe data X , and compute an approximate MLE � θ X ( M ) from ≈ distribution of X | � • Step 3: sample copies ˜ X (1) , . . . , ˜ θ • Step 4: compute a rank-based p-value to test H 0 : pval = 1 + � m ✶ { T ( ˜ X ( m ) ) ≥ T ( X ) } 1 + M 14/27
aCSS algorithm • Step 2: observe data X , and compute an approximate MLE � θ Ideally would like to minimize σ · W ⊤ θ L ( θ ; X , W ) = L ( θ ; X ) + � �� � � �� � perturb with W ∼ N (0 , 1 d I d ) penalized neg. log-likelihood (choose σ ≪ n 1 / 2 ) − log f ( X ; θ )+ R ( θ ) (see also Tian & Taylor 2018—random perturbation for selective inference) 15/27
aCSS algorithm • Step 2: observe data X , and compute an approximate MLE � θ Ideally would like to minimize σ · W ⊤ θ L ( θ ; X , W ) = L ( θ ; X ) + � �� � � �� � perturb with W ∼ N (0 , 1 d I d ) penalized neg. log-likelihood (choose σ ≪ n 1 / 2 ) − log f ( X ; θ )+ R ( θ ) (see also Tian & Taylor 2018—random perturbation for selective inference) But... what if nonconvex? what if no global minimum? θ : X × R d → Θ, returns � — Function � θ ( X , W ). — If � θ ( X , W ) is a strict SOSP of L ( θ ; X , W ), proceed to next step. X (1) = · · · = ˜ X ( M ) = X � pval = 1. — Otherwise return ˜ 15/27
✶ ✶ aCSS algorithm X ( M ) from ≈ distribution of X | � • Step 3: sample copies ˜ X (1) , . . . , ˜ θ 16/27
✶ aCSS algorithm X ( M ) from ≈ distribution of X | � • Step 3: sample copies ˜ X (1) , . . . , ˜ θ Density of X | � θ , conditional on the event that � θ ( X , W ) is strict SOSP: � � � � −�∇ θ L ( � θ ; x ) � θ L ( � ∇ 2 ∝ f ( x ; θ 0 ) · exp · det θ ; x ) · ✶ x ∈X � 2 σ 2 / d θ տ support of X | � θ 16/27
aCSS algorithm X ( M ) from ≈ distribution of X | � • Step 3: sample copies ˜ X (1) , . . . , ˜ θ Density of X | � θ , conditional on the event that � θ ( X , W ) is strict SOSP: � � � � −�∇ θ L ( � θ ; x ) � θ L ( � ∇ 2 ∝ f ( x ; θ 0 ) · exp · det θ ; x ) · ✶ x ∈X � 2 σ 2 / d θ տ support of X | � θ θ 0 unknown � use � θ as plug-in estimate: � � � � −�∇ θ L ( � θ ; x ) � ∝ f ( x ; � θ L ( � ∇ 2 θ ) · exp · det θ ; x ) · ✶ x ∈X � 2 σ 2 / d θ 16/27
Recommend
More recommend