Instrumental Variables, DeepIV, and Forbidden Regressions Aaron Mishkin UBC MLRG 2019W2 1 ⁄ 41
Introduction Goal : Counterfactual reasoning in the presence of unknown confounders. From the CONSORT 2010 statement [Schulz et al., 2010]; https://commons.wikimedia.org/w/index.php?curid=9841081 2 ⁄ 41
Introduction: Motivation Can we draw causal conclusions from observational data? • Medical Trials : Is the new sunscreen I’m using effective? ◮ Confounder : I live in my laboratory! • Pricing : should airlines increase ticket prices next December? ◮ Confounder : NeurIPS 2019 was in Vancouver. • Policy : will unemployment continue to drop if the Federal Reserve keeps interest rates low? ◮ Confounder : US shale oil production increases. We cannot control for confounders in observational data! 3 ⁄ 41
Introduction: Graphical Model ǫ Features X Confounder Policy Response P Y We will graphical models to represent our learning problem. • X : observed features associated with a trial. • ǫ : unobserved (possibly unknown) confounders . • P : the policy variable we will to control. • Y : the response we want to predict. 4 ⁄ 41
Introduction: Answering Causal Questions ǫ Features X Confounder Policy Response P Y • Causal Statements : Y is caused by P . • Action Sentences : Y will happen if we do P . • Counterfactuals: Given ( x , p , y ) happened, how would Y change if we had done P instead? 5 ⁄ 41
Introduction: Berkeley Gender Bias Study S : Gender causes admission to UC Berkeley [Bickel et al., 1975]. A : Estimate mapping g ( p ) from 1973 admissions records. Gender G ? g ( G ) A Admission Men Women Applications Admitted Applications Admitted 8442 44% 4321 35% 6 ⁄ 41
Introduction: Berkeley with a Controlled Trial Controlled Exp. Observational Data G D G D A A Simpson’s Paradox : Controlling for the effects of D shows “small but statistically significant bias in favor of women” [Bickel et al., 1975]. 7 ⁄ 41
Part 1: “Intervention Graphs” 8 ⁄ 41
Intervention Graphs The do( · ) operator formalizes this transformation [Pearl, 2009]. Observation Intervention ǫ ǫ X X do( P = p 0 ) p 0 P Y Y Intuition : effects of forcing P = p 0 vs “natural” occurrence. 9 ⁄ 41
Intervention Graphs: Supervised vs Causal Learning Setup Graphical Model ǫ • ǫ, η ∼ N (0 , 1). P • P = p + 2 ǫ . g 0 ( p ) η � P � • g 0 ( P ) = max 5 , P . • Y = g 0 ( P ) − 2 ǫ + η . Y Can supervised learning recover g 0 ( P = p 0 ) from observations? Synthetic example introduced by Bennett et al. [2019] 10 ⁄ 41
Intervention Graphs: Supervised Failure 4 2 0 2 true g 0 estimated by neural net 4 observed 6 4 2 0 2 4 6 8 Supervised learning fails because it assumes P ⊥ ⊥ ǫ ! Taken from https://arxiv.org/abs/1905.12495 11 ⁄ 41
Intervention Graphs: Supervised vs Causal Learning Observation Intervention ǫ ǫ P P do( P ) η η g 0 ( p ) g 0 ( p ) Y Y Given dataset D = { p i , y i } n i =1 : • Supervised Learning estimates the conditional E [ Y | P ] = g 0 ( P ) − 2 E [ ǫ | P ] • Causal Learning estimates the conditional E [ Y | do( P )] = g 0 ( P ) − 2 E [ ǫ ] ���� =0 12 ⁄ 41
Intervention Graphs: Known Confounders Obervations Intervention ǫ X i X p 0 P i Y i Y i ∈ [ n ] What if 1. all confounders are known and in ǫ ; 2. ǫ persists across observations; 3. the mapping Y = f ( X , P , ǫ ) is known and persists. 13 ⁄ 41
Intervention Graphs: Inference Obervations Intervention ǫ X i X p 0 P i Y i Y i ∈ [ n ] Steps to inference: 1. Abduction : compute posterior P ( ǫ | { x i , p i , y i } n i =1 ) 2. Action : form subgraph corresponding to do( P = p 0 ). 3. Prediction : compute P ( Y | do( P = p 0 ) , { x i , p i , y i } n i =1 ). 14 ⁄ 41
Intervention Graphs: Limitations Our assumptions are unrealistic since • identifying all confounders is hard . • assuming all confounders are “global” is unrealistic . • characterizing Y = f ( X , P , ǫ ) requires expert knowledge . What we really want is to • allow any number and kind of confounders! • allow confounders to be “ local ”. • learn f ( X , P , ǫ ) from data! 15 ⁄ 41
Part 2: Instrumental Variables 16 ⁄ 41
Instrumental Variables . . . the drawing of inferences from studies in which subjects have the final choice of program; the randomization is confined to an indirect instrument (or assignment) that merely encourages or discourages participation in the various programs. — Pearl [2009] 17 ⁄ 41
IV: Expanded Model ǫ Features X Confounder Response Instrument Z P Y Policy We augment our model with an instrumental variable Z that • affects the distribution of P ; • only affects Y through P ; • is conditionally independent of ǫ . 18 ⁄ 41
IV: Air Travel Example Fuel F Price Conference P C I Income Intuition : “[ F is] as good as randomization for the purposes of causal inference”— Hartford et al. [2017]. 19 ⁄ 41
IV: Formally Goal : counterfactual predictions of the form E [ Y | X , do( P = p 0 )] − E [ Y | X , do( P = p 1 )] . Let’s make the following assumptions: 1. the additive noise model Y = g ( P , X ) + ǫ , 2. the following conditions on the IV: 2.1 Relevance : p ( P | X , Z ) is not constant in Z . 2.2 Exclusion : Z ⊥ ⊥ Y | P , X , ǫ . 2.3 Unconfounded Instrument : Z ⊥ ⊥ ǫ | P . 20 ⁄ 41
IV: Model Learning Part 1 ǫ X Intervention p Y = g ( P , X ) + ǫ Z Y Under the do operator: E [ Y | X , do( P = p 0 )] − E [ Y | X , do( P = p 1 )] = g ( p 0 , X ) − g ( p 1 , X ) + E [ ǫ − ǫ | X ] . � �� � =0 So, we only need to estimate h ( P , X ) = g ( P , X ) + E [ ǫ | X ]! 21 ⁄ 41
IV: Model Learning Part 2 Want : h ( P , X ) = g ( P , X ) + E [ ǫ | X ]. Approach : Marginalize out confounded policy P . � E [ Y | X , Z ] = ( g ( P , X ) + E [ ǫ | P , X ]) dp ( P | X , Z ) P � = ( g ( P , X ) + E [ ǫ | X ]) dp ( P | X , Z ) P � = h ( P , X ) dp ( P | X , Z ) . P Key Trick : E [ ǫ | X ] is the same as E [ ǫ | P , X ] when marginalizing. 22 ⁄ 41
IV: Two-Stage Methods n � � � 1 � L h ( P , x i ) dp ( P | z i ) Objective : y i , . n P i =1 Two-stage methods: p ( P | X , Z ) from 1. Estimate Density : learn ˆ D = { p i , x i , z i } n i =1 . 2. Estimate Function : learn ˆ h ( P , X ) from ¯ D = { y i , x i , z i } n i =1 . 3. Evaluate : counterfactual reasoning via ˆ h ( p 0 , x ) − ˆ h ( p 1 , x ). 23 ⁄ 41
IV: Two-Stage Least-Squares Classic Approach : two-stage least-squares (2SLS). h ( P , X ) = w ⊤ 0 P + w ⊤ 1 X + ǫ E [ P | X , Z ] = A 0 X + A 1 Z + r ( ǫ ) Then we have the following: � E [ Y | X , Z ] = h ( P , X ) dp ( P | X , Z ) P � � � w ⊤ 0 P + w ⊤ = 1 X dp ( P | X , Z ) P � = w ⊤ 1 X + w ⊤ Pdp ( P | X , Z ) 0 P = w ⊤ 1 X + w ⊤ 0 ( A 0 X + A 1 Z ) . No need for density estimation! See Angrist and Pischke [2008]. 24 ⁄ 41
Part 3: Deep IV 25 ⁄ 41
Deep IV: Problems with 2SLS Problem : Linear models aren’t very expressive. • What if we want to do causal inference with time-series? Federal Reserve Economic Research, Federal Reserve Bank of Saint Louis. https://fred.stlouisfed.org/ 26 ⁄ 41
Deep IV: Problems with 2SLS Problem : Linear models aren’t very expressive. • How about complex image data? https://alexgkendall.com/computer vision/bayesian deep learning for safe ai/ 27 ⁄ 41
Deep IV: Approach Remember our objective function: n � � � 1 � Objective : L y i , h ( P , x i ) dp ( P | z i ) . n P i =1 Deep IV : Two-stage method using deep neural networks. p ( P | φ ( X , Z ) ). 1. Treatment Network : estimate ˆ ◮ Categorical P : softmax w/ favourite architecture. ◮ Continuous P : autoregressive models (MADE, RNADE, etc.), normalizing flows (MAF, IAF, etc) and so on. 2. Outcome Network : fit favorite architecture ˆ h θ ( P , X ) ≈ h ( P , X ) . Autogressive models: [Germain et al., 2015, Uria et al., 2013], Normalizing Flows: [Rezende and Mohamed, 2015, Papamakarios et al., 2017, Kingma et al., 2016] 28 ⁄ 41
Deep IV: Training Deep IV Models 1. Treatment Network “easy” via maximum-likelihood: � n � � φ ∗ = arg max log ˆ p ( p i | φ ( x i , z i ) ) φ i =1 2. Outcome Network : Monte Carlo approximation for loss: n � � � L ( θ ) = 1 � ˆ L y i , h θ ( P , X ) d ˆ p ( P | φ ( x i , z i ) ) n P i =1 n m ≈ 1 y i , 1 � � ˆ := ˆ L h θ ( p j , x i ) L ( θ ) , n M i =1 j =1 where p j ∼ ˆ p ( P | φ ( x i , z i ) ). 29 ⁄ 41
Deep IV: Biased and Unbiased Gradients y ) 2 : When L ( y , ˆ y ) = ( y − ˆ n � � 2 � L ( θ ) = 1 � y i − h ( P , x i ) dp ( P | z i ) . n P i =1 � � ˆ If we use a single set of samples to estimate E ˆ h θ ( P , x i ) : p n L ( θ ) ≈ − 21 � � � ∇ ˆ y i − ˆ h θ ( P , x i ) ∇ θ ˆ h θ ( P , x i ) E ˆ p n i =1 n ≥ − 21 � � � � � y i − ˆ ∇ θ ˆ h θ ( P , x i ) h θ ( P , x i ) = ∇ θ L ( θ ) , E ˆ E ˆ p p n i =1 by Jensen’s inequality. 30 ⁄ 41
Part 4: Experimental Results and Forbidden Techniques 31 ⁄ 41
Recommend
More recommend