gumbel max structural causal models
play

Gumbel-Max Structural Causal Models Michael Oberst David Sontag - PowerPoint PPT Presentation

Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models Michael Oberst David Sontag MIT MIT @MichaelOberst Motivation: Building trust in RL policies Goal : Apply reinforcement learning in high risk settings (e.g.,


  1. Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models Michael Oberst David Sontag MIT MIT @MichaelOberst

  2. Motivation: Building trust in RL policies ► Goal : Apply reinforcement learning in high risk settings (e.g., healthcare) ► Problem : How to safely evaluate a policy? No simulator, and off-policy evaluation can fail due to ► Confounding ► Small sample sizes ► Poorly specified rewards ► Could try to interpret the policy directly, but if not possible, what can we do?

  3. Motivation: Building trust in RL policies Suppose we are given: • Markov Decision Process (MDP) Markov Decision Process (MDP) • Policy (e.g., learned using MDP) 𝑄 𝑇 ′ , 𝑆 𝑇, 𝐵) 𝑇: Current State 𝐵: Action 𝑆: Reward ? 𝑇′: Next State Observational Data Policy 𝜌 𝐵 𝑇)

  4. Using counterfactuals to “sanity check” 𝑇: State 𝐵: Action …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time

  5. Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time

  6. Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 Antibiotics …patient 𝑇 0 has infection …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time

  7. Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 Antibiotics …patient …infection 𝑇 0 𝑇 1 has infection cleared …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time

  8. Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action Model-based rollout 𝐵 1 Antibiotics not a fair comparison …patient …infection 𝑇 0 𝑇 1 has infection cleared …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time

  9. Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 Antibiotics …patient 𝑇 0 𝑇 1 has infection …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time

  10. Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action Counterfactual influenced 𝐵 1 Antibiotics by actual outcome …patient …drug 𝑇 0 𝑇 1 has infection reaction …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time

  11. Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 𝐵 2 𝐵 3 Antibiotics No action Discharge …patient …drug …patient 𝑇 0 𝑇 1 𝑇 2 has infection reaction recovers …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time

  12. Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 𝐵 2 𝐵 3 Antibiotics No action Discharge …patient …drug …patient 𝑇 0 𝑇 1 𝑇 2 has infection reaction recovers …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Idea: If the counterfactual trajectory is unreasonable given Time full context of patient, the model / policy may be flawed

  13. Using counterfactuals to “sanity check” Approach Decomposition of reward 1 over real episodes, to identify interesting cases See paper / poster for synthetic case study motivated by sepsis management

  14. Using counterfactuals to “sanity check” Example Approach Decomposition of reward 1 over real episodes, to identify interesting cases See paper / poster for synthetic case study motivated by sepsis management

  15. Using counterfactuals to “sanity check” Example Approach Decomposition of reward 1 over real episodes, to identify interesting cases Examine counterfactual 2 trajectories under new policy Validate and/or criticize 3 conclusions, using full patient information (e.g., chart review) See paper / poster for synthetic case study motivated by sepsis management

  16. Simulating counterfactual trajectories What we need 1 Observed trajectories 2 Policy to evaluate 𝜌 𝐵 𝑇) Model of discrete dynamics, 3 e.g., Markov Decision Process 𝑇 𝑇′ 𝑇: Current State 𝐵: Action 𝑇′: Next State 𝐵

  17. Simulating counterfactual trajectories What we need 1 Observed trajectories Structural Causal Model (SCM) 𝑇 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) + 𝐵 𝑉 𝑇′ Model of discrete dynamics, 3 𝑇 ′ = 𝑔(𝑇, 𝐵, 𝑉 𝑡′ ) e.g., Markov Decision Process 𝑉 𝑡′ ∼ 𝑄(𝑉 𝑡 ′ ) 𝑇 𝑇′ 𝑇: Current State 𝐵: Action 𝑇′: Next State 𝐵

  18. Simulating counterfactual trajectories What we need 1 Observed trajectories Structural Causal Model (SCM) 𝑇 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) + 𝐵 𝑉 𝑇′ Model of discrete dynamics, 3 𝑇 ′ = 𝑔(𝑇, 𝐵, 𝑉 𝑡′ ) e.g., Markov Decision Process 𝑉 𝑡′ ∼ 𝑄(𝑉 𝑡 ′ ) 𝑇 𝑇′ 𝑇: Current State Problem : Choice of SCM is not 𝐵: Action identifiable from data! 𝑇′: Next State 𝐵

  19. So, what should we use for the structural causal model (SCM)? Key challenge: Non-identifiability There are multiple SCMs consistent with 𝑄 𝑇 ′ 𝑇, 𝐵) but with different counterfactual distributions For binary variables , assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states!

  20. So, what should we use for the structural causal model (SCM)? Key challenge: Non-identifiability Theorem 1 (informal) : (Newly defined) There are multiple SCMs consistent property of counterfactual stability generalizes with 𝑄 𝑇 ′ 𝑇, 𝐵) but with different monotonicity to categorical variables counterfactual distributions For binary variables , assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states!

  21. So, what should we use for the structural causal model (SCM)? Key challenge: Non-identifiability Theorem 1 (informal) : (Newly defined) There are multiple SCMs consistent property of counterfactual stability generalizes with 𝑄 𝑇 ′ 𝑇, 𝐵) but with different monotonicity to categorical variables counterfactual distributions Gumbel-Max SCM For binary variables , assuming the Use the Gumbel-Max trick to sample from a property of monotonicity (Pearl, categorical distribution with 𝑙 categories: 2000) is sufficient to identify the 𝑕 𝑘 ∼ 𝐻𝑣𝑛𝑐𝑓𝑚 counterfactual distribution 𝑇 ′ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑘 { log 𝑄 𝑇 ′ = 𝑘 𝑇, 𝐵) + 𝑕 𝑘 } But most real-world MDPs have non-binary states! Theorem 2: Gumbel-Max SCM satisfies the counterfactual stability condition

  22. Thank you! Come to our poster for more details: Pacific Ballroom #72

Recommend


More recommend