Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models Michael Oberst David Sontag MIT MIT @MichaelOberst
Motivation: Building trust in RL policies ► Goal : Apply reinforcement learning in high risk settings (e.g., healthcare) ► Problem : How to safely evaluate a policy? No simulator, and off-policy evaluation can fail due to ► Confounding ► Small sample sizes ► Poorly specified rewards ► Could try to interpret the policy directly, but if not possible, what can we do?
Motivation: Building trust in RL policies Suppose we are given: • Markov Decision Process (MDP) Markov Decision Process (MDP) • Policy (e.g., learned using MDP) 𝑄 𝑇 ′ , 𝑆 𝑇, 𝐵) 𝑇: Current State 𝐵: Action 𝑆: Reward ? 𝑇′: Next State Observational Data Policy 𝜌 𝐵 𝑇)
Using counterfactuals to “sanity check” 𝑇: State 𝐵: Action …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time
Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time
Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 Antibiotics …patient 𝑇 0 has infection …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time
Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 Antibiotics …patient …infection 𝑇 0 𝑇 1 has infection cleared …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time
Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action Model-based rollout 𝐵 1 Antibiotics not a fair comparison …patient …infection 𝑇 0 𝑇 1 has infection cleared …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time
Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 Antibiotics …patient 𝑇 0 𝑇 1 has infection …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time
Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action Counterfactual influenced 𝐵 1 Antibiotics by actual outcome …patient …drug 𝑇 0 𝑇 1 has infection reaction …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time
Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 𝐵 2 𝐵 3 Antibiotics No action Discharge …patient …drug …patient 𝑇 0 𝑇 1 𝑇 2 has infection reaction recovers …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Time
Using counterfactuals to “sanity check” 𝑇: State If the new policy had been applied to this patient… 𝐵: Action 𝐵 1 𝐵 2 𝐵 3 Antibiotics No action Discharge …patient …drug …patient 𝑇 0 𝑇 1 𝑇 2 has infection reaction recovers …patient …drug …significant has infection reaction agitation Antibiotics Mechanical Sedation 𝐵 1 𝐵 2 𝐵 3 Ventilation Idea: If the counterfactual trajectory is unreasonable given Time full context of patient, the model / policy may be flawed
Using counterfactuals to “sanity check” Approach Decomposition of reward 1 over real episodes, to identify interesting cases See paper / poster for synthetic case study motivated by sepsis management
Using counterfactuals to “sanity check” Example Approach Decomposition of reward 1 over real episodes, to identify interesting cases See paper / poster for synthetic case study motivated by sepsis management
Using counterfactuals to “sanity check” Example Approach Decomposition of reward 1 over real episodes, to identify interesting cases Examine counterfactual 2 trajectories under new policy Validate and/or criticize 3 conclusions, using full patient information (e.g., chart review) See paper / poster for synthetic case study motivated by sepsis management
Simulating counterfactual trajectories What we need 1 Observed trajectories 2 Policy to evaluate 𝜌 𝐵 𝑇) Model of discrete dynamics, 3 e.g., Markov Decision Process 𝑇 𝑇′ 𝑇: Current State 𝐵: Action 𝑇′: Next State 𝐵
Simulating counterfactual trajectories What we need 1 Observed trajectories Structural Causal Model (SCM) 𝑇 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) + 𝐵 𝑉 𝑇′ Model of discrete dynamics, 3 𝑇 ′ = 𝑔(𝑇, 𝐵, 𝑉 𝑡′ ) e.g., Markov Decision Process 𝑉 𝑡′ ∼ 𝑄(𝑉 𝑡 ′ ) 𝑇 𝑇′ 𝑇: Current State 𝐵: Action 𝑇′: Next State 𝐵
Simulating counterfactual trajectories What we need 1 Observed trajectories Structural Causal Model (SCM) 𝑇 𝑇′ 2 Policy to evaluate 𝜌 𝐵 𝑇) + 𝐵 𝑉 𝑇′ Model of discrete dynamics, 3 𝑇 ′ = 𝑔(𝑇, 𝐵, 𝑉 𝑡′ ) e.g., Markov Decision Process 𝑉 𝑡′ ∼ 𝑄(𝑉 𝑡 ′ ) 𝑇 𝑇′ 𝑇: Current State Problem : Choice of SCM is not 𝐵: Action identifiable from data! 𝑇′: Next State 𝐵
So, what should we use for the structural causal model (SCM)? Key challenge: Non-identifiability There are multiple SCMs consistent with 𝑄 𝑇 ′ 𝑇, 𝐵) but with different counterfactual distributions For binary variables , assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states!
So, what should we use for the structural causal model (SCM)? Key challenge: Non-identifiability Theorem 1 (informal) : (Newly defined) There are multiple SCMs consistent property of counterfactual stability generalizes with 𝑄 𝑇 ′ 𝑇, 𝐵) but with different monotonicity to categorical variables counterfactual distributions For binary variables , assuming the property of monotonicity (Pearl, 2000) is sufficient to identify the counterfactual distribution But most real-world MDPs have non-binary states!
So, what should we use for the structural causal model (SCM)? Key challenge: Non-identifiability Theorem 1 (informal) : (Newly defined) There are multiple SCMs consistent property of counterfactual stability generalizes with 𝑄 𝑇 ′ 𝑇, 𝐵) but with different monotonicity to categorical variables counterfactual distributions Gumbel-Max SCM For binary variables , assuming the Use the Gumbel-Max trick to sample from a property of monotonicity (Pearl, categorical distribution with 𝑙 categories: 2000) is sufficient to identify the 𝑘 ∼ 𝐻𝑣𝑛𝑐𝑓𝑚 counterfactual distribution 𝑇 ′ = 𝑏𝑠𝑛𝑏𝑦 𝑘 { log 𝑄 𝑇 ′ = 𝑘 𝑇, 𝐵) + 𝑘 } But most real-world MDPs have non-binary states! Theorem 2: Gumbel-Max SCM satisfies the counterfactual stability condition
Thank you! Come to our poster for more details: Pacific Ballroom #72
Recommend
More recommend