Imbalance in representation space โบ We do not want treatment groups to be identical Treatment group imbalance klK ๐ฆ klL ๐ฆ โ ๐ j ๐ j ฮฆ ๐ฆ ๐ฆ b ฮฆ(๐ฆ) b ๐ฆ L ฮฆ(๐ฆ) L Control, ๐ = 0 Treated, ๐ = 1
Integral probability metric penalty โบ Regularizer to improve counterfactual estimation โบ Penalize treatment distributional distance in representation space โ L ๐๐ ๐ = 1 ๐(โ L (ฮฆ), ๐(1)) โฆ ๐ฆ ฮฆ โฆ ๐ โ K ฮฆ , ๐(0) โฆ ๐๐ ๐ = 0 โ K klK , ฬ klL ) IPM p ( ฬ ๐ j ๐ j ๐ โบ Integral Probability Metrics (IPM) such as Wasserstein distance and MMD IPM q ๐ K , ๐ L = sup v ๐ ๐ก ๐ K ๐ก โ ๐ L ๐ก ๐๐ก With G a function family: uโp ๐ฏ
Integral probability metric penalty โบ Regularizer to improve counterfactual estimation โบ Penalize treatment distributional distance in representation space โ L ๐๐ ๐ = 1 ๐(โ L (ฮฆ), ๐(1)) โฆ ๐ฆ ฮฆ โฆ ๐ โ K ฮฆ , ๐(0) โฆ ๐๐ ๐ = 0 โ K klK , ฬ klL ) IPM p ( ฬ ๐ j ๐ j ๐ โบ Integral Probability Metrics (IPM) such as Wasserstein distance and MMD With G a IPM q ๐ K , ๐ L = sup v ๐ ๐ก ๐ K ๐ก โ ๐ L ๐ก ๐๐ก function family: uโp ๐ฏ
Individual-level treatment effect generalization bound โบ Factual per-treatment group โบ Precision in Estimation of prediction error Heterogeneous Effects 1 : b { klK = v ๐ท๐ต๐๐น j,| = โ ฮฆ ๐ฆ , 1 โ โ(ฮฆ ๐ฆ , 0) ๐ โกlK ๐ฆ ๐๐ฆ โ โบ ๐ โ ๐ 0 โ ๐ 0 โข b b { klL = v ๐ โกlL ๐ฆ ๐๐ฆ ๐ }~kโข (๐, โ) = v ๐ท๐ต๐๐น j,| โ CATE ๐ฆ ๐ ๐ฆ ๐๐ฆ โ ๐ โ ๐ 1 โ ๐ 1 โข โข โบ Theorem 1: klK ฮฆ, โ + ๐ โ klL ฮฆ, โ + ๐ถ j IPM q ๐ j klL , ๐ j klK ๐ }~kโข (๐, โ) โค 2 ๐ โ Effect error Prediction error Treatment group distance 1 Hill, Journal of Computational and Graphical Statistics 2011
โบ Theorem 1: klK ฮฆ, โ + ๐ โ klL ฮฆ, โ + ๐ถ j IPM q ๐ j klL , ๐ j klK ๐ CATE โค 2 ๐ โ Effect error Prediction error Treatment group distance โข Problem with Theorem 1: Too loose when we have overlap + infinite samples โข We should be able to achieve the predicFon error itself on either group
Trading off accuracy for balance โบ Our full architecture learns a representation ฮฆ(x) , a re-weighting ๐ฅ โก (๐ฆ) and hypotheses โ โก (ฮฆ) to trade-off between the re-weighted loss ๐ฅโ and imbalance between re-weighted representations Hypotheses Weighted loss Context Repres. Treatment โ K ๐ฆ ฮฆ ๐ข ๐ฅโ โ L DNN ๐ฅ โกlK , ๐ฅ L ๐ j โกlL ) IPM(๐ฅ K ๐ j ฮฆ Weighting Imbalance
Individual-treatment effect generalization bound โบ Theorem 2 *: (Representation learning) ลพ ลธ ฮฆ, โ + ๐ถ j IPM q ๐ j โก (๐ฆ) L โก (๐ฆ), ๐ฅ โก ๐ j ๐ CATE โค 2 โบ ๐ โก โกโ{K,L} Effect risk Re-weighted factual loss Imbalance of re-weighted representations โบ Letting ฮฆ ๐ฆ = ๐ฆ , and ๐ฅ โก (๐ฆ) be inverse propensity weights, we recover classic result โบ Minimizing a weighted loss and IPM converge to the representation and hypothesis that minimize CATE error *Extension to finite samples available
Evaluating Individual Treatment Effect (CATE) Estimates โบ No ground truth , similar to o ff -policy evaluation in reinforcement learning
Evaluating Individual Treatment Effect (CATE) Estimates โบ No ground truth , similar to off-policy evaluation in reinforcement learning โบ Requires either: โบ Knowledge of the true outcome (synthetic) โบ Knowledge of treatment assignment policy (e.g. a randomized controlled trial)
Evaluating Individual Treatment Effect (CATE) Estimates โบ No ground truth , similar to off-policy evaluation in reinforcement learning โบ Requires either: โบ Knowledge of the true outcome (synthetic) โบ Knowledge of treatment assignment policy (e.g. a randomized controlled trial) โบ Our framework has proven effective in both settings
IH IHDP Benchmark 1 โบ The Infant Health and Development Program (IHDP) โบ Studied the effects of home visits and other interventions โบ Real covariates and treatment, synthesized outcome โบ Overlap is not satisfied (by design) โบ Used to evaluate MSE in CATE prediction 1 Hill, JCGS, 2011
Empir Em piric ical l results ults โบ BART, Bayesian Additive Regression Method CATE MSE Trees, are state-of-the-art baselines BART 1 2.3 ยฑ 0.1 โบ Standard neural networks Neural net 2.0 ยฑ 0.0 competitive Shared rep. 2 ๐. ๐ ยฑ ๐. ๐ โบ Shared representation learning with Shared rep. ๐. ๐ ยฑ ๐. ๐ ERM halves the MSE on IHDP 2 + invariance 2 Shared rep. ๐. ๐ ยฑ ๐. ๐ + invariance + weighting 3 โบ Minimizing upper bounds on risk, including ๐ โ further reduces the MSE 1 Hill, JCGS, 2011, 2 S ., Johansson, Sontag. ICML , 2017, 3 Johansson, Kallus, S. , Sontag. arXiv , 2018
In Intermedia iate conclu lusio sions โบ ML is well understood when test data โ training data โบ Learning individualized policies from observational data requires going beyond test โ train โบ Fewer/worse guarantees when assumptions are violated
Outline โข ML for causal inference โข Causal inference for ML โข Off-policy evaluation in a partially observable Markov decision process โข Robust learning for unsupervised covariate shift
Outline โข ML for causal inference โข Causal inference for ML โข Off-policy evaluation in a partially observable Markov decision process โข Robust learning for unsupervised covariate shift โOff-Policy Evaluation in Partially Observable Environmentsโ, Tennenholtz, Mannor, S AAAI 2020
Healthcare with time-varying decisions โข Physicians make ongoing decisions: treat, see change in patients state, modify treatment, and so on Doctor Patient
Healthcare with time-varying decisions โข Maps very well to reinforcement learning paradigm Figure: Shweta Bhatt
Reinforcement learning (RL) and causal inference From causal inference perspective โข RL usually assumes we can intervene directly โข ร mostly about how to experiment optimally in a dynamic environment
Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspective โข RL usually assumes we can intervene directly โข ร mostly about how to experiment optimally in a dynamic environment
Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspecFve โข RL usually assumes we can โข Causal inference usually deals intervene directly with cases we cannot intervene directly โข ร mostly about how to experiment optimally in a dynamic environment
Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspective โข RL usually assumes we can โข Causal inference usually deals intervene directly with cases we cannot intervene directly โข ร mostly about how to experiment optimally in โข Causal inference usually focuses a dynamic environment on single point-in-time actions
Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspective โข RL usually assumes we can โข Causal inference usually deals intervene directly with cases we cannot intervene directly โข ร mostly about how to experiment optimally in โข Causal inference usually focuses a dynamic environment on single point-in-time actions โข ร mostly about off-policy evaluation of a simple policy such as โtreat everyoneโ
A meePng point of RL and causal inference โข When performing off-policy evaluation of data from i. dynamic environment with ongoing actions ii. while we possibly do not have access to the same data as the agent โข Example: learning from records of physicians treating patients in an intensive care unit (ICU) โข Mistakes were made: applying RL to observational intensive care unit data without considering hidden confounders or overlap (common support / positivity) (see โGuidelines for Reinforcement Learning in Healthcareโ Gottesman et al. 2019) โข In RL nomenclature, hidden confounding can be described by a Partially Observable Markov Decision Process (POMDP)
Partially Observable Markov Decision Process (POMDP): some formalism 7
POMDP causal graph Causal RL Example name name confounder state Information ๐ฏ t (possibly (possibly available to โhiddenโ) โunobservedโ) the doctor action, medications, ๐ t action treatment proceduresโฆ ๐ฌ t outcome reward mortality treatment The way behavioral ๐ ๐ assignment doctors treat policy process patients Proxy Electronic ๐ด t observation variable health record
POMDP causal graph Causal RL Example name name confounder state Information ๐ฏ t (possibly (possibly available to โhiddenโ) โunobservedโ) the doctor action, medications, ๐ t action treatment proceduresโฆ ๐ฌ t outcome reward mortality treatment The way behavioral ๐ ๐ assignment doctors treat policy process patients Proxy Electronic ๐ด t observation variable health record
POMDP causal graph Causal RL Example name name confounder state information ๐ฏ t (possibly (possibly available to โhiddenโ) โunobservedโ) the doctor action, medications, ๐ t action treatment proceduresโฆ ๐ฌ t outcome reward mortality treatment the way behavioral ๐ ๐ assignment doctors treat policy process patients Proxy Electronic ๐ด t observation variable health record
POMDP causal graph Causal RL Example name name confounder state information ๐ฏ t (possibly (possibly available to โhiddenโ) โunobservedโ) the doctor action, medications, ๐ t action treatment proceduresโฆ ๐ฌ t outcome reward mortality treatment the way behavioral ๐ ๐ assignment doctors treat policy process patients proxy electronic ๐ด t observation variable health record
POMDP causal graph Causal RL Example name name confounder state information ๐ฏ t (possibly (possibly available to โhiddenโ) โunobservedโ) the doctor action, medications, ๐ t action treatment proceduresโฆ ๐ฌ t outcome reward mortality treatment the way behavioral ๐ ๐ assignment doctors treat policy process patients proxy electronic ๐ด t observation variable health record
POMDP causal graph โข Observe data from ๐ ๐ , with ๐ฏ ๐ฎ unobserved
Our goal: evaluate a new policy ๐ ๐ given data from ๐ ๐ โข Observe data from ๐ ๐ , with ๐ฏ ๐ฎ unobservedโฆ โข Evaluate a proposed policy ๐ ๐ (๐ ๐ ) in terms of policy value (discounted over a finite horizon) โข Why a function of ๐ด t ? Because ๐ฏ t is unobserved โข How to evaluate ๐ ๐ (๐ ๐ ) given only observations from ๐ ๐ , with ๐ฏ t unobserved? โข This is a problem anyone trying to create optimal dynamic treatment policies with observational data must address
Our goal: evaluate a new policy ๐ ๐ given data from ๐ ๐ โข Observe data from ๐ ๐ , with ๐ฏ ๐ฎ unobservedโฆ โข Evaluate a proposed policy ๐ ๐ (๐ ๐ ) in terms of policy value (discounted over a finite horizon) โข Denote ๐ ๐ ๐ (๐, ๐, ๐, โฆ |๐, ๐, ๐, โฆ ) probabilities from observed behavioral policy โข Can sample from this distribution โข Denote ๐ ๐ ๐ (๐, ๐, ๐, โฆ |๐, ๐, ๐, โฆ ) probabilities from targeted evaluation policy โข Cannot sample from this distribution
Our goal: evaluate a new policy ๐ ๐ given data from ๐ ๐ โข Observing data from ๐ ๐ , with ๐ฏ ๐ฎ unobserved evaluate a proposed policy ๐ ๐ (๐ด ๐ฎ ) in terms of policy value (discounted over a finite horizon) โข Without further assumptions: IMPOSSIBLE โข Example: ICU doctors treating sicker patients more aggressively โข Impossible even when conditioning on entire observable history ๐ด ๐ , ๐ ๐ , ๐ ๐ , โฆ , ๐ด ๐ , ๐ ๐ , ๐ ๐ โข Due to hidden confounding by ๐ฏ ๐ฎ โข But much harder: confounder<->action dynamics
Proxies and negative controls โข Miao, Geng, & Tchetgen Tchetgen. โIdentifying causal effects with proxy variables of an unmeasured confounder.โ Biometrika (2018) โข Only ๐ is unobserved ๐ด ๐ ๐ฑ โข Goal: identify the causal effect of ๐ on ๐ฌ โข ๐ด โซซ ๐ฑ | ๐ โข In general: impossible ๐ฌ ๐ โข New identification condition: matrices ๐ ยนยบ (๐) = ๐(๐ฑ = ๐|๐ด = j, ๐ = ๐) are invertible for all ๐ โข Requires ๐ฑ and ๐ด to be discrete with as many categories as discrete ๐
Our goal: evaluate a new policy ๐ ๐ given data from ๐ ๐ โข Assume ๐ด ๐ฎ are discrete with โฅ categories as ๐ฏ ๐ฎ Invertibility example (untestable from data) If z ยฟ are binary, then a sufficient condition for โก ๐ = ๐ ๐ ๐ ๐ด ๐ฎ = ๐ ๐ด ๐ฎ ๐ = ๐, ๐ ๐ฎ = ๐ โข Let ๐ ยนยบ invertiblity of ๐ โก (๐) is โข Theorem: ๐ z ยฟ = 1 z ยฟ L = 1, ๐ โ ๐ z ยฟ = 1 z ยฟ L = 0, ๐ If ๐ โก (๐) are all invertible then we can evaluate value of a proposed policy ๐ ๐ (๐ ๐ ) given observational data gathered under ๐ ๐ , without observing ๐ฏ ๐ฎ โข Future and past observations ๐ ๐ are conditionally independent proxies for unobserved ๐ฏ ๐ฎ
Assumptions 1. Assume ๐ด ๐ฎ are discrete with โฅ categories as ๐ฏ ๐ฎ โก ๐ = ๐ ๐ ๐ ๐ด ๐ฎ = ๐ ๐ด ๐ฎ ๐ = ๐, ๐ ๐ฎ = ๐ 2. Matrices ๐ ยนยบ are invertible for all ๐ and ๐ข โข Allow off-policy evaluation for class of POMDPs โข No need to measure or even know what is ๐ฏ ๐ฎ โข As usual in Causal Inference, some of the assumptions are unverifiable from data
Assumptions 1. Assume ๐ด ๐ฎ are discrete with โฅ categories as ๐ฏ ๐ฎ โก ๐ = ๐ ๐ ๐ ๐ด ๐ฎ = ๐ ๐ด ๐ฎ ๐ = ๐, ๐ ๐ฎ = ๐ 2. Matrices ๐ ยนยบ are invertible for all ๐ and ๐ข โข Observed sequence ๐ = ๐จ K , ๐ K , โฆ , ๐จ k , ๐ k โ ๐ฐ k โก ๐ = ๐ ๐ ๐ ๐ด ๐ฎ = ๐, ๐ด ๐ฎ ๐ = ๐จ โก L ๐ด ๐ฎ ๐ = ๐, ๐ ๐ฎ ๐ = ๐ โข ๐ ยนยบ โข ๐ โก ๐ = ๐ โก ๐ โก L ๐ โก (๐ โก L ) L ๐ ๐ ๐ (๐ ๐ = ๐) K (๐) = โ ยบ ๐ K ๐ ร ยนยบ โข ๐ ยน ๐ โก ๐ โ ๐ K (๐) k โข ฮฉ ๐ = โ โกlK k โข ฮ ร ๐ = โ โกlK ๐ ๐ (๐ โก |๐จ K , ๐ K , โฆ , ๐จ โก L , ๐ โก L , ๐จ โก ) โข Then: ๐ ๐ ๐ ๐ โก = โ `โ๐ฐ ร ฮ ร ๐ ๐ ๐ ๐ (๐ โก , ๐จ โก |๐ โก , ๐จ โก L ) ฮฉ ๐
Off-policy POMDP evaluation โข The above evaluation requires estimating the inverses of many conditional probability tables โข Scales poorly statistically โข We introduce another causal model called decoupled - POMDP โข Similar causal graph โข Significantly reduces the dimensions and improves condition number of the estimated inverse matrices
Decoupled POMDP
Off-policy POMDP evaluation โข The above evaluation requires estimating the inverses of many conditional probability tables โข Scales poorly statistically โข We introduce another causal model called decoupled - POMDP โข Similar causal graph โข Significantly reduces the dimensions and improves condition number of the estimated inverse matrices โข Current challenge: scaling to realistic health data
Outline โข ML for causal inference โข Causal inference for ML โข Off-policy evaluation in a partially observable Markov decision process โข Robust learning for unsupervised covariate shift
Outline โข ML for causal inference โข Causal inference for ML โข Off-policy evaluation in a partially observable Markov decision process โข Robust learning for unsupervised covariate shift โRobust learning with the Hilbert- Schmidt independence criterionโ, Greenfeld & S arXiv:1910.00270
Classic non-causal tasks in machine learning: many success stories โข Classification โข ImageNet โข MNIST โข TIMIT (sound) โข Sentiment analysis โข Prediction โข Which patients will die? โข Which users will click? โข (under current practice)
Failures of ML Classification models
Failures of ML Classification models test set โ train set, but we know humans succeed here
How to learn models which are ro robust to a-priori unknown changes in test distribution? โข Source distribution ๐ ร (๐, ๐) โข Learn model that works well on unknown Target distributions ๐ ร ๐, ๐ โ ๐ญ Set of Source possible ๐ ร targets ๐ญ
How to learn models which are ro robust to a-priori unknown changes in test distribution? โข Source distribution ๐ ร ๐, ๐ โข Learn model that works well on all target distributions ๐ ร ๐, ๐ โ ๐ญ โข What is ๐ญ ? โข We assume Covariate Shift : For all ๐ ร ๐, ๐ โ ๐ญ , ๐ ร ๐|๐ = ๐ ร (๐|๐) โข Further restrictions on ๐ญ to follow โข Covariate shift is easy if learning ๐ ร ๐ ๐ is easy โข Focus on tasks where itโs hard
Unsupervised covariate shift โข A model that works well even when the underlying distribution of instances changes โข Works as long as ๐(๐|๐) is stable โข When does this happen?
Causal mechanisms are stable
Learning with an independence criterion โข ๐ causes ๐ , structural causal model: ๐ = ๐ โ ๐ + ๐ , ๐ โซซ ๐ โข ๐ โ ๐ฆ is the mechanism tying ๐ to ๐ โข ๐ is independent addiKve noise โข Therefore, ๐ โ ๐ โ ๐ โซซ ๐ โข Mooij, Janzing, Peters & Schรถlkopf (2009): Learn structure of causal models by learning funcFons ๐ such that ๐ โ ๐ ๐ is approximately independent of ๐ โข Need a non-parametric measure of independence โข Hilbert-Schmidt independence criterion, HSIC
Hilbert-Schmidt independence criterion: HSIC โข Let ๐ , ๐ be two metric spaces with a joint distribution ๐(๐, ๐) โข ๐ฃ ร and ๐ฃ ร are reproducing kernel Hilbert spaces on ๐ and ๐ induced by kernels ๐ฟ(โ ,โ ) and ๐(โ ,โ ) respectively โข ๐ผ๐๐ฝ๐ท(๐, ๐) measures the degree of dependence between ๐ and ๐ โข Empirical version: Sample ๐ฆ L , ๐ง L , โฆ , ๐ฆ ร , ๐ง ร Denote (some abuse of notation) ๐ฟ the ๐ ร ๐ kernel matrix on ๐ , ๐ is ๐ ร ๐ kernel matrix on ๐ L โข { ๐ผ๐๐ฝ๐ท (๐, ๐; ๐ฃ ร , ๐ฃ ร ) = ร L ร ๐ข๐ ๐ฟ๐ผ๐๐ผ ๐ผ is a centering matrix, ๐ผ ยนยบ = ๐ ยนยบ โ L ร
Learning with HSIC โข Hypothesis class โ โข Classic learning for loss โ , e.g. squared loss: min |โโ ๐ฝ โ(๐, โ ๐ ) โข Learning with HSIC (Mooij et al., 2009): |โโ ๐ผ๐๐ฝ๐ท ๐, ๐ โ โ ๐ ; ๐ฃ ร , ๐ฃ ร min
Learning with HSIC โข Learning with HSIC (Mooij et al., 2009): min |โโ ๐ผ๐๐ฝ๐ท ๐, ๐ โ โ ๐ ; ๐ฃ ร , ๐ฃ ร โข Recall: ๐ โ ๐ โ ๐ โซซ ๐ โข If objective equals 0 then โ โ ๐ = ๐ โ ๐ฆ + ๐ for some constant ๐ โข Can learn up to an additive bias term
Learning with HSIC โข Learning with HSIC (Mooij et al., 2009): min |โโ ๐ผ๐๐ฝ๐ท ๐, ๐ โ โ ๐ ; ๐ฃ ร , ๐ฃ ร โข DifferenFable with respect to โ ๐ โข We opFmize with SGD using mini-batches to approximate HSIC
Theoretical results โข Learnability: minimizing HSIC-loss over a sample leads to generalization โข Robustness: minimizing HSIC-loss leads to tightly-bounded error in unsupervised covariate shift รข ลธรฃรครฅรฆลธ โข โข If denstiy ratio รงรจรฉรครชรฆ โข is โniceโ in the sense of low RKHS norm. รข
Experiments โ rotated MNIST (Heinze-Deml & Meinshausen 2017) โข Train on ordinary MNIST โข Test on MNIST rotated uniformly at random [-45ยฐ,45ยฐ]
Experiments โ rotated MNIST (Heinze-Deml & Meinshausen 2017) โข Train on ordinary MNIST โข Test on MNIST rotated uniformly at random [-45ยฐ,45ยฐ] Source { C11 - sourcH 7raLnLng schHPH C11 - sourcH C11 - sourcH 7raLnLng schHPH H6IC 7raLnLng schHPH 0LP 2x256 - sourcH C11 - sourcH C11 - sourcH 7raLnLng schHPH 7raLnLng schHPH HSIC C11 - sourcH 7raLnLng schHPH Cross HntroSy H6IC H6IC H6IC H6IC 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH H6IC Cross HntroSy Cross HntroSy Cross HntroSy Cross HntroSy 0LP 2x256 - sourcH Cross 0LP 2x524 - sourcH 0LP 2x524 - sourcH Cross HntroSy 0LP 2x524 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH entropy 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - sourcH 0LP 4x512 - sourcH C11 - sourcH 7raLnLng schHPH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH H6IC Target { 0LP 4x256 - sourcH 0LP 2x256 - sourcH 0LP 4x1024 - sourcH Cross HntroSy 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH 0LP 2x524 - sourcH C11 - targHt C11 - sourcH C11 - sourcH 0LP 4x512 - sourcH 7raLnLng schHPH C11 - targHt 7raLnLng schHPH C11 - targHt C11 - sourcH C11 - sourcH 7raLnLng schHPH 7raLnLng schHPH 0LP 4x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x1024 - sourcH H6IC H6IC 0LP 2x256 - sourcH 0LP 2x256 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt Cross HntroSy 0LP 2x256 - sourcH 0LP 4x1024 - sourcH H6IC H6IC 0LP 4x256 - sourcH Cross HntroSy 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH C11 - targHt C11 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt Cross HntroSy Cross HntroSy 0LP 4x512 - sourcH 0LP 2x524 - sourcH C11 - targHt 0LP 2x1024 - sourcH 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 2x524 - sourcH 0LP 2x524 - sourcH 0LP 2x256 - targHt 0LP 2x1024 - targHt 0LP 2x256 - targHt 0LP 4x1024 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - sourcH 0LP 2x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt C11 - targHt 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 4x512 - sourcH 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x256 - sourcH 0LP 2x524 - targHt 0LP 2x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 2x524 - targHt 0LP 4x512 - sourcH 0LP 2x1024 - targHt 0LP 4x1024 - targHt C11 - targHt 70 80 90 100 0LP 4x512 - sourcH 0LP 4x512 - sourcH 60 65 60 70 65 75 70 80 75 85 80 90 85 95 90 100 95 100 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 2x1024 - targHt 0LP 2x256 - targHt Accuracy Accuracy Accuracy 60 65 70 75 80 85 90 95 100 0LP 4x1024 - sourcH 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH Accuracy 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 2x524 - targHt 0LP 4x512 - targHt C11 - targHt 0LP 4x512 - targHt 0LP 2x1024 - targHt C11 - targHt C11 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x256 - targHt 0LP 2x256 - targHt 0LP 4x1024 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt 60 65 70 75 80 85 90 95 100 60 65 70 75 80 85 90 95 100 60 65 70 75 80 85 90 95 100 0LP 4x512 - targHt Accuracy 0LP 2x524 - targHt Accuracy 60 65 70 75 80 85 90 95 100 Accuracy 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 4x1024 - targHt Accuracy 0LP 2x1024 - targHt 60 65 70 75 80 85 90 95 100 0LP 2x1024 - targHt 0LP 2x1024 - targHt Accuracy 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 60 65 70 75 80 85 90 95 100 60 65 60 65 70 75 70 80 75 85 80 90 85 95 90 100 95 100 Accuracy Accuracy Accuracy
Experiments โ rotated MNIST (Heinze-Deml & Meinshausen 2017) โข Train on ordinary MNIST โข Test on MNIST rotated uniformly at random [-45ยฐ,45ยฐ] Source { C11 - sourcH 7raLnLng schHPH H6IC 0LP 2x256 - sourcH C11 - sourcH C11 - sourcH 7raLnLng schHPH 7raLnLng schHPH HSIC Cross HntroSy H6IC H6IC 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH Cross HntroSy Cross HntroSy Cross 0LP 2x524 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH entropy 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH Target { 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH C11 - targHt C11 - targHt C11 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 70 80 90 100 60 65 60 70 65 75 70 80 75 85 80 90 85 95 90 100 95 100 Accuracy Accuracy Accuracy 60 65 70 75 80 85 90 95 100 Accuracy
Outline โข ML for causal inference โข Causal inference for ML โข Off-policy evaluaFon in a parFally observable Markov decision process โข Robust learning for unsupervised covariate shiT
Summary โข Machine learning for causal- inference: โข Individual-level treatment effects from observational data - robustness to treatment assignments process โข Using ecently proposed โnegative controlโ to create first Off-Policy Evaluation scheme for POMDPs, with past and future in the role of the controls โข Learning models robust against unknown covariate shift
Thank you to all my collaborators! โข Fredrik Johansson (Chalmers) โข David Sontag (MIT) โข Nathan Kallus (Cornell-Tech) โข Guy Tennenholtz (Technion) โข Shie Mannor (Technion) โข Daniel Greenfeld (Technion)
Even estimating average effects from observational data is hard! Do we believe we can estimate individual-level effects? โข Causal identification assumptions: โข Hidden confounding: No unmeasured factors that affect both treatment and outcome โข Common support: ๐ = 1 and ๐ = 0 populations should be similar โข Accurate effect estimates: be able to approximate ๐ฝ ๐|๐ฆ, ๐ = ๐ข
Even esPmaPng average effects from observaPonal data is hard! Do we believe we can esPmate individual-level effects? โข Causal identification assumptions: โข Hidden confounding โข Common support โข Accurate effect estimates โข We focus on tasks where we hope we can address all three concerns โข And still be useful โข Designing for causal identification
You have condition A. Treatment options are T=0, T=1
Obviously, give T=0 No need for algorithmic decision support
Obviously, give T=0 Obviously, give T=1 No need for algorithmic decision support
Obviously, give T=0 Obviously, give T=1 Iโm not so sureโฆ
Recommend
More recommend