machine learning and causal inference a two way road
play

Machine learning and causal inference: a two-way road Uri Shalit - PowerPoint PPT Presentation

Machine learning and causal inference: a two-way road Uri Shalit Technion Israel Institute of Technology DATAIA Seminar Paris, January 2020 What is causality? A big question! Extremely short into to causality (in the context of


  1. Imbalance in representation space โ–บ We do not want treatment groups to be identical Treatment group imbalance klK ๐‘ฆ klL ๐‘ฆ โ‰  ๐‘ž j ๐‘ž j ฮฆ ๐‘ฆ ๐‘ฆ b ฮฆ(๐‘ฆ) b ๐‘ฆ L ฮฆ(๐‘ฆ) L Control, ๐‘ˆ = 0 Treated, ๐‘ˆ = 1

  2. Integral probability metric penalty โ–บ Regularizer to improve counterfactual estimation โ–บ Penalize treatment distributional distance in representation space โ„Ž L ๐‘—๐‘” ๐‘ˆ = 1 ๐‘€(โ„Ž L (ฮฆ), ๐‘(1)) โ€ฆ ๐‘ฆ ฮฆ โ€ฆ ๐‘€ โ„Ž K ฮฆ , ๐‘(0) โ€ฆ ๐‘—๐‘” ๐‘ˆ = 0 โ„Ž K klK , ฬ‚ klL ) IPM p ( ฬ‚ ๐‘ž j ๐‘ž j ๐‘ˆ โ–บ Integral Probability Metrics (IPM) such as Wasserstein distance and MMD IPM q ๐‘ž K , ๐‘ž L = sup v ๐‘• ๐‘ก ๐‘ž K ๐‘ก โˆ’ ๐‘ž L ๐‘ก ๐‘’๐‘ก With G a function family: uโˆˆp ๐’ฏ

  3. Integral probability metric penalty โ–บ Regularizer to improve counterfactual estimation โ–บ Penalize treatment distributional distance in representation space โ„Ž L ๐‘—๐‘” ๐‘ˆ = 1 ๐‘€(โ„Ž L (ฮฆ), ๐‘(1)) โ€ฆ ๐‘ฆ ฮฆ โ€ฆ ๐‘€ โ„Ž K ฮฆ , ๐‘(0) โ€ฆ ๐‘—๐‘” ๐‘ˆ = 0 โ„Ž K klK , ฬ‚ klL ) IPM p ( ฬ‚ ๐‘ž j ๐‘ž j ๐‘ˆ โ–บ Integral Probability Metrics (IPM) such as Wasserstein distance and MMD With G a IPM q ๐‘ž K , ๐‘ž L = sup v ๐‘• ๐‘ก ๐‘ž K ๐‘ก โˆ’ ๐‘ž L ๐‘ก ๐‘’๐‘ก function family: uโˆˆp ๐’ฏ

  4. Individual-level treatment effect generalization bound โ–บ Factual per-treatment group โ–บ Precision in Estimation of prediction error Heterogeneous Effects 1 : b { klK = v ๐ท๐ต๐‘ˆ๐น j,| = โ„Ž ฮฆ ๐‘ฆ , 1 โˆ’ โ„Ž(ฮฆ ๐‘ฆ , 0) ๐‘ž โ€กlK ๐‘ฆ ๐‘’๐‘ฆ โ€  โ–บ ๐œ— โ€ž ๐‘ 0 โˆ’ ๐‘ 0 โ€ข b b { klL = v ๐‘ž โ€กlL ๐‘ฆ ๐‘’๐‘ฆ ๐œ— }~kโ€ข (๐œš, โ„Ž) = v ๐ท๐ต๐‘ˆ๐น j,| โˆ’ CATE ๐‘ฆ ๐‘ž ๐‘ฆ ๐‘’๐‘ฆ โ€  ๐œ— โ€ž ๐‘ 1 โˆ’ ๐‘ 1 โ€ข โ€ข โ–บ Theorem 1: klK ฮฆ, โ„Ž + ๐œ— โ€ž klL ฮฆ, โ„Ž + ๐ถ j IPM q ๐‘ž j klL , ๐‘ž j klK ๐œ— }~kโ€ข (๐œš, โ„Ž) โ‰ค 2 ๐œ— โ€ž Effect error Prediction error Treatment group distance 1 Hill, Journal of Computational and Graphical Statistics 2011

  5. โ–บ Theorem 1: klK ฮฆ, โ„Ž + ๐œ— โ€ž klL ฮฆ, โ„Ž + ๐ถ j IPM q ๐‘ž j klL , ๐‘ž j klK ๐œ— CATE โ‰ค 2 ๐œ— โ€ž Effect error Prediction error Treatment group distance โ€ข Problem with Theorem 1: Too loose when we have overlap + infinite samples โ€ข We should be able to achieve the predicFon error itself on either group

  6. Trading off accuracy for balance โ–บ Our full architecture learns a representation ฮฆ(x) , a re-weighting ๐‘ฅ โ€ก (๐‘ฆ) and hypotheses โ„Ž โ€ก (ฮฆ) to trade-off between the re-weighted loss ๐‘ฅโ„“ and imbalance between re-weighted representations Hypotheses Weighted loss Context Repres. Treatment โ„Ž K ๐‘ฆ ฮฆ ๐‘ข ๐‘ฅโ„“ โ„Ž L DNN ๐‘ฅ โ€กlK , ๐‘ฅ L ๐‘ž j โ€กlL ) IPM(๐‘ฅ K ๐‘ž j ฮฆ Weighting Imbalance

  7. Individual-treatment effect generalization bound โ–บ Theorem 2 *: (Representation learning) ลพ ลธ ฮฆ, โ„Ž + ๐ถ j IPM q ๐‘ž j โ€ก (๐‘ฆ) L โ€ก (๐‘ฆ), ๐‘ฅ โ€ก ๐‘ž j ๐œ— CATE โ‰ค 2 โ€บ ๐œ— โ€ก โ€กโˆˆ{K,L} Effect risk Re-weighted factual loss Imbalance of re-weighted representations โ–บ Letting ฮฆ ๐‘ฆ = ๐‘ฆ , and ๐‘ฅ โ€ก (๐‘ฆ) be inverse propensity weights, we recover classic result โ–บ Minimizing a weighted loss and IPM converge to the representation and hypothesis that minimize CATE error *Extension to finite samples available

  8. Evaluating Individual Treatment Effect (CATE) Estimates โ–บ No ground truth , similar to o ff -policy evaluation in reinforcement learning

  9. Evaluating Individual Treatment Effect (CATE) Estimates โ–บ No ground truth , similar to off-policy evaluation in reinforcement learning โ–บ Requires either: โ–บ Knowledge of the true outcome (synthetic) โ–บ Knowledge of treatment assignment policy (e.g. a randomized controlled trial)

  10. Evaluating Individual Treatment Effect (CATE) Estimates โ–บ No ground truth , similar to off-policy evaluation in reinforcement learning โ–บ Requires either: โ–บ Knowledge of the true outcome (synthetic) โ–บ Knowledge of treatment assignment policy (e.g. a randomized controlled trial) โ–บ Our framework has proven effective in both settings

  11. IH IHDP Benchmark 1 โ–บ The Infant Health and Development Program (IHDP) โ–บ Studied the effects of home visits and other interventions โ–บ Real covariates and treatment, synthesized outcome โ–บ Overlap is not satisfied (by design) โ–บ Used to evaluate MSE in CATE prediction 1 Hill, JCGS, 2011

  12. Empir Em piric ical l results ults โ–บ BART, Bayesian Additive Regression Method CATE MSE Trees, are state-of-the-art baselines BART 1 2.3 ยฑ 0.1 โ–บ Standard neural networks Neural net 2.0 ยฑ 0.0 competitive Shared rep. 2 ๐Ÿ. ๐Ÿ ยฑ ๐Ÿ. ๐Ÿ โ–บ Shared representation learning with Shared rep. ๐Ÿ. ๐Ÿ— ยฑ ๐Ÿ. ๐Ÿ ERM halves the MSE on IHDP 2 + invariance 2 Shared rep. ๐Ÿ. ๐Ÿ– ยฑ ๐Ÿ. ๐Ÿ + invariance + weighting 3 โ–บ Minimizing upper bounds on risk, including ๐‘’ โ„‹ further reduces the MSE 1 Hill, JCGS, 2011, 2 S ., Johansson, Sontag. ICML , 2017, 3 Johansson, Kallus, S. , Sontag. arXiv , 2018

  13. In Intermedia iate conclu lusio sions โ–บ ML is well understood when test data โ‰ˆ training data โ–บ Learning individualized policies from observational data requires going beyond test โ‰ˆ train โ–บ Fewer/worse guarantees when assumptions are violated

  14. Outline โ€ข ML for causal inference โ€ข Causal inference for ML โ€ข Off-policy evaluation in a partially observable Markov decision process โ€ข Robust learning for unsupervised covariate shift

  15. Outline โ€ข ML for causal inference โ€ข Causal inference for ML โ€ข Off-policy evaluation in a partially observable Markov decision process โ€ข Robust learning for unsupervised covariate shift โ€œOff-Policy Evaluation in Partially Observable Environmentsโ€, Tennenholtz, Mannor, S AAAI 2020

  16. Healthcare with time-varying decisions โ€ข Physicians make ongoing decisions: treat, see change in patients state, modify treatment, and so on Doctor Patient

  17. Healthcare with time-varying decisions โ€ข Maps very well to reinforcement learning paradigm Figure: Shweta Bhatt

  18. Reinforcement learning (RL) and causal inference From causal inference perspective โ€ข RL usually assumes we can intervene directly โ€ข ร  mostly about how to experiment optimally in a dynamic environment

  19. Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspective โ€ข RL usually assumes we can intervene directly โ€ข ร  mostly about how to experiment optimally in a dynamic environment

  20. Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspecFve โ€ข RL usually assumes we can โ€ข Causal inference usually deals intervene directly with cases we cannot intervene directly โ€ข ร  mostly about how to experiment optimally in a dynamic environment

  21. Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspective โ€ข RL usually assumes we can โ€ข Causal inference usually deals intervene directly with cases we cannot intervene directly โ€ข ร  mostly about how to experiment optimally in โ€ข Causal inference usually focuses a dynamic environment on single point-in-time actions

  22. Reinforcement learning (RL) and causal inference From causal inference perspective From RL perspective โ€ข RL usually assumes we can โ€ข Causal inference usually deals intervene directly with cases we cannot intervene directly โ€ข ร  mostly about how to experiment optimally in โ€ข Causal inference usually focuses a dynamic environment on single point-in-time actions โ€ข ร  mostly about off-policy evaluation of a simple policy such as โ€œtreat everyoneโ€

  23. A meePng point of RL and causal inference โ€ข When performing off-policy evaluation of data from i. dynamic environment with ongoing actions ii. while we possibly do not have access to the same data as the agent โ€ข Example: learning from records of physicians treating patients in an intensive care unit (ICU) โ€ข Mistakes were made: applying RL to observational intensive care unit data without considering hidden confounders or overlap (common support / positivity) (see โ€œGuidelines for Reinforcement Learning in Healthcareโ€ Gottesman et al. 2019) โ€ข In RL nomenclature, hidden confounding can be described by a Partially Observable Markov Decision Process (POMDP)

  24. Partially Observable Markov Decision Process (POMDP): some formalism 7

  25. POMDP causal graph Causal RL Example name name confounder state Information ๐ฏ t (possibly (possibly available to โ€œhiddenโ€) โ€œunobservedโ€) the doctor action, medications, ๐› t action treatment proceduresโ€ฆ ๐ฌ t outcome reward mortality treatment The way behavioral ๐† ๐’„ assignment doctors treat policy process patients Proxy Electronic ๐ด t observation variable health record

  26. POMDP causal graph Causal RL Example name name confounder state Information ๐ฏ t (possibly (possibly available to โ€œhiddenโ€) โ€œunobservedโ€) the doctor action, medications, ๐› t action treatment proceduresโ€ฆ ๐ฌ t outcome reward mortality treatment The way behavioral ๐† ๐’„ assignment doctors treat policy process patients Proxy Electronic ๐ด t observation variable health record

  27. POMDP causal graph Causal RL Example name name confounder state information ๐ฏ t (possibly (possibly available to โ€œhiddenโ€) โ€œunobservedโ€) the doctor action, medications, ๐› t action treatment proceduresโ€ฆ ๐ฌ t outcome reward mortality treatment the way behavioral ๐† ๐’„ assignment doctors treat policy process patients Proxy Electronic ๐ด t observation variable health record

  28. POMDP causal graph Causal RL Example name name confounder state information ๐ฏ t (possibly (possibly available to โ€œhiddenโ€) โ€œunobservedโ€) the doctor action, medications, ๐› t action treatment proceduresโ€ฆ ๐ฌ t outcome reward mortality treatment the way behavioral ๐† ๐’„ assignment doctors treat policy process patients proxy electronic ๐ด t observation variable health record

  29. POMDP causal graph Causal RL Example name name confounder state information ๐ฏ t (possibly (possibly available to โ€œhiddenโ€) โ€œunobservedโ€) the doctor action, medications, ๐› t action treatment proceduresโ€ฆ ๐ฌ t outcome reward mortality treatment the way behavioral ๐† ๐’„ assignment doctors treat policy process patients proxy electronic ๐ด t observation variable health record

  30. POMDP causal graph โ€ข Observe data from ๐† ๐’„ , with ๐ฏ ๐ฎ unobserved

  31. Our goal: evaluate a new policy ๐† ๐’‡ given data from ๐† ๐’„ โ€ข Observe data from ๐† ๐’„ , with ๐ฏ ๐ฎ unobservedโ€ฆ โ€ข Evaluate a proposed policy ๐† ๐’‡ (๐’œ ๐’– ) in terms of policy value (discounted over a finite horizon) โ€ข Why a function of ๐ด t ? Because ๐ฏ t is unobserved โ€ข How to evaluate ๐† ๐’‡ (๐’œ ๐’– ) given only observations from ๐† ๐’„ , with ๐ฏ t unobserved? โ€ข This is a problem anyone trying to create optimal dynamic treatment policies with observational data must address

  32. Our goal: evaluate a new policy ๐† ๐’‡ given data from ๐† ๐’„ โ€ข Observe data from ๐† ๐’„ , with ๐ฏ ๐ฎ unobservedโ€ฆ โ€ข Evaluate a proposed policy ๐† ๐’‡ (๐’œ ๐’– ) in terms of policy value (discounted over a finite horizon) โ€ข Denote ๐‘ž ๐† ๐’„ (๐‘, ๐‘, ๐‘‘, โ€ฆ |๐‘’, ๐‘“, ๐‘”, โ€ฆ ) probabilities from observed behavioral policy โ€ข Can sample from this distribution โ€ข Denote ๐‘ž ๐† ๐’‡ (๐‘, ๐‘, ๐‘‘, โ€ฆ |๐‘’, ๐‘“, ๐‘”, โ€ฆ ) probabilities from targeted evaluation policy โ€ข Cannot sample from this distribution

  33. Our goal: evaluate a new policy ๐† ๐’‡ given data from ๐† ๐’„ โ€ข Observing data from ๐† ๐’„ , with ๐ฏ ๐ฎ unobserved evaluate a proposed policy ๐† ๐’‡ (๐ด ๐ฎ ) in terms of policy value (discounted over a finite horizon) โ€ข Without further assumptions: IMPOSSIBLE โ€ข Example: ICU doctors treating sicker patients more aggressively โ€ข Impossible even when conditioning on entire observable history ๐ด ๐Ÿ , ๐› ๐Ÿ , ๐’” ๐Ÿ , โ€ฆ , ๐ด ๐” , ๐› ๐” , ๐’” ๐” โ€ข Due to hidden confounding by ๐ฏ ๐ฎ โ€ข But much harder: confounder<->action dynamics

  34. Proxies and negative controls โ€ข Miao, Geng, & Tchetgen Tchetgen. โ€œIdentifying causal effects with proxy variables of an unmeasured confounder.โ€ Biometrika (2018) โ€ข Only ๐’— is unobserved ๐ด ๐’— ๐ฑ โ€ข Goal: identify the causal effect of ๐› on ๐ฌ โ€ข ๐ด โซซ ๐ฑ | ๐’— โ€ข In general: impossible ๐ฌ ๐› โ€ข New identification condition: matrices ๐‘ ยนยบ (๐‘) = ๐‘ž(๐ฑ = ๐‘—|๐ด = j, ๐› = ๐‘) are invertible for all ๐‘ โ€ข Requires ๐ฑ and ๐ด to be discrete with as many categories as discrete ๐’—

  35. Our goal: evaluate a new policy ๐† ๐’‡ given data from ๐† ๐’„ โ€ข Assume ๐ด ๐ฎ are discrete with โ‰ฅ categories as ๐ฏ ๐ฎ Invertibility example (untestable from data) If z ยฟ are binary, then a sufficient condition for โ€ก ๐‘ = ๐‘ž ๐† ๐’„ ๐ด ๐ฎ = ๐‘— ๐ด ๐ฎ ๐Ÿ = ๐‘˜, ๐› ๐ฎ = ๐‘ โ€ข Let ๐‘ ยนยบ invertiblity of ๐‘ โ€ก (๐‘) is โ€ข Theorem: ๐‘ž z ยฟ = 1 z ยฟ L = 1, ๐‘ โ‰  ๐‘ž z ยฟ = 1 z ยฟ L = 0, ๐‘ If ๐‘ โ€ก (๐‘) are all invertible then we can evaluate value of a proposed policy ๐† ๐’‡ (๐’œ ๐’– ) given observational data gathered under ๐† ๐’„ , without observing ๐ฏ ๐ฎ โ€ข Future and past observations ๐’œ ๐’– are conditionally independent proxies for unobserved ๐ฏ ๐ฎ

  36. Assumptions 1. Assume ๐ด ๐ฎ are discrete with โ‰ฅ categories as ๐ฏ ๐ฎ โ€ก ๐‘ = ๐‘ž ๐† ๐’„ ๐ด ๐ฎ = ๐‘— ๐ด ๐ฎ ๐Ÿ = ๐‘˜, ๐› ๐ฎ = ๐‘ 2. Matrices ๐‘ ยนยบ are invertible for all ๐‘ and ๐‘ข โ€ข Allow off-policy evaluation for class of POMDPs โ€ข No need to measure or even know what is ๐ฏ ๐ฎ โ€ข As usual in Causal Inference, some of the assumptions are unverifiable from data

  37. Assumptions 1. Assume ๐ด ๐ฎ are discrete with โ‰ฅ categories as ๐ฏ ๐ฎ โ€ก ๐‘ = ๐‘ž ๐† ๐’„ ๐ด ๐ฎ = ๐‘— ๐ด ๐ฎ ๐Ÿ = ๐‘˜, ๐› ๐ฎ = ๐‘ 2. Matrices ๐‘ ยนยบ are invertible for all ๐‘ and ๐‘ข โ€ข Observed sequence ๐œ = ๐‘จ K , ๐‘ K , โ€ฆ , ๐‘จ k , ๐‘ k โˆˆ ๐’ฐ k โ€ก ๐‘ = ๐‘ž ๐† ๐’„ ๐ด ๐ฎ = ๐‘—, ๐ด ๐ฎ ๐Ÿ = ๐‘จ โ€ก L ๐ด ๐ฎ ๐Ÿ‘ = ๐‘˜, ๐› ๐ฎ ๐Ÿ = ๐‘ โ€ข ๐‘‚ ยนยบ โ€ข ๐‘‹ โ€ก ๐œ = ๐‘ โ€ก ๐‘ โ€ก L ๐‘‚ โ€ก (๐‘ โ€ก L ) L ๐‘ž ๐† ๐’„ (๐’œ ๐Ÿ = ๐‘˜) K (๐œ) = โˆ‘ ยบ ๐‘ K ๐‘ ร† ยนยบ โ€ข ๐‘… ยน ๐‘‹ โ€ก ๐œ โ‹… ๐‘… K (๐œ) k โ€ข ฮฉ ๐œ = โˆ โ€กlK k โ€ข ฮ› ร‹ ๐œ = โˆ โ€กlK ๐† ๐’‡ (๐‘ โ€ก |๐‘จ K , ๐‘ K , โ€ฆ , ๐‘จ โ€ก L , ๐‘ โ€ก L , ๐‘จ โ€ก ) โ€ข Then: ๐‘ž ๐† ๐’‡ ๐‘  โ€ก = โˆ‘ `โˆˆ๐’ฐ ร ฮ› ร‹ ๐œ ๐‘ž ๐† ๐’„ (๐‘  โ€ก , ๐‘จ โ€ก |๐‘ โ€ก , ๐‘จ โ€ก L ) ฮฉ ๐œ

  38. Off-policy POMDP evaluation โ€ข The above evaluation requires estimating the inverses of many conditional probability tables โ€ข Scales poorly statistically โ€ข We introduce another causal model called decoupled - POMDP โ€ข Similar causal graph โ€ข Significantly reduces the dimensions and improves condition number of the estimated inverse matrices

  39. Decoupled POMDP

  40. Off-policy POMDP evaluation โ€ข The above evaluation requires estimating the inverses of many conditional probability tables โ€ข Scales poorly statistically โ€ข We introduce another causal model called decoupled - POMDP โ€ข Similar causal graph โ€ข Significantly reduces the dimensions and improves condition number of the estimated inverse matrices โ€ข Current challenge: scaling to realistic health data

  41. Outline โ€ข ML for causal inference โ€ข Causal inference for ML โ€ข Off-policy evaluation in a partially observable Markov decision process โ€ข Robust learning for unsupervised covariate shift

  42. Outline โ€ข ML for causal inference โ€ข Causal inference for ML โ€ข Off-policy evaluation in a partially observable Markov decision process โ€ข Robust learning for unsupervised covariate shift โ€œRobust learning with the Hilbert- Schmidt independence criterionโ€, Greenfeld & S arXiv:1910.00270

  43. Classic non-causal tasks in machine learning: many success stories โ€ข Classification โ€ข ImageNet โ€ข MNIST โ€ข TIMIT (sound) โ€ข Sentiment analysis โ€ข Prediction โ€ข Which patients will die? โ€ข Which users will click? โ€ข (under current practice)

  44. Failures of ML Classification models

  45. Failures of ML Classification models test set โ‰  train set, but we know humans succeed here

  46. How to learn models which are ro robust to a-priori unknown changes in test distribution? โ€ข Source distribution ๐‘„ รŽ (๐‘Œ, ๐‘) โ€ข Learn model that works well on unknown Target distributions ๐‘„ ร ๐‘Œ, ๐‘ โˆˆ ๐’ญ Set of Source possible ๐‘„ รŽ targets ๐’ญ

  47. How to learn models which are ro robust to a-priori unknown changes in test distribution? โ€ข Source distribution ๐‘„ รŽ ๐‘Œ, ๐‘ โ€ข Learn model that works well on all target distributions ๐‘„ ร ๐‘Œ, ๐‘ โˆˆ ๐’ญ โ€ข What is ๐’ญ ? โ€ข We assume Covariate Shift : For all ๐‘„ ร ๐‘Œ, ๐‘ โˆˆ ๐’ญ , ๐‘„ ร ๐‘|๐‘Œ = ๐‘„ รŽ (๐‘|๐‘Œ) โ€ข Further restrictions on ๐’ญ to follow โ€ข Covariate shift is easy if learning ๐‘„ รŽ ๐‘ ๐‘Œ is easy โ€ข Focus on tasks where itโ€™s hard

  48. Unsupervised covariate shift โ€ข A model that works well even when the underlying distribution of instances changes โ€ข Works as long as ๐‘„(๐‘|๐‘Œ) is stable โ€ข When does this happen?

  49. Causal mechanisms are stable

  50. Learning with an independence criterion โ€ข ๐‘Œ causes ๐‘ , structural causal model: ๐’ = ๐’ˆ โˆ— ๐’€ + ๐‘ , ๐‘ โซซ ๐’€ โ€ข ๐‘” โˆ— ๐‘ฆ is the mechanism tying ๐‘Œ to ๐‘ โ€ข ๐œ— is independent addiKve noise โ€ข Therefore, ๐‘ โˆ’ ๐‘” โˆ— ๐‘Œ โซซ ๐‘Œ โ€ข Mooij, Janzing, Peters & Schรถlkopf (2009): Learn structure of causal models by learning funcFons ๐‘” such that ๐‘ โˆ’ ๐‘” ๐‘Œ is approximately independent of ๐‘Œ โ€ข Need a non-parametric measure of independence โ€ข Hilbert-Schmidt independence criterion, HSIC

  51. Hilbert-Schmidt independence criterion: HSIC โ€ข Let ๐‘Œ , ๐‘ be two metric spaces with a joint distribution ๐‘„(๐‘Œ, ๐‘) โ€ข ๐’ฃ ร• and ๐’ฃ ร– are reproducing kernel Hilbert spaces on ๐‘Œ and ๐‘ induced by kernels ๐ฟ(โ‹…,โ‹…) and ๐‘€(โ‹…,โ‹…) respectively โ€ข ๐ผ๐‘‡๐ฝ๐ท(๐‘Œ, ๐‘) measures the degree of dependence between ๐‘Œ and ๐‘ โ€ข Empirical version: Sample ๐‘ฆ L , ๐‘ง L , โ€ฆ , ๐‘ฆ รœ , ๐‘ง รœ Denote (some abuse of notation) ๐ฟ the ๐‘œ ร— ๐‘œ kernel matrix on ๐‘Œ , ๐‘€ is ๐‘œ ร— ๐‘œ kernel matrix on ๐‘ L โ€ข { ๐ผ๐‘‡๐ฝ๐ท (๐‘Œ, ๐‘; ๐’ฃ ร• , ๐’ฃ ร– ) = รœ L ร  ๐‘ข๐‘  ๐ฟ๐ผ๐‘€๐ผ ๐ผ is a centering matrix, ๐ผ ยนยบ = ๐œ€ ยนยบ โˆ’ L รœ

  52. Learning with HSIC โ€ข Hypothesis class โ„‹ โ€ข Classic learning for loss โ„“ , e.g. squared loss: min |โˆˆโ„‹ ๐”ฝ โ„“(๐‘, โ„Ž ๐‘Œ ) โ€ข Learning with HSIC (Mooij et al., 2009): |โˆˆโ„‹ ๐ผ๐‘‡๐ฝ๐ท ๐‘Œ, ๐‘ โˆ’ โ„Ž ๐‘Œ ; ๐’ฃ ร• , ๐’ฃ ร– min

  53. Learning with HSIC โ€ข Learning with HSIC (Mooij et al., 2009): min |โˆˆโ„‹ ๐ผ๐‘‡๐ฝ๐ท ๐‘Œ, ๐‘ โˆ’ โ„Ž ๐‘Œ ; ๐’ฃ ร• , ๐’ฃ ร– โ€ข Recall: ๐‘ โˆ’ ๐‘” โˆ— ๐‘Œ โซซ ๐‘Œ โ€ข If objective equals 0 then โ„Ž โˆ— ๐‘Œ = ๐‘” โˆ— ๐‘ฆ + ๐‘ for some constant ๐‘ โ€ข Can learn up to an additive bias term

  54. Learning with HSIC โ€ข Learning with HSIC (Mooij et al., 2009): min |โˆˆโ„‹ ๐ผ๐‘‡๐ฝ๐ท ๐‘Œ, ๐‘ โˆ’ โ„Ž ๐‘Œ ; ๐’ฃ ร• , ๐’ฃ ร– โ€ข DifferenFable with respect to โ„Ž ๐‘Œ โ€ข We opFmize with SGD using mini-batches to approximate HSIC

  55. Theoretical results โ€ข Learnability: minimizing HSIC-loss over a sample leads to generalization โ€ข Robustness: minimizing HSIC-loss leads to tightly-bounded error in unsupervised covariate shift รข ลธรฃรครฅรฆลธ โ€ข โ€ข If denstiy ratio รงรจรฉรครชรฆ โ€ข is โ€œniceโ€ in the sense of low RKHS norm. รข

  56. Experiments โ€“ rotated MNIST (Heinze-Deml & Meinshausen 2017) โ€ข Train on ordinary MNIST โ€ข Test on MNIST rotated uniformly at random [-45ยฐ,45ยฐ]

  57. Experiments โ€“ rotated MNIST (Heinze-Deml & Meinshausen 2017) โ€ข Train on ordinary MNIST โ€ข Test on MNIST rotated uniformly at random [-45ยฐ,45ยฐ] Source { C11 - sourcH 7raLnLng schHPH C11 - sourcH C11 - sourcH 7raLnLng schHPH H6IC 7raLnLng schHPH 0LP 2x256 - sourcH C11 - sourcH C11 - sourcH 7raLnLng schHPH 7raLnLng schHPH HSIC C11 - sourcH 7raLnLng schHPH Cross HntroSy H6IC H6IC H6IC H6IC 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH H6IC Cross HntroSy Cross HntroSy Cross HntroSy Cross HntroSy 0LP 2x256 - sourcH Cross 0LP 2x524 - sourcH 0LP 2x524 - sourcH Cross HntroSy 0LP 2x524 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH entropy 0LP 2x524 - sourcH 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - sourcH 0LP 4x512 - sourcH C11 - sourcH 7raLnLng schHPH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH H6IC Target { 0LP 4x256 - sourcH 0LP 2x256 - sourcH 0LP 4x1024 - sourcH Cross HntroSy 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH 0LP 2x524 - sourcH C11 - targHt C11 - sourcH C11 - sourcH 0LP 4x512 - sourcH 7raLnLng schHPH C11 - targHt 7raLnLng schHPH C11 - targHt C11 - sourcH C11 - sourcH 7raLnLng schHPH 7raLnLng schHPH 0LP 4x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x1024 - sourcH H6IC H6IC 0LP 2x256 - sourcH 0LP 2x256 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt Cross HntroSy 0LP 2x256 - sourcH 0LP 4x1024 - sourcH H6IC H6IC 0LP 4x256 - sourcH Cross HntroSy 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH C11 - targHt C11 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt Cross HntroSy Cross HntroSy 0LP 4x512 - sourcH 0LP 2x524 - sourcH C11 - targHt 0LP 2x1024 - sourcH 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 2x524 - sourcH 0LP 2x524 - sourcH 0LP 2x256 - targHt 0LP 2x1024 - targHt 0LP 2x256 - targHt 0LP 4x1024 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - sourcH 0LP 2x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt C11 - targHt 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 4x512 - sourcH 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x256 - sourcH 0LP 2x524 - targHt 0LP 2x256 - targHt 0LP 4x512 - targHt 0LP 4x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 2x524 - targHt 0LP 4x512 - sourcH 0LP 2x1024 - targHt 0LP 4x1024 - targHt C11 - targHt 70 80 90 100 0LP 4x512 - sourcH 0LP 4x512 - sourcH 60 65 60 70 65 75 70 80 75 85 80 90 85 95 90 100 95 100 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 2x1024 - targHt 0LP 2x256 - targHt Accuracy Accuracy Accuracy 60 65 70 75 80 85 90 95 100 0LP 4x1024 - sourcH 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH Accuracy 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 2x524 - targHt 0LP 4x512 - targHt C11 - targHt 0LP 4x512 - targHt 0LP 2x1024 - targHt C11 - targHt C11 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x256 - targHt 0LP 2x256 - targHt 0LP 4x1024 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt 60 65 70 75 80 85 90 95 100 60 65 70 75 80 85 90 95 100 60 65 70 75 80 85 90 95 100 0LP 4x512 - targHt Accuracy 0LP 2x524 - targHt Accuracy 60 65 70 75 80 85 90 95 100 Accuracy 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 4x1024 - targHt Accuracy 0LP 2x1024 - targHt 60 65 70 75 80 85 90 95 100 0LP 2x1024 - targHt 0LP 2x1024 - targHt Accuracy 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 60 65 70 75 80 85 90 95 100 60 65 60 65 70 75 70 80 75 85 80 90 85 95 90 100 95 100 Accuracy Accuracy Accuracy

  58. Experiments โ€“ rotated MNIST (Heinze-Deml & Meinshausen 2017) โ€ข Train on ordinary MNIST โ€ข Test on MNIST rotated uniformly at random [-45ยฐ,45ยฐ] Source { C11 - sourcH 7raLnLng schHPH H6IC 0LP 2x256 - sourcH C11 - sourcH C11 - sourcH 7raLnLng schHPH 7raLnLng schHPH HSIC Cross HntroSy H6IC H6IC 0LP 2x256 - sourcH 0LP 2x256 - sourcH 0LP 2x524 - sourcH Cross HntroSy Cross HntroSy Cross 0LP 2x524 - sourcH 0LP 2x524 - sourcH 0LP 2x1024 - sourcH entropy 0LP 2x1024 - sourcH 0LP 2x1024 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 4x256 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH 0LP 4x512 - sourcH Target { 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH 0LP 4x1024 - sourcH C11 - targHt C11 - targHt C11 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt 0LP 2x256 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 2x524 - targHt 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 2x1024 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x256 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x512 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 0LP 4x1024 - targHt 70 80 90 100 60 65 60 70 65 75 70 80 75 85 80 90 85 95 90 100 95 100 Accuracy Accuracy Accuracy 60 65 70 75 80 85 90 95 100 Accuracy

  59. Outline โ€ข ML for causal inference โ€ข Causal inference for ML โ€ข Off-policy evaluaFon in a parFally observable Markov decision process โ€ข Robust learning for unsupervised covariate shiT

  60. Summary โ€ข Machine learning for causal- inference: โ€ข Individual-level treatment effects from observational data - robustness to treatment assignments process โ€ข Using ecently proposed โ€œnegative controlโ€ to create first Off-Policy Evaluation scheme for POMDPs, with past and future in the role of the controls โ€ข Learning models robust against unknown covariate shift

  61. Thank you to all my collaborators! โ€ข Fredrik Johansson (Chalmers) โ€ข David Sontag (MIT) โ€ข Nathan Kallus (Cornell-Tech) โ€ข Guy Tennenholtz (Technion) โ€ข Shie Mannor (Technion) โ€ข Daniel Greenfeld (Technion)

  62. Even estimating average effects from observational data is hard! Do we believe we can estimate individual-level effects? โ€ข Causal identification assumptions: โ€ข Hidden confounding: No unmeasured factors that affect both treatment and outcome โ€ข Common support: ๐‘ˆ = 1 and ๐‘ˆ = 0 populations should be similar โ€ข Accurate effect estimates: be able to approximate ๐”ฝ ๐‘|๐‘ฆ, ๐‘ˆ = ๐‘ข

  63. Even esPmaPng average effects from observaPonal data is hard! Do we believe we can esPmate individual-level effects? โ€ข Causal identification assumptions: โ€ข Hidden confounding โ€ข Common support โ€ข Accurate effect estimates โ€ข We focus on tasks where we hope we can address all three concerns โ€ข And still be useful โ€ข Designing for causal identification

  64. You have condition A. Treatment options are T=0, T=1

  65. Obviously, give T=0 No need for algorithmic decision support

  66. Obviously, give T=0 Obviously, give T=1 No need for algorithmic decision support

  67. Obviously, give T=0 Obviously, give T=1 Iโ€™m not so sureโ€ฆ

Recommend


More recommend