foundations of causal discovery
play

Foundations of Causal Discovery Frederick Eberhardt KDD Causality - PowerPoint PPT Presentation

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery data sample x y z w samples 2 Causal Discovery assumptions, e.g. causal Markov causal faithfulness functional form etc.


  1. y x x 6? ? z y 6? ? z x ⊥ ⊥ y } z sufficient to determine the equivalence class, y y x y x y x y x y x x in this case, a unique causal graph z z z z z z y y x y x y x x For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the z z z z true model is complete. y y x y x x (Pearl & Geiger 1988, Meek 1995) z z z y x y x y y x y x x z z z z z y y x y x y x y x y x x z z z z z z

  2. Staying in business • Weaken the assumptions (and increase the equivalence class) - allow for unmeasured common causes - allow for cycles - weaken faithfulness 10

  3. Staying in business • Weaken the assumptions (and increase the equivalence class) - allow for unmeasured common causes - allow for cycles - weaken faithfulness • Exclude the limitations (and reduce the equivalence class) - restrict to non-Gaussian error distributions - restrict to non-linear causal relations - restrict to specific discrete parameterizations 10

  4. Staying in business • Weaken the assumptions (and increase the equivalence class) - allow for unmeasured common causes - allow for cycles - weaken faithfulness • Exclude the limitations (and reduce the equivalence class) - restrict to non-Gaussian error distributions - restrict to non-linear causal relations - restrict to specific discrete parameterizations • Include more general data collection set-ups (and see how assumptions can be adjusted and what equivalence class results) - experimental evidence - multiple (overlapping) data sets - relational data 10

  5. Staying in business • Weaken the assumptions (and increase the equivalence class) - allow for unmeasured common causes - allow for cycles Zhalama talk - weaken faithfulness • Exclude the limitations (and reduce the equivalence class) - restrict to non-Gaussian error distributions Tank talk - restrict to non-linear causal relations - restrict to specific discrete parameterizations • Include more general data collection set-ups (and see how assumptions can be adjusted and what equivalence class results) - experimental evidence - multiple (overlapping) data sets - relational data 10

  6. Limitations For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete. (Pearl & Geiger 1988, Meek 1995) 11

  7. Linear non-Gaussian method (LiNGaM) • Linear causal relations: X � ij x j + ✏ j x i = x j ∈ Pa ( x i ) • Assumptions: - causal Markov - causal sufficiency - acyclicity [Shimizu et al., 2006] 12

  8. Linear non-Gaussian method (LiNGaM) • Linear causal relations: X � ij x j + ✏ j x i = x j ∈ Pa ( x i ) • Assumptions: - causal Markov - causal sufficiency - acyclicity ‣ If non-Gaussian, then the true graph is uniquely identifiable ✏ j ∼ from the joint distribution. [Shimizu et al., 2006] 12

  9. Two variable case True model ✏ y ✏ x y = � x + ✏ y y x 13

  10. Two variable case True model ✏ y ✏ x y = � x + ✏ y x ⊥ ⊥ ✏ y y x 13

  11. Two variable case True model ✏ y ✏ x y = � x + ✏ y x ⊥ ⊥ ✏ y y x ˜ ˜ Backwards model ✏ x ✏ y x = ✓ y + ˜ ✏ x y x 13

  12. Two variable case True model ✏ y ✏ x y = � x + ✏ y x ⊥ ⊥ ✏ y y x ˜ ˜ Backwards model ✏ x ✏ y x = ✓ y + ˜ ✏ x y ⊥ ⊥ ˜ ✏ x y x 13

  13. Two variable case True model ✏ y ✏ x y = � x + ✏ y x ⊥ ⊥ ✏ y y x ˜ ˜ Backwards model ✏ x ✏ y x = ✓ y + ˜ ✏ x y ⊥ ⊥ ˜ ✏ x y x ˜ = ✏ x x − ✓ y = x − ✓ ( � x + ✏ y ) = (1 − ✓� ) x − ✓✏ y 13

  14. Two variable case True model ✏ y ✏ x y = � x + ✏ y x ⊥ ⊥ ✏ y y x ˜ ˜ Backwards model ✏ x ✏ y x = ✓ y + ˜ ✏ x y ⊥ ⊥ ˜ ✏ x y x ˜ = ✏ x x − ✓ y = x − ✓ ( � x + ✏ y ) = (1 − ✓� ) x − ✓✏ y 13

  15. Why Normals are unusual ✏ x ✏ y y = � x + ✏ y Forwards model ? ˜ ✏ x = (1 − ✓� ) x − ✓✏ y y For backwards model x 14

  16. Why Normals are unusual ✏ x ✏ y y = � x + ✏ y Forwards model ? ˜ ✏ x = (1 − ✓� ) x − ✓✏ y y For backwards model x Theorem 1 (Darmois-Skitovich) Let X 1 , . . . , X n be independent, non-degenerate random variables. If for two linear combinations l 1 a 1 X 1 + . . . + a n X n , a i 6 = 0 = b i 6 = 0 l 2 = b 1 X 1 + . . . + b n X n , are independent, then each X i is normally distributed. 14

  17. algorithm/ assumption Markov faithfulness causal sufficiency acyclicity parametric assumption output 15

  18. algorithm/ PC / GES FCI CCD assumption ✓ ✓ ✓ Markov faithfulness ✓ ✓ ✓ causal ✓ ✓ ✗ sufficiency acyclicity ✓ ✓ ✗ parametric ✗ ✗ ✗ assumption Markov output PAG PAG equivalence 15

  19. algorithm/ cyclic PC / GES FCI CCD LiNGaM lvLiNGaM assumption LiNGaM ✓ ✓ ✓ ✓ ✓ ✓ Markov faithfulness ✓ ✓ ✓ ✓ ~ ✗ causal ✓ ✓ ✓ ✓ ✗ ✗ sufficiency acyclicity ✓ ✓ ✓ ✓ ✗ ✗ parametric linear non- linear non- linear non- ✗ ✗ ✗ assumption Gaussian Gaussian Gaussian Markov set of output PAG PAG unique DAG set of graphs equivalence DAGs 15

  20. Limitations For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete. (Pearl & Geiger 1988, Meek 1995) 16

  21. Limitations For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete. (Pearl & Geiger 1988, Meek 1995) 17

  22. Bivariate Linear Gaussian case True model x = ✏ x ✏ x , ✏ y ∼ indep. Gaussian y = x + ✏ y a b c 5 p ( y | x ) p ( x | y ) y 0 -5 -5 0 5 -5 0 5 -3 0 3 x y x Forwards Backwards (true) model model (graphics from Hoyer et al. 2009) 18

  23. Bivariate Linear Gaussian case True model x = ✏ x ✏ x , ✏ y ∼ indep. Gaussian y = x + ✏ y a b c 5 p ( y | x ) p ( x | y ) y 0 -5 -5 0 5 -5 0 5 -3 0 3 x y x Forwards Backwards (true) model model (graphics from Hoyer et al. 2009) 18

  24. Continuous additive noise models x j = f j ( pa ( x j )) + ✏ j 19

  25. Continuous additive noise models x j = f j ( pa ( x j )) + ✏ j • If is linear, then non-Gaussian errors are required for f j ( . ) identifiability 19

  26. Continuous additive noise models x j = f j ( pa ( x j )) + ✏ j • If is linear, then non-Gaussian errors are required for f j ( . ) identifiability ➡ What if the errors are Gaussian, but is non-linear? f j ( . ) 19

  27. Continuous additive noise models x j = f j ( pa ( x j )) + ✏ j • If is linear, then non-Gaussian errors are required for f j ( . ) identifiability ➡ What if the errors are Gaussian, but is non-linear? f j ( . ) ➡ More generally, under what circumstances is the causal structure represented by this class of models identifiable? 19

  28. Bivariate non-linear Gaussian additive noise model True model ✏ x , ✏ y ∼ indep. Gaussian x = ✏ x y = x + x 3 + ✏ y d e f 5 p ( y | x ) p ( x | y ) y 0 -5 -5 0 5 -3 0 3 -5 0 5 y x x Forwards Backwards (true) model model x = g ( y ) + ˜ ✏ x y \ ⊥ ⊥ ˜ ✏ x (graphics from Hoyer et al. 2009) 20

  29. Bivariate non-linear Gaussian additive noise model True model ✏ x , ✏ y ∼ indep. Gaussian x = ✏ x y = x + x 3 + ✏ y d e f 5 p ( y | x ) p ( x | y ) y 0 -5 -5 0 5 -3 0 3 -5 0 5 y x x Forwards Backwards (true) model model x = g ( y ) + ˜ ✏ x y \ ⊥ ⊥ ˜ ✏ x (graphics from Hoyer et al. 2009) 20

  30. Bivariate non-linear Gaussian additive noise model True model ✏ x , ✏ y ∼ indep. Gaussian x = ✏ x y = x + x 3 + ✏ y d e f 5 p ( y | x ) p ( x | y ) y 0 -5 -5 0 5 -3 0 3 -5 0 5 y x x Forwards Backwards (true) model model x = g ( y ) + ˜ ✏ x y \ ⊥ ⊥ ˜ ✏ x (graphics from Hoyer et al. 2009) 20

  31. Bivariate non-linear Gaussian additive noise model True model ✏ x , ✏ y ∼ indep. Gaussian x = ✏ x y = x + x 3 + ✏ y d e f 5 p ( y | x ) p ( x | y ) y 0 -5 -5 0 5 -3 0 3 -5 0 5 y x x Forwards Backwards (true) model model x = g ( y ) + ˜ ✏ x y \ ⊥ ⊥ ˜ ✏ x (graphics from Hoyer et al. 2009) 20

  32. General non-linear additive noise models Hoyer Condition (HC) : Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model. 21

  33. General non-linear additive noise models Hoyer Condition (HC) : Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model. • If the error terms are Gaussian , then the only functional form that satisfies HC is linearity , otherwise the model is identifiable . 21

  34. General non-linear additive noise models Hoyer Condition (HC) : Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model. • If the error terms are Gaussian , then the only functional form that satisfies HC is linearity , otherwise the model is identifiable . • If the errors are non-Gaussian , then there are (rather contrived) functions that satisfy HC, but in general identifiability is guaranteed . 21

  35. General non-linear additive noise models Hoyer Condition (HC) : Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model. • If the error terms are Gaussian , then the only functional form that satisfies HC is linearity , otherwise the model is identifiable . • If the errors are non-Gaussian , then there are (rather contrived) functions that satisfy HC, but in general identifiability is guaranteed . - this generalizes to multiple variables (assuming minimality*)! 21

  36. General non-linear additive noise models Hoyer Condition (HC) : Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model. • If the error terms are Gaussian , then the only functional form that satisfies HC is linearity , otherwise the model is identifiable . • If the errors are non-Gaussian , then there are (rather contrived) functions that satisfy HC, but in general identifiability is guaranteed . - this generalizes to multiple variables (assuming minimality*)! - extension to discrete additive noise models 21

  37. General non-linear additive noise models Hoyer Condition (HC) : Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model. • If the error terms are Gaussian , then the only functional form that satisfies HC is linearity , otherwise the model is identifiable . • If the errors are non-Gaussian , then there are (rather contrived) functions that satisfy HC, but in general identifiability is guaranteed . - this generalizes to multiple variables (assuming minimality*)! - extension to discrete additive noise models • If the function is linear , but the error terms non-Gaussian , then one can’t fit a linear backwards model (Lingam), but there are cases where one can fit a non-linear backwards model 21

  38. algorithm/ cyclic PC / GES FCI CCD LiNGaM lvLiNGaM assumptions LiNGaM ✓ ✓ ✓ ✓ ✓ ✓ Markov faithfulness ✓ ✓ ✓ ✓ ~ ✗ causal ✓ ✓ ✓ ✓ ✗ ✗ sufficiency acyclicity ✓ ✓ ✓ ✓ ✗ ✗ parametric linear non- linear non- linear non- ✗ ✗ ✗ assumption Gaussian Gaussian Gaussian Markov unique set of set of output PAG PAG equivalence DAG DAGs graphs 22

  39. algorithm/ cyclic non-linear PC / GES FCI CCD LiNGaM lvLiNGaM assumptions LiNGaM additive noise ✓ ✓ ✓ ✓ ✓ ✓ ✓ Markov faithfulness ✓ ✓ ✓ ✓ ~ minimality ✗ causal ✓ ✓ ✓ ✓ ✓ ✗ ✗ sufficiency acyclicity ✓ ✓ ✓ ✓ ✓ ✗ ✗ parametric linear non- linear non- linear non- non-linear ✗ ✗ ✗ assumption Gaussian Gaussian Gaussian additive noise Markov unique set of set of output PAG PAG unique DAG equivalence DAG DAGs graphs 22

  40. Experiments, Background Knowledge and all the other Jazz 23

  41. Experiments, Background Knowledge and all the other Jazz • how to integrate data from experiments? 23

  42. Experiments, Background Knowledge and all the other Jazz y x • how to integrate data from experiments? l 2 l 1 z 23

  43. Experiments, Background Knowledge and all the other Jazz ⇒ y x • how to integrate data from experiments? l 2 l 1 z 23

  44. Experiments, Background Knowledge and all the other Jazz ⇒ y x • how to integrate data from experiments? l 2 l 1 z • how to include background knowledge? 23

  45. Experiments, Background Knowledge and all the other Jazz ⇒ y x • how to integrate data from experiments? l 2 l 1 z • how to include background knowledge? x w pathways 23

  46. Experiments, Background Knowledge and all the other Jazz ⇒ y x • how to integrate data from experiments? l 2 l 1 z • how to include background knowledge? x x z < y w w pathways tier orderings 23

  47. Experiments, Background Knowledge and all the other Jazz ⇒ y x • how to integrate data from experiments? l 2 l 1 z • how to include background knowledge? x y x x z < y w z w w pathways tier orderings “priors” 23

  48. Experiments, Background Knowledge and all the other Jazz ⇒ y x • how to integrate data from experiments? l 2 l 1 z • how to include background knowledge? x y x x z < y w z w w pathways tier orderings “priors” • specific search space restrictions 23

  49. Experiments, Background Knowledge and all the other Jazz ⇒ y x • how to integrate data from experiments? l 2 l 1 z • how to include background knowledge? x y x x z < y w z w w pathways tier orderings “priors” • specific search space restrictions biological settings 23

  50. Experiments, Background Knowledge and all the other Jazz ⇒ y x • how to integrate data from experiments? l 2 l 1 z • how to include background knowledge? x y x x z < y w z w w pathways tier orderings “priors” • specific search space restrictions time biological settings subsampled time series 23

  51. Experiments, Background Knowledge and all the other Jazz ⇒ y x • how to integrate data from experiments? l 2 l 1 z • how to include background knowledge? x y x x z < y w z w w pathways tier orderings “priors” • specific search space restrictions time Tank talk biological settings subsampled time series 23

  52. High-Level 24

  53. High-Level data sample x y z w samples w x y samples 24

  54. High-Level data sample x y z w } samples (in)dependence constraints w x y ? y | C || J x 6? samples 24

  55. High-Level assumptions, e.g. data • causal Markov sample • causal faithfulness • etc. x y z w } samples (in)dependence constraints w x y ? y | C || J x 6? samples 24

  56. background High-Level knowledge, e.g. • pathways assumptions, e.g. • tier ordering data • causal Markov • “priors” sample • causal faithfulness • etc. • etc. x y z w } samples (in)dependence constraints w x y ? y | C || J x 6? samples 24

  57. background High-Level knowledge, e.g. • pathways setting assumptions, e.g. • tier ordering data • time series • causal Markov • “priors” sample • internal latent structures • causal faithfulness • etc. • etc. • etc. x y z w } samples (in)dependence constraints w x y ? y | C || J x 6? samples 24

  58. background High-Level knowledge, e.g. • pathways setting assumptions, e.g. • tier ordering data • time series • causal Markov • “priors” sample • internal latent structures • causal faithfulness • etc. • etc. • etc. x y z w } samples (in)dependence Encode these as constraints logical constraints on w x y ? y | C || J x 6? the underlying graph samples structure 24

  59. background High-Level knowledge, e.g. • pathways setting assumptions, e.g. • tier ordering data • time series • causal Markov • “priors” sample • internal latent structures • causal faithfulness • etc. • etc. • etc. x y z w } samples (max) SAT-solver (in)dependence Encode these as constraints logical constraints on w x y ? y | C || J x 6? the underlying graph samples structure 24

  60. SAT -based Causal Discovery • Formulate the independence x ⊥ ⊥ y ⇐ ⇒ ¬ A ∧ ¬ B . . . constraints in propositional A = ‘ x → y is present’ logic 25

  61. SAT -based Causal Discovery • Formulate the independence x ⊥ ⊥ y ⇐ ⇒ ¬ A ∧ ¬ B . . . constraints in propositional A = ‘ x → y is present’ logic • Encode the constraints into ¬ A ∧ ¬ B ∧ ¬ ( C ∧ D ) ∧ ¬ ... one formula. 25

  62. SAT -based Causal Discovery • Formulate the independence x ⊥ ⊥ y ⇐ ⇒ ¬ A ∧ ¬ B . . . constraints in propositional A = ‘ x → y is present’ logic • Encode the constraints into ¬ A ∧ ¬ B ∧ ¬ ( C ∧ D ) ∧ ¬ ... one formula. • Find satisfying assignments A = false y x using a SAT-solver B = false ⇐ ⇒ z ... 25

  63. SAT -based Causal Discovery • Formulate the independence x ⊥ ⊥ y ⇐ ⇒ ¬ A ∧ ¬ B . . . constraints in propositional A = ‘ x → y is present’ logic • Encode the constraints into ¬ A ∧ ¬ B ∧ ¬ ( C ∧ D ) ∧ ¬ ... one formula. • Find satisfying assignments A = false y x using a SAT-solver B = false ⇐ ⇒ z ... ➡ very general setting (allows for cycles and latents) and trivially complete 25

  64. SAT -based Causal Discovery • Formulate the independence x ⊥ ⊥ y ⇐ ⇒ ¬ A ∧ ¬ B . . . constraints in propositional A = ‘ x → y is present’ logic • Encode the constraints into ¬ A ∧ ¬ B ∧ ¬ ( C ∧ D ) ∧ ¬ ... one formula. • Find satisfying assignments A = false y x using a SAT-solver B = false ⇐ ⇒ z ... ➡ very general setting (allows for cycles and latents) and trivially complete ➡ BUT : erroneous test results induce conflicting constraints: UNsatisfiable 25

  65. Conflicts and Errors • Statistical independence tests produce errors ➡ Conflict: no graph can produce the set of constraints constraints y x x 6? ? y x ⊥ ⊥ y z x 6? ? z z y 6? ? z y x ⊥ y | z ? y | z x 6? x ⊥ z z 26

Recommend


More recommend