causal models for scientific discovery
play

Causal Models for Scientific Discovery Research Challenges and - PowerPoint PPT Presentation

Causal Models for Scientific Discovery Research Challenges and Opportunities David Jensen College of Information and Computer Sciences Computational Social Science Institute Center for Data Science University of Massachusetts


  1. 
 Causal Models for Scientific Discovery 
 Research Challenges and Opportunities David Jensen 
 College of Information and Computer Sciences 
 Computational Social Science Institute Center for Data Science 
 University of Massachusetts Amherst Symposium on Accelerating Science 
 18 November 2016

  2. Sources: The Guardian, July 2005; Wallace Kirkland, for Time

  3. Sources: Wikipedia (pile); Argonne National Laboratory (Fermi)

  4. Main points • Representing and reasoning about causality is central to science and scientific discovery. • Understanding of causal inference has advanced tremendously in the past 25 years through the work of several disparate research communities. • Several emerging opportunities and challenges exist: • Expressiveness — Combining data and knowledge from multiple sources to understand complex phenomena • Critique — Inferring errors in modeling assumptions or problem construction • Empirical evaluation — Providing realistic empirical tests of methods for causal modeling

  5. Causality is central 
 to science

  6. Explanation ⇒ Causality • Explanation is a central activity 
 in science. Effective theories explain previously unexplained phenomena • Effective explanations generally take the form of a counterfactual 
 (“What would have happened if 
 conditions had been different?”). • “…explanatory relationships are relationships that are potentially exploitable for purposes of manipulation and control.”

  7. Control & design ⇒ Causality Sources: Wikipedia (pile)

  8. Models • Because of this, “models” in 
 most scientific fields have causal implications (infer how a system would behave under intervention) • In contrast, most “models” in machine learning and statistics have been defined as having only associational semantics. • This leads to substantial confusion among researchers from other 
 fields when first encountering 
 machine learning methods.

  9. Progress in causal modeling • An explicit theory of causal inference has been worked out over the past 20 years 
 by a small group of computer 
 scientists, philosophers, 
 and statisticians. • The theory uses directed 
 graphical models to represent 
 causal dependence among variables. • That theory provides a formal correspondence 
 between causal models and their observable statistical implications. This correspondence has been exploited to produce a number of algorithms for reasoning with causal graphical models (CGMs). (Pearl 2000, 2009; Spirtes, Glymour, and Scheines 1993, 2001)

  10. Key concepts • Only statistical dependence is directly observable in data. Causal dependence is not observable. • Statistical dependence underdetermines causal dependence ( “correlation is not causation” ) • The observable statistical consequences of a given causal model can be inferred from structure ( d-separation) • Multiple causal structures produce the same observed statistical dependencies ( Markov equivalence ). • However, some combinations of conditional independence and known causal dependence imply constraints on the space of causal structures, and some uniquely identify causal structures

  11. Main points • Representing and reasoning about causality is central to science and scientific discovery. • Understanding of causal inference has advanced tremendously in the past 25 years through the work of several disparate research communities. • Several emerging opportunities and challenges exist: • Expressiveness — Combining data and knowledge from multiple sources to understand complex phenomena • Critique — Inferring errors in modeling assumptions or problem construction • Empirical evaluation — Providing realistic empirical tests of methods for causal modeling

  12. Expressiveness

  13. Source: Honavar, Hill, & Yelick (2016) , Accelerating Science: A Computing Research Agenda

  14. Source: Honavar, Hill, & Yelick (2016) , Accelerating Science: A Computing Research Agenda

  15. Machine Learning 
 Manual Scientific Practice 
 Rarely searches large spaces 
 Rarely analyzes 
 of formally represented models causal dependence Relational, Temporal and Spatial Models Causal Automated Analysis Discovery Causal Discovery 
 Rarely discovers relational, temporal, or spatial models

  16. Causal models of independent outcomes A B . . . Z Causal 
 Outcome Variables Process

  17. Causal models of independent outcomes A B C D E F G H I J

  18. Key assumption of simple CGMs A B . . . Z Causal 
 Outcome Variables Process

  19. Key assumption of simple CGMs ? x Causal 
 Multiple Dependent Process Outcomes

  20. Causal models of independent outcomes A B C D E F G H I J

  21. Causal models of dependent outcomes L K A B C M D N E F G H K I J K O P Q R S T (Friedman, Getoor, Koller, & Pfeffer 1999; Heckerman, Meek, & Koller 2007; Maier, Marazopoulou, and Jensen 2013)

  22. (Maier, Marazopoulou, and Jensen 2013)

  23. (Maier, Marazopoulou, and Jensen 2013)

  24. (Maier, Marazopoulou, and Jensen 2013)

  25. Causal models of general processes 1: bool c1, c2; 
 2: int count = 0; 
 3: c1 = Bernoulli(0.5); 
 4: if (c1==true) then 
 5: count = count + 1; 
 6: c2 = Bernoulli(0.5); 
 7: if (c2==true) then 
 8: count = count + 1; 
 9: observe(c1==true||c2==true); 
 10: return(count); Causal 
 Probabilistic 
 Process Program

  26. Critique

  27. 
 “[To support science, we would expect] 
 that two different kinds of inferential process 
 would be required to put it into effect. The first, used in estimating parameters from data conditional on the truth of some tentative model, 
 is appropriately called Estimation . 
 The second, used in checking whether, in the light of the data, any model of the kind proposed is plausible , has been aptly named… Criticism .” — George Box (emphasis added)

  28. Example assumptions • Faithfulness • Causal Markov assumption • Definitions of variables, entities, relationships, etc. • Measurement process • Temporal granularity of measurement • Latent variables, entities, relationships, etc. • Structural form of causal dependence • Functional form of probabilistic dependence • Compositional form • Closed world (or form of open world) • …and many others

  29. Empirical evaluation

  30. Goals for Empirical Evaluation Approaches • Empirical — A pre-existing system created by someone other than the researchers. • Stochastic — Produces non-deterministic experimental results. • Identifiable — Amenable to direct experimental investigation to estimate interventional distributions • Recoverable — Lacks memory or irreversible effects, which enables complete state recovery during experiments. • Efficient — Generates large amounts of data with relatively few resources. • Reproducible — Fairly easy to recreate nearly identical data sets without access to one-of-a-kind hardware or software.

  31. Simple example: Database configuration

  32. ML for database configuration (setup) • Assume a fixed database 
 and DB server hardware • Questions • For a given query, what is the expected performance under each set of configuration parameters? • For a given query, which configuration will give me the best performance? • Data • Run 11,252 queries actually run against the Stack Exchange Data Explorer • Each query run using one of many different joint values of the configuration parameters using Postgres 9.2.2 (Garant & Jensen 2016)

  33. CGM for database configuration Retrieved Page Memory Row Count Indexing Cost Level Join Table Length Count Count Block Hits Block Writes in Cache to RAM Group-by Total Row Count Count Block Reads Block Reads from Disk from RAM Year Total Queries Runtime Created by User

  34. CGM for database configuration Query Database Retrieved Retrieved Page Page Memory Memory Row Count Row Count Indexing Indexing Cost Cost Level Level Join Join Table Table Length Length Count Count Count Count Block Hits Block Hits Block Writes Block Writes in Cache in Cache to RAM to RAM Group-by Group-by Total Row Total Row Count Count Count Count Block Reads Block Reads Block Reads Block Reads from Disk from Disk from RAM from RAM Year Year Total Queries Total Queries Runtime Runtime Created Created by User by User User Processing

  35. CGM for database configuration Query Query Database Database Retrieved Retrieved Retrieved Page Page Page Memory Memory Memory Row Count Row Count Row Count Indexing Indexing Indexing Cost Cost Cost Level Level Level Join Join Join Table Table Table Length Length Length Count Count Count Count Count Count Block Hits Block Hits Block Hits Block Writes Block Writes Block Writes in Cache in Cache in Cache to RAM to RAM to RAM Group-by Group-by Group-by Total Row Total Row Total Row Count Count Count Count Count Count Block Reads Block Reads Block Reads Block Reads Block Reads Block Reads from Disk from Disk from Disk from RAM from RAM from RAM Year Year Year Total Queries Total Queries Total Queries Runtime Runtime Runtime Created Created Created by User by User by User User User Processing Processing

Recommend


More recommend