Causal Models for Scientific Discovery Research Challenges and Opportunities David Jensen College of Information and Computer Sciences Computational Social Science Institute Center for Data Science University of Massachusetts Amherst Symposium on Accelerating Science 18 November 2016
Sources: The Guardian, July 2005; Wallace Kirkland, for Time
Sources: Wikipedia (pile); Argonne National Laboratory (Fermi)
Main points • Representing and reasoning about causality is central to science and scientific discovery. • Understanding of causal inference has advanced tremendously in the past 25 years through the work of several disparate research communities. • Several emerging opportunities and challenges exist: • Expressiveness — Combining data and knowledge from multiple sources to understand complex phenomena • Critique — Inferring errors in modeling assumptions or problem construction • Empirical evaluation — Providing realistic empirical tests of methods for causal modeling
Causality is central to science
Explanation ⇒ Causality • Explanation is a central activity in science. Effective theories explain previously unexplained phenomena • Effective explanations generally take the form of a counterfactual (“What would have happened if conditions had been different?”). • “…explanatory relationships are relationships that are potentially exploitable for purposes of manipulation and control.”
Control & design ⇒ Causality Sources: Wikipedia (pile)
Models • Because of this, “models” in most scientific fields have causal implications (infer how a system would behave under intervention) • In contrast, most “models” in machine learning and statistics have been defined as having only associational semantics. • This leads to substantial confusion among researchers from other fields when first encountering machine learning methods.
Progress in causal modeling • An explicit theory of causal inference has been worked out over the past 20 years by a small group of computer scientists, philosophers, and statisticians. • The theory uses directed graphical models to represent causal dependence among variables. • That theory provides a formal correspondence between causal models and their observable statistical implications. This correspondence has been exploited to produce a number of algorithms for reasoning with causal graphical models (CGMs). (Pearl 2000, 2009; Spirtes, Glymour, and Scheines 1993, 2001)
Key concepts • Only statistical dependence is directly observable in data. Causal dependence is not observable. • Statistical dependence underdetermines causal dependence ( “correlation is not causation” ) • The observable statistical consequences of a given causal model can be inferred from structure ( d-separation) • Multiple causal structures produce the same observed statistical dependencies ( Markov equivalence ). • However, some combinations of conditional independence and known causal dependence imply constraints on the space of causal structures, and some uniquely identify causal structures
Main points • Representing and reasoning about causality is central to science and scientific discovery. • Understanding of causal inference has advanced tremendously in the past 25 years through the work of several disparate research communities. • Several emerging opportunities and challenges exist: • Expressiveness — Combining data and knowledge from multiple sources to understand complex phenomena • Critique — Inferring errors in modeling assumptions or problem construction • Empirical evaluation — Providing realistic empirical tests of methods for causal modeling
Expressiveness
Source: Honavar, Hill, & Yelick (2016) , Accelerating Science: A Computing Research Agenda
Source: Honavar, Hill, & Yelick (2016) , Accelerating Science: A Computing Research Agenda
Machine Learning Manual Scientific Practice Rarely searches large spaces Rarely analyzes of formally represented models causal dependence Relational, Temporal and Spatial Models Causal Automated Analysis Discovery Causal Discovery Rarely discovers relational, temporal, or spatial models
Causal models of independent outcomes A B . . . Z Causal Outcome Variables Process
Causal models of independent outcomes A B C D E F G H I J
Key assumption of simple CGMs A B . . . Z Causal Outcome Variables Process
Key assumption of simple CGMs ? x Causal Multiple Dependent Process Outcomes
Causal models of independent outcomes A B C D E F G H I J
Causal models of dependent outcomes L K A B C M D N E F G H K I J K O P Q R S T (Friedman, Getoor, Koller, & Pfeffer 1999; Heckerman, Meek, & Koller 2007; Maier, Marazopoulou, and Jensen 2013)
(Maier, Marazopoulou, and Jensen 2013)
(Maier, Marazopoulou, and Jensen 2013)
(Maier, Marazopoulou, and Jensen 2013)
Causal models of general processes 1: bool c1, c2; 2: int count = 0; 3: c1 = Bernoulli(0.5); 4: if (c1==true) then 5: count = count + 1; 6: c2 = Bernoulli(0.5); 7: if (c2==true) then 8: count = count + 1; 9: observe(c1==true||c2==true); 10: return(count); Causal Probabilistic Process Program
Critique
“[To support science, we would expect] that two different kinds of inferential process would be required to put it into effect. The first, used in estimating parameters from data conditional on the truth of some tentative model, is appropriately called Estimation . The second, used in checking whether, in the light of the data, any model of the kind proposed is plausible , has been aptly named… Criticism .” — George Box (emphasis added)
Example assumptions • Faithfulness • Causal Markov assumption • Definitions of variables, entities, relationships, etc. • Measurement process • Temporal granularity of measurement • Latent variables, entities, relationships, etc. • Structural form of causal dependence • Functional form of probabilistic dependence • Compositional form • Closed world (or form of open world) • …and many others
Empirical evaluation
Goals for Empirical Evaluation Approaches • Empirical — A pre-existing system created by someone other than the researchers. • Stochastic — Produces non-deterministic experimental results. • Identifiable — Amenable to direct experimental investigation to estimate interventional distributions • Recoverable — Lacks memory or irreversible effects, which enables complete state recovery during experiments. • Efficient — Generates large amounts of data with relatively few resources. • Reproducible — Fairly easy to recreate nearly identical data sets without access to one-of-a-kind hardware or software.
Simple example: Database configuration
ML for database configuration (setup) • Assume a fixed database and DB server hardware • Questions • For a given query, what is the expected performance under each set of configuration parameters? • For a given query, which configuration will give me the best performance? • Data • Run 11,252 queries actually run against the Stack Exchange Data Explorer • Each query run using one of many different joint values of the configuration parameters using Postgres 9.2.2 (Garant & Jensen 2016)
CGM for database configuration Retrieved Page Memory Row Count Indexing Cost Level Join Table Length Count Count Block Hits Block Writes in Cache to RAM Group-by Total Row Count Count Block Reads Block Reads from Disk from RAM Year Total Queries Runtime Created by User
CGM for database configuration Query Database Retrieved Retrieved Page Page Memory Memory Row Count Row Count Indexing Indexing Cost Cost Level Level Join Join Table Table Length Length Count Count Count Count Block Hits Block Hits Block Writes Block Writes in Cache in Cache to RAM to RAM Group-by Group-by Total Row Total Row Count Count Count Count Block Reads Block Reads Block Reads Block Reads from Disk from Disk from RAM from RAM Year Year Total Queries Total Queries Runtime Runtime Created Created by User by User User Processing
CGM for database configuration Query Query Database Database Retrieved Retrieved Retrieved Page Page Page Memory Memory Memory Row Count Row Count Row Count Indexing Indexing Indexing Cost Cost Cost Level Level Level Join Join Join Table Table Table Length Length Length Count Count Count Count Count Count Block Hits Block Hits Block Hits Block Writes Block Writes Block Writes in Cache in Cache in Cache to RAM to RAM to RAM Group-by Group-by Group-by Total Row Total Row Total Row Count Count Count Count Count Count Block Reads Block Reads Block Reads Block Reads Block Reads Block Reads from Disk from Disk from Disk from RAM from RAM from RAM Year Year Year Total Queries Total Queries Total Queries Runtime Runtime Runtime Created Created Created by User by User by User User User Processing Processing
Recommend
More recommend