Causal Models for Scientific Discovery Research Challenges and - PowerPoint PPT Presentation

  Causal Models for Scientific Discovery   Research Challenges and Opportunities David Jensen   College of Information and Computer Sciences   Computational Social Science Institute Center for Data Science   University of Massachusetts Amherst Symposium on Accelerating Science   18 November 2016

Sources: The Guardian, July 2005; Wallace Kirkland, for Time

Sources: Wikipedia (pile); Argonne National Laboratory (Fermi)

Main points • Representing and reasoning about causality is central to science and scientific discovery. • Understanding of causal inference has advanced tremendously in the past 25 years through the work of several disparate research communities. • Several emerging opportunities and challenges exist: • Expressiveness — Combining data and knowledge from multiple sources to understand complex phenomena • Critique — Inferring errors in modeling assumptions or problem construction • Empirical evaluation — Providing realistic empirical tests of methods for causal modeling

Causality is central   to science

Explanation ⇒ Causality • Explanation is a central activity   in science. Effective theories explain previously unexplained phenomena • Effective explanations generally take the form of a counterfactual   (“What would have happened if   conditions had been different?”). • “…explanatory relationships are relationships that are potentially exploitable for purposes of manipulation and control.”

Control & design ⇒ Causality Sources: Wikipedia (pile)

Models • Because of this, “models” in   most scientific fields have causal implications (infer how a system would behave under intervention) • In contrast, most “models” in machine learning and statistics have been defined as having only associational semantics. • This leads to substantial confusion among researchers from other   fields when first encountering   machine learning methods.

Progress in causal modeling • An explicit theory of causal inference has been worked out over the past 20 years   by a small group of computer   scientists, philosophers,   and statisticians. • The theory uses directed   graphical models to represent   causal dependence among variables. • That theory provides a formal correspondence   between causal models and their observable statistical implications. This correspondence has been exploited to produce a number of algorithms for reasoning with causal graphical models (CGMs). (Pearl 2000, 2009; Spirtes, Glymour, and Scheines 1993, 2001)

Key concepts • Only statistical dependence is directly observable in data. Causal dependence is not observable. • Statistical dependence underdetermines causal dependence ( “correlation is not causation” ) • The observable statistical consequences of a given causal model can be inferred from structure ( d-separation) • Multiple causal structures produce the same observed statistical dependencies ( Markov equivalence ). • However, some combinations of conditional independence and known causal dependence imply constraints on the space of causal structures, and some uniquely identify causal structures

Main points • Representing and reasoning about causality is central to science and scientific discovery. • Understanding of causal inference has advanced tremendously in the past 25 years through the work of several disparate research communities. • Several emerging opportunities and challenges exist: • Expressiveness — Combining data and knowledge from multiple sources to understand complex phenomena • Critique — Inferring errors in modeling assumptions or problem construction • Empirical evaluation — Providing realistic empirical tests of methods for causal modeling

Expressiveness

Source: Honavar, Hill, & Yelick (2016) , Accelerating Science: A Computing Research Agenda

Machine Learning   Manual Scientific Practice   Rarely searches large spaces   Rarely analyzes   of formally represented models causal dependence Relational, Temporal and Spatial Models Causal Automated Analysis Discovery Causal Discovery   Rarely discovers relational, temporal, or spatial models

Causal models of independent outcomes A B . . . Z Causal   Outcome Variables Process

Causal models of independent outcomes A B C D E F G H I J

Key assumption of simple CGMs A B . . . Z Causal   Outcome Variables Process

Key assumption of simple CGMs ? x Causal   Multiple Dependent Process Outcomes

Causal models of independent outcomes A B C D E F G H I J

Causal models of dependent outcomes L K A B C M D N E F G H K I J K O P Q R S T (Friedman, Getoor, Koller, & Pfeffer 1999; Heckerman, Meek, & Koller 2007; Maier, Marazopoulou, and Jensen 2013)

(Maier, Marazopoulou, and Jensen 2013)

Causal models of general processes 1: bool c1, c2;   2: int count = 0;   3: c1 = Bernoulli(0.5);   4: if (c1==true) then   5: count = count + 1;   6: c2 = Bernoulli(0.5);   7: if (c2==true) then   8: count = count + 1;   9: observe(c1==true||c2==true);   10: return(count); Causal   Probabilistic   Process Program

Critique

  “[To support science, we would expect]   that two different kinds of inferential process   would be required to put it into effect. The first, used in estimating parameters from data conditional on the truth of some tentative model,   is appropriately called Estimation .   The second, used in checking whether, in the light of the data, any model of the kind proposed is plausible , has been aptly named… Criticism .” — George Box (emphasis added)

Example assumptions • Faithfulness • Causal Markov assumption • Definitions of variables, entities, relationships, etc. • Measurement process • Temporal granularity of measurement • Latent variables, entities, relationships, etc. • Structural form of causal dependence • Functional form of probabilistic dependence • Compositional form • Closed world (or form of open world) • …and many others

Empirical evaluation

Goals for Empirical Evaluation Approaches • Empirical — A pre-existing system created by someone other than the researchers. • Stochastic — Produces non-deterministic experimental results. • Identifiable — Amenable to direct experimental investigation to estimate interventional distributions • Recoverable — Lacks memory or irreversible effects, which enables complete state recovery during experiments. • Efficient — Generates large amounts of data with relatively few resources. • Reproducible — Fairly easy to recreate nearly identical data sets without access to one-of-a-kind hardware or software.

Simple example: Database configuration

ML for database configuration (setup) • Assume a fixed database   and DB server hardware • Questions • For a given query, what is the expected performance under each set of configuration parameters? • For a given query, which configuration will give me the best performance? • Data • Run 11,252 queries actually run against the Stack Exchange Data Explorer • Each query run using one of many different joint values of the configuration parameters using Postgres 9.2.2 (Garant & Jensen 2016)

CGM for database configuration Retrieved Page Memory Row Count Indexing Cost Level Join Table Length Count Count Block Hits Block Writes in Cache to RAM Group-by Total Row Count Count Block Reads Block Reads from Disk from RAM Year Total Queries Runtime Created by User

CGM for database configuration Query Database Retrieved Retrieved Page Page Memory Memory Row Count Row Count Indexing Indexing Cost Cost Level Level Join Join Table Table Length Length Count Count Count Count Block Hits Block Hits Block Writes Block Writes in Cache in Cache to RAM to RAM Group-by Group-by Total Row Total Row Count Count Count Count Block Reads Block Reads Block Reads Block Reads from Disk from Disk from RAM from RAM Year Year Total Queries Total Queries Runtime Runtime Created Created by User by User User Processing

CGM for database configuration Query Query Database Database Retrieved Retrieved Retrieved Page Page Page Memory Memory Memory Row Count Row Count Row Count Indexing Indexing Indexing Cost Cost Cost Level Level Level Join Join Join Table Table Table Length Length Length Count Count Count Count Count Count Block Hits Block Hits Block Hits Block Writes Block Writes Block Writes in Cache in Cache in Cache to RAM to RAM to RAM Group-by Group-by Group-by Total Row Total Row Total Row Count Count Count Count Count Count Block Reads Block Reads Block Reads Block Reads Block Reads Block Reads from Disk from Disk from Disk from RAM from RAM from RAM Year Year Year Total Queries Total Queries Total Queries Runtime Runtime Runtime Created Created Created by User by User by User User User Processing Processing

Causal Models for Scientific Discovery Research Challenges and - PowerPoint PPT Presentation

Causal Models for Scientific Discovery Research Challenges and Opportunities David Jensen College of Information and Computer Sciences Computational Social Science Institute Center for Data Science University of Massachusetts

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery

Causal Effect Evaluation and Causal Network Learning Zhi Geng Peking University, China June

Causal Discovery from Observational Data Brady Neal causalcourse.com What if we dont have

Political Science 209 - Fall 2018 Causal Inference Florian Hollenbach 7th September 2018 Causal

CAUSAL DISCOVERY CAUSAL DISCOVERY Beware of the DAG! Beware of the DAG! Philip Dawid

Causal Inference By: Miguel A. Hern an and James M. Robins Part I: Causal inference without

Causal Programming Causal Programming Joshua Brul Joshua Brul

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Benchmarks, wikis, and open-source causal discovery Patrik O. Hoyer Univ. of Helsinki Finland

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Introduction to Causal Inference Lan Liu University of Minnesota at Twin Cities liux3771@umn.edu

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

A Brief Introduction to Causal Inference Brady Neal causalcourse.com What is causal inference?

Randomized Experiments The goal of randomized experiments is to identify The causal

Causal and Non-Causal Feature Selection for Ridge Regression Gavin Cawley School of Computing

On estimation of functional causal models: Post - nonlinear causal model as an

High-Redshift Circumgalactic Medium in FIRE Simulations (work in progress) Bili Dong UC San

ts r s t

Coupling of Smooth Faceted Surface Evaluations in the SIERRA FEA Code Timothy J. Tautges Steven

WebCGM The Choice for Technical Illustrations Presented at XML Europe 2001 Berlin, Germany May

Efficient Parallel Implementations of Multiple Sequence Alignment Using BSP/CGM Model Jucele F.

Metallicity and morphology of the cool circumgalactic medium Ting-Wen Lan Kavli Fellow In

A Persona-based Modeling for Contextual Requirements Genana Nunes Rodrigues 1 , Carlos Joel

A Formal Proof of Countermeasures against Fault Injection Attacks on CRT-RSA Pablo Rauzy Sylvain