Single World Intervention Graphs (SWIGs): Unifying the - - PowerPoint PPT Presentation

single world intervention graphs swigs
SMART_READER_LITE
LIVE PREVIEW

Single World Intervention Graphs (SWIGs): Unifying the - - PowerPoint PPT Presentation

Single World Intervention Graphs (SWIGs): Unifying the Counterfactual and Graphical Approaches to Causality Thomas Richardson Department of Statistics University of Washington Joint work with James Robins (Harvard School of Public Health)


slide-1
SLIDE 1

Single World Intervention Graphs (SWIGs):

Unifying the Counterfactual and Graphical Approaches to Causality

Thomas Richardson Department of Statistics University of Washington

Joint work with James Robins (Harvard School of Public Health)

Therme Vals Causal Workshop 5 Aug 2013

slide-2
SLIDE 2

Outline

Brief review of counterfactuals A new unification of graphs and counterfactuals via node-splitting

◮ Factorization and Modularity ◮ Contrast with Twin Network approach ◮ Some Examples and Extensions ◮ Sequentially Randomized Experiments / Time Dependent

Confounding

◮ Dynamic Regimes

Experimental Testability and Independence of Errors in NPSEMs

Thomas Richardson Therme Vals Workshop Slide 1

slide-3
SLIDE 3

Counterfactuals aka Potential Outcomes

Thomas Richardson Therme Vals Workshop Slide 2

slide-4
SLIDE 4

The potential outcomes framework: philosophy

Hume (1748) An Enquiry Concerning Human Understanding:

We may define a cause to be an object followed by another, and where all the objects, similar to the first, are followed by objects similar to the second, . . . . . . where, if the first object had not been the second never had existed.

Thomas Richardson Therme Vals Workshop Slide 3

slide-5
SLIDE 5

The potential outcomes framework: crop trials

Jerzy Neyman (1923):

To compare v varieties [on m plots] we will consider numbers: U11

. . .

U1m

. . . . . .

Uv1

. . .

Uvm

Here Uij is the crop yield that would be observed if variety i were planted in plot j. Physical constraints only allow one variety to be planted in a given plot in any given growning season.

Popularized by Rubin (1974); sometimes called the ‘Rubin causal model’.

Thomas Richardson Therme Vals Workshop Slide 4

slide-6
SLIDE 6

Potential outcomes with binary treatment

For binary treatment X and response Y, we define two potential

  • utcome variables:

Y(x = 0): the value of Y that would be observed for a given

unit if assigned X = 0;

Y(x = 1): the value of Y that would be observed for a given

unit if assigned X = 1; WIll also write these as Y(x0) and Y(x1). Implicit here is the assumption that these outcomes are well-defined. Specifically:

◮ Only one version of treatment X = x ◮ No interference between units (SUTVA). Thomas Richardson Therme Vals Workshop Slide 5

slide-7
SLIDE 7

Potential Outcomes

Unit Potential Outcomes Observed

Y(x = 0) Y(x = 1) X Y

1 1 2 1 3 4 1 1 5 1

Thomas Richardson Therme Vals Workshop Slide 6

slide-8
SLIDE 8

Drug Response ‘Types’:

In the simplest case where Y is a binary outcome we have the following 4 types:

Y(x0) Y(x1)

Name Never Recover 1 Helped 1 Hurt 1 1 Always Recover

Thomas Richardson Therme Vals Workshop Slide 7

slide-9
SLIDE 9

Assignment to Treatments

Unit Potential Outcomes Observed

Y(x = 0) Y(x = 1) X Y

1 1 1 2 1 3 1 4 1 1 1 5 1

Thomas Richardson Therme Vals Workshop Slide 8

slide-10
SLIDE 10

Observed Outcomes from Potential Outcomes

Unit Potential Outcomes Observed

Y(x = 0) Y(x = 1) X Y

1 1 1 1 2 1 3 1 4 1 1 1 1 5 1 1

Thomas Richardson Therme Vals Workshop Slide 9

slide-11
SLIDE 11

Potential Outcomes and Missing Data

Unit Potential Outcomes Observed

Y(x = 0) Y(x = 1) X Y

1 ? 1 1 1 2 ? 3 ? 1 4 ? 1 1 1 5 1 ? 1

Thomas Richardson Therme Vals Workshop Slide 10

slide-12
SLIDE 12

Average Causal Effect (ACE) of X on Y

ACE(X → Y)

≡ E[Y(x1) − Y(x0)] = p(Helped) − p(Hurt) ∈ [−1, 1]

Thus ACE(X → Y) is the difference in % recovery if everyone treated (X = 1) vs. if noone treated (X = 0).

Thomas Richardson Therme Vals Workshop Slide 11

slide-13
SLIDE 13

Identification of the ACE under randomization

If X is assigned randomly then

X ⊥ ⊥ Y(x0)

and

X ⊥ ⊥ Y(x1)

(1) hence

E[Y(x1) − Y(x0)] = E[Y(x1)] − E[Y(x0)] = E[Y(x1) | X = 1] − E[Y(x0) | X = 0] = E[Y | X = 1] − E[Y | X = 0].

Thus if (1) holds then ACE(X → Y) is identified from P(X, Y).

Thomas Richardson Therme Vals Workshop Slide 12

slide-14
SLIDE 14

Inference for the ACE without randomization

Suppose that we do not know that X ⊥

⊥ Y(x0) and X ⊥ ⊥ Y(x1).

What can be inferred?

X = 0 X = 1

Placebo Drug

Y = 0

200 600

Y = 1

800 400 What is: The largest number of people who could be Helped? 400 + 200 The smallest number of people who could be Hurt? 0

⇒ Max value of ACE: (200 + 400)/2000 − 0 = 0.3

Similar logic:

⇒ Min value of ACE: 0 − (600 + 800)/2000 = −0.7

Thomas Richardson Therme Vals Workshop Slide 13

slide-15
SLIDE 15

Inference for the ACE without randomization

Suppose that we do not know that X ⊥

⊥ Y(x0) and X ⊥ ⊥ Y(x1).

General case:

−(P(x=0, y=1) + P(x=1, y=0)) ACE(X → Y) ACE(X → Y) P(x=0, y=0) + P(x=1, y=1) ⇒ Bounds will always cross zero. ⇒ X ⊥ ⊥ Y(x0) and X ⊥ ⊥ Y(x1) essential for drawing non-trivial

causal inferences.

Thomas Richardson Therme Vals Workshop Slide 14

slide-16
SLIDE 16

Summary of Counterfactual Approach

In our observed data, for each unit one outcome will be ‘actual’; the others will be ‘counterfactual’. The potential outcome framework allows Causation to be ‘reduced’ to Missing Data

⇒ Conceptual progress!

The ACE is identified if X ⊥

⊥ Y(xi) for all values xi.

Randomization of treatment assignment implies X ⊥

⊥ Y(xi).

Ideas are central to Fisher’s Exact Test; also many parts of experimental design. The framework is the basis of many practical causal data analyses published in Biostatistics, Econometrics and Epidemiology.

Thomas Richardson Therme Vals Workshop Slide 15

slide-17
SLIDE 17

Relating Counterfactuals and Structural Equations

Potential outcomes can be seen as a different notation for Non-Parametric Structural Equation Models (NPSEMs): Example: X → Y. NPSEM formulation:

Y = f(X, ǫY)

Potential outcome formulation:

Y(x) = f(x, ǫY)

Two important caveats: NPSEMs typically assume all variables are seen as being subject to well-defined interventions (not so with potential

  • utcomes)

Pearl associates NPSEMs with Independent Errors (NPSEM-IEs) with DAGs (more on this later).

Thomas Richardson Therme Vals Workshop Slide 16

slide-18
SLIDE 18

Relating Counterfactuals and ‘do’ notation

Expressions in terms of ‘do’ can be expressed in terms of counterfactuals:

P(Y(x) = y) ≡ P(Y = y | do(X = x))

but counterfactual notation is more general. Ex. Distribution of

  • utcomes that would arise among those who took treatment

(X = 1) had counter-to-fact they not received treatment:

P(Y(x = 0) = y | X = 1)

If treatment is randomized, so X ⊥

⊥ Y(x = 0) then this equals P(Y(x = 0) = y), but in an observational study these may be

different.

Thomas Richardson Therme Vals Workshop Slide 17

slide-19
SLIDE 19

Graphs

Thomas Richardson Therme Vals Workshop Slide 18

slide-20
SLIDE 20

Recap: Graphical Approach to Causality

X Y

No Confounding

X H Y

Confounding Unobserved

Graph intended to represent direct causal relations. Convention that confounding variables (e.g. H) are always included

  • n the graph.

Approach originates in the path diagrams introduced by Sewall Wright in the 1920s. If X → Y then X is said to be a parent of Y; Y is child of X.

Thomas Richardson Therme Vals Workshop Slide 19

slide-21
SLIDE 21

Edges are directed, but are they causal?

X Y P(X, Y) = P(X)P(Y | X)

No Confounding

X Y P(X, Y) = P(Y)P(X | Y)

No Confounding Neither factorization places any restriction on P(X, Y).

Thomas Richardson Therme Vals Workshop Slide 20

slide-22
SLIDE 22

Linking the two approaches

X Y X ⊥ ⊥ Y(x0) & X ⊥ ⊥ Y(x1) X H Y X ⊥ ⊥ Y(x0) & X ⊥ ⊥ Y(x1)

Unobserved Elephant in the room: The variables Y(x0) and Y(x1) do not appear on these graphs!!

Thomas Richardson Therme Vals Workshop Slide 21

slide-23
SLIDE 23

Node splitting: Setting X to 0

X Y P(X= ˜ x, Y = ˜ y) = P(X= ˜ x)P(Y = ˜ y | X= ˜ x)

X x = 0 Y(x = 0)

Can now ‘read’ the independence: X ⊥

⊥ Y(x=0).

Also associate a new factorization:

P (X= ˜ x, Y(x=0)= ˜ y) = P(X= ˜ x)P (Y(x=0)= ˜ y)

where:

P (Y(x=0)= ˜ y) = P(Y = ˜ y | X=0).

This last equation links a term in the original factorization to the new factorization. We term this the ‘modularity assumption’.

Thomas Richardson Therme Vals Workshop Slide 22

slide-24
SLIDE 24

Node splitting: Setting X to 1

X Y P(X= ˜ x, Y = ˜ y) = P(X= ˜ x)P(Y = ˜ y | X= ˜ x)

X x = 1 Y(x = 1)

Can now ‘read’ the independence: X ⊥

⊥ Y(x=1).

Also associate a new factorization:

P (X= ˜ x, Y(x=1)= ˜ y) = P(X= ˜ x)P (Y(x=1)= ˜ y)

where:

P (Y(x=1)=y) = P(Y =y | X=1).

Thomas Richardson Therme Vals Workshop Slide 23

slide-25
SLIDE 25

Crucial point: Y(x=0) and Y(x=1) are never on the same graph. Although we have:

X ⊥ ⊥ Y(x=0)

and

X ⊥ ⊥ Y(x=1)

we do not have

X ⊥ ⊥ Y(x=0), Y(x=1)

Had we tried to construct a single graph containing both Y(x=0) and Y(x=1) this would have been impossible. (Why?)

⇒ Single-World Intervention Graphs (SWIGs).

Thomas Richardson Therme Vals Workshop Slide 24

slide-26
SLIDE 26

Representing both graphs via a ‘template’

X Y G

X x Y(x) G(x)

Represent both graphs via a template: Formally this is a ‘graph valued function’: Takes as input a specific value x∗ Returns as output a SWIG G(x∗). Each instantiation of the template is a SWIG G(x∗) that represents a different margin: P(X, Y(x∗)) with red nodes x∗ becoming constants.

Thomas Richardson Therme Vals Workshop Slide 25

slide-27
SLIDE 27

Intuition behind node splitting:

(Robins, VanderWeele, Richardson 2007)

Q: How could we identify whether someone would choose to take treatment, i.e. have X = 1, and at the same time find out what happens to such a person if they don’t take treatment Y(x = 0)? A: Consider an experiment in which, whenever a patient is

  • bserved to swallow the drug have X = 1, we instantly intervene

by administering a safe ‘emetic’ that causes the pill to be regurgitated before any drug can enter the bloodstream. Since we assume the emetic has no side effects, the patient’s recorded outcome is then Y(x = 0).

Thomas Richardson Therme Vals Workshop Slide 26

slide-28
SLIDE 28

Harder Inferential problem

X0 Z H X1 Y

Query: does this causal graph imply?

Y(x0, x1) ⊥ ⊥ X1(x0) | Z(x0), X0,

Thomas Richardson Therme Vals Workshop Slide 27

slide-29
SLIDE 29

Simple solution

X0 Z H X1 Y X0 x0 Z(x0) H X1(x0) x1 Y(x0, x1)

Query does this graph imply:

Y(x0, x1) ⊥ ⊥ X1(x0) | Z(x0), X0

? Answer: Yes – applying d-separation to the SWIG on the right we see that there is no d-connecting path from Y(x0, x1) given Z(x0). More on this shortly...

Thomas Richardson Therme Vals Workshop Slide 28

slide-30
SLIDE 30

Single World Intervention Template Construction (1)

Given a graph G, a subset of vertices A = {A1, . . . , Ak} to be intervened

  • n, we form G(a) in two steps:

(1) (Node splitting): For every A ∈ A split the node into a random

node A and a fixed node a:

A · · · · · ·

A a

Splitting: Schematic Illustrating the Splitting of Node A The random half inherits all edges directed into A in G; The fixed half inherits all edges directed out of A in G.

Thomas Richardson Therme Vals Workshop Slide 29

slide-31
SLIDE 31

Single World Intervention Template Construction (2)

(2) Relabel descendants of fixed nodes: a

A B C D F E X T Y Z

· · · · · · · · · · · · a

A(. . .) B(a, . . .) C(a, . . .) D(a, . . .) F(a, . . .) E(a, . . .) X(. . .) T(. . .) Y(. . .) Z(. . .)

· · · · · · · · · · · ·

Thomas Richardson Therme Vals Workshop Slide 30

slide-32
SLIDE 32

Single World Intervention Graph

A Single World Intervention Graph (SWIG) G(a∗) is obtained from the Template G(a) by simply substituting specific values a∗ for the variables a in G(a); For example, we replace G(x) with G(x=0). Changing the value of a fixed variable corresponds to constructing a new graph and considering a different population, e.g. P(X, Y(x=0)) vs. P(X, Y(x=1)) It is only the instantiated graph G(˜

x) that represents P(V(˜ x)),

not the template G(x).

Thomas Richardson Therme Vals Workshop Slide 31

slide-33
SLIDE 33

Factorization and Modularity

Factorization: P(V(˜

a)) over the counterfactual variables in G(˜ a)

factorizes with respect to G(˜

a) (ignoring fixed nodes): P (V(˜ a)) =

  • Y(˜

a)∈V(˜ a)

P

  • Y(˜

a)

  • paG(˜

a)(Y(˜

a)) \ ˜ a

  • .

Modularity: P(V(˜

a)) and P(V) are linked as follows: P

  • Y(˜

a)=y

  • paG(˜

a)(Y(˜

a)) \ ˜ a

  • = q
  • = P
  • Y =y
  • paG(Y) \ A
  • = q,
  • paG(Y) ∩ A
  • = ˜

apaG(Y)∩A

  • ,

So the conditional density associated with Y(˜ aY) in G(˜ a) is just the conditional density associated with Y in G after substituting ˜ ai for any Ai ∈ A that is a parent of Y.

Thomas Richardson Therme Vals Workshop Slide 32

slide-34
SLIDE 34

Applying d-separation to the graph G(a)

Counterfactual conditional independence relations may be

  • btained from the transformed graph by applying d-separation after

adding fixed nodes to the conditioning set: Given disjoint subsets B(˜

a), C(˜ a) and D(˜ a) of random vertices

(where D(˜

a) may be empty),

if B(˜

a) is d-separated from C(˜ a) given D(˜ a) ∪ ˜ a in G(˜ a)

(2) then

B(˜ a) ⊥ ⊥ C(˜ a) | D(˜ a) [P(V(˜ a))].

In words, if in G(˜ a) two subsets B(˜ a) and C(˜ a) of random nodes are d-separated by D(˜ a) in conjunction with the fixed nodes ˜ a, then B(˜ a) and C(˜ a) are conditionally independent given D(˜ a) in the associated distribution P(V(˜ a)).

Thomas Richardson Therme Vals Workshop Slide 33

slide-35
SLIDE 35

Conditioning on fixed variables ˜

a

intuitive since these are fixed constants in the SWIG Since vertices in ˜

a have no parents, no new paths d-connect

due to also conditioning on ˜

a. ⇒ If a d-separation holds in G(˜ a) without conditioning on the

fixed nodes, then it will continue to hold if we also condition on fixed nodes. An alternative is simply to restrict attention to paths that do not contain fixed vertices, e.g. remove fixed nodes from the graph before checking d-separation.

Thomas Richardson Therme Vals Workshop Slide 34

slide-36
SLIDE 36

Mediation graph (I)

Intervention on Z alone.

X Y Z X(˜ z) Z

˜

z Y(˜ z)

factorization:

P(Z, X(˜ z), Y(˜ z)) = P(Z)P(X(˜ z))P(Y(˜ z) | X(˜ z))

modularity:

P(X(˜ z)=x) = P(X=x | Z= ˜ z), P(Y(˜ z)=y | X(˜ z)=x) = P(Y =y | X=x, Z= ˜ z).

d-separation gives:

Z ⊥ ⊥ X(˜ z), Y(˜ z).

Thomas Richardson Therme Vals Workshop Slide 35

slide-37
SLIDE 37

Mediation graph (II)

Intervention on Z and X:

X Y Z X(˜ z)

˜

x Z

˜

z Y(˜ x, ˜ z)

factorization:

P(Z, X(˜ z), Y(˜ x, ˜ z)) = P(Z)P(X(˜ z))P(Y(˜ x, ˜ z))

modularity:

P(X(˜ z)=x) = P(X=x | Z= ˜ z), P(Y(˜ x, ˜ z)=y) = P(Y =y | X= ˜ x, Z= ˜ z).

d-separation gives:

Z ⊥ ⊥ X(˜ z) ⊥ ⊥ Y(˜ x, ˜ z)

Thomas Richardson Therme Vals Workshop Slide 36

slide-38
SLIDE 38

No direct effect graph

X Y Z X(˜ z)

˜

x Z

˜

z Y(˜ x)

factorization:

P(Z, X(˜ z), Y(˜ x)) = P(Z)P(X(˜ z))P(Y(˜ x))

modularity:

P(X(˜ z)=x) = P(X=x | Z= ˜ z), P(Y(˜ x)=y) = P(Y =y | X= ˜ x).

d-separation gives:

Z ⊥ ⊥ X(˜ z) ⊥ ⊥ Y(˜ x)

Thomas Richardson Therme Vals Workshop Slide 37

slide-39
SLIDE 39

Inferential Problem (II)

X0 Z H X1 Y X0 x0 Z(x0) H X1(x0) x1 Y(x0, x1)

Pearl (2009), Ex. 11.3.3, claims the causal DAG above does not imply: Y(x0, x1) ⊥ ⊥ X1 | Z, X0 = x0. (3) The SWIG shows that (3) does hold; Pearl is incorrect. Specifically, we see from the SWIG: Y(x0, x1) ⊥ ⊥ X1(x0) | Z(x0), X0, (4) ⇒ Y(x0, x1) ⊥ ⊥ X1(x0) | Z(x0), X0 = x0. (5) This last condition is then equivalent to (3) via consistency. (Pearl infers a claim of Robins is false since if true then (3) would hold).

Thomas Richardson Therme Vals Workshop Slide 38

slide-40
SLIDE 40

Pearl’s twin network for the same problem

X0 Z H X1 Y X0 Z H X1 Y x0 Z(x0, x1) H(x0, x1) x1 Y(x0, x1) UZ UH UY

The twin network fails to reveal that Y(x0, x1) ⊥ ⊥ X1 | Z, X0 = x0. This ‘extra’ independence holds in spite of d-connection because (by consistency) when X0 = x0, then Z = Z(x0) = Z(x0, x1). Note that Y(x0, x1) ⊥ ⊥ X1 | Z, X0 = x0. Shpitser & Pearl (2008) introduce a pre-processing step to address this.

Thomas Richardson Therme Vals Workshop Slide 39

slide-41
SLIDE 41

Confounding Revisited

X x Y(x) H

Here we can read directly from the template that X ⊥

⊥ Y(x) since

there is a path:

X ← H → Y(x).

Thomas Richardson Therme Vals Workshop Slide 40

slide-42
SLIDE 42

Adjusting for confounding

X Y L X x Y(x) L

Here we can read directly from the template that

X ⊥ ⊥ Y(x) | L.

It follows that:

P(Y(˜ x)=y) =

  • l

P(Y =y | L=l, X= ˜ x)P(L=l).

(6)

Thomas Richardson Therme Vals Workshop Slide 41

slide-43
SLIDE 43

Contrast with approach via removing edges

X Y L X x Y(x) L X Y L

This ‘explains’ why L is sufficient to control confounding under the null (where X has no effect on Y) but not under the alternative.

Thomas Richardson Therme Vals Workshop Slide 42

slide-44
SLIDE 44

Adjusting for confounding

X Y L X x Y(x) L X ⊥ ⊥ Y(x) | L.

Proof of identification:

P[Y(˜ x) = y] =

  • l

P[Y(˜ x) = y | L = l]P(L = l) =

  • l

P[Y(˜ x) = y | L = l, X = ˜ x]P(L = l) indep =

  • l

P[Y = y | L = l, X = ˜ x]P(L = l) modularity

Thomas Richardson Therme Vals Workshop Slide 43

slide-45
SLIDE 45

More Examples (I)

X Y L H

(a-i)

X x Y(x) L H

(a-ii) Here we can read directly from the template that

X ⊥ ⊥ Y(x) | L.

Thomas Richardson Therme Vals Workshop Slide 44

slide-46
SLIDE 46

More Examples (II)

X Y L H

(b-i)

X x Y(x) L H

(b-ii) Here we can read directly from the template that

X ⊥ ⊥ Y(x) | L.

Thomas Richardson Therme Vals Workshop Slide 45

slide-47
SLIDE 47

Sequentially randomized experiment (I)

A B C D H

A and C are treatments; H is unobserved; B is a time varying confounder; D is the final response; Treatment C is assigned randomly conditional on the observed history, A and B; Want to know P(D( ˜ a, ˜ c)).

Thomas Richardson Therme Vals Workshop Slide 46

slide-48
SLIDE 48

Sequentially randomized experiment (I)

A B C D H

If the following holds: A ⊥ ⊥ D( ˜ a, ˜ c) C( ˜ a) ⊥ ⊥ D( ˜ a, ˜ c) | B( ˜ a), A General result of Robins (1986) then implies: P(D( ˜ a, ˜ c)=d) =

  • b

P(B=b | A= ˜ a)P(D=d | A= ˜ a, B=b, C= ˜ c). Does it??

Thomas Richardson Therme Vals Workshop Slide 47

slide-49
SLIDE 49

Sequentially randomized experiment (II)

A

˜

a B( ˜ a) C( ˜ a)

˜

c D( ˜ a, ˜ c) H

d-separation:

A ⊥ ⊥ D( ˜ a, ˜ c) C( ˜ a) ⊥ ⊥ D( ˜ a, ˜ c) | B( ˜ a), A

General result of Robins (1986) then implies:

P(D( ˜ a, ˜ c)=d) =

  • b

P(B=b | A= ˜ a)P(D=d | A= ˜ a, B=b, C= ˜ c).

Thomas Richardson Therme Vals Workshop Slide 48

slide-50
SLIDE 50

Multi-network approach

A B C D H UH UB UC UD a B(a) C(a) D(a) H(a) a B(a, c) c D(a, c) H(a, c)

Thomas Richardson Therme Vals Workshop Slide 49

slide-51
SLIDE 51

Another example

A B C D H2 H1 A ⊥ ⊥ D( ˜ a, ˜ c) C( ˜ a) ⊥ ⊥ D( ˜ a, ˜ c) | B( ˜ a), A

General result of Robins (1986) then implies:

P(D( ˜ a, ˜ c)=d) =

  • b

P(B=b | A= ˜ a)P(D=d | A= ˜ a, B=b, C= ˜ c).

Does it??

Thomas Richardson Therme Vals Workshop Slide 50

slide-52
SLIDE 52

Another example

A B C D H2 H1 A

˜

a B( ˜ a) C( ˜ a)

˜

c D( ˜ a, ˜ c) H1 H2 A ⊥ ⊥ D( ˜ a, ˜ c) C( ˜ a) ⊥ ⊥ D( ˜ a, ˜ c) | B( ˜ a), A

General result of Robins (1986) then implies:

P(D( ˜ a, ˜ c)=d) =

  • b

P(B=b | A= ˜ a)P(D=d | A= ˜ a, B=b, C= ˜ c).

Thomas Richardson Therme Vals Workshop Slide 51

slide-53
SLIDE 53

General result (Robins, 1986)

Observed data:

O ≡ L1, A1, . . . , LK, AK, Y.

If the following holds for k = 1, . . . , K

Y(a†) ⊥ ⊥ Ak(a†) | Lk(a†), Ak−1(a†);

(7) then (under positivity):

P(Y(a†)=y | Lj(a†) = lj, Aj−1(a†) = a†

j−1)

=

  • lm+1,...,lK

p(y|lK, a†

K) K

  • j=m+1

p(lj|lj−1, a†

j−1).

(8)

Here Aj−1(a†) ≡ A1, . . . , Aj−1(a†

j−2), similarly for Lj−1(a†).

The RHS of (8) is referred to as the ‘g-formula’.

Thomas Richardson Therme Vals Workshop Slide 52

slide-54
SLIDE 54

Dynamic regimes

A dynamic regime g is a policy that assigns treatment (usually at multiple time points) on the basis of past history; Including conditional on the ‘natural’ value of treatment in the absence of an intervention; Exercise for as long as you would have done without intervention or twenty minutes, whichever is more. See Young et al. (2012) for additional analysis.

Thomas Richardson Therme Vals Workshop Slide 53

slide-55
SLIDE 55

Dynamic regimes

A1 a1 L(a1) A2(a1) a2 Y(a1, a2) H2 H1 A1

A+

1 (g)

L(g) A2(g)

A+

2 (g)

Y(g) H2 H1 P(Y(g)) is identified.

Thomas Richardson Therme Vals Workshop Slide 54

slide-56
SLIDE 56

Dynamic regimes

A1 a1 L(a1) A2(a1) a2 Y(a1, a2) H2 H1 A1

A+

1 (g)

L(g) A2(g)

A+

2 (g)

Y(g) H2 H1 P(Y(g)) is not identified.

Thomas Richardson Therme Vals Workshop Slide 55

slide-57
SLIDE 57

Joint Independence

We saw earlier that the causal DAG X → Y implied:

X ⊥ ⊥ Y(x0)

and

X ⊥ ⊥ Y(x1)

However, joint independence relations such as:

X ⊥ ⊥ Y(x0), Y(x1)

never follow from our SWIG transformation: There is no way via node-splitting to construct a graph with both

Y(x0), and Y(x1).

This has important consequences for the identification of direct effects.

Thomas Richardson Therme Vals Workshop Slide 56

slide-58
SLIDE 58

Assuming Independent Errors and Cross-World Independence

Thomas Richardson Therme Vals Workshop Slide 57

slide-59
SLIDE 59

Mediation graph

Intervention on X and M:

M Y X M(˜ x)

˜

m X

˜

x Y(˜ x, ˜ m)

d-separation gives:

X ⊥ ⊥ M(˜ x) ⊥ ⊥ Y(˜ x, ˜ m)

Pearl associates additional independence relations with this graph

Y(x1, m) ⊥ ⊥ M(x0), X Y(x0, m) ⊥ ⊥ M(x1), X

equivalent to assuming independent errors, εX ⊥

⊥ εM ⊥ ⊥ εY.

Thomas Richardson Therme Vals Workshop Slide 58

slide-60
SLIDE 60

Pure Direct Effect

Pure (aka Natural) Direct Effect (PDE): Change in Y had X been different, but M fixed at the value it would have taken had X not been changed:

PDE ≡ Y(x1, M(x0)) − Y(x0, M(x0)).

Legal motivation [from Pearl (2000)]: “The central question in any employment-discrimination case is whether the employer would have taken the same action had the employee been

  • f a different race (age, sex, religion, national origin etc.) and everything

else had been the same.” (Carson versus Bethlehem Steel Corp., 70 FEP Cases 921, 7th Cir. (1996)).

Thomas Richardson Therme Vals Workshop Slide 59

slide-61
SLIDE 61

Decomposition

PDE also allows non-parametric decomposition of Total Effect (ACE) into direct (PDE) and indirect (TIE) pieces.

PDE ≡ E [Y(1, M(0))] − E [Y(0)] TIE ≡ E [Y (1, M(1)) − Y (1, M(0))] TIE + PDE ≡ E [Y(1)] − E [Y(0)] ≡ ACE(X → Y)

Thomas Richardson Therme Vals Workshop Slide 60

slide-62
SLIDE 62

Pearl’s identification claim

Pearl and others claim that under “no confounding” the PDE is identified by the following mediation formula: PDEmed(m) =

  • m

[E[Y|x1, m] − E[Y|x0, m]] P(m|x0)

Thomas Richardson Therme Vals Workshop Slide 61

slide-63
SLIDE 63

Critique of PDE: Hypothetical Case Study

Observational data on three variables:

X- treatment: cigarette cessation M intermediate: blood pressure at 1 year, high or low Y outcome: say CHD by 2 years

Observed data (X, M, Y) on each of n subjects. All binary

X randomly assigned

Thomas Richardson Therme Vals Workshop Slide 62

slide-64
SLIDE 64

Hypothetical Study (I): X randomized

Y = 0 Y = 1

Total ˆ

P(Y =1 | m, x) M = 0

1500 500 2000 0.25

X = 0 M = 1

1200 800 2000 0.40

M = 0

948 252 1200 0.21

X = 1 M = 1

1568 1232 2800 0.44 A researcher, Prof H wishes to apply the mediation formula to estimate the PDE. Prof H believes that there is no confounding, so that Pearl’s NPSEM-IE holds, but his post-doc, Dr L is skeptical.

Thomas Richardson Therme Vals Workshop Slide 63

slide-65
SLIDE 65

Hypothetical Study (II): X and M Randomized

To try to address Dr L ’s concerns, Prof H carries out animal intervention studies.

Y = 0 Y = 1

Total ˆ

P(Y(m, x)=1) M = 0

750 250 1000 0.25

X = 0 M = 1

600 400 1000 0.40

M = 0

790 210 1000 0.21

X = 1 M = 1

560 440 1000 0.44 As we see: ˆ

P(Y(m, x)=1) = ˆ P(Y =1 | m, x);

Prof H is now convinced: ‘What other experiment could I do ?’ He applies the mediation formula, yielding

PDE

med = 0. Conclusion: No direct effect of X on Y.

Thomas Richardson Therme Vals Workshop Slide 64

slide-66
SLIDE 66

Failure of the mediation formula

Under the true generating process, the true value of the PDE is:

  • PDE = 0.153 =

PDE

med = 0 Prof H’s conclusion was completely wrong!

Thomas Richardson Therme Vals Workshop Slide 65

slide-67
SLIDE 67

Why did the mediation formula go wrong?

Dr L was right – there was a confounder:

M Y X H

  • but. . . it had a special structure so that:

Y ⊥ ⊥ H | M, X = 0

and

M ⊥ ⊥ H | X = 1

Thomas Richardson Therme Vals Workshop Slide 66

slide-68
SLIDE 68

Why did the mediation formula go wrong?

Dr L was right – there was a confounder:

M Y X H

  • but. . . it had a special structure so that:

Y ⊥ ⊥ H | M, X = 0

and

M ⊥ ⊥ H | X = 1 M Y H X = 0 M Y H X = 1

The confounding undetectable by any intervention on X and/or M. Pearl: Onus is on the researcher to be sure there is no confounding. Causation should precede intervention.

Thomas Richardson Therme Vals Workshop Slide 67

slide-69
SLIDE 69

PDE identification cannot be checked via experiment

If our only interventions are on the variables X and M then we cannot do an experiment to learn the PDE. We could learn E [Y{x = 1, M(x = 0)}] by intervention if we could

◮ intervene and set X to 0 and observe M(0), ◮ then return each subject to their pre-intervention state, ◮ finally intervene to set X to 1 and M to M(0) and observe

Y(1, M(0)). Such an intervention strategy will usually not exist because not possible in a real-world intervention (e.g., suppose the outcome Y were death). Because we cannot observe the same subject under both X = 1 and X = 0 (i.e. ”across worlds”,) no intervention will allow us to learn the distribution of mixed counterfactuals such as Y{x = 1, M(x = 0)} : (In the story Dr L had to introduce a new node on the graph in order to check the value of the PDE via an experiment.)

Thomas Richardson Therme Vals Workshop Slide 68

slide-70
SLIDE 70

Summary of critique of Independent Error Assumption

The independent error assumption cannot be checked by any randomized experiment on the variables in the graph. ⇒ Connection between experimental interventions and potential

  • utcomes, established by Neyman has been severed;

⇒ Theories in Social and Medical sciences are not detailed enough to support the independent error assumption. What about faithfulness and causal discovery procedures? Such inferences are explicit that they rely on faithfulness, and are designed to guide hypothesis formation;

◮ Contrast: In Pearl’s NPSEM-IE approach the simple act of

using a DAG is viewed as automatically committing you to making this untestable hypothesis. Predictions (possibly derived assuming faithfulness) regarding intervention distributions P(Y(x)) = P(Y | do(x)) can be tested by randomized experiments.

Thomas Richardson Therme Vals Workshop Slide 69

slide-71
SLIDE 71

How many experimentally untestable assumptions?

Assumption of independent errors implies super-exponentially many ‘cross-world’ counterfactual independence assumptions:

  • No. Actual Vars.

2 3 4 K

  • Dim. P(V)

3 7 15 2K − 1

  • No. Cnterfactual Vars.

3 7 15 2K − 1

  • Dim. Cnterfactual Dist.

7 127 32767 2(2K−1) − 1

  • Dim. SWIG

5 113 32697 (2(2K−1) − 1) − K−1

j=1 (4j − 2j)

  • Dim. NPSEM-IE

4 19 274 K−1

j=0 (22j − 1)

  • No. untestable indep.

1 94 32423 O(22K−2) constrnts in NPSEM-IE Table: Dimensions of counterfactual models associated with complete graphs with binary variables.

Thomas Richardson Therme Vals Workshop Slide 70

slide-72
SLIDE 72

SWIG Completeness Conjecture

In an NPSEM we define a counterfactual independence to be logical if it holds regardless of the distribution over counterfactuals (equivalently error terms) e.g. for binary X

Y(x0) ⊥ ⊥ Y(x1) | X, Y

Completeness Conjecture There exists a distribution over counterfactuals that is experimentally indistinguishable from the NPSEM that assumes independent errors but in which the

  • nly non-logical independencies are those that may be

derived from the SWIG.

Thomas Richardson Therme Vals Workshop Slide 71

slide-73
SLIDE 73

Summary

A simple approach to unifying graphs and counterfactuals via node-splitting The approach works via linking the factorizations associated with the two graphs The approach provides a language that allows counterfactual and graphical people to communicate The approach leads to many fewer untestable independence assumptions than in the NPSEM-IE approach of Pearl. The approach also provides a way to combine information on the absence of individual and population level direct effects.

Thomas Richardson Therme Vals Workshop Slide 72

slide-74
SLIDE 74

Thank You!

Thomas Richardson Therme Vals Workshop Slide 73

slide-75
SLIDE 75

References

Pearl, J. Causality (Second ed.). Cambridge, UK: Cambridge University Press, 2009. Richardson, TS, Robins, JM. Single World Intervention Graphs. CSSS Technical Report No. 128 http://www.csss.washington.edu/Papers/wp128.pdf, 2013. Robins, JM A new approach to causal inference in mortality studies with sustained exposure periods applications to control of the healthy worker survivor effect. Mathematical Modeling 7, 1393–1512, 1986. Robins, JM, VanderWeele, TJ, Richardson TS. Discussion of “Causal effects in the presence of non compliance a latent variable interpretation by Forcina, A. Metron LXIV (3), 288–298, 2007. Shpitser, I, Pearl, J. What counterfactuals can be tested. Journal of Machine Learning Research 9, 1941–1979, 2008. Spirtes, P , Glymour, C, Scheines R. Causation, Prediction and

  • Search. Lecture Notes in Statistics 81, Springer-Verlag.

Thomas Richardson Therme Vals Workshop Slide 74

slide-76
SLIDE 76

Details on Pearl’s Error

Pearl correctly states that using his Twin Network method (next slide) it may be shown that Y(x0, x1) is not independent of X1, given Z and X0. However, he then goes on to say (incorrectly): In the twin network model there is a d-connected path from X1 to Y(x0, x1). . . Therefore, [(3)] is not satisfied for Y(x0, x1) and X1. [Ex. 11.3.3, p.353] This is actually incorrect in two ways: Y(x0, x1) ⊥ ⊥ X1 | Z, X0 does not imply Y(x0, x1) ⊥ ⊥ X1 | Z, X0 =x0 d-separation is not complete for Twin Networks so the presence of a d-connected path does not imply that an independence is not implied.

Thomas Richardson Therme Vals Workshop Slide 75

slide-77
SLIDE 77

T X Z Y T X Z Y UT UX UZ UY T(z) X(z) z Y(z) T X Z z Y(z)

T and Y(z) are d-connected given X in the twin-network, but in spite of this T ⊥ ⊥ Y(z) | X under the associated NPSEM-IE because X(z) = X, and T and Y(z) are d-separated given X in the twin-network.

Thomas Richardson Therme Vals Workshop Slide 76

slide-78
SLIDE 78

Mediation graph (I)

Intervention on Z alone.

X Y Z X(˜ z) Z

˜

z Y(˜ z)

factorization:

P(Z, X(˜ z), Y(˜ z)) = P(Z)P(X(˜ z))P(Y(˜ z) | X(˜ z))

modularity:

P(X(˜ z)=x) = P(X=x | Z= ˜ z), P(Y(˜ z)=y | X(˜ z)=x) = P(Y =y | X=x, Z= ˜ z).

d-separation gives:

Z ⊥ ⊥ X(˜ z), Y(˜ z).

Thomas Richardson Therme Vals Workshop Slide 77

slide-79
SLIDE 79

Mediation graph (II)

Intervention on Z and X:

X Y Z X(˜ z)

˜

x Z

˜

z Y(˜ x, ˜ z)

factorization:

P(Z, X(˜ z), Y(˜ x, ˜ z)) = P(Z)P(X(˜ z))P(Y(˜ x, ˜ z))

modularity:

P(X(˜ z)=x) = P(X=x | Z= ˜ z), P(Y(˜ x, ˜ z)=y) = P(Y =y | X= ˜ x, Z= ˜ z).

d-separation gives:

Z ⊥ ⊥ X(˜ z) ⊥ ⊥ Y(˜ x, ˜ z)

Thomas Richardson Therme Vals Workshop Slide 78

slide-80
SLIDE 80

Importance of fixed nodes

Compare:

X Y Z X(˜ z) Z

˜

z Y(˜ z)

P(Z, X(˜ z), Y(˜ z)) = P(Z)P(X(˜ z))P(Y(˜ z) | X(˜ z)) P(Y(˜ z)=y | X(˜ z) = x) = P(Y =y | X=x, Z= ˜ z).

versus

X Y Z X(˜ z) Z

˜

z Y(˜ z)

P(Z, X(˜ z), Y(˜ z)) = P(Z)P(X(˜ z))P(Y(˜ z) | X(˜ z)), P(Y(˜ z)=y | X(˜ z) = x) = P(Y =y | X=x)

Thomas Richardson Therme Vals Workshop Slide 79

slide-81
SLIDE 81

Importance of fixed nodes: leaving them out causes problems!

X Y Z X(˜ z) Z Y(˜ z)

P(Z, X(˜ z), Y(˜ z)) = P(Z)P(X(˜ z))P(Y(˜ z) | X(˜ z)) P(Y(˜ z)=y | X(˜ z) = x) = P(Y =y | X=x, Z= ˜ z).

versus

X Y Z X(˜ z) Z Y(˜ z)

P(Z, X(˜ z), Y(˜ z)) = P(Z)P(X(˜ z))P(Y(˜ z) | X(˜ z)), P(Y(˜ z)=y | X(˜ z) = x) = P(Y =y | X=x)

Red nodes are needed in order to read off modularity property from G(˜ a).

Thomas Richardson Therme Vals Workshop Slide 80

slide-82
SLIDE 82

No direct effect graph (I)

X Y Z X(˜ z) Z

˜

z Y(˜ z)

factorization:

P(Z, X(˜ z), Y(˜ z)) = P(Z)P(X(˜ z))P(Y(˜ z) | X(˜ z))

modularity:

P(X(˜ z)=x) = P(X=x | Z= ˜ z), P(Y(˜ z)=y | X(˜ z) = x) = P(Y =y | X=x).

d-separation gives:

Z ⊥ ⊥ X(˜ z), Y(˜ z)

Thomas Richardson Therme Vals Workshop Slide 81