What does ’strong causal influence’ mean? Joint work with David Balduzzi, Moritz Grosse-Wentrup and Bernhard Sch¨ olkopf Dominik Janzing Max Planck Institute for Intelligent Systems T¨ ubingen, Germany 1
¡ ¡ Quantifying strength of an arrow: Given: • causally sufficient set of variables X 1 , . . . , X n • causal DAG G • all causal conditionals P ( x j | pa j ) even for values pa j with probability zero (more than just knowing P ( X 1 , . . . , X n ) X 1 X 2 X 4 X 3 quantify the strength of X i → X j
¡ ¡ Motivation: X X Y Y Z Z W W Maybe, the true causal DAG is always complete if we also account for weak interactions. Which ones are so weak that we can neglect them?
¡ ¡ Strength of a set of arrows Idea: • strength of an arrow measures its relevance for understanding the behav- ior of the system under inverventions • strength of a set of arrows measures their relevance for understanding the behavior of the system under interventions • if each arrow in S is irrelevant then S could still be relevant
¡ ¡ Note: X X Y Y Z Z W W this picture is misleading because for a set S of arrows • each element may have negligible strength • but jointly they are not negligible our causal strength will not be subadditive over the edges!
¡ ¡ Information theoretic approach advantages of information theory • variables may have different domains • quantities are invariant under rescaling • related to thermodynamics • better for non-statistical generalizations don’t consider approaches that involve expectations, variances, etc. (ANOVA, ACE. . . )
¡ ¡ Some related work • Avin, Sphitser, Pearl: Identifiability of path-specific effects, 2005. • Pearl: direct and indirect effects, 2001. • Robins, Greenland: Identifiability and exchangeability of direct and indi- rect effects, 1992. • Holland: Causal inference, path analysis, and recursive structural equa- tion models, 1988. do not achieve our goal because: • measure impact of switching X from x to x 0 for one particular pair ( x, x 0 ) on Y when other paths are blocked • we want an overall score of the strength of X → Y without referring to particular pairs
¡ ¡ Axiomatic approach: Let S be a set of arrows. • Let C S denote its strength. • Postulate desired properties of C S .
¡ ¡ Postulate 0 Causal Markov condition: if C S = 0 then P is also Markov w.r.t. G S (after removing all arrows in S ) X DAG G S X DAG G S Z Z Y Y
¡ ¡ Postulate 1 Mutual information: Y X for this simple DAG we postulate C X → Y = I ( X ; Y ) (all the dependences are due to the influence of X on Y , hence the strength of dependences can be a measure of the strength of the influence)
¡ ¡ Alternative option: Y X C X → Y := capacity of the information channel P ( Y | do ( X )) = P ( Y | X ) defined by maximizing I ( X ; Y ) over all possible input distributions Q ( X ) • requires knowing P ( Y | x ) also for x -values that never/seldom occur • quantifies the potential influence rather than the actual one • nevertheless an interesting option
¡ ¡ Potential strength vs actual strength Assume a medical study shows that • changing cholesterol within the range of values occurring in humans has no impact on life expectancy • increasing it by 10 times compared to the highest observed value had a strong impact Which statement would you prefer: • “cholesterol has a strong impact on life expectancy” • “cholesterol would have a strong impact on life expectancy if it was much higher than it is”
¡ ¡ Postulate 2 Locality: ξ X → Y is determined by P ( Y | PA Y ) and P ( PA Y ) X X Y Y Z Z Z is irrelevant in both cases
¡ ¡ Postulate 3: C X → Y ≥ I ( X ; Y | PA X Y ) Quantitative causal Markov cond: PA Y X (parents of Y without X) Idea: removing X → Y would imply I ( X ; Y | PA X Y ) = 0 X X Y Y No other arrow can generate non-zero dependence I ( X ; Y | PA X Y )
¡ ¡ Postulate 4: Heredity: If T ⊃ S then C T = 0 ⇒ C S = 0 (subsets of irrelevant sets of arrows are irrevalent)
¡ ¡ Apart from the postulates. . . Consider a simple communication scenario for which we might agree on how C should read...
¡ ¡ Toy model with partial copy operations: • Each variable X j consists of k j bits • some of the bits are set uniformly at random • the remaining ones are copied from parents i.e. structural equation model X j = f j ( PA j , U j ) where • every X j and U j is a vector of bits • every f j is a restriction map
¡ ¡ Example with X → Y 1. X sets all its bits randomly 1 0 1 1 0 X Y 2. Y copies some of them 1 0 1 1 0 1 1 0 X Y 1 0 1 1 0 1 1 0 X Y 3. Y sets the remaining ones randomly 1 0 1 1 0 1 1 0 1 X Y
¡ ¡ Do we agree that. . . . . . C X → Y should be the number of bits that Y takes from X ? (for the simple DAG X → Y this number equals I ( X ; Y ) )
¡ ¡ Why I ( X ; Y ) is an inappropriate measure for general DAGs X Z X Z a) Y Y b) doesn’t account for the fact that part of the dependences are due to a) the confounder Z b) the indirect influence via Z
¡ ¡ First guess: I ( X ; Y | Z ) X Z X Z a) Y Y b) • qualitatively, it behaves correctly: screens off the path involving Z • quantitatively wrong because. . .
¡ ¡ Fails even for a simple copy scenario Z Z Z Z 4) 3) 1) 2) 1 1 1 1 1 1 X X Y X X Y Y Y 1 1 1 1 1 1 1 1 • I ( X ; Y | Z ) = 0 because X and Y are constants when conditioned on Z • we would like to have C X → Y = 1
¡ ¡ Why I ( X ; Y | Z ) is inappropriate Z X a) Y b) weakening Z → Y converts a) into b), where C X → Y = I ( X ; Y )
¡ ¡ Idea: Measure strength of X on Y by the impact of interventions on X (while adjusting other variables) • formalized by Ay & Polani (2006) in terms of Pearl’s do-calculus • defined family of information theoretic quantities called “Information Flow”
¡ ¡ does not solve our problem • Ay and Polani’s Information Flow measures an interesting quantity (something related to causality) • we don’t consider it a good measure for the strength of an arrow • arguments follow
¡ ¡ First attempt: Z X a) Y The strength of X → Y is the mutual information between I ( X ; Y ) in a scenario where • X is subjected to a randomized intervention
¡ ¡ Fails because... Z • X, Y, Z binary • P ( Z ) uniform X Y • Y = X ⊕ Z X and Y are independent both with respect to the • observed distribution • distribution obtained by randomizing X
¡ ¡ Second attempt: Z X a) Y The strength of X → Y is the conditional mutual information I ( X ; Y | Z ) in a scenario where • X is subjected to a randomized intervention Question: X is randomized according to which distribution?
¡ ¡ Second attempt, Version I Z X a) Y The strength of X → Y is the conditional mutual information between I ( X ; Y | Z ) in a scenario where • X is subjected to a randomized intervention • X distributed according to P ( X | Z )
¡ ¡ Fails because. . . Z = X Y If X is a copy of Z , • given Z , X is a constant • I ( X ; Y | Z ) = 0 also for the post-interventional distribution
¡ ¡ Second attempt, Version II Z X a) Y The strength of X → Y is the conditional mutual information between I ( X ; Y | Z ) in a scenario where • X is subjected to a randomized intervention • X distributed according to P ( X )
¡ ¡ Violates Postulate 3: Z X a) Y there is a contrived example where strength of X → Y would be smaller than I ( X ; Y | Z )
¡ ¡ Violates Postulate 3: Z random bit X Y k bits k bits • copied from X for Z = 1 • randomized for Z = 1 • set to 1 for Z = 0 • set to zero for Z = 0 I ( X ; Y | Z ) = k/ 2 because k bits are copied in half of the cases for X and Z independent, copying occurs only in 1 / 4 of the cases
¡ ¡ Hence. . . • defining strength of an arrow by intervention on nodes seems difficult • we now define the strength by intervention on edges
¡ ¡ Our approach: measure impact of ‘deleting arrows’ To define the strength of S , cut every edge in S and feed the open end with an independent copy Z Z P(Z) S X X Y Y P(X) defines new distribution P S ( x, y, z ) := P ( x, z ) P x 0 ,z 0 P ( y | x 0 , z 0 ) P ( x 0 ) P ( z 0 ) C S := D ( P k P S )
¡ ¡ Idea of ‘edge deletion’ Z P(Z) X Y P(X) • edges are electrical wires • attacker cuts some wires • feeds the open ends with random input • distribution of input chosen like observed marginal distribution • only distribution that is locally accessible
¡ ¡ why product distribution? Z Z X X P(X,Z) P(X)P(Z) Y Y our edge deletion ‘source exclusion’ by Ay & Krakauer (2006) • not accessible to local attacker • Postulate 4 fails
¡ ¡ Applying our measure to our toy model Z Z Z 1 0 1 0 S S X X S X 0 0 Y Y Y D ( P k P S ) = number of corrupted bits (in agreement with what we expect)
¡ ¡ Quantifying the impact of a vaccine Age vaccinated or not infected or not P S corresponds to an experiment where • vaccine is randomly redistributed regardless of Age (keeping the fraction of treated subjects) • the random variable vaccinated is reinterpreted as ‘intention to get vaccinated’
Recommend
More recommend