On Testing Marginal versus Conditional Independence Richard Guo - PowerPoint PPT Presentation

On Testing Marginal versus Conditional Independence Richard Guo ricguo@uw.edu Nov, 2019 Department of Statistics, University of Washington, Seattle 1

Introduction

Motivation Inferring causal structures usually involves model selection among directed acyclic graphs (DAGs). While learning undirected graphical models has been relatively well-developed (e.g., graphical lasso, neighborhood selection), model selection for DAGs is less well-understood. This poses a challenge to maintaining error guarantee in causal inference, even in large samples. In this talk, I will analyze the simplest example where such a challenge arises. 2

Marginal vs. conditional independence Consider ( X 1 , X 2 , X 3 ) ⊺ ∼ N (0 , Σ) on R 3 . Covariance Σ ∈ S 3 , the set of 3 × 3 real positive definite matrices. We want to test between M 0 : X 1 ⊥ ⊥ X 2 , ( X 1 → X 3 ← X 2 ) , M 1 : X 1 ⊥ ⊥ X 2 | X 3 , ( X 1 − X 3 − X 2 ) , assuming that at least one of them is true. X 1 − X 3 − X 2 includes the following Markov-equivalent DAGs X 1 ← X 3 ← X 2 , X 1 → X 3 → X 2 , X 1 ← X 3 → X 2 . 3

Marginal vs. conditional independence Testing between M 0 : X 1 ⊥ ⊥ X 2 M 1 : X 1 ⊥ ⊥ X 2 | X 3 vs. is a non-nested model selection problem. They correspond to equality/algebraic constraints on Σ = { σ ij } : M 0 : σ 12 = 0 , M 1 : σ 12 · 3 = σ 12 − σ 13 σ − 1 33 σ 23 = 0 ⇔ σ 12 σ 33 = σ 13 σ 23 . M 0 and M 1 intersect at the two axes M 0 ∩ M 1 = { σ 12 = σ 13 = 0 } ∪ { σ 12 = σ 23 = 0 } . 4

Geometry We visualize the parameter space in the correlation space. M 0 : ρ 12 = 0 , M 1 : ρ 12 = ρ 13 ρ 23 5

Singularity The two axes further intersect at the origin M sing : { σ 12 = σ 13 = σ 23 = 0 } , which is a singularity . M sing corresponds to diagonal Σ. • M 0 ∩ M 1 vs. S 3 : Likelihood-ratio test (LRT) was studied by Drton (2006, 2009) and Drton and Sullivant (2007). • LRT has a non-standard asymptotic distribution at M sing . • M 0 vs. M 1 : At M sing , the tangent cones of the two models coincide. • They are called “1-equivalent” by Evans (2018), meaning that linear approximations to the parameter space are the same. • In the Euclidean m − 1 / 2 -ball of M sing , m 2 samples are required to distinguish M 0 and M 1 . 6

Difficulty Model selection for DAGs is usually conducted by the following approaches (Drton and Maathuis, 2017). • Score-based : Picking the model with the highest penalized likelihood score (e.g., AIC, BIC). Since dim( M 0 ) = dim( M 1 ), both AIC and BIC will pick the model with the higher likelihood. • Constraint-based : Testing M 0 : X 1 ⊥ ⊥ X 2 M 1 : X 1 ⊥ ⊥ X 2 | X 3 . vs. This is adopted by the PC algorithm. For Gaussian data, Fisher’s z -transformation of partial correlation is used as the test statistic. 7

Difficulty Simulated with n = 1 , 000, ρ = 0 . 3 and unit variances under level α = 0 . 05. X 1 X 2 X 1 X 2 γ n − 1 / 2 γ n − 1 / 2 ρ ρ X 3 X 3 M 0 \ M 1 M 1 \ M 0 M0\M1 M1\M0 1.00 0.75 method size 0.50 BIC/AIC PC 0.25 0.00 0 5 10 0 5 10 8 | γ |

Method

Likelihood ratio test for nested models Consider a parametric family { P θ : θ ∈ Θ } , where Θ is an open subset of R d . For Θ 0 ⊆ Θ, suppose we want to test H 0 : θ ∈ Θ 0 vs. H 1 : θ ∈ Θ . Under regularity, the likelihood ratio test (LRT) statistic � � d ⇒ χ 2 λ n = 2 sup ℓ n ( θ ) − sup ℓ n ( θ ) c , θ θ 0 where c = d − dim(Θ 0 ). ℓ n ( · ) is the log-likelihood under sample size n . For example, in linear regression y ∼ β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 . We use χ 2 2 for testing H 1 : β ∈ R 4 . H 0 : β 0 = β 1 = 0 vs. 9

Likelihood ratio test Similarly, we define the log-likelihood ratio of M 0 versus M 1 as � � λ (0:1) :=2 sup ℓ n (Σ) − sup ℓ n (Σ) n Σ ∈M 0 Σ ∈M 1 � � Σ (0) Σ (1) ℓ n (ˆ n ) − ℓ n (ˆ =2 n ) , Σ (0) Σ (1) where ˆ n , ˆ are MLEs within M 0 and M 1 respectively. n ℓ n ( · ) is the Gaussian log-likelihood function ℓ n (Σ) = n 2( − log | Σ | − Tr ( S n Σ − 1 )) . 10

Likelihood ratio test The Gaussian MLEs for DAGs take a closed form (Drton and Richardson, 2008), which yields the following expression for the LRT. �� s 2 � � s 2 � 13 − s 11 s 33 23 − s 22 s 33 λ (0:1) − = n log n s 33 � s 22 s 2 13 − 2 s 12 s 23 s 13 + s 11 s 2 � �� 23 n log s 11 s 22 + s 33 , s 2 12 − s 11 s 22 where S is the sample covariance taken with respect to mean zero. 11

Our plan 1. An information-theoretic analysis on how well the two models can be distinguished (by any means). 2. Look at the regimes of “effect size” ∼ n , such that the optimal error is between 0 and 1. • a stable, non-degenerate asymptotic distribution of LRT. • We will be doing large- n -small-effect asymptotics ! 3. Derive the asymptotic distributions. • Are they uniform? 4. Develop a model selection procedure with error guarantees. 12

Optimal error We study the minimax rate of distinguishing two sequences of distributions, one within M 0 and the other within M 1 , as they approach M 0 ∩ M 1 . Lemma: testing two simple hypotheses For testing H 0 : X ∼ P versus H 1 : X ∼ Q , the minimum sum of type-I and type-II errors is 1 − d TV ( P , Q ). Total variation distance | P ( A ) − Q ( A ) | = 1 � d TV ( P , Q ) = sup | p − q | d µ. 2 A 13

Optimal error Consider a sequence → Σ ∗ ∈ M 0 ∩ M 1 . Σ (0) Σ (0) P n = P Σ (0) n , ∈ M 0 \ M 1 , n n Correspondingly, let Q n = P Σ (1) from M 1 \ M 0 such that n Σ (1) = arg min D KL ( P Σ (0) n � P Σ ) , n Σ ∈M 1 \M 0 which is the most difficult to distinguish from. With P n = P Σ (0) and Q n = P Σ (1) n , let us compute the total n variation between the product measures ( n iid samples). The limiting optimal error can be sandwiched by the Hellinger ( √ p − √ q ) 2 d µ � 1 � 1 / 2 . � distance H ( P , Q ) := 2 � H 2 ( P n n , Q n n ) ≤ d TV ( P n n , Q n n ) ≤ H ( P n n , Q n 2 − H 2 ( P n n ) n , Q n n ) . 14

Optimal error With some algebra, we have  H ( P n , Q n ) = ω ( n − 1 / 2 ) 0 ,  1 − d TV ( P n n , Q n n ) → , H ( P n , Q n ) = o ( n − 1 / 2 ) 1 ,  and when H ( P n , Q n ) ≍ n − 1 / 2 , { 1 − d TV ( P n n , Q n { 1 − d TV ( P n n , Q n n ) } ≤ lim sup n ) } < 1 . 0 < lim inf n n Effect size H ( P n , Q n ) ≍ ρ 13 , n ρ 23 , n , where ρ ij = σ ij / √ σ ii σ jj is the correlation coefficient. 15

Optimal error Comparing H ( P n , Q n ) to n − 1 / 2 , to stabilize the asymptotic error, there are two regimes. Two regimes { 1 − d TV ( P n n , Q n n ) } → c ∈ (0 , 1)  ρ 13 , n ≍ γ n − 1 / 2 , ρ 23 , n → ρ 23 � = 0 �   “ weak-strong ”  iff ρ 23 , n ≍ γ n − 1 / 2 , ρ 13 , n → ρ 13 � = 0  “ weak-weak ”  ρ 13 , n ρ 23 , n ≍ δ n − 1 / 2 , ρ 13 , n , ρ 23 , n → 0 .  16

Asymptotics: weak-strong regime We study the (local) asymptotic distribution of λ (0:1) . n For r = γ √ σ 11 σ 33 , we set r / √ n   σ 11 0 Σ (0) =  , 0 σ 22 σ 23   n r / √ n  σ 23 σ 33 ( r / √ n ) σ 23 /σ 33 r / √ n   σ 11 ( r / √ n ) σ 23 /σ 33 Σ (1) = σ 22 σ 23  ,   n r / √ n  σ 23 σ 33   σ 11 0 0 → Σ ∗ = Σ (0) n , Σ (1) 0 σ 22 σ 23   n   0 σ 23 σ 33 17

<latexit sha1_base64="C5oYjXs5XFpIvmf4zkzWcJwqK0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ip1arBRdW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AgsuQNQ=</latexit> <latexit sha1_base64="N2t1x1B6o7YXeWQNQ7GLKgYtqW0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewaQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ipXVSDWtW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AhFCQNg=</latexit> <latexit sha1_base64="+q3hAqsk252mIPphuj3whPN1q8w=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewmgh6DXjxGMImSLGF2MpsMmcyMyuEJV/hxYMiXv0cb/6Nk2QPmljQUFR1090VJZwZ6/vfXmFtfWNzq7hd2tnd2z8oHx61jUo1oS2iuNIPETaUM0lblOHxJNsYg47UTjm5nfeaLaMCXv7ShocBDyWJGsHXSY0+PVD+r1af9csWv+nOgVRLkpAI5mv3yV2+gSCqotIRjY7qBn9gw9oywum01EsNTAZ4yHtOiqxoCbM5gdP0ZlTBihW2pW0aK7+nsiwMGYiItcpsB2ZW8m/ud1UxtfhRmTSWqpJItFcqRVWj2PRowTYnlE0cw0czdisgIa0ysy6jkQgiWX14l7Vo1qFf9u4tK4zqPowgncArnEMAlNOAWmtACAgKe4RXePO29eO/ex6K14OUzx/AH3ucPhdaQNw=</latexit> Asymptotics: weak-strong regime ρ 12 ρ 23 ρ 13 X 1 X 2 X 1 X 2 X 1 X 2 γ n − 1 / 2 γ n − 1 / 2 ρ ρ ρ X 3 X 3 X 3 Σ (0) Σ (1) Σ ∗ ∈ M 0 ∩ M 1 ∈ M 0 \ M 1 ∈ M 1 \ M 0 n n 18

On Testing Marginal versus Conditional Independence Richard Guo - PowerPoint PPT Presentation

On Testing Marginal versus Conditional Independence Richard Guo ricguo@uw.edu Nov, 2019 Department of Statistics, University of Washington, Seattle 1 Introduction Motivation Inferring causal structures usually involves model selection among

Formal Modeling in Cognitive Science 1 Distributions Lecture 20: Joint, Marginal, and Conditional

CS70: Jean Walrand: Lecture 23. Bayes Rule, Independence, Mutual Independence 1. Conditional

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Conditional Independence Testing using Adversarial Neural Networks Alexis Bellot Mihaela van der

Graphs and Conditional Independence Steffen Lauritzen, University of Oxford CIMPA Summerschool,

Graphical Models Graphical Models Conditional Independence 1 Steven J Zeil d-Separation 2

Graphical Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Conditional Independence

An illustration of Conditional Independence Martin Emms October 8, 2020 An illustration of

Conditional Independence CMPUT 366: Intelligent Systems P&M 8.2 Lecture Outline 1.

Lecture 3 Conditional Independence CS 886 Sept 21, 2010 Testing Independence Given BN, how

Conditional Independence in Testing Bayesian Networks Yujia Shen, Haiying Huang, Arthur Choi,

Conditional independence ideals with hidden variables Fatemeh Mohammadi (IST Austria) Johannes

Conditional Probability & Independence Conditional Probabilities Question : How should we

Conditional Probability & Independence Conditional Probabilities Question : How should we

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Independence, conditional distributions So far density of X specified explicitly. Often modelling

x = , , P r=2 : R 1 = , 1 , ,

Bayesian Calibration of Simulators with Structured Discretization Uncertainty Oksana A. Chkrebtii

Comparison of semi-parametric reduced bias quantile estimators Maria Ivette Gomes (CEAUL and

Adaptive estimation of survival function in the convolution model on R + Gwenna elle MABON

Properties of the Stochastic Approximation Schedule in the Wang-Landau Algorithm Pierre E. Jacob

Rare decays at LHCb: looking for new physics in b s + - transitions Luca

Reproducible Research, Replicability, and Ethical Practice Ronald A. Thisted Departments of

Advanced Machine Learning CS 7140 - Spring 2018 Lecture 13: Project Discussion Jan-Willem van de

On Testing Marginal versus Conditional Independence Richard Guo - PowerPoint PPT Presentation

On Testing Marginal versus Conditional Independence Richard Guo ricguo@uw.edu Nov, 2019 Department of Statistics, University of Washington, Seattle 1 Introduction Motivation Inferring causal structures usually involves model selection among

Formal Modeling in Cognitive Science 1 Distributions Lecture 20: Joint, Marginal, and Conditional

CS70: Jean Walrand: Lecture 23. Bayes Rule, Independence, Mutual Independence 1. Conditional

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Conditional Independence Testing using Adversarial Neural Networks Alexis Bellot Mihaela van der

Graphs and Conditional Independence Steffen Lauritzen, University of Oxford CIMPA Summerschool,

Graphical Models Graphical Models Conditional Independence 1 Steven J Zeil d-Separation 2

Graphical Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Conditional Independence

An illustration of Conditional Independence Martin Emms October 8, 2020 An illustration of

Conditional Independence CMPUT 366: Intelligent Systems P&amp;M 8.2 Lecture Outline 1.

Lecture 3 Conditional Independence CS 886 Sept 21, 2010 Testing Independence Given BN, how

Conditional Independence in Testing Bayesian Networks Yujia Shen, Haiying Huang, Arthur Choi,

Conditional independence ideals with hidden variables Fatemeh Mohammadi (IST Austria) Johannes

Conditional Probability &amp; Independence Conditional Probabilities Question : How should we

Conditional Probability &amp; Independence Conditional Probabilities Question : How should we

On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2

Independence, conditional distributions So far density of X specified explicitly. Often modelling

x = , , P r=2 : R 1 = , 1 , ,

Bayesian Calibration of Simulators with Structured Discretization Uncertainty Oksana A. Chkrebtii

Comparison of semi-parametric reduced bias quantile estimators Maria Ivette Gomes (CEAUL and

Adaptive estimation of survival function in the convolution model on R + Gwenna elle MABON

Properties of the Stochastic Approximation Schedule in the Wang-Landau Algorithm Pierre E. Jacob

Rare decays at LHCb: looking for new physics in b s + - transitions Luca

Reproducible Research, Replicability, and Ethical Practice Ronald A. Thisted Departments of

Advanced Machine Learning CS 7140 - Spring 2018 Lecture 13: Project Discussion Jan-Willem van de

Conditional Independence CMPUT 366: Intelligent Systems P&M 8.2 Lecture Outline 1.

Conditional Probability & Independence Conditional Probabilities Question : How should we

Conditional Probability & Independence Conditional Probabilities Question : How should we