On Testing Marginal versus Conditional Independence Richard Guo ricguo@uw.edu Nov, 2019 Department of Statistics, University of Washington, Seattle 1
Introduction
Motivation Inferring causal structures usually involves model selection among directed acyclic graphs (DAGs). While learning undirected graphical models has been relatively well-developed (e.g., graphical lasso, neighborhood selection), model selection for DAGs is less well-understood. This poses a challenge to maintaining error guarantee in causal inference, even in large samples. In this talk, I will analyze the simplest example where such a challenge arises. 2
Marginal vs. conditional independence Consider ( X 1 , X 2 , X 3 ) ⊺ ∼ N (0 , Σ) on R 3 . Covariance Σ ∈ S 3 , the set of 3 × 3 real positive definite matrices. We want to test between M 0 : X 1 ⊥ ⊥ X 2 , ( X 1 → X 3 ← X 2 ) , M 1 : X 1 ⊥ ⊥ X 2 | X 3 , ( X 1 − X 3 − X 2 ) , assuming that at least one of them is true. X 1 − X 3 − X 2 includes the following Markov-equivalent DAGs X 1 ← X 3 ← X 2 , X 1 → X 3 → X 2 , X 1 ← X 3 → X 2 . 3
Marginal vs. conditional independence Testing between M 0 : X 1 ⊥ ⊥ X 2 M 1 : X 1 ⊥ ⊥ X 2 | X 3 vs. is a non-nested model selection problem. They correspond to equality/algebraic constraints on Σ = { σ ij } : M 0 : σ 12 = 0 , M 1 : σ 12 · 3 = σ 12 − σ 13 σ − 1 33 σ 23 = 0 ⇔ σ 12 σ 33 = σ 13 σ 23 . M 0 and M 1 intersect at the two axes M 0 ∩ M 1 = { σ 12 = σ 13 = 0 } ∪ { σ 12 = σ 23 = 0 } . 4
Geometry We visualize the parameter space in the correlation space. M 0 : ρ 12 = 0 , M 1 : ρ 12 = ρ 13 ρ 23 5
Singularity The two axes further intersect at the origin M sing : { σ 12 = σ 13 = σ 23 = 0 } , which is a singularity . M sing corresponds to diagonal Σ. • M 0 ∩ M 1 vs. S 3 : Likelihood-ratio test (LRT) was studied by Drton (2006, 2009) and Drton and Sullivant (2007). • LRT has a non-standard asymptotic distribution at M sing . • M 0 vs. M 1 : At M sing , the tangent cones of the two models coincide. • They are called “1-equivalent” by Evans (2018), meaning that linear approximations to the parameter space are the same. • In the Euclidean m − 1 / 2 -ball of M sing , m 2 samples are required to distinguish M 0 and M 1 . 6
Difficulty Model selection for DAGs is usually conducted by the following approaches (Drton and Maathuis, 2017). • Score-based : Picking the model with the highest penalized likelihood score (e.g., AIC, BIC). Since dim( M 0 ) = dim( M 1 ), both AIC and BIC will pick the model with the higher likelihood. • Constraint-based : Testing M 0 : X 1 ⊥ ⊥ X 2 M 1 : X 1 ⊥ ⊥ X 2 | X 3 . vs. This is adopted by the PC algorithm. For Gaussian data, Fisher’s z -transformation of partial correlation is used as the test statistic. 7
Difficulty Simulated with n = 1 , 000, ρ = 0 . 3 and unit variances under level α = 0 . 05. X 1 X 2 X 1 X 2 γ n − 1 / 2 γ n − 1 / 2 ρ ρ X 3 X 3 M 0 \ M 1 M 1 \ M 0 M0\M1 M1\M0 1.00 0.75 method size 0.50 BIC/AIC PC 0.25 0.00 0 5 10 0 5 10 8 | γ |
Method
Likelihood ratio test for nested models Consider a parametric family { P θ : θ ∈ Θ } , where Θ is an open subset of R d . For Θ 0 ⊆ Θ, suppose we want to test H 0 : θ ∈ Θ 0 vs. H 1 : θ ∈ Θ . Under regularity, the likelihood ratio test (LRT) statistic � � d ⇒ χ 2 λ n = 2 sup ℓ n ( θ ) − sup ℓ n ( θ ) c , θ θ 0 where c = d − dim(Θ 0 ). ℓ n ( · ) is the log-likelihood under sample size n . For example, in linear regression y ∼ β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 . We use χ 2 2 for testing H 1 : β ∈ R 4 . H 0 : β 0 = β 1 = 0 vs. 9
Likelihood ratio test Similarly, we define the log-likelihood ratio of M 0 versus M 1 as � � λ (0:1) :=2 sup ℓ n (Σ) − sup ℓ n (Σ) n Σ ∈M 0 Σ ∈M 1 � � Σ (0) Σ (1) ℓ n (ˆ n ) − ℓ n (ˆ =2 n ) , Σ (0) Σ (1) where ˆ n , ˆ are MLEs within M 0 and M 1 respectively. n ℓ n ( · ) is the Gaussian log-likelihood function ℓ n (Σ) = n 2( − log | Σ | − Tr ( S n Σ − 1 )) . 10
Likelihood ratio test The Gaussian MLEs for DAGs take a closed form (Drton and Richardson, 2008), which yields the following expression for the LRT. �� � s 2 � � s 2 � 13 − s 11 s 33 23 − s 22 s 33 λ (0:1) − = n log n s 33 � s 22 s 2 13 − 2 s 12 s 23 s 13 + s 11 s 2 � �� 23 n log s 11 s 22 + s 33 , s 2 12 − s 11 s 22 where S is the sample covariance taken with respect to mean zero. 11
Our plan 1. An information-theoretic analysis on how well the two models can be distinguished (by any means). 2. Look at the regimes of “effect size” ∼ n , such that the optimal error is between 0 and 1. • a stable, non-degenerate asymptotic distribution of LRT. • We will be doing large- n -small-effect asymptotics ! 3. Derive the asymptotic distributions. • Are they uniform? 4. Develop a model selection procedure with error guarantees. 12
Optimal error We study the minimax rate of distinguishing two sequences of distributions, one within M 0 and the other within M 1 , as they approach M 0 ∩ M 1 . Lemma: testing two simple hypotheses For testing H 0 : X ∼ P versus H 1 : X ∼ Q , the minimum sum of type-I and type-II errors is 1 − d TV ( P , Q ). Total variation distance | P ( A ) − Q ( A ) | = 1 � d TV ( P , Q ) = sup | p − q | d µ. 2 A 13
Optimal error Consider a sequence → Σ ∗ ∈ M 0 ∩ M 1 . Σ (0) Σ (0) P n = P Σ (0) n , ∈ M 0 \ M 1 , n n Correspondingly, let Q n = P Σ (1) from M 1 \ M 0 such that n Σ (1) = arg min D KL ( P Σ (0) n � P Σ ) , n Σ ∈M 1 \M 0 which is the most difficult to distinguish from. With P n = P Σ (0) and Q n = P Σ (1) n , let us compute the total n variation between the product measures ( n iid samples). The limiting optimal error can be sandwiched by the Hellinger ( √ p − √ q ) 2 d µ � 1 � 1 / 2 . � distance H ( P , Q ) := 2 � H 2 ( P n n , Q n n ) ≤ d TV ( P n n , Q n n ) ≤ H ( P n n , Q n 2 − H 2 ( P n n ) n , Q n n ) . 14
Optimal error With some algebra, we have H ( P n , Q n ) = ω ( n − 1 / 2 ) 0 , 1 − d TV ( P n n , Q n n ) → , H ( P n , Q n ) = o ( n − 1 / 2 ) 1 , and when H ( P n , Q n ) ≍ n − 1 / 2 , { 1 − d TV ( P n n , Q n { 1 − d TV ( P n n , Q n n ) } ≤ lim sup n ) } < 1 . 0 < lim inf n n Effect size H ( P n , Q n ) ≍ ρ 13 , n ρ 23 , n , where ρ ij = σ ij / √ σ ii σ jj is the correlation coefficient. 15
Optimal error Comparing H ( P n , Q n ) to n − 1 / 2 , to stabilize the asymptotic error, there are two regimes. Two regimes { 1 − d TV ( P n n , Q n n ) } → c ∈ (0 , 1) ρ 13 , n ≍ γ n − 1 / 2 , ρ 23 , n → ρ 23 � = 0 � “ weak-strong ” iff ρ 23 , n ≍ γ n − 1 / 2 , ρ 13 , n → ρ 13 � = 0 “ weak-weak ” ρ 13 , n ρ 23 , n ≍ δ n − 1 / 2 , ρ 13 , n , ρ 23 , n → 0 . 16
Asymptotics: weak-strong regime We study the (local) asymptotic distribution of λ (0:1) . n For r = γ √ σ 11 σ 33 , we set r / √ n σ 11 0 Σ (0) = , 0 σ 22 σ 23 n r / √ n σ 23 σ 33 ( r / √ n ) σ 23 /σ 33 r / √ n σ 11 ( r / √ n ) σ 23 /σ 33 Σ (1) = σ 22 σ 23 , n r / √ n σ 23 σ 33 σ 11 0 0 → Σ ∗ = Σ (0) n , Σ (1) 0 σ 22 σ 23 n 0 σ 23 σ 33 17
<latexit sha1_base64="C5oYjXs5XFpIvmf4zkzWcJwqK0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ip1arBRdW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AgsuQNQ=</latexit> <latexit sha1_base64="N2t1x1B6o7YXeWQNQ7GLKgYtqW0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewaQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ipXVSDWtW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AhFCQNg=</latexit> <latexit sha1_base64="+q3hAqsk252mIPphuj3whPN1q8w=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewmgh6DXjxGMImSLGF2MpsMmcyMyuEJV/hxYMiXv0cb/6Nk2QPmljQUFR1090VJZwZ6/vfXmFtfWNzq7hd2tnd2z8oHx61jUo1oS2iuNIPETaUM0lblOHxJNsYg47UTjm5nfeaLaMCXv7ShocBDyWJGsHXSY0+PVD+r1af9csWv+nOgVRLkpAI5mv3yV2+gSCqotIRjY7qBn9gw9oywum01EsNTAZ4yHtOiqxoCbM5gdP0ZlTBihW2pW0aK7+nsiwMGYiItcpsB2ZW8m/ud1UxtfhRmTSWqpJItFcqRVWj2PRowTYnlE0cw0czdisgIa0ysy6jkQgiWX14l7Vo1qFf9u4tK4zqPowgncArnEMAlNOAWmtACAgKe4RXePO29eO/ex6K14OUzx/AH3ucPhdaQNw=</latexit> Asymptotics: weak-strong regime ρ 12 ρ 23 ρ 13 X 1 X 2 X 1 X 2 X 1 X 2 γ n − 1 / 2 γ n − 1 / 2 ρ ρ ρ X 3 X 3 X 3 Σ (0) Σ (1) Σ ∗ ∈ M 0 ∩ M 1 ∈ M 0 \ M 1 ∈ M 1 \ M 0 n n 18
Recommend
More recommend