Identifiability and Consistency of Bayesian Network Structure Learning from Incomplete Data tjebbe.bodewes@linacre.ox.ac.uk scutari@idsia.ch Statistics, University of Oxford Artificial Intelligence (IDSIA) September 24, 2020 Tjebbe Bodewes 1 Marco Scutari 2 1 Zivver & Department of 2 Dalle Molle Institute for
Introduction . ⏟ ⏟ ⏟ ⏟ ⏟ parameter learning Assuming complete data, we can decompose P ( ∣ ) into Learning a Bayesian network B = (, Θ) from a data set involves: P ( ∣ ) ∝ P () P ( ∣ ) = P () ∫ P ( ∣ , Θ) P (Θ ∣ )𝑒Θ where P () is the prior over the space of the DAGs and P ( ∣ ) is the marginal likelihood (ML) of the data; and then P ( ∣ ) = 𝑂 ∏ 𝑗=1 P (Θ ∣ , ) ⋅ structure learning ⏟ P ( B ∣ ) = P (, Θ ∣ ) ⏟⏟ ⏟ ⏟ ⏟⏟⏟ ⏟ ⏟ ⏟⏟ learning = P ( ∣ ) ⏟ ⏟ ⏟ ⏟ P ( ∣ ) . Denote them with 𝑇 ML ( ∣ ) and 𝑇 BIC ( ∣ ) respectively. [∫ P (𝑌 𝑗 ∣ Π 𝑌 𝑗 , Θ 𝑌 𝑗 ) P (Θ 𝑌 𝑗 ∣ Π 𝑌 𝑗 )𝑒Θ 𝑌 𝑗 ] . where Π 𝑌 𝑗 are the parents of 𝑌 𝑗 in . BIC [9] is ofuen used to approximate
Learning a Bayesian Network from Incomplete Data When the data are incomplete, 𝑇 ML ( ∣ ) and 𝑇 BIC ( ∣ ) are no longer decomposable because we must integrate out missing values. conditional on the observed data using belief propagation [7, 8, 10]; expected sufgicient statistics. There are two ways of applying EM to structure learning: expected sufgicient statistics using the current best DAG. This The latter is computationally feasible for medium and large problems, but still computationally demanding. We can use Expectation-Maximisation (EM) [4]: • in the E-step, we compute the expected sufgicient statistics • in the M-step, we use complete-data learning methods with the • We can apply EM separately to each candidate DAG to be scored, as in the variational-Bayes EM [2]. • We can embed structure learning in the M-step, estimating the approach is called Structural EM [5, 6].
The Node-Averaged Likelihood (𝑗) properties hold more generally, and in particular that they hold for Balov proved both identifiability and consistency of structure learning ℓ(, Θ ∣ ) − 𝜇 𝑜 ℎ(), which Balov used to define Balov [1] proposed a more scalable approach for discrete BNs called conditional Gaussian BNs (CGBNs). 1 Θ 𝑌 𝑗 ) = ̄ Node-Average Likelihood (NAL). NAL computes each term using the (𝑗) ⊆ locally-complete data for which 𝑌 𝑗 , Π 𝑌 𝑗 are observed: ℓ(𝑌 𝑗 ∣ Π 𝑌 𝑗 , ̂ log P (𝑌 𝑗 ∣ Π 𝑌 𝑗 , ̂ | (𝑗) | ∑ Θ 𝑌 𝑗 ) → E [ℓ(𝑌 𝑗 ∣ Π 𝑌 𝑗 )] , 𝑇 PL ( ∣ ) = ̄ 𝜇 𝑜 ∈ ℝ + , ℎ ∶ → ℝ + and structure learning as ̂ = argmax ∈ 𝑇 PL ( ∣ ) . when using 𝑇 PL ( ∣ ) for discrete BNs. We will now prove both
Identifiability (General) ℓ( 0 , Θ 0 ) . ̄ ∈ ℓ( ∗ , Θ ∗ ) = max [ 0 ] is identifiable under MCAR, that is Identifiability follows from the above. ℓ(, Θ)} . ℓ( 0 , Θ 0 ) , then P ( X ) = P 0 ( X ) . ℓ( 0 , Θ 0 ) . Under MCAR, we have: Denote the true DAG as 0 and the equivalence class it belongs to as [ 0 ] . 1. max ∈ ̄ ℓ(, Θ) = ̄ 2. If ̄ ℓ(, Θ) = ̄ 3. If 0 ⊆ , then ̄ ℓ(, Θ) = ̄ 0 ≅ min { ∗ ∈ ∶ ̄
Consistency (for CGBNs) From [1], the sufgicient conditions for consistency are: is 𝑜→∞ 3. Under the above and condition 3, if lim inf consistent. is is consistent. Hessian exist finite. Then as 𝑜 → ∞ : not consistent. 𝑌 𝑗 3. ∃ ∶ Π ( 0 ) ⊂ Π () 1. If 0 ⊆ 1 , 0 ⊈ 2 , lim 𝑜→∞ P (𝑇 PL ( 1 ∣ ) > 𝑇 PL ( 2 ∣ )) = 1 . 2. If 0 ⊆ 1 , 1 ⊂ 2 , lim 𝑜→∞ P (𝑇 PL ( 1 ∣ ) > 𝑇 PL ( 2 ∣ )) = 1 . 𝑌 𝑗 , Π () 𝑌 𝑘 = Π ( 0 ) 𝑌 𝑘 , Π () 𝑌 𝑗 ∖ Π ( 0 ) 𝑌 𝑗 are neither always observed nor never observed (thus 0 must not be a maximal DAG). Under some regularity conditions, we show when they hold for CGBNs: Let 0 be identifiable, 𝜇 𝑜 → 0 as 𝑜 → ∞ , and assume MLEs and NAL’s 1. If 𝑜𝜇 𝑜 → ∞ , ̂ 2. Under MCAR and VAR ( NAL ) < ∞ , if √𝑜𝜇 𝑜 → ∞ , ̂ √𝑜𝜇 𝑜 < ∞ , then ̂
Conclusions complete data but not for incomplete data. either complete or incomplete data. are not nested; does necessarily make them vanish as 𝑜 → ∞ . • In 𝑇 BIC ( ∣ ) , 𝑜𝜇 𝑜 = log (𝑜)/2 → ∞ and √𝑜𝜇 𝑜 = log (𝑜)/(2√𝑜) → 0 , so BIC satisfies the first condition but not the second in the main result. Hence BIC is consistent for • The equivalent 𝑇 AIC ( ∣ ) does not satisfy either condition which confirms and extends the results in [3]. Hence AIC is not consistent for • How to choose 𝜇 𝑜 is an open problem. • Proving results is complicated because • 𝑇 PL ( ∣ ) is fitted on difgerent subsets of for difgerent , so models • variables have heterogeneous distributions; • DAGs that may represent misspecified models [11] are not representable in terms of 0 so minimising Kullback-Leibler distances to obtain MLEs
Thanks! Any questions?
References I Psychometrika , 52(3):345–370, 1987. The Bayesian Structural EM Algorithm. Tah N. Friedman. In ICML , pages 125–133, 1997. Learning Belief Networks in the Presence of Missing Values and Hidden Variables. Tah N. Friedman. Journal of the Royal Statistical Society, Series B , pages 1–38, 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Tah A. P. Dempster, N. M. Laird, and D. B. Rubin. Extensions. Tah N. Balov. Model Selection and Akaike’s Information Criterion (AIC): The General Theory and its Analytical Tah H. Bozdogan. Bayesian Statistics , 7:453–464, 2003. Graphical Model Structures. The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Tah M. Beal and Z. Ghahramani. Electronic Journal of Statistics , 7:1047–1077, 2013. Consistent Model Selection of Discrete Bayesian Networks from Incomplete Data. In UAI , pages 129–138, 1998.
References II Tah S. L. Lauritzen. The EM algorithm for Graphical Association Models with Missing Data. Computational Statistics & Data Analysis , 19(2):191–201, 1995. Tah J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference . Morgan Kaufmann Publishers Inc., 1988. Tah G. Schwarz. Estimating the Dimension of a Model. The Annals of Statistics , 6(2):461–464, 1978. Tah G. Shafer and P. P. Shenoy. Probability propagation. Annals of Mathematics and Artificial Intelligence , 2(1-4):327–351, 1990. Tah H. White. Maximum Likelihood Estimation of Misspecified Models. Econometrica , 50(1):1–25, 1982.
Recommend
More recommend