Dirichlet Bayesian Network Scores and the Maximum Entropy Principle Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford September 21, 2017
Bayesian Network Structure Learning Learning a BN B = ( G , Θ) from a data set D is performed in two steps: P( B | D ) = P( G , Θ | D ) = P( G | D ) · P(Θ | G , D ) . � �� � � �� � � �� � learning structure learning parameter learning In a Bayesian setting structure learning consists in finding the DAG with the best P( G | D ) (BIC [6] is a common alternative) with some heuristic search algorithm. We can decompose P( G | D ) into � P( G | D ) ∝ P( G ) P( D | G ) = P( G ) P( D | G , Θ) P(Θ | G ) d Θ where P( G ) is the prior distribution over the space of the DAGs and P( D | G ) is the marginal likelihood of the data given G averaged over all possible parameter sets Θ ; and then N �� � � P( D | G ) = P( X i | Π X i , Θ X i ) P(Θ X i | Π X i ) d Θ X i i =1 where Π X i are the parents of X i in G . Marco Scutari University of Oxford
The Bayesian Dirichlet Marginal Likelihood If D contains no missing values and assuming: ❼ a Dirichlet conjugate prior ( X i | Π X i ∼ Mult (Θ X i | Π X i ) and Θ X i | Π X i ∼ Dir ( α ijk ) , � jk α ijk = α i the imaginary sample size); ❼ positivity (all conditional probabilities π ijk > 0 ); ❼ parameter independence ( π ijk for different parent configurations are independent) and modularity ( π ijk in different nodes are independent); Heckerman et al. [4] derived a closed form expression for P( D | G ) : N � BD( G , D ; α ) = BD( X i , Π X i ; α i ) = i =1 � � q i N r i Γ( α ij ) Γ( α ijk + n ijk ) � � � = Γ( α ij + n ij ) Γ( α ijk ) i =1 j =1 k =1 where r i is the number of states of X i ; q i is the number of configurations of Π X i ; n ij = � k n ijk ; and α ij = � k α ijk . Marco Scutari University of Oxford
Bayesian Dirichlet Equivalent Uniform (BDeu) The most common implementation of BD assumes α ijk = α/ ( r i q i ) , α = α i and is known from [4] as the Bayesian Dirichlet equivalent uniform (BDeu) marginal likelihood. However, there is evidence that assuming a flat prior over the parameters can be problematic: ❼ The prior is actually not uninformative [5]. ❼ MAP DAGs selected using BDeu are highly sensitive to the choice of α and can have markedly different number of arcs even for reasonable α [8]. ❼ In the limits α → 0 and α → ∞ it is possible to obtain both very simple and very complex DAGs, and model comparison may be inconsistent for small D and small α [8, 10]. ❼ The sparseness of the MAP network is determined by a complex interaction between α and D [10, 12]. ❼ There are formal proofs of all this in [11, 12]. Marco Scutari University of Oxford
Exhibits A and B W Y Z X W Z Y X Marco Scutari University of Oxford
Exhibit A The sample frequencies ( n ijk ) for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 2 1 1 2 X 1 1 2 2 1 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 2 1 1 0 0 0 0 2 X 1 1 2 2 0 0 0 0 1 Even though X | Π X and X | Π X ∪ Y have the same empirical entropy, � � − 1 3 log 1 3 − 2 3 log 2 H( X | Π X ) = H( X | Π X ∪ Y ) = 4 = 2 . 546 ... 3 Marco Scutari University of Oxford
Exhibit A ... G − has a higher entropy than G + a posteriori with α = 1 ... � � − 1 + 1 / 4 log 1 + 1 / 4 − 2 + 1 / 4 log 2 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 = 2 . 580 , � � − 1 + 1 / 8 log 1 + 1 / 8 − 2 + 1 / 8 log 2 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 = 2 . 564 ... and BDeu with α = 1 chooses accordingly, so things fortunately work out: � � Γ( 1 / �� 4 Γ( 1 / 4 ) 8 + 2) · Γ( 1 / 8 + 1) BDeu( X | Π X ) = Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) = 3 . 906 × 10 − 7 , � � Γ( 1 / �� 4 Γ( 1 / 8 ) 16 + 2) · Γ( 1 / 16 + 1) BDeu( X | Π X ∪ Y ) = Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) = 3 . 721 × 10 − 8 . Marco Scutari University of Oxford
Exhibit B The sample frequencies for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 3 0 0 3 X 1 0 3 3 0 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 3 0 0 0 0 0 0 3 X 1 0 3 3 0 0 0 0 0 The empirical entropy of X is equal to zero for both G + and G − , since the value of X is completely determined by the configurations of its parents in both cases. Marco Scutari University of Oxford
Exhibit B Again, the posterior entropies for G + and G − differ: � � − 0 + 1 / 4 log 0 + 1 / 4 − 3 + 1 / 4 log 3 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 = 0 . 652 , � � − 0 + 1 / 8 log 0 + 1 / 8 − 3 + 1 / 8 log 3 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 = 0 . 392 . However, BDeu with α = 1 yields � � �� 4 � Γ( 1 / 4 ) Γ( 1 / 8 + 3) Γ( 1 / 8 ) � BDeu( X | Π X ) = · = 0 . 032 , Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) � � � Γ( 1 / ✚ �� 4 Γ( 1 / 8 ) 16 + 3) ✚✚✚ Γ( 1 / 16 ) BDeu( X | Π X ∪ Y ) = · = 0 . 044 , Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) preferring G + over G − even though the additional arc Y → X does not provide any additional information on the distribution of X , and even though 4 out of 8 conditional distributions in X | Π X ∪ Y are not observed at all in the data. Marco Scutari University of Oxford
Better Than BDeu: Bayesian Dirichlet Sparse (BDs) If the positivity assumption is violated or the sample size n is small, there may be configurations of some Π X i that are not observed in D . And then BDeu( X i , Π X i ; α ) = � ✘ � � � ✘✘✘✘✘✘✘✘ r i r i Γ( α ij ) Γ( α ijk ) Γ( α ij ) Γ( α ijk + n ijk ) � � � � , Γ( α ij ) Γ( α ijk ) Γ( α ij + n ij ) Γ( α ijk ) j : n ij =0 k =1 j : n ij > 0 k =1 so the effective imaginary sample size decreases as the number of unobserved parents configurations increases. We can prevent that by replacing α ijk with � α/ ( r i ˜ q i ) if n ij > 0 α ijk = ˜ otherwise , ˜ q i = { number of Π X i such that n ij > 0 } 0 and plugging it in BD instead of α ijk = α/ ( r i q i ) to obtain BDs. Then BDs( X i , Π X i ; α ) = BDeu( X i , Π X i ; αq i / ˜ q i ) . Marco Scutari University of Oxford
BDeu and BDs Compared Cells that correspond to ( X i , Π X i ) combinations that are not observed in the data are in red, observed combinations are in green. Marco Scutari University of Oxford
Exhibits A and B, Once More BDs does not suffer from the bias arising from ˜ q i < q i and it assigns the same score to G − and G + in both examples, BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 3 . 9 × 10 − 7 , Exhibit A: Exhibit B: BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 0 . 032 . It also avoids giving wildly different Bayes factors depending on the value of α . 2.5 1.0 BDeu BDs BDeu 0.8 BDs 2.0 Bayes factor Bayes factor 0.6 0.4 1.5 0.2 1.0 −4 −2 0 2 4 −4 −2 0 2 4 log 10 ( α ) log 10 ( α ) Marco Scutari University of Oxford
This Left Me with a Few Questions... The obvious one being: 1. The behaviour of BDeu is certainly undesirable, but it is it wrong? Followed by: 2. Posterior entropy and BDeu rank G − and G + in the same order for Exhibit A, but they do not for Exhibit B. Why is that? And the reason why I found that surprising is that: 3. Maximum (relative) entropy [7, 9, 1] represents a very general approach that includes Bayesian posterior estimation as a particular case [3]; it can also be seen as a particular case of MDL [2]. Hence, unless something is wrong with BDeu I would expect the two to agree. Especially because we can use MDL (using BIC), MAP (using BDeu/BDs), Marco Scutari University of Oxford
Bayesian Statistics and Information Theory (I) The derivation of Bayesian posterior as a particular case of maximum (relative) entropy is made clear in Giffin and Caticha [3]. The selected joint posterior P( X, Θ) is that which maximises the relative entropy � P( X, Θ) S (P , P old ) = − P( X, Θ) log P old ( X, Θ) dX d Θ . The family of posteriors that reflects the fact that X is now known to take value x ′ is such that � P( X = x ′ , Θ) d Θ = δ ( X − x ′ ) P( X ) = which amounts to a (possibly infinite) number of constraints on P( X, Θ) : for each possible value of X there is one constraint. Marco Scutari University of Oxford
Bayesian Statistics and Information Theory (II) Maximising S (P , P old ) subject to those constraints using Lagrange multipliers means solving �� � S (P , P old ) + λ 0 P dX d Θ − 1 + � �� � normalising constraint �� � � P( X, Θ) − δ ( X − x ′ ) d Θ λ ( x ) dX � �� � constraint for each value of X and yields the familiar Bayesian update rule: P new ( X, Θ) = P old ( X, Θ) δ ( X − x ′ ) = P old (Θ | X ) δ ( X − x ′ ) . P old ( X ) Marco Scutari University of Oxford
Recommend
More recommend