beyond uniform priors in bayesian network structure
play

Beyond Uniform Priors in Bayesian Network Structure Learning (for - PowerPoint PPT Presentation

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford April 5, 2017 Bayesian Network Structure Learning Learning a BN


  1. Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks) Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford April 5, 2017

  2. Bayesian Network Structure Learning Learning a BN B = ( G , Θ) from a data set D is performed in two steps: P( B | D ) = P( G , Θ | D ) = P( G | D ) · P(Θ | G , D ) . � �� � � �� � � �� � learning structure learning parameter learning In a Bayesian setting structure learning consists in finding the DAG with the best P( G | D ) (BIC [5] is a common alternative) with some heuristic search algorithm. We can decompose P( G | D ) into � P( G | D ) ∝ P( G ) P( D | G ) = P( G ) P( D | G , Θ) P(Θ | G ) d Θ where P( G ) is the prior distribution over the space of the DAGs and P( D | G ) is the marginal likelihood of the data given G averaged over all possible parameter sets Θ ; and then N �� � � P( D | G ) = P( X i | Π X i , Θ X i ) P(Θ X i | Π X i ) d Θ X i . i =1 where Π X i are the parents of X i in G . Marco Scutari University of Oxford

  3. The Bayesian Dirichlet Marginal Likelihood If D contains no missing values and assuming: ❼ a Dirichlet conjugate prior ( X i | Π X i ∼ Multinomial (Θ X i | Π X i ) and Θ X i | Π X i ∼ Dirichlet ( α ijk ) , � jk α ijk = α i the imaginary sample size); ❼ positivity (all conditional probabilties π ijk > 0 ); ❼ parameter independence ( π ijk for different parent configurations are independent) and modularity ( π ijk in different nodes are independent); Heckerman et al. [2] derived a closed form expression for P( D | G ) : N � BD( G , D ; α ) = BD( X i , Π X i ; α i ) = i =1 � � q i N r i Γ( α ij ) Γ( α ijk + n ijk ) � � � = Γ( α ij + n ij ) Γ( α ijk ) i =1 j =1 k =1 where r i is the number of states of X i ; q i is the number of configurations of Π X i ; n ij = � k n ijk ; and α ij = � k α ijk . Marco Scutari University of Oxford

  4. Bayesian Dirichlet Equivalent Uniform (BDeu) The most common implementation of BD assumes α ijk = α/ ( r i q i ) , α i = α and is known from [2] as the Bayesian Dirichlet equivalent uniform (BDeu) marginal likelihood. The uniform prior over the parameters was justified by the lack of prior knowledge and widely assumed to be non-informative. However, there is ample evidence that this is a problematic choice: ❼ The prior is actually not uninformative. ❼ MAP DAGs selected using BDeu are highly sensitive to the choice of α and can have markedly different number of arcs even for reasonable α [8]. ❼ In the limits α → 0 and α → ∞ it is possible to obtain both very simple and very complex DAGs, and model comparison may be inconsistent for small D and small α [8, 10]. ❼ The sparseness of the MAP network is determined by a complex interaction between α and D [10, 13]. ❼ There are formal proofs of all this in [12, 13]. Marco Scutari University of Oxford

  5. Exhibits A and B W Y Z X W Z Y X Marco Scutari University of Oxford

  6. Exhibit A The sample frequencies ( n ijk ) for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 2 1 1 2 X 1 1 2 2 1 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 2 1 1 0 0 0 0 2 X 1 1 2 2 0 0 0 0 1 Even though X | Π X and X | Π X ∪ Y have the same entropy, � � − 1 3 log 1 3 − 2 3 log 2 H( X | Π X ) = H( X | Π X ∪ Y ) = 4 = 2 . 546 ... 3 Marco Scutari University of Oxford

  7. Exhibit A ... G − has a higher entropy than G + a posteriori ... � � − 1 + 1 / 4 log 1 + 1 / 4 − 2 + 1 / 4 log 2 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 = 2 . 580 � � − 1 + 1 / 8 log 1 + 1 / 8 − 2 + 1 / 8 log 2 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 = 2 . 564 ... and BDeu with α = 1 chooses accordingly, and things fortunately work out: � � Γ( 1 / �� 4 Γ( 1 / 4 ) 8 + 2) · Γ( 1 / 8 + 1) BDeu( X | Π X ) = Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) = 3 . 906 × 10 − 7 , � � Γ( 1 / �� 4 Γ( 1 / 8 ) 16 + 2) · Γ( 1 / 16 + 1) BDeu( X | Π X ∪ Y ) = Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) = 3 . 721 × 10 − 8 . Marco Scutari University of Oxford

  8. Exhibit B The sample frequencies for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 3 0 0 3 X 1 0 3 3 0 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 3 0 0 0 0 0 0 3 X 1 0 3 3 0 0 0 0 0 The conditional entropy of X is equal to zero for both G + and G − , since the value of X is completely determined by the configurations of its parents in both cases. Marco Scutari University of Oxford

  9. Exhibit B Again, the posterior entropies for G + and G − differ: � � − 0 + 1 / 4 log 0 + 1 / 4 − 3 + 1 / 4 log 3 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 = 0 . 652 , 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 � � − 0 + 1 / 8 log 0 + 1 / 8 − 3 + 1 / 8 log 3 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 = 0 . 392 . 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 However, BDeu with α = 1 yields � � �� 4 � Γ( 1 / 4 ) Γ( 1 / 8 + 3) Γ( 1 / 8 ) BDeu( X | Π X ) = · � = 0 . 032 , Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) � � � Γ( 1 / ✚ �� 4 ✚✚✚ Γ( 1 / 8 ) 16 + 3) Γ( 1 / 16 ) BDeu( X | Π X ∪ Y ) = · = 0 . 044 , Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) preferring G + over G − even though the additional arc Y → X does not provide any additional information on the distribution of X , and even though 4 out of 8 conditional distributions in X | Π X ∪ Y are not observed at all in the data. Marco Scutari University of Oxford

  10. Better Than BDeu: Bayesian Dirichlet Sparse (BDs) If the positivity assumption is violated or the sample size n is small, there may be configurations of some Π X i that are not observed in D . BDeu( X i , Π X i ; α ) = � ✘ � � � ✘✘✘✘✘✘✘✘ Γ( α ∗ + n ijk ) r i r i Γ( r i α ∗ ) Γ( α ∗ ) Γ( r i α ∗ ) � � � � = , Γ( r i α ∗ + n ij ) Γ( r i α ∗ ) Γ( α ∗ ) Γ( α ∗ ) j : n ij =0 k =1 j : n ij > 0 k =1 so the effective imaginary sample size decreases as the number of unobserved parents configurations increases. We can prevent that by replacing α ijk with � α/ ( r i ˜ q i ) if n ij > 0 α ijk = ˜ otherwise , ˜ q i = { number of Π X i such that n ij > 0 } 0 and we plug it in BD instead of α ijk = α/ ( r i q i ) to obtain BDs. Then BDs( X i , Π X i ; α ) = BDeu( X i , Π X i ; αq i / ˜ q i ) . Marco Scutari University of Oxford

  11. BDeu and BDs Compared Cells that correspond to ( X i , Π X i ) combinations that are not observed in the data are in red, observed combinations are in green. Marco Scutari University of Oxford

  12. Exhibits A and B, Once More BDs does not suffer from the bias arising from ˜ q i < q i and it correctly assigns the same score to G − and G + in both examples, BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 3 . 906 × 10 − 7 . BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 0 . 03262 . following the maximum entropy principle. 2.5 1.0 BDeu BDs BDeu 0.8 BDs Bayes factor 2.0 Bayes factor 0.6 0.4 1.5 0.2 1.0 −4 −2 0 2 4 −4 −2 0 2 4 log 10 ( α ) log 10 ( α ) Marco Scutari University of Oxford

  13. Entropy and BDeu In a Bayesian setting, the conditional entropy H( · ) of X | Π X given a uniform Dirichlet prior with imaginary sample size α over the cell probabilities is r i ij | k = α ∗ i + n ijk � � p ( α ∗ ij | k log p ( α ∗ p ( α ∗ i ) i ) i ) H( X | Π X ; α ) = − with . ij | k r i α ∗ i + n ij j : n ij > 0 k =1 and H( X | Π X ; α ) > H( X | Π X ; β ) if α > β and X | Π X is not a uniform distribution. Let α/ ( r i q i ) → 0 and let α > β > 0 . Then if d ( X i , G ) BDeu( X | Π X ; α ) > BDeu( X | Π X ; β ) > 0 , EP � 1 � ˜ q i if d ( X i , G ) BDeu( X | Π X ; α ) = = 0 . EP r i Marco Scutari University of Oxford

  14. To Sum It Up in a Theorem Let G + and G − be two DAGs differing from a single arc X j → X i , and let α/ ( r i q i ) → 0 . Then the Bayes factor computed using BDs corresponds to the Bayes factor computed using BDeu weighted by the following implicit prior ratio: q i ) d ( Xi, G +) P( G + ) P( G − ) = ( q i / ˜ EP . i ) d ( Xi, G− ) ( q ′ q ′ i / ˜ EP and can be written as q i ) d ( Xi, G +) α d ( G +) BDs( X i , Π X i ∪ X j ; α ) = ( q i / ˜ EP EP i ) d ( Xi, G− ) α d ( G− ) BDs( X i , Π X i ; α ) ( q ′ q ′ i / ˜ EP EP � if d EDF > − log α (P( G + ) / P( G − )) 0 → if d EDF < − log α (P( G + ) / P( G − )) . + ∞ Marco Scutari University of Oxford

Recommend


More recommend