Dirichlet Bayesian Network Scores and the Maximum Entropy Principle - PowerPoint PPT Presentation

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford September 21, 2017

Bayesian Network Structure Learning Learning a BN B = ( G , Θ) from a data set D is performed in two steps: P( B | D ) = P( G , Θ | D ) = P( G | D ) · P(Θ | G , D ) . � �� learning structure learning parameter learning In a Bayesian setting structure learning consists in finding the DAG with the best P( G | D ) (BIC [6] is a common alternative) with some heuristic search algorithm. We can decompose P( G | D ) into � P( G | D ) ∝ P( G ) P( D | G ) = P( G ) P( D | G , Θ) P(Θ | G ) d Θ where P( G ) is the prior distribution over the space of the DAGs and P( D | G ) is the marginal likelihood of the data given G averaged over all possible parameter sets Θ ; and then N �� P( D | G ) = P( X i | Π X i , Θ X i ) P(Θ X i | Π X i ) d Θ X i i =1 where Π X i are the parents of X i in G . Marco Scutari University of Oxford

The Bayesian Dirichlet Marginal Likelihood If D contains no missing values and assuming: ❼ a Dirichlet conjugate prior ( X i | Π X i ∼ Mult (Θ X i | Π X i ) and Θ X i | Π X i ∼ Dir ( α ijk ) , � jk α ijk = α i the imaginary sample size); ❼ positivity (all conditional probabilities π ijk > 0 ); ❼ parameter independence ( π ijk for different parent configurations are independent) and modularity ( π ijk in different nodes are independent); Heckerman et al. [4] derived a closed form expression for P( D | G ) : N � BD( G , D ; α ) = BD( X i , Π X i ; α i ) = i =1 � � q i N r i Γ( α ij ) Γ( α ijk + n ijk ) � � � = Γ( α ij + n ij ) Γ( α ijk ) i =1 j =1 k =1 where r i is the number of states of X i ; q i is the number of configurations of Π X i ; n ij = � k n ijk ; and α ij = � k α ijk . Marco Scutari University of Oxford

Bayesian Dirichlet Equivalent Uniform (BDeu) The most common implementation of BD assumes α ijk = α/ ( r i q i ) , α = α i and is known from [4] as the Bayesian Dirichlet equivalent uniform (BDeu) marginal likelihood. However, there is evidence that assuming a flat prior over the parameters can be problematic: ❼ The prior is actually not uninformative [5]. ❼ MAP DAGs selected using BDeu are highly sensitive to the choice of α and can have markedly different number of arcs even for reasonable α [8]. ❼ In the limits α → 0 and α → ∞ it is possible to obtain both very simple and very complex DAGs, and model comparison may be inconsistent for small D and small α [8, 10]. ❼ The sparseness of the MAP network is determined by a complex interaction between α and D [10, 12]. ❼ There are formal proofs of all this in [11, 12]. Marco Scutari University of Oxford

Exhibits A and B W Y Z X W Z Y X Marco Scutari University of Oxford

Exhibit A The sample frequencies ( n ijk ) for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 2 1 1 2 X 1 1 2 2 1 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 2 1 1 0 0 0 0 2 X 1 1 2 2 0 0 0 0 1 Even though X | Π X and X | Π X ∪ Y have the same empirical entropy, � � − 1 3 log 1 3 − 2 3 log 2 H( X | Π X ) = H( X | Π X ∪ Y ) = 4 = 2 . 546 ... 3 Marco Scutari University of Oxford

Exhibit A ... G − has a higher entropy than G + a posteriori with α = 1 ... � � − 1 + 1 / 4 log 1 + 1 / 4 − 2 + 1 / 4 log 2 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 = 2 . 580 , � � − 1 + 1 / 8 log 1 + 1 / 8 − 2 + 1 / 8 log 2 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 = 2 . 564 ... and BDeu with α = 1 chooses accordingly, so things fortunately work out: � � Γ( 1 / �� 4 Γ( 1 / 4 ) 8 + 2) · Γ( 1 / 8 + 1) BDeu( X | Π X ) = Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) = 3 . 906 × 10 − 7 , � � Γ( 1 / �� 4 Γ( 1 / 8 ) 16 + 2) · Γ( 1 / 16 + 1) BDeu( X | Π X ∪ Y ) = Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) = 3 . 721 × 10 − 8 . Marco Scutari University of Oxford

Exhibit B The sample frequencies for X | Π X are: Z, W 0 , 0 1 , 0 0 , 1 1 , 1 0 3 0 0 3 X 1 0 3 3 0 and those for X | Π X ∪ Y are as follows. Z, W, Y 0 , 0 , 0 1 , 0 , 0 0 , 1 , 0 1 , 1 , 0 0 , 0 , 1 1 , 0 , 1 0 , 1 , 1 1 , 1 , 1 0 3 0 0 0 0 0 0 3 X 1 0 3 3 0 0 0 0 0 The empirical entropy of X is equal to zero for both G + and G − , since the value of X is completely determined by the configurations of its parents in both cases. Marco Scutari University of Oxford

Exhibit B Again, the posterior entropies for G + and G − differ: � � − 0 + 1 / 4 log 0 + 1 / 4 − 3 + 1 / 4 log 3 + 1 / 8 8 8 8 H( X | Π X ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 4 = 0 . 652 , � � − 0 + 1 / 8 log 0 + 1 / 8 − 3 + 1 / 8 log 3 + 1 / 16 16 16 16 H( X | Π X ∪ Y ; α ) = 4 3 + 1 / 3 + 1 / 3 + 1 / 3 + 1 / 8 = 0 . 392 . However, BDeu with α = 1 yields � � �� 4 � Γ( 1 / 4 ) Γ( 1 / 8 + 3) Γ( 1 / 8 ) � BDeu( X | Π X ) = · = 0 . 032 , Γ( 1 / 4 + 3) Γ( 1 / 8 ) Γ( 1 / 8 ) � � � Γ( 1 / ✚ �� 4 Γ( 1 / 8 ) 16 + 3) ✚✚✚ Γ( 1 / 16 ) BDeu( X | Π X ∪ Y ) = · = 0 . 044 , Γ( 1 / 8 + 3) Γ( 1 / 16 ) Γ( 1 / 16 ) preferring G + over G − even though the additional arc Y → X does not provide any additional information on the distribution of X , and even though 4 out of 8 conditional distributions in X | Π X ∪ Y are not observed at all in the data. Marco Scutari University of Oxford

Better Than BDeu: Bayesian Dirichlet Sparse (BDs) If the positivity assumption is violated or the sample size n is small, there may be configurations of some Π X i that are not observed in D . And then BDeu( X i , Π X i ; α ) = � ✘ � � � ✘✘✘✘✘✘✘✘ r i r i Γ( α ij ) Γ( α ijk ) Γ( α ij ) Γ( α ijk + n ijk ) � � � � , Γ( α ij ) Γ( α ijk ) Γ( α ij + n ij ) Γ( α ijk ) j : n ij =0 k =1 j : n ij > 0 k =1 so the effective imaginary sample size decreases as the number of unobserved parents configurations increases. We can prevent that by replacing α ijk with � α/ ( r i ˜ q i ) if n ij > 0 α ijk = ˜ otherwise , ˜ q i = { number of Π X i such that n ij > 0 } 0 and plugging it in BD instead of α ijk = α/ ( r i q i ) to obtain BDs. Then BDs( X i , Π X i ; α ) = BDeu( X i , Π X i ; αq i / ˜ q i ) . Marco Scutari University of Oxford

BDeu and BDs Compared Cells that correspond to ( X i , Π X i ) combinations that are not observed in the data are in red, observed combinations are in green. Marco Scutari University of Oxford

Exhibits A and B, Once More BDs does not suffer from the bias arising from ˜ q i < q i and it assigns the same score to G − and G + in both examples, BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 3 . 9 × 10 − 7 , Exhibit A: Exhibit B: BDs( X | Π X ) = BDs( X | Π X ∪ Y ) = 0 . 032 . It also avoids giving wildly different Bayes factors depending on the value of α . 2.5 1.0 BDeu BDs BDeu 0.8 BDs 2.0 Bayes factor Bayes factor 0.6 0.4 1.5 0.2 1.0 −4 −2 0 2 4 −4 −2 0 2 4 log 10 ( α ) log 10 ( α ) Marco Scutari University of Oxford

This Left Me with a Few Questions... The obvious one being: 1. The behaviour of BDeu is certainly undesirable, but it is it wrong? Followed by: 2. Posterior entropy and BDeu rank G − and G + in the same order for Exhibit A, but they do not for Exhibit B. Why is that? And the reason why I found that surprising is that: 3. Maximum (relative) entropy [7, 9, 1] represents a very general approach that includes Bayesian posterior estimation as a particular case [3]; it can also be seen as a particular case of MDL [2]. Hence, unless something is wrong with BDeu I would expect the two to agree. Especially because we can use MDL (using BIC), MAP (using BDeu/BDs), Marco Scutari University of Oxford

Bayesian Statistics and Information Theory (I) The derivation of Bayesian posterior as a particular case of maximum (relative) entropy is made clear in Giffin and Caticha [3]. The selected joint posterior P( X, Θ) is that which maximises the relative entropy � P( X, Θ) S (P , P old ) = − P( X, Θ) log P old ( X, Θ) dX d Θ . The family of posteriors that reflects the fact that X is now known to take value x ′ is such that � P( X = x ′ , Θ) d Θ = δ ( X − x ′ ) P( X ) = which amounts to a (possibly infinite) number of constraints on P( X, Θ) : for each possible value of X there is one constraint. Marco Scutari University of Oxford

Bayesian Statistics and Information Theory (II) Maximising S (P , P old ) subject to those constraints using Lagrange multipliers means solving �� S (P , P old ) + λ 0 P dX d Θ − 1 + � �� normalising constraint �� P( X, Θ) − δ ( X − x ′ ) d Θ λ ( x ) dX � �� constraint for each value of X and yields the familiar Bayesian update rule: P new ( X, Θ) = P old ( X, Θ) δ ( X − x ′ ) = P old (Θ | X ) δ ( X − x ′ ) . P old ( X ) Marco Scutari University of Oxford

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle - PowerPoint PPT Presentation

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford September 21, 2017 Bayesian Network Structure Learning Learning a BN B = ( G , ) from a

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet

Dirichlet Processes and Nonparametric Bayesian Modelling Volker Tresp 1 Motivation Infinite

+ 2. Model Selection Scores 3. New Stuff: fNML Score 2/30 + Bayesian Networks 3/30 Conditional

Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of

Building a Bayesian Network 223 / 385 The construction of a Bayesian network Construction of a

The Bayesian Network Framework 89 / 384 The network formalism, informal A Bayesian network

Exact inference (Ch. 14) Bayesian Network A Bayesian network (Bayes net) is: (1) a directed

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet

Overview Bayesian Model Selection Bayesian Learning of CPTs Dealing with Multiple Models Chris

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for

Bayes Nets (Ch. 14) Announcements Homework 1 posted Bayesian Network A Bayesian network (Bayes

Machine Learning Lecture 2 - Bayesian Learning: Binomial and Dirichlet Distributions Devdatt

Overview Bayesian Methods for Parameter Estimation Introduction to Bayesian Statistics: Learning

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Learning Objectives At the end of the class you should be able to: derive Bayesian learning from

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

ReBaStaBa : handling Bayesian Network with R Jean-Baptiste.Denis@Jouy.Inra.Fr

Dynamic Bayesian network (DBN) HMM defined by Transition model P(X (t+1) |X (t) )

Outline Motivation and challenge Dirichlet Process and Infinite Mixture Formulation

Variational Inference for Dirichlet Process Mixtures By David Blei and Michael Jordan Presented

Full Bayesian Network Classifiers by Jiang Su and Harry Zhang Flemming Jensen November 2008

Statistical Learning: The Complex Cases Case 0: Bayesian Network structure known, all

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle - PowerPoint PPT Presentation

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle Marco Scutari scutari@stats.ox.ac.uk Department of Statistics University of Oxford September 21, 2017 Bayesian Network Structure Learning Learning a BN B = ( G , ) from a

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet

Dirichlet Processes and Nonparametric Bayesian Modelling Volker Tresp 1 Motivation Infinite

+ 2. Model Selection Scores 3. New Stuff: fNML Score 2/30 + Bayesian Networks 3/30 Conditional

Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of

Building a Bayesian Network 223 / 385 The construction of a Bayesian network Construction of a

The Bayesian Network Framework 89 / 384 The network formalism, informal A Bayesian network

Exact inference (Ch. 14) Bayesian Network A Bayesian network (Bayes net) is: (1) a directed

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet

Overview Bayesian Model Selection Bayesian Learning of CPTs Dealing with Multiple Models Chris

Lecture 14: Inference in Dirichlet Processes (Blei &amp; Jordan, Variational inference for

Bayes Nets (Ch. 14) Announcements Homework 1 posted Bayesian Network A Bayesian network (Bayes

Machine Learning Lecture 2 - Bayesian Learning: Binomial and Dirichlet Distributions Devdatt

Overview Bayesian Methods for Parameter Estimation Introduction to Bayesian Statistics: Learning

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Learning Objectives At the end of the class you should be able to: derive Bayesian learning from

Chapter 5: z-Scores : Location of Scores Chapter 5: z-Scores : Location of Scores and Standardized

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

ReBaStaBa : handling Bayesian Network with R Jean-Baptiste.Denis@Jouy.Inra.Fr

Dynamic Bayesian network (DBN) HMM defined by Transition model P(X (t+1) |X (t) )

Outline Motivation and challenge Dirichlet Process and Infinite Mixture Formulation

Variational Inference for Dirichlet Process Mixtures By David Blei and Michael Jordan Presented

Full Bayesian Network Classifiers by Jiang Su and Harry Zhang Flemming Jensen November 2008

Statistical Learning: The Complex Cases Case 0: Bayesian Network structure known, all

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for