Comparing Bayesian Networks and Structure Learning Algorithms (and other graphical models) Marco Scutari marco.scutari@stat.unipd.it Department of Statistical Sciences University of Padova October 20, 2009 Marco Scutari University of Padova
Introduction Marco Scutari University of Padova
Introduction Graphical models Graphical models are defined by the combination of: • a network structure, either an undirected (Markov networks [2], gene association networks, correlation networks, etc.) or a directed graph (Bayesian networks [7]). Each node corresponds to a random variable. • a global probability distribution which can be factorized into a set of local probability distributions (one for each node) according to the topology of the graph. This allows a compact representation of the joint distribution of large numbers of random variables and simplifies inference on their parameters. Marco Scutari University of Padova
Introduction A simple Bayesian network: Watson’s lawn SPRINKLER SPRINKLER SPRINKLER RAIN RAIN SPRINKLER TRUE FALSE RAIN TRUE FALSE 0.2 0.8 GRASS WET FALSE 0.4 0.6 TRUE 0.01 0.99 GRASS WET SPRINKLER RAIN TRUE FALSE FALSE FALSE 0.0 1.0 FALSE TRUE 0.8 0.2 TRUE FALSE 0.9 0.1 TRUE TRUE 0.99 0.01 Marco Scutari University of Padova
Introduction The problem Almost all literature on graphical models focuses on the study of the parameters of the local probability distributions (such as conditional probabilities or partial linear correlations). • this makes comparing models learned with different algorithms difficult, because they maximize different scores, use different estimators for the parameters, work under different sets of hypotheses, etc. • unless the true global probability distribution is known it’s difficult to assess the quality of the estimated models. • the few measures of structural difference are completely descriptive in nature (i.e. Hamming distance [6] or SHD [10]), and have no easy interpretation. Marco Scutari University of Padova
Modeling undirected network structures Marco Scutari University of Padova
Modeling undirected network structures Edges and univariate Bernoulli random variables Each edge e i in an undirected graph U = ( V , E ) has only two possible states, � 1 if e i ∈ E e i = otherwise . 0 Therefore it can be modeled as a Bernoulli random variable E i : � 1 e i ∈ E with probability p i e i ∼ E i = 0 e i �∈ E with probability 1 − p i where p i is the probability that the edge e i belongs to the graph. Let’s denote it as e i ∼ Ber ( p i ) . Marco Scutari University of Padova
Modeling undirected network structures Edge sets as multivariate Bernoulli The natural extension of this approach is to model any set W of edges (such as E or { V × V } ) as a multivariate Bernoulli random variable W ∼ Ber k ( p ) . It is uniquely identified by the parameter set p = { p w : w ⊆ W, w � = ∅ } , which represents the dependence structure [8] among the marginal distributions W i ∼ Ber ( p i ) , i = 1 , . . . , k of the edges. Marco Scutari University of Padova
Modeling undirected network structures Estimation of the parameters of W The parameter set p of W can be estimated via bootstrap [3] as in Friedman et al. [4] or Imoto et al. [5]: 1. For b = 1 , 2 , . . . , m 1.1 re-sample a new data set D ∗ b from the original data D using either parametric or nonparametric bootstrap. 1.2 learn a graphical model U b = ( V , E b ) from D ∗ b . 2. Estimate the probability of each subset w of W as m p w = 1 � ˆ I { w ⊆ E b } ( U b ) . m b =1 Marco Scutari University of Padova
Properties of the multivariate Bernoulli distribution Marco Scutari University of Padova
Properties of the multivariate Bernoulli distribution Moments The first two moments of a multivariate Bernoulli variable W = [ W 1 , W 2 , . . . , W k ] are P = [ E ( W 1 ) , . . . , E ( W k )] T Σ = [ σ ij ] = [ COV ( W i , W j )] where E ( W i ) = p i COV ( W i , W j ) = E ( W i W j ) − E ( W i ) E ( W j ) = p ij − p i p j VAR ( W i ) = COV ( W i , W i ) = p i − p 2 i and can be estimated using m m p i = 1 p ij = 1 � � ˆ I { e i ∈ E b } ( U b ) and ˆ I { e i ∈ E b ,e j ∈ E b } ( U b ) . m m b =1 b =1 Marco Scutari University of Padova
Properties of the multivariate Bernoulli distribution Uncorrelation and independence Theorem Let B i and B j be two Bernoulli random variables. Then B i and B j are independent if and only if their covariance is zero: B i ⊥ ⊥ B j ⇐ ⇒ COV ( B i , B j ) = 0 Theorem Let B = [ B 1 , B 2 , . . . , B k ] T and C = [ C 1 , C 2 , . . . , C l ] T , k, l ∈ N be two multivariate Bernoulli random variables. Then B and C are independent if and only if B ⊥ ⊥ C ⇐ ⇒ COV ( B , C ) = O where O is the zero matrix. Marco Scutari University of Padova
Properties of the multivariate Bernoulli distribution Uncorrelation and independence (an example) Let B = [ B 1 B 2 B 3 ] T = B 1 + B 2 ; then we have 0 0 − E E � B 1 � �� B 1 �� 0 0 COV ( B 1 , B 2 ) = E B 2 B 3 B 2 B 3 0 0 0 0 0 0 − � p 1 � = E 0 0 B 1 B 2 B 2 B 3 p 2 p 3 0 0 0 0 0 0 0 0 0 0 − = = 0 0 p 12 p 23 p 1 p 2 p 2 p 3 0 0 0 0 0 0 0 0 0 = O ⇔ B 1 ⊥ = p 12 − p 1 p 2 0 p 23 − p 2 p 3 ⊥ B 2 0 0 0 Marco Scutari University of Padova
Properties of the multivariate Bernoulli distribution Constraints on the covariance matrix Σ The marginal variances of the edges are bounded, because � 0 , 1 � ⇒ σ ii = p i − p 2 p i ∈ [0 , 1] = i ∈ . 4 The maximum is attained for p i = 1 2 , and the minimum for both p i = 0 and p i = 1 . For the Cauchy-Schwartz theorem [1] then covariances are bounded too: ij � σ ii σ jj � 1 � 0 , 1 � 0 � σ 2 16 = ⇒ | σ ij | ∈ . 4 These result in similar bounds on the eigenvalues λ 1 , . . . , λ k of Σ , k 0 � λ i � k λ i � k � and 0 � 4 . 4 i =1 Marco Scutari University of Padova
Properties of the multivariate Bernoulli distribution Constraints on Σ : a graphical representation � 6 � � 0 . 24 � Σ 1 = 1 1 0 . 04 = 25 1 6 0 . 04 0 . 24 � 66 � 1 − 21 Σ 2 = − 21 126 625 � 0 . 1056 � − 0 . 0336 = − 0 . 0336 0 . 2016 � 66 � 1 91 Σ 3 = 91 126 625 � 0 . 1056 0 . 1456 � = 0 . 1456 0 . 2016 Marco Scutari University of Padova
Measures of Structure Variability Marco Scutari University of Padova
Measures of Structure Variability Entropy of the bootstrapped models Let’s consider the graphical models U 1 , . . . , U m learned from the bootstrap samples. • minimum entropy: all the models learned from the bootstrap samples have the same structure. In this case: � 1 if e i ∈ E p i = and Σ = O . 0 otherwise • intermediate entropy: several models are observed with different frequencies m b , � m b = m , so p i = 1 p ij = 1 � � ˆ m b and ˆ m b . m m b : e i ∈ E b b : e i ∈ E b ,e j ∈ E b • maximum entropy: all possible models appear with the same frequency, which results in p i = 1 Σ = 1 and 4 I k . 2 Marco Scutari University of Padova
Measures of Structure Variability Entropy of the bootstrapped models maximum entropy minimum entropy Marco Scutari University of Padova
Measures of Structure Variability Univariate measures of variability • the generalized variance k � � 0 , 1 � VAR G (Σ) = det(Σ) = λ i ∈ 4 k i =1 • the total variance k � � 0 , k � VAR T (Σ) = tr (Σ) = λ i ∈ 4 i =1 • the squared Frobenius matrix norm k � 2 � � k ( k − 1) 2 , k 3 � VAR N (Σ) = ||| Σ − k λ i − k � 4 I k ||| 2 F = ∈ 4 16 16 i =1 Marco Scutari University of Padova
Measures of Structure Variability Measures of structure variability max Σ VAR T (Σ) = 4 VAR T (Σ) VAR T (Σ) = k VAR T (Σ) VAR G (Σ) max Σ VAR G (Σ) = 4 k VAR G (Σ) VAR G (Σ) = max Σ VAR N (Σ) − VAR N (Σ) VAR N (Σ) = max Σ VAR N (Σ) − min Σ VAR N (Σ) = k 3 − 16 VAR N (Σ) k (2 k − 1) All of them vary in the [0 , 1] interval and associate high values to networks whose structure display a high entropy in the bootstrap samples. Marco Scutari University of Padova
Measures of Structure Variability Structure variability (total variance) maximum entropy minimum entropy Marco Scutari University of Padova
Measures of Structure Variability Structure variability (Frobenius norm) maximum entropy minimum entropy Marco Scutari University of Padova
Measures of Structure Variability Applications • compare the performance of different combinations of learning algorithms and network scores/independence tests on the same data. • study the performance of an algorithm at different sample sizes by changing the size bootstrap samples. The simplest way is to test the hypothesis H 0 : Σ = 1 H 1 : Σ � = 1 4 I k 4 I k using either parametric tests or parametric bootstrap. • apply many techniques from classical multivariate statistics (such as principal components), graph theory (path analysis) and linear algebra (matrix decompositions). Marco Scutari University of Padova
Recommend
More recommend