 
              Directed Graphical Models Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018
Recap ◮ We talked about reasonably weak assumption to facilitate the efficient representation of a probabilistic model ◮ Independence assumptions reduce the number of interacting variables ◮ Parametric assumptions restrict the way the variables may interact. ◮ (Conditional) independence assumptions lead to a factorisation of the pdf/pmf, e.g. p ( x , y , z ) = p ( x ) p ( y ) p ( z ) p ( x 1 , . . . , x d ) = p ( x d | x d − 3 , x d − 2 , x d − 1 ) p ( x 1 , . . . , x d − 1 ) Michael Gutmann Directed Graphical Models 2 / 66
Program 1. Equivalence of factorisation and ordered Markov property 2. Understanding models from their factorisation 3. Definition of directed graphical models 4. Independencies in directed graphical models Michael Gutmann Directed Graphical Models 3 / 66
Program 1. Equivalence of factorisation and ordered Markov property Chain rule Ordered Markov property implies factorisation Factorisation implies ordered Markov property 2. Understanding models from their factorisation 3. Definition of directed graphical models 4. Independencies in directed graphical models Michael Gutmann Directed Graphical Models 4 / 66
Chain rule Iteratively applying the product rule allows us to factorise any joint pdf (pmf) p ( x ) = p ( x 1 , x 2 , . . . , x d ) into product of conditional pdfs. p ( x ) = p ( x 1 ) p ( x 2 , . . . , x d | x 1 ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 , . . . , x d | x 1 , x 2 ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) p ( x 4 , . . . , x d | x 1 , x 2 , x 3 ) . . . = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) . . . p ( x d | x 1 , . . . x d − 1 ) d � = p ( x 1 ) p ( x i | x 1 , . . . , x i − 1 ) i =2 d � = p ( x i | pre i ) i =1 with pre i = pre ( x i ) = { x 1 , . . . , x i − 1 } , pre 1 = ∅ and p ( x 1 | ∅ ) = p ( x 1 ) The chain rule can be applied to any ordering x k 1 , . . . x k d . Different orderings give different factorisations. Michael Gutmann Directed Graphical Models 5 / 66
From (conditional) independence to factorisation p ( x ) = � d i =1 p ( x i | pre i ) for the ordering x 1 , . . . , x d ◮ For each x i , we condition on all previous variables in the ordering. ◮ Assume that, for each i , there is a minimal subset of variables π i ⊆ pre i such that p ( x ) satisfies x i ⊥ ⊥ ( pre i \ π i ) | π i for all i . The distribution is then said to satisfy the ordered Markov property . ◮ By definition of conditional independence: p ( x i | x 1 , . . . , x i − 1 ) = p ( x i | pre i ) = p ( x i | π i ) ◮ With the convention π 1 = ∅ , we obtain the factorisation d � p ( x 1 , . . . , x d ) = p ( x i | π i ) i =1 ◮ See later: the π i correspond to the parents of x i in graphs. Michael Gutmann Directed Graphical Models 6 / 66
From (conditional) independence to factorisation ◮ Assume the variables are ordered as x 1 , . . . , x d , let pre i = { x 1 , . . . x i − 1 } and π i ⊆ pre i . ◮ We have seen that x i ⊥ ⊥ pre i \ π i | π i for all i if d � p ( x i | π i ) then p ( x 1 , . . . , x d ) = i =1 ◮ The chain rule corresponds to the case where π i = pre i . ◮ Do we also have the reverse? d � if p ( x 1 , . . . , x d ) = p ( x i | π i ) with π i ⊆ pre i i =1 x i ⊥ ⊥ pre i \ π i | π i for all i ? then Michael Gutmann Directed Graphical Models 7 / 66
From factorisation to (conditional) independence ◮ Let us first check whether x d ⊥ ⊥ pre d \ π d | π d holds. ◮ We do that by checking whether pre d � �� � p ( x d | x 1 , . . . , x d − 1 ) = p ( x | π d ) holds. ◮ Since p ( x 1 , . . . , x d ) p ( x d | x 1 , . . . , x d − 1 ) = p ( x 1 , . . . , x d − 1 ) we start with computing p ( x 1 , . . . , x d − 1 ). Michael Gutmann Directed Graphical Models 8 / 66
From factorisation to (conditional) independence Assume that the x i are ordered as x 1 , . . . , x d and that p ( x 1 , . . . , x d ) = � d i =1 p ( x i | π i ) with π i ⊆ pre i . We compute p ( x 1 , . . . , x d − 1 ) using the sum rule: � p ( x 1 , . . . , x d − 1 ) = p ( x 1 , . . . , x d ) d x d d � � = p ( x i | π i ) d x d i =1 � d − 1 � = p ( x i | π i ) p ( x d | π d ) d x d ( x d / ∈ π i , i < d ) i =1 d − 1 � � p ( x i | π i ) p ( x d | π d ) d x d = i =1 d − 1 � p ( x i | π i ) = i =1 Michael Gutmann Directed Graphical Models 9 / 66
From factorisation to (conditional) independence Hence: p ( x 1 , . . . , x d ) p ( x d | x 1 , . . . , x d − 1 ) = p ( x 1 , . . . , x d − 1 ) � d i =1 p ( x i | π i ) = � d − 1 i =1 p ( x i | π i ) = p ( x d | π d ) And p ( x d | x 1 , . . . , x d − 1 ) = p ( x d | π d ) means that x d ⊥ ⊥ pre d \ π d | π d as desired. p ( x 1 , . . . , x d − 1 ) has the same form as p ( x 1 , . . . , x d ): apply same procedure to all p ( x 1 , . . . , x k ), for smaller and smaller k ≤ d − 1 Proves that (1) p ( x 1 , . . . , x k ) = � k i =1 p ( x i | π i ) and that (2) factorisation implies x i ⊥ ⊥ pre i \ π i | π i for all i Michael Gutmann Directed Graphical Models 10 / 66
Brief summary ◮ Let x = ( x 1 , . . . , x d ) be a d -dimensional random vector with pdf/pmf p ( x ). ◮ Denote the predecessors of x i in the ordering by pre ( x i ) = pre i = { x 1 , . . . , x i − 1 } , and let π i ⊆ pre i . d � p ( x ) = p ( x i | π i ) ⇐ ⇒ x i ⊥ ⊥ pre i \ π i | π i for all i i =1 ◮ Equivalence of factorisation and ordered Markov property of the pdf/pmf Michael Gutmann Directed Graphical Models 11 / 66
Why does it matter? ◮ Denote the predecessors of x i in the ordering by pre i = { x 1 , . . . , x i − 1 } , and let π i ⊆ pre i . d � p ( x ) = p ( x i | π i ) ⇐ ⇒ x i ⊥ ⊥ pre i \ π i | π i for all i i =1 ◮ Why does it matter? ◮ Relatively strong result: It holds for sets of pdfs/pmfs and not only single instances ◮ For all members of the set: Fewer numbers are needed for their representation ◮ Given the independencies, we know what form p ( x ) must have. ◮ Increased understanding of the properties of the model (independencies and data generation mechanism) ◮ Visualisation as a graph Michael Gutmann Directed Graphical Models 12 / 66
Program 1. Equivalence of factorisation and ordered Markov property Chain rule Ordered Markov property implies factorisation Factorisation implies ordered Markov property 2. Understanding models from their factorisation 3. Definition of directed graphical models 4. Independencies in directed graphical models Michael Gutmann Directed Graphical Models 13 / 66
Program 1. Equivalence of factorisation and ordered Markov property 2. Understanding models from their factorisation Ancestral sampling Visualisation as a directed graph Description of directed graphs and topological orderings 3. Definition of directed graphical models 4. Independencies in directed graphical models Michael Gutmann Directed Graphical Models 14 / 66
Ancestral sampling ◮ Factorisation provides a recipe for data generation / sampling from p ( x ) ◮ Example: p ( x 1 , x 2 , x 3 , x 4 , x 5 ) = p ( x 1 ) p ( x 2 ) p ( x 3 | x 1 , x 2 ) p ( x 4 | x 3 ) p ( x 5 | x 2 ) ◮ We can generate samples from the joint distribution p ( x 1 , x 2 , x 3 , x 4 , x 5 ) by sampling 1. x 1 ∼ p ( x 1 ) 2. x 2 ∼ p ( x 2 ) 3. x 3 ∼ p ( x 3 | x 1 , x 2 ) 4. x 4 ∼ p ( x 4 | x 3 ) 5. x 5 ∼ p ( x 5 | x 2 ) ◮ Note: Helps in modelling and understanding of the properties of p ( x ) but may not reflect causal relationships. Michael Gutmann Directed Graphical Models 15 / 66
Visualisation as a directed graph If p ( x ) = � d i =1 p ( x i | π i ) with π i ⊆ pre i we can visualise the model as a graph with the random variables x i as nodes, and directed edges that point from the x j ∈ π i to the x i . This results in a directed acyclic graph (DAG). Example: p ( x 1 , x 2 , x 3 , x 4 , x 5 ) = p ( x 1 ) p ( x 2 ) p ( x 3 | x 1 , x 2 ) p ( x 4 | x 3 ) p ( x 5 | x 2 ) x 1 x 2 x 3 x 5 x 4 Michael Gutmann Directed Graphical Models 16 / 66
Visualisation as a directed graph Example: p ( x 1 , x 2 , x 3 , x 4 ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) p ( x 4 | x 1 , x 2 , x 3 ) x 1 x 2 x 3 x 4 Factorisation obtained by chain rule ≡ fully connected directed acyclic graph. Michael Gutmann Directed Graphical Models 17 / 66
Graph concepts ◮ Directed graph: graph where all edges are directed ◮ Directed acyclic graph (DAG): by following the direction of the arrows you will never visit a node more than once ◮ x i is a parent of x j if there is a (directed) edge from x i to x j . The set of parents of x i in the graph is denoted by pa ( x i ) = pa i , e.g. pa ( x 3 ) = pa 3 = { x 1 , x 2 } . ◮ x j is a child of x i if x i ∈ pa ( x j ), e.g. x 3 and x 5 are children of x 2 . x 1 x 2 x 3 x 5 x 4 Michael Gutmann Directed Graphical Models 18 / 66
Graph concepts ◮ A path or trail from x i to x j is a sequence of distinct connected nodes starting at x i and ending at x j . The direction of the arrows does not matter. For example: x 5 , x 2 , x 3 , x 1 is a trail. ◮ A directed path is a sequence of connected nodes where we follow the direction of the arrows. For example: x 1 , x 3 , x 4 is a directed path. But x 5 , x 2 , x 3 , x 1 is not a directed path. x 1 x 2 x 3 x 5 x 4 Michael Gutmann Directed Graphical Models 19 / 66
Recommend
More recommend