probabilistic unsupervised learning graphical models
play

Probabilistic & Unsupervised Learning Graphical Models Maneesh - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Graphical Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2017 Graphs,


  1. Factor graphs: neighbourhoods and Markov boundaries A B A B C C D D E E (a) (b) ◮ Variables are neighbours if they share a common factor; the neighbourhood ne ( X ) is the set of all neighbours of X . ◮ Each variable X is conditionally independent of all non-neighbours given its neighbours: X ⊥ ⊥ Y | ne ( X ) , ∀ Y / ∈ { X ∪ ne ( X ) } ⇒ ne ( X ) is a Markov blanket for X .

  2. Factor graphs: neighbourhoods and Markov boundaries A B A B C C D D E E (a) (b) ◮ Variables are neighbours if they share a common factor; the neighbourhood ne ( X ) is the set of all neighbours of X . ◮ Each variable X is conditionally independent of all non-neighbours given its neighbours: X ⊥ ⊥ Y | ne ( X ) , ∀ Y / ∈ { X ∪ ne ( X ) } ⇒ ne ( X ) is a Markov blanket for X . ◮ In fact, the neighbourhood is the minimal such set: the Markov boundary.

  3. Undirected graphical models: Markov networks A B C D E An undirected graphical model is a direct representation of conditional independence structure. Nodes are connected iff they are conditionally dependent given all others. ⇒ neighbours (connected nodes) in a Markov net share a factor.

  4. Undirected graphical models: Markov networks A B C D E An undirected graphical model is a direct representation of conditional independence structure. Nodes are connected iff they are conditionally dependent given all others. ⇒ neighbours (connected nodes) in a Markov net share a factor. ⇒ non-neighbours (disconnected nodes) in a Markov net cannot share a factor.

  5. Undirected graphical models: Markov networks A B C D E An undirected graphical model is a direct representation of conditional independence structure. Nodes are connected iff they are conditionally dependent given all others. ⇒ neighbours (connected nodes) in a Markov net share a factor. ⇒ non-neighbours (disconnected nodes) in a Markov net cannot share a factor. ⇒ the joint probability factors over the maximal cliques C j of the graph: P ( X ) = 1 � f j ( X C j ) Z j It may also factor more finely (as we will see in a moment). [Cliques are fully connected subgraphs, maximal cliques are cliques not contained in other cliques.]

  6. Undirected graphs: Markov boundaries A B C D E ◮ X ⊥ ⊥ Y |V if every path between X and Y contains some node V ∈ V ◮ Each variable X is conditionally independent of all non-neighbours given its neighbours: X ⊥ ⊥ Y | ne ( X ) , ∀ Y / ∈ { X ∪ ne ( X ) } ◮ V is a Markov blanket for X iff X ⊥ ⊥ Y |V for all Y / ∈ { X ∪ V} . ◮ Markov boundary: minimal Markov blanket. For undirected graphs (like factor graphs) this is the set of neighbours of X .

  7. Undirected graphs and factor graphs A B A B A B C C C D D D E E E (a) (b) (c) ◮ Each node has the same neighbours in each graph, so (a), (b) and (c) represent exactly the same conditional independence relationships. ◮ The implied maximal factorisations differ: (b) has two three-way factors; (c) has only pairwise factors; (a) cannot distinguish between these (so we have to adopt factorisation (b) to be safe). ◮ Suppose all variables are discrete and can take on K possible values. Then the functions in (a) and (b) are tables with O ( K 3 ) cells, whereas in (c) they are O ( K 2 ) . ◮ Factor graphs have richer expressive power than undirected graphical models. ◮ Factors cannot be determined solely by testing for conditional independence.

  8. Some examples of undirected graphical models ◮ Markov random fields (used in computer vision) ◮ Maximum entropy language models (used in speech and language modelling) �� � P ( X ) = 1 Z p 0 ( X ) exp λ j g j ( X ) j ◮ Conditional random fields are undirected graphical models (conditioned on the input variables). ◮ Boltzmann machines (a kind of neural network/Ising model)

  9. Limitations of undirected and factor graphs Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl): Rain Sprinkler Rain Sprinkler Ground wet Ground wet

  10. Limitations of undirected and factor graphs Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl): Rain Sprinkler Rain Sprinkler Ground wet Ground wet ◮ Most sprinklers switch on come rain or shine; and certainly the weather pays no heed to the state of the spinklers.

  11. Limitations of undirected and factor graphs Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl): Rain Sprinkler Rain Sprinkler Ground wet Ground wet ◮ Most sprinklers switch on come rain or shine; and certainly the weather pays no heed to the state of the spinklers. ◮ Explaining away : Damp ground suggests that it has rained; but if we also see a running sprinkler this explains away the damp, returning our belief about rain to the prior.

  12. Limitations of undirected and factor graphs Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl): Rain Sprinkler Rain Sprinkler Ground wet Ground wet ◮ Most sprinklers switch on come rain or shine; and certainly the weather pays no heed to the state of the spinklers. ◮ Explaining away : Damp ground suggests that it has rained; but if we also see a running sprinkler this explains away the damp, returning our belief about rain to the prior. ◮ R ⊥ ⊥ S | ∅ but R �⊥ ⊥ S | G .

  13. Limitations of undirected and factor graphs Undirected and factor graphs fail to capture some useful independencies—a pair of variables may be connected merely because some other variable depends on them: The classic example (due to Pearl): Rain Sprinkler Rain Sprinkler Ground wet Ground wet ◮ Most sprinklers switch on come rain or shine; and certainly the weather pays no heed to the state of the spinklers. ◮ Explaining away : Damp ground suggests that it has rained; but if we also see a running sprinkler this explains away the damp, returning our belief about rain to the prior. ◮ R ⊥ ⊥ S | ∅ but R �⊥ ⊥ S | G . This highlights the difference between marginal and conditional independence .

  14. Directed acyclic graphical models A B C D E A directed acyclic graphical (DAG) model represents a factorization of the joint probability distribution in terms of conditionals: P ( A , B , C , D , E ) = P ( A ) P ( B ) P ( C | A , B ) P ( D | B , C ) P ( E | C , D ) In general: n � P ( X 1 , . . . , X n ) = P ( X i | X pa ( i ) ) i = 1 where pa ( i ) are the parents of node i . DAG models are also known as Bayesian networks or Bayes nets .

  15. Conditional independence in DAGs A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs.

  16. Conditional independence in DAGs A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs. • A ⊥ ⊥ E | { B , C } : conditioning nodes block paths

  17. Conditional independence in DAGs A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs. • A ⊥ ⊥ E | { B , C } : conditioning nodes block paths • A ⊥ ⊥ B | ∅ : other nodes block reflected paths

  18. Conditional independence in DAGs A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs. • A ⊥ ⊥ E | { B , C } : conditioning nodes block paths • A ⊥ ⊥ B | ∅ : other nodes block reflected paths • A �⊥ ⊥ B | C : conditioning node creates a reflected path by explaining away

  19. Conditional independence in DAGs A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs. • A ⊥ ⊥ E | { B , C } : conditioning nodes block paths • A ⊥ ⊥ B | ∅ : other nodes block reflected paths • A �⊥ ⊥ B | C : conditioning node creates a reflected path by explaining away • A �⊥ ⊥ E | C : the created path extends to E via D

  20. Conditional independence in DAGs A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs. • A ⊥ ⊥ E | { B , C } : conditioning nodes block paths • A ⊥ ⊥ B | ∅ : other nodes block reflected paths • A �⊥ ⊥ B | C : conditioning node creates a reflected path by explaining away • A �⊥ ⊥ E | C : the created path extends to E via D • A ⊥ ⊥ E | { C , D } : but is blocked by observing D

  21. Conditional independence in DAGs A B C D E Reading conditional independence from DAGs is more complicated than in undirected graphs. • A ⊥ ⊥ E | { B , C } : conditioning nodes block paths • A ⊥ ⊥ B | ∅ : other nodes block reflected paths • A �⊥ ⊥ B | C : conditioning node creates a reflected path by explaining away • A �⊥ ⊥ E | C : the created path extends to E via D • A ⊥ ⊥ E | { C , D } : but is blocked by observing D So conditioning on (i.e. observing) nodes can both create and remove dependencies.

  22. The Bayes-ball algorithm A B C D E Game: can you get a ball from X to Y without being blocked by V ? If so, X �⊥ ⊥ Y |V . Rules: ball follow edges, and are passed on or bounced back from nodes according to:

  23. The Bayes-ball algorithm A B C D E Game: can you get a ball from X to Y without being blocked by V ? If so, X �⊥ ⊥ Y |V . Rules: ball follow edges, and are passed on or bounced back from nodes according to: ◮ Nodes V / ∈ V pass balls down or up chains: → V → or ← V ← .

  24. The Bayes-ball algorithm A B C D E Game: can you get a ball from X to Y without being blocked by V ? If so, X �⊥ ⊥ Y |V . Rules: ball follow edges, and are passed on or bounced back from nodes according to: ◮ Nodes V / ∈ V pass balls down or up chains: → V → or ← V ← . ◮ Nodes V / ∈ V , bounce balls from children to children.

  25. The Bayes-ball algorithm A B C D E Game: can you get a ball from X to Y without being blocked by V ? If so, X �⊥ ⊥ Y |V . Rules: ball follow edges, and are passed on or bounced back from nodes according to: ◮ Nodes V / ∈ V pass balls down or up chains: → V → or ← V ← . ◮ Nodes V / ∈ V , bounce balls from children to children. ◮ Nodes V ∈ V , bounce balls from parents to parents (including returning the ball whence it came).

  26. The Bayes-ball algorithm A B C D E Game: can you get a ball from X to Y without being blocked by V ? If so, X �⊥ ⊥ Y |V . Rules: ball follow edges, and are passed on or bounced back from nodes according to: ◮ Nodes V / ∈ V pass balls down or up chains: → V → or ← V ← . ◮ Nodes V / ∈ V , bounce balls from children to children. ◮ Nodes V ∈ V , bounce balls from parents to parents (including returning the ball whence it came). Otherwise the ball is blocked. (So V ∈ V blocks all balls from children, and stops balls from parents reaching children.)

  27. D-separation A B C D E So when is X ⊥ ⊥ Y |V ?

  28. D-separation A B C D E So when is X ⊥ ⊥ Y |V ? Consider every undirected path (i.e. ignoring arrows) between X and Y . The path is blocked by V if there is a node V on the path such that either:

  29. D-separation A B C D E So when is X ⊥ ⊥ Y |V ? Consider every undirected path (i.e. ignoring arrows) between X and Y . The path is blocked by V if there is a node V on the path such that either: ◮ V has convergent arrows ( → V ← ) on the path ( i.e. , V is a “collider node”) and neither V nor its descendents ∈ V .

  30. D-separation A B C D E So when is X ⊥ ⊥ Y |V ? Consider every undirected path (i.e. ignoring arrows) between X and Y . The path is blocked by V if there is a node V on the path such that either: ◮ V has convergent arrows ( → V ← ) on the path ( i.e. , V is a “collider node”) and neither V nor its descendents ∈ V .

  31. D-separation A B C D E So when is X ⊥ ⊥ Y |V ? Consider every undirected path (i.e. ignoring arrows) between X and Y . The path is blocked by V if there is a node V on the path such that either: ◮ V has convergent arrows ( → V ← ) on the path ( i.e. , V is a “collider node”) and neither V nor its descendents ∈ V . ◮ V does not have convergent arrows on the path ( → V → or ← V → ) and V ∈ V . This is similar to the undirected graph semantics.

  32. D-separation A B C D E So when is X ⊥ ⊥ Y |V ? Consider every undirected path (i.e. ignoring arrows) between X and Y . The path is blocked by V if there is a node V on the path such that either: ◮ V has convergent arrows ( → V ← ) on the path ( i.e. , V is a “collider node”) and neither V nor its descendents ∈ V . ◮ V does not have convergent arrows on the path ( → V → or ← V → ) and V ∈ V . This is similar to the undirected graph semantics. If all paths are blocked, we say V d-separates X from Y (d for directed), and X ⊥ ⊥ Y |V .

  33. D-separation A B C D E So when is X ⊥ ⊥ Y |V ? Consider every undirected path (i.e. ignoring arrows) between X and Y . The path is blocked by V if there is a node V on the path such that either: ◮ V has convergent arrows ( → V ← ) on the path ( i.e. , V is a “collider node”) and neither V nor its descendents ∈ V . ◮ V does not have convergent arrows on the path ( → V → or ← V → ) and V ∈ V . This is similar to the undirected graph semantics. If all paths are blocked, we say V d-separates X from Y (d for directed), and X ⊥ ⊥ Y |V . Markov boundary for X : { pa ( X ) ∪ ch ( X ) ∪ pa ( ch ( X )) } .

  34. Expressive power of directed and undirected graphs A B No DAG can represent these and only these independencies D E No matter how we direct the arrows there will always be two non-adjacent parents sharing a common child ⇒ dependence in DAG but independence in undirected graph. A B No undirected or factor graph can rep- resent these and only these indepen- dencies C One three-way factor, but this does not encode marginal independence.

  35. Graphs, conditional independencies, and families of distributions Each graph G implies a set of conditional independence statements C ( G ) = { X i ⊥ ⊥ Y i |V i } . Each such set C , defines a family of distributions that satisfy all the statements in C : P C ( G ) = { P ( X ) : P ( X i , Y i |V i ) = P ( X i |V i ) P ( Y i |V i ) for all X i ⊥ ⊥ Y i |V i in C } G may also encode a family of distributions by their functional form, e.g. for a factor graph � P G = { P ( X ) : P ( X ) = 1 j f j ( X C j ) , for some non-negative functions f j } Z

  36. Graphs, conditional independencies, and families of distributions Each graph G implies a set of conditional independence statements C ( G ) = { X i ⊥ ⊥ Y i |V i } . Each such set C , defines a family of distributions that satisfy all the statements in C : P C ( G ) = { P ( X ) : P ( X i , Y i |V i ) = P ( X i |V i ) P ( Y i |V i ) for all X i ⊥ ⊥ Y i |V i in C } G may also encode a family of distributions by their functional form, e.g. for a factor graph � P G = { P ( X ) : P ( X ) = 1 j f j ( X C j ) , for some non-negative functions f j } Z ◮ For directed graphs, P G = P C ( G ) .

  37. Graphs, conditional independencies, and families of distributions Each graph G implies a set of conditional independence statements C ( G ) = { X i ⊥ ⊥ Y i |V i } . Each such set C , defines a family of distributions that satisfy all the statements in C : P C ( G ) = { P ( X ) : P ( X i , Y i |V i ) = P ( X i |V i ) P ( Y i |V i ) for all X i ⊥ ⊥ Y i |V i in C } G may also encode a family of distributions by their functional form, e.g. for a factor graph � P G = { P ( X ) : P ( X ) = 1 j f j ( X C j ) , for some non-negative functions f j } Z ◮ For directed graphs, P G = P C ( G ) . ◮ For undirected graphs, P G = P C ( G ) if all distributions are positive, i.e. P ( X ) > 0 for all values of X (Hammersley-Clifford Theorem).

  38. Graphs, conditional independencies, and families of distributions Each graph G implies a set of conditional independence statements C ( G ) = { X i ⊥ ⊥ Y i |V i } . Each such set C , defines a family of distributions that satisfy all the statements in C : P C ( G ) = { P ( X ) : P ( X i , Y i |V i ) = P ( X i |V i ) P ( Y i |V i ) for all X i ⊥ ⊥ Y i |V i in C } G may also encode a family of distributions by their functional form, e.g. for a factor graph � P G = { P ( X ) : P ( X ) = 1 j f j ( X C j ) , for some non-negative functions f j } Z ◮ For directed graphs, P G = P C ( G ) . ◮ For undirected graphs, P G = P C ( G ) if all distributions are positive, i.e. P ( X ) > 0 for all values of X (Hammersley-Clifford Theorem). ◮ There are factor graphs for which P G � = P C G .

  39. Graphs, conditional independencies, and families of distributions Each graph G implies a set of conditional independence statements C ( G ) = { X i ⊥ ⊥ Y i |V i } . Each such set C , defines a family of distributions that satisfy all the statements in C : P C ( G ) = { P ( X ) : P ( X i , Y i |V i ) = P ( X i |V i ) P ( Y i |V i ) for all X i ⊥ ⊥ Y i |V i in C } G may also encode a family of distributions by their functional form, e.g. for a factor graph � P G = { P ( X ) : P ( X ) = 1 j f j ( X C j ) , for some non-negative functions f j } Z ◮ For directed graphs, P G = P C ( G ) . ◮ For undirected graphs, P G = P C ( G ) if all distributions are positive, i.e. P ( X ) > 0 for all values of X (Hammersley-Clifford Theorem). ◮ There are factor graphs for which P G � = P C G . ◮ Factor graphs are more expressive than undirected graphs: for every undirected graph G 1 there is a factor graph G 2 with P G 1 = P G 2 but not vice versa.

  40. Graphs, conditional independencies, and families of distributions Each graph G implies a set of conditional independence statements C ( G ) = { X i ⊥ ⊥ Y i |V i } . Each such set C , defines a family of distributions that satisfy all the statements in C : P C ( G ) = { P ( X ) : P ( X i , Y i |V i ) = P ( X i |V i ) P ( Y i |V i ) for all X i ⊥ ⊥ Y i |V i in C } G may also encode a family of distributions by their functional form, e.g. for a factor graph � P G = { P ( X ) : P ( X ) = 1 j f j ( X C j ) , for some non-negative functions f j } Z ◮ For directed graphs, P G = P C ( G ) . ◮ For undirected graphs, P G = P C ( G ) if all distributions are positive, i.e. P ( X ) > 0 for all values of X (Hammersley-Clifford Theorem). ◮ There are factor graphs for which P G � = P C G . ◮ Factor graphs are more expressive than undirected graphs: for every undirected graph G 1 there is a factor graph G 2 with P G 1 = P G 2 but not vice versa. ◮ Adding edges to graph ⇒ removing conditional independency statements ⇒ enlarging the family of distributions (converse true for removing edges).

  41. Graphs, conditional independencies, and families of distributions { X i ⊥ ⊥ Y i |V i } { p ( X ) : p ( X i , Y i |V i ) { p ( X ) = � i p ( X i | X pa ( i ) ) } = p ( X i |V i ) p ( Y i |V i ) }

  42. Tree-structured graphical models B B F F C C E A E A G G D D Rooted directed tree Directed polytree B B F F C C E A E A G G D D Undirected tree Tree-structured factor graph These are all tree-structured or “singly-connected” graphs.

  43. Polytrees to tree-structured factor graphs B B F F C C ⇒ E A E A G G D D Polytrees are tree-structured DAGs that may have more than one root. � P ( X ) = P ( X i | X pa ( i ) ) i � = f i ( X C i ) i where C i = i ∪ pa ( i ) and f i ( X C i ) = P ( X i | X pa ( i ) ) . Marginal distribution on roots P ( X r ) absorbed into an adjacent factor.

  44. Undirected trees and factor graphs B B F F C C ⇒ E A E A G G D D In an undirected tree all maximal cliques are of size 2, and so the equivalent factor graph has only pairwise factors. P ( X ) = 1 � f ( ij ) ( X i , X j ) Z edges ( ij )

  45. Rooted directed trees to undirected trees B B F F C C ⇒ E A E A G G D D The distribution for a single-rooted directed tree can be written as a product of pairwise factors ⇒ undirected tree. � P ( X ) = P ( X r ) P ( X i | X pa ( i ) ) i � = r � = f ( ij ) ( X i , X j ) edges ( ij )

  46. Undirected trees to rooted directed trees B B F F C C ⇒ E A E A G G D D This direction is slightly trickier:

  47. Undirected trees to rooted directed trees B B F F C C ⇒ E A E A G G D D This direction is slightly trickier: ◮ Choose an arbitrary node X r to be the root and point all the arrows away from it

  48. Undirected trees to rooted directed trees B B F F C C ⇒ E A E A G G D D This direction is slightly trickier: ◮ Choose an arbitrary node X r to be the root and point all the arrows away from it ◮ Compute the marginal distributions on single nodes P ( X i ) and on edges P ( X i , X j ) implied by the undirected graph.

  49. Undirected trees to rooted directed trees B B F F C C ⇒ E A E A G G D D This direction is slightly trickier: ◮ Choose an arbitrary node X r to be the root and point all the arrows away from it ◮ Compute the marginal distributions on single nodes P ( X i ) and on edges P ( X i , X j ) implied by the undirected graph. ◮ Compute the conditionals in the DAG: P ( X i , X pa ( i ) ) � � P ( X ) = P ( X r ) P ( X i | X pa ( i ) ) = P ( X r ) P ( X pa ( i ) ) i � = r i � = r � edges ( ij ) P ( X i , X j ) = � nodes i P ( X i ) deg(i)-1

  50. Undirected trees to rooted directed trees B B F F C C ⇒ E A E A G G D D This direction is slightly trickier: ◮ Choose an arbitrary node X r to be the root and point all the arrows away from it ◮ Compute the marginal distributions on single nodes P ( X i ) and on edges P ( X i , X j ) implied by the undirected graph. ◮ Compute the conditionals in the DAG: P ( X i , X pa ( i ) ) � � P ( X ) = P ( X r ) P ( X i | X pa ( i ) ) = P ( X r ) P ( X pa ( i ) ) i � = r i � = r � edges ( ij ) P ( X i , X j ) = � nodes i P ( X i ) deg(i)-1 How do we compute P ( X i ) and P ( X i , X j ) ? ⇒ Belief propagation .

  51. Finding marginals in undirected trees X j X i Undirected tree ⇒ pairwise factored joint distribution: P ( X ) = 1 � f ( ij ) ( X i , X j ) Z ( ij ) ∈E T

  52. Finding marginals in undirected trees X j X i Undirected tree ⇒ pairwise factored joint distribution: P ( X ) = 1 � f ( ij ) ( X i , X j ) Z ( ij ) ∈E T Each neigbour X j of X i defines a disjoint subtree T j → i .

  53. Finding marginals in undirected trees X j X i Undirected tree ⇒ pairwise factored joint distribution: P ( X ) = 1 � f ( ij ) ( X i , X j ) Z ( ij ) ∈E T Each neigbour X j of X i defines a disjoint subtree T j → i . So we can split up the product: � � � P ( X i ) = P ( X ) ∝ f ( ij ) ( X i , X j ) X\{ X i } X\{ X i } ( ij ) ∈E T

  54. Finding marginals in undirected trees X j X i Undirected tree ⇒ pairwise factored joint distribution: P ( X ) = 1 � f ( ij ) ( X i , X j ) Z ( ij ) ∈E T Each neigbour X j of X i defines a disjoint subtree T j → i . So we can split up the product: � � � P ( X i ) = P ( X ) ∝ f ( ij ) ( X i , X j ) X\{ X i } X\{ X i } ( ij ) ∈E T � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i X\{ X i } X j ∈ ne ( X i )

  55. Finding marginals in undirected trees X j X i Undirected tree ⇒ pairwise factored joint distribution: P ( X ) = 1 � f ( ij ) ( X i , X j ) Z ( ij ) ∈E T Each neigbour X j of X i defines a disjoint subtree T j → i . So we can split up the product: � � � P ( X i ) = P ( X ) ∝ f ( ij ) ( X i , X j ) X\{ X i } X\{ X i } ( ij ) ∈E T � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i X\{ X i } X j ∈ ne ( X i ) � � � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) X j ∈ ne ( X i ) X Tj → i ( i ′ j ′ ) ∈E Tj → i � �� � M j → i ( X i )

  56. Finding marginals in undirected trees X j X i Undirected tree ⇒ pairwise factored joint distribution: P ( X ) = 1 � f ( ij ) ( X i , X j ) Z ( ij ) ∈E T Each neigbour X j of X i defines a disjoint subtree T j → i . So we can split up the product: � � � P ( X i ) = P ( X ) ∝ f ( ij ) ( X i , X j ) X\{ X i } X\{ X i } ( ij ) ∈E T � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i X\{ X i } X j ∈ ne ( X i ) � � � � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) = M j → i ( X i ) X j ∈ ne ( X i ) X Tj → i ( i ′ j ′ ) ∈E Tj → i X j ∈ ne ( X i ) � �� � M j → i ( X i )

  57. Message recursion: Belief Propagation (BP) X j X i M j → i ( X i )

  58. Message recursion: Belief Propagation (BP) X j X i � � M j → i ( X i ) = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) X Tj → i ( i ′ j ′ ) ∈E Tj → i

  59. Message recursion: Belief Propagation (BP) X j X i � � M j → i ( X i ) = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) X Tj → i ( i ′ j ′ ) ∈E Tj → i � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i X j X Tj → i \ X j

  60. Message recursion: Belief Propagation (BP) X j � � M j → i ( X i ) = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) X Tj → i ( i ′ j ′ ) ∈E Tj → i � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i X j X Tj → i \ X j � �� � ∝ P T j → i ( X j )

  61. Message recursion: Belief Propagation (BP) X j � � M j → i ( X i ) = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) X Tj → i ( i ′ j ′ ) ∈E Tj → i � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i X j X Tj → i \ X j � �� � � ∝ P T j → i ( X j ) ∝ M k → j ( X j ) X k ∈ ne ( X j ) \ X i

  62. Message recursion: Belief Propagation (BP) X j X i � � M j → i ( X i ) = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) X Tj → i ( i ′ j ′ ) ∈E Tj → i � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i X j X Tj → i \ X j � �� � � ∝ P T j → i ( X j ) ∝ M k → j ( X j ) X k ∈ ne ( X j ) \ X i � � = f ( ij ) ( X i , X j ) M k → j ( X j ) X j X k ∈ ne ( X j ) \ X i

  63. BP for pairwise marginals in undirected trees X j X i � � � P ( X i , X j ) = P ( X ) ∝ f ( ij ) ( X i , X j ) X\{ X i , X j } X\{ X i , X j } ( ij ) ∈E T

  64. BP for pairwise marginals in undirected trees X j X i � � � P ( X i , X j ) = P ( X ) ∝ f ( ij ) ( X i , X j ) X\{ X i , X j } X\{ X i , X j } ( ij ) ∈E T � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i ( i ′ j ′ ) ∈E Ti → j X\{ X i , X j }

  65. BP for pairwise marginals in undirected trees X j X i � � � P ( X i , X j ) = P ( X ) ∝ f ( ij ) ( X i , X j ) X\{ X i , X j } X\{ X i , X j } ( ij ) ∈E T � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i ( i ′ j ′ ) ∈E Ti → j X\{ X i , X j } � �� � � � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i ( i ′ j ′ ) ∈E Ti → j X Tj → i \ X j X Ti → j \ X i

  66. BP for pairwise marginals in undirected trees X j X i � � � P ( X i , X j ) = P ( X ) ∝ f ( ij ) ( X i , X j ) X\{ X i , X j } X\{ X i , X j } ( ij ) ∈E T � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i ( i ′ j ′ ) ∈E Ti → j X\{ X i , X j } � �� � � � � � = f ( ij ) ( X i , X j ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) f ( i ′ j ′ ) ( X i ′ , X j ′ ) ( i ′ j ′ ) ∈E Tj → i ( i ′ j ′ ) ∈E Ti → j X Tj → i \ X j X Ti → j \ X i � � = f ( ij ) ( X i , X j ) M k → j ( X j ) M k → i ( X i ) X k ∈ ne ( X j ) \ X i X k ∈ ne ( X i ) \ X j

  67. BP for inference X a X i Messages from observed leaf nodes are conditioned rather than marginalised:

  68. BP for inference X a X i Messages from observed leaf nodes are conditioned rather than marginalised: � To compute P ( X i ) : M a → i = f ai ( X a , X i ) X a

  69. BP for inference X a X i Messages from observed leaf nodes are conditioned rather than marginalised: � To compute P ( X i ) : M a → i = f ai ( X a , X i ) X a To compute P ( X i | X a = a ) : M a → i = f ai ( X a = a , X i )

  70. BP for inference X a X k X j X i X b Messages from observed leaf nodes are conditioned rather than marginalised: � To compute P ( X i ) : M a → i = f ai ( X a , X i ) X a To compute P ( X i | X a = a ) : M a → i = f ai ( X a = a , X i ) Observed internal nodes partition the graph, and so messages propagate independently. M b → j = f bj ( X b = b , X j ) M b → k = f bk ( X b = b , X k )

  71. BP for inference X a X k X j X i X b Messages from observed leaf nodes are conditioned rather than marginalised: � To compute P ( X i ) : M a → i = f ai ( X a , X i ) X a To compute P ( X i | X a = a ) : M a → i = f ai ( X a = a , X i ) Observed internal nodes partition the graph, and so messages propagate independently. M b → j = f bj ( X b = b , X j ) M b → k = f bk ( X b = b , X k ) Messages M i → j are proportional to the likelihood based on any observed variables ( O ) within the messages subtree T i → j , possibly scaled by a prior factor (depending on factorisation) M i → j ( X j ) ∝ P ( X T i → j ∩ O| X i ) P ( X i )

  72. BP for latent chain models • • • s 1 s 2 s 3 s T x 1 x 2 x 3 x T A latent chain model is a rooted directed tree ⇒ an undirected tree. The forward-backward algorithm is just BP on this graph. α t ( i ) ⇔ M s t − 1 → s t ( s t = i ) ∝ P ( x 1 : t , s t ) β t ( i ) ⇔ M s t + 1 → s t ( s t = i ) ∝ P ( x t + 1 : T | s t ) � α t ( i ) β t ( i ) = M j → s t ( s t = i ) ∝ P ( s t = i |O ) j ∈ ne ( s t ) Algorithms like BP extend the power of graphical models beyond just encoding of independence and factorisation. A single derivation serves for a wide array of models.

  73. BP in non-trees? A B E C D Can we find P ( D ) easily?

  74. BP in non-trees? A B E C D Can we find P ( D ) easily? ◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be separated into messages.

  75. BP in non-trees? A B E C D Can we find P ( D ) easily? ◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be separated into messages. ◮ Observed nodes may break loops and make subtrees independent,

  76. BP in non-trees? A B E C D Can we find P ( D ) easily? ◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be separated into messages. ◮ Observed nodes may break loops and make subtrees independent, but may not resolve all loops.

  77. BP in non-trees? A B E C D Can we find P ( D ) easily? ◮ Neighbours do not belong to disjoint subtrees, so influence of other nodes cannot be separated into messages. ◮ Observed nodes may break loops and make subtrees independent, but may not resolve all loops. Possible strategies:

Recommend


More recommend