Graphical Models L´ eon Bottou COS 424 – 4/15/2010
Introduction People like drawings better than equations – A graphical model is a diagram representing certain aspects of the algebraic structure of a probabilistic model. Purposes – Visualize the structure of a model. – Investigate conditional independence properties. – Some computations are more easily expressed on a graph than written as equations with complicated subscripts. L´ eon Bottou 2/37 COS 424 – 4/15/2010
Summary Summary I. Directed graphical models II. Undirected graphical models III. Inference in graphical models More – David Blei runs a complete course on graphical models. L´ eon Bottou 3/37 COS 424 – 4/15/2010
I. Directed graphical models “Bayesian Networks” (Pearl 1988) L´ eon Bottou 4/37 COS 424 – 4/15/2010
A pattern for independence assumptions Probability distribution P ( x 1 , x 2 , x 3 , x 4 ) Bayesian chain theorem P ( x 1 , x 2 , x 3 , x 4 ) = P ( x 1 ) P ( x 2 | x 1 ) P ( x 3 | x 1 , x 2 ) P ( x 4 | x 1 , x 2 , x 3 ) Independence assumptions P ( x 1 , x 2 , x 3 , x 4 ) = P ( x 1 ) P ( x 2 | x 1 ) P ( x 3 | x 1 , x 2 ) P ( x 4 | x 1 , x 2 , x 3 ) = P ( x 1 ) P ( x 2 | x 1 ) P ( x 3 | x 1 ) P ( x 4 | x 1 , x 2 ) L´ eon Bottou 5/37 COS 424 – 4/15/2010
Graphical representation Bayesian chain theorem P ( x 1 , x 2 , x 3 , x 4 ) = P ( x 1 ) P ( x 2 | x 1 ) P ( x 3 | x 1 , x 2 ) P ( x 4 | x 1 , x 2 , x 3 ) Directed acyclic graph � � � � � � � � Arrows do not represent causality! L´ eon Bottou 6/37 COS 424 – 4/15/2010
Graphical representation Independence assumptions P ( x 1 , x 2 , x 3 , x 4 ) = P ( x 1 ) P ( x 2 | x 1 ) P ( x 3 | x 1 , x 2 ) P ( x 4 | x 1 , x 2 , x 3 ) = P ( x 1 ) P ( x 2 | x 1 ) P ( x 3 | x 1 ) P ( x 4 | x 1 , x 2 ) � � � � � � � � Missing links represent independence assumptions L´ eon Bottou 7/37 COS 424 – 4/15/2010
A more complicated example P ( x 1 ) P ( x 2 ) P ( x 3 ) P ( x 4 | x 1 , x 2 ) P ( x 5 | x 1 , x 2 , x 3 ) P ( x 6 | x 4 ) P ( x 7 | x 4 , x 5 ) � � � � � � � � � � � � � � Parametrization The graph says nothing about the parametric form of the probabilities. – Discrete distributions – Continuous distributions L´ eon Bottou 8/37 COS 424 – 4/15/2010
Discrete distributions Input x = ( x 1 , x 2 . . . x d ) ∈ { 0 , 1 } d . Class y ∈ { A 1 , . . . , A k } . General generative model Na ¨ ıve Bayes model P ( x , y ) = P ( y ) P ( x | y ) P ( x , y ) = P ( y ) P ( x 1 | y ) . . . P ( x d | y ) � � � � � � � – k parameters for P ( y ) – k 2 d parameters for P ( x | y ) � � – k parameters for P ( y ) – k d parameters for P ( x | y ) L´ eon Bottou 9/37 COS 424 – 4/15/2010
Discrete distributions Na ¨ ıve Bayes model Linear discriminant model P ( x , y ) = P ( y ) P ( x 1 | y ) . . . P ( x d | y ) P ( x , y ) = P ( x ) P ( y | x ) � � � � � � � y ( x ) = arg max ˆ P ( x , y ) y � � = arg max P ( y | x ) y y ( x ) = arg max ˆ P ( x , y ) y – k parameters for P ( y ) . – k ( d + 1) parameters for P ( y | x ) . – 2 d unused parameters for P ( x ) . – k d parameters for P ( x | y ) . Fails when the x i are correlated ! Works when the x i are correlated ! L´ eon Bottou 10/37 COS 424 – 4/15/2010
Continuous distributions Linear regression – Input x = ( x 1 , x 2 . . . x d ) ∈ R d . – Output y ∈ R . P ( x , y ) = P ( y | x ) P ( x ) � � � � − 1 � 2 � y − w ⊤ x P ( y | x ) ∝ exp 2 σ 2 No need to model P ( x ) . L´ eon Bottou 11/37 COS 424 – 4/15/2010
Bayesian regression Consider a dataset D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . n � P ( D , w ) = P ( w ) P ( D| w ) = P ( w ) P ( y i | x i , w ) P ( x i ) i =1 � � � � � � � � ��� � Plates represent repeated subgraphs. Although the parameter w is explicit, other details about the distributions are not. L´ eon Bottou 12/37 COS 424 – 4/15/2010
Hidden Markov Models P ( x 1 . . . x T , s 1 . . . s T ) = P ( s 1 ) P ( x 1 | s 1 ) P ( s 2 | s 1 ) P ( x 2 | s 2 ) . . . P ( s T | s T − 1 ) P ( x T | s T ) � � � � � � � � � � � � � � � � What is the relation between this graph and that graph? ��� � � L´ eon Bottou 13/37 COS 424 – 4/15/2010
Conditional independence patterns (1) Tail-to-tail � � � � � � P ( a, b, c ) = P ( a | c ) P ( b | c ) P ( c ) P ( a, b, c ) = P ( a | c ) P ( b | c ) P ( c ) � P ( a, b ) = P ( a | c ) P ( b | c ) P ( c ) P ( a, b | c ) = P ( a, b, c ) /P ( c ) c = P ( a | c ) P ( b | c ) � = P ( a ) P ( b ) in general a ⊥ �⊥ b | ∅ a ⊥ ⊥ b | c L´ eon Bottou 14/37 COS 424 – 4/15/2010
Conditional independence patterns (2) Head-to-tail � � � � � � P ( a, b, c ) = P ( a ) P ( c | a ) P ( b | c ) P ( a, b, c ) = P ( a ) P ( c | a ) P ( b | c ) = P ( a, c ) P ( b | c ) � P ( a, b ) = P ( a ) P ( c | a ) P ( b | c ) c P ( a, b | c ) = P ( a, b, c ) /P ( c ) � = P ( a ) P ( b, c | a ) = P ( a | c ) P ( b | c ) c = P ( a ) P ( b | a ) � = P ( a ) P ( b ) in general a ⊥ �⊥ b | ∅ a ⊥ ⊥ b | c L´ eon Bottou 15/37 COS 424 – 4/15/2010
Conditional independence patterns (3) Head-to-head � � � � � � P ( a, b, c ) = P ( a ) P ( b ) P ( c | a, b ) P ( a, b, c ) = P ( a ) P ( b ) P ( c | a, b ) � P ( a, b | c ) � = P ( a | c ) P ( b | c ) in general P ( a, b ) = P ( a ) P ( b ) P ( c | a, b ) c Example: � = P ( a ) P ( b ) P ( c | a, b )) c = “the house is shaking” c a = “there is an earthquake” = P ( a ) P ( b ) b = “a truck hits the house” a ⊥ ⊥ b | ∅ a ⊥ �⊥ b | c L´ eon Bottou 16/37 COS 424 – 4/15/2010
D-separation Problem – Consider three disjoint sets of nodes: A , B , C . – When do we have A ⊥ ⊥ B | C ? Definition A and B are d-separated by C if all paths from a ∈ A to b ∈ B – contain a head-to-tail or tail-to-tail node c ∈ C , or – contain a head-to-head node c such that neither c nor any of its descendants belongs to C . Theorem A and B are d-separated by C ⇐ ⇒ A ⊥ ⊥ B | C L´ eon Bottou 17/37 COS 424 – 4/15/2010
II. Undirected graphical models “Markov Random Fields” L´ eon Bottou 18/37 COS 424 – 4/15/2010
Another independence assumption pattern Boltzmann distribution P ( x ) = 1 � � � � � Z exp − E ( x ) Z = exp − E ( x ) with x – The function E ( x ) is called energy function . – The quantity Z is called the partition function . Markov Random Field – Let { x C } be a family of subsets of the variables x . – The distribution P ( x ) is a Markov Random Field with cliques { x C } if � there are functions E C ( x C ) such that E ( x ) = E C ( x C ) . C Equivalently, P ( x ) = 1 � Ψ C ( x C ) with Ψ C ( x C ) = exp( − E C ( x C )) > 0 . Z C L´ eon Bottou 19/37 COS 424 – 4/15/2010
Graphical representation P ( x 1 , x 2 , x 3 , x 4 , x 5 ) = 1 Z Ψ 1 ( x 1 , x 2 ) Ψ 2 ( x 2 , x 3 ) Ψ 3 ( x 3 , x 4 , x 5 ) � � � � � � � � � � � � � � � � – Completely connect the nodes belonging to each x C . – Each subset x C forms a clique of the graph. L´ eon Bottou 20/37 COS 424 – 4/15/2010
Markov Blanket Definition – The Markov blanket of x is the minimal subset of variables B x of the variables x such that P ( x | x \ x ) = P ( x | B x ) . Example Ψ 1 ( x 1 , x 2 ) Ψ 2 ( x 2 , x 3 ) Ψ 3 ( x 3 , x 4 , x 5 ) P ( x 3 | x 1 , x 2 , x 4 , x 5 ) = � Ψ 1 ( x 1 , x 2 ) Ψ 2 ( x 2 , x ′ 3 ) Ψ 3 ( x ′ 3 , x 4 , x 5 ) x ′ 3 Ψ 2 ( x 2 , x 3 ) Ψ 3 ( x 3 , x 4 , x 5 ) = � Ψ 2 ( x 2 , x ′ 3 ) Ψ 3 ( x ′ 3 , x 4 , x 5 ) x ′ 3 = P ( x 3 | x 2 , x 4 , x 5 ) L´ eon Bottou 21/37 COS 424 – 4/15/2010
Graph and Markov blanket The Markov blanket of a MRF variable is the set of its neighbors. P ( x 3 | x 1 , x 2 , x 4 , x 5 ) = P ( x 3 | x 2 , x 4 , x 5 ) � � � � � � � � � � � � � � � � Consequence – Consider three disjoint sets of nodes: A , B , C . � Any path between a ∈ A and b ∈ B A ⊥ ⊥ B | C ⇐ ⇒ passes through a node c ∈ C. Conversely (Hammersley-Clifford theorem) – Any distribution that satisfies such properties with respect to an undirected graph is a Markov Random Field. L´ eon Bottou 22/37 COS 424 – 4/15/2010
Directed vs. undirected graphs Consider a directed graph. P ( x ) = P ( x 1 ) P ( x 2 ) P ( x 3 | x 1 , x 2 ) P ( x 4 | x 2 ) � �� � � �� � � �� � � �� � Ψ 1 ( x 1 ) Ψ 2 ( x 2 ) Ψ 3 ( x 1 , x 2 , x 3 ) Ψ 4 ( x 2 , x 4 ) ( Z = 1 ) � � � � � � � � � � � � � � � � The opposite inclusion is not true because the undirected graph marries the parents of x 3 with a moralization link. Directed and undirected graphs represent different sets of distributions. Neither set is included in the other one. L´ eon Bottou 23/37 COS 424 – 4/15/2010
Example: image denoising Noise model: randomly flipping a small proportion of the pixels. Image model: pixel distribution given its four neighbors. ������������ ���������� ����������� �������� Inference problem – Given the observed noisy pixels, reconstruct the true pixel distributions. L´ eon Bottou 24/37 COS 424 – 4/15/2010
Recommend
More recommend