10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Directed Graphical Models + Undirected Graphical Models Matt Gormley Lecture 7 Sep. 18, 2019 1
Q&A Q: How will I earn the 5% participation points? A: Very gradually. There will be a few aspects of the course (polls, surveys, meetings with the course staff) that we will attach participation points to. That said, we might not actually use the whole 5% that is being held out. 2
Q&A Q: When should I prefer a directed graphical model to an undirected graphical model? A: As we’ll see today, the primary differences between them are: 1. the conditional independence assumptions they define 2. the normalization assumptions they make (Bayes Nets are locally normalized) (That said, we’ll also tie them together via a single framework: factor graphs.) There are also some practical differences (e.g. ease of learning) that result from the locally vs. globally normalized difference. 3
Reminders • Homework 1: DAgger for seq2seq – Out: Thu, Sep. 12 – Due: Thu, Sep. 26 at 11:59pm 4
SUPERVISED LEARNING FOR BAYES NETS 5
Recipe for Closed-form MLE 1. Assume data was generated i.i.d. from some model (i.e. write the generative story) x (i) ~ p(x| θ ) 2. Write log-likelihood l ( θ ) = log p(x (1) | θ ) + … + log p(x (N) | θ ) 3. Compute partial derivatives (i.e. gradient) ! l ( θ )/ ! θ 1 = … ! l ( θ )/ ! θ 2 = … … ! l ( θ )/ ! θ M = … 4. Set derivatives to zero and solve for θ ! l ( θ )/ ! θ m = 0 for all m ∈ {1, …, M} θ MLE = solution to system of M equations and M variables Compute the second derivative and check that l ( θ ) is concave down 5. at θ MLE 6
Machine Learning Our model The data inspires defines a score the structures for each structure we want to predict It also tells us Domain Mathematical Knowledge Modeling what to optimize ML Inference finds Optimization Combinatorial { best structure, marginals, Optimization partition function }for a new observation Learning tunes the parameters of the (Inference is usually model called as a subroutine in learning) 7
Machine Learning Model Data X 1 X 3 arrow X 2 an like flies time X 4 X 5 Objective Inference Learning (Inference is usually called as a subroutine in learning) 8
Learning Fully Observed BNs X 1 p ( X 1 , X 2 , X 3 , X 4 , X 5 ) = X 3 X 2 p ( X 5 | X 3 ) p ( X 4 | X 2 , X 3 ) p ( X 3 ) p ( X 2 | X 1 ) p ( X 1 ) X 4 X 5 9
Learning Fully Observed BNs X 1 p ( X 1 , X 2 , X 3 , X 4 , X 5 ) = X 3 X 2 p ( X 5 | X 3 ) p ( X 4 | X 2 , X 3 ) p ( X 3 ) p ( X 2 | X 1 ) p ( X 1 ) X 4 X 5 10
Learning Fully Observed BNs X 1 p ( X 1 , X 2 , X 3 , X 4 , X 5 ) = X 3 X 2 p ( X 5 | X 3 ) p ( X 4 | X 2 , X 3 ) p ( X 3 ) p ( X 2 | X 1 ) p ( X 1 ) X 4 X 5 How do we learn these conditional and marginal distributions for a Bayes Net? 11
Learning Fully Observed BNs Learning this fully observed p ( X 1 , X 2 , X 3 , X 4 , X 5 ) = Bayesian Network is p ( X 5 | X 3 ) p ( X 4 | X 2 , X 3 ) equivalent to learning five (small / simple) independent p ( X 3 ) p ( X 2 | X 1 ) p ( X 1 ) networks from the same data X 1 X 1 X 1 X 3 X 2 X 3 X 2 X 3 X 3 X 2 X 4 X 5 X 4 X 5 12
Learning Fully Observed BNs How do we learn these θ ∗ = argmax conditional and marginal log p ( X 1 , X 2 , X 3 , X 4 , X 5 ) distributions for a Bayes Net? θ = argmax log p ( X 5 | X 3 , θ 5 ) + log p ( X 4 | X 2 , X 3 , θ 4 ) θ X 1 + log p ( X 3 | θ 3 ) + log p ( X 2 | X 1 , θ 2 ) + log p ( X 1 | θ 1 ) X 3 X 2 θ ∗ 1 = argmax log p ( X 1 | θ 1 ) θ 1 X 4 X 5 θ ∗ 2 = argmax log p ( X 2 | X 1 , θ 2 ) θ 2 θ ∗ 3 = argmax log p ( X 3 | θ 3 ) θ 3 θ ∗ 4 = argmax log p ( X 4 | X 2 , X 3 , θ 4 ) θ 4 5 = argmax log p ( X 5 | X 3 , θ 5 ) θ ∗ θ 5 13
Learning Fully Observed BNs 14
INFERENCE FOR BAYESIAN NETWORKS 16
A Few Problems for Bayes Nets Suppose we already have the parameters of a Bayesian Network… 1. How do we compute the probability of a specific assignment to the variables? P(T=t, H=h, A=a, C=c) 2. How do we draw a sample from the joint distribution? t,h,a,c ∼ P(T, H, A, C) 3. How do we compute marginal probabilities? P(A) = … 4. How do we draw samples from a conditional distribution? t,h,a ∼ P(T, H, A | C = c) 5. How do we compute conditional marginal probabilities? P(H | C = c) = … 17
GRAPHICAL MODELS: DETERMINING CONDITIONAL INDEPENDENCIES
What Independencies does a Bayes Net Model? • In order for a Bayesian network to model a probability distribution, the following must be true: Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents. • This follows from n P ( X 1 … X n ) = ∏ P ( X i | parents ( X i )) i = 1 n ∏ P ( X i | X 1 … X i − 1 ) = i = 1 • But what else does it imply? Slide from William Cohen
What Independencies does a Bayes Net Model? Three cases of interest… Cascade Common Parent V-Structure Z Y X Z Y Y X Z X 20
What Independencies does a Bayes Net Model? Three cases of interest… Cascade Common Parent V-Structure Z Y X Z Y Y X Z X �� Z | Y X � ⊥ Z | Y ⊥ Z | Y X ⊥ X ⊥ Knowing Y Knowing Y decouples X and Z couples X and Z 21
Whiteboard Common Parent Y (The other two Proof of cases can be conditional shown just as X Z independence easily.) ⊥ Z | Y X ⊥ 22
The � Burglar Alarm � example • Your house has a twitchy burglar Burglar Earthquake alarm that is also sometimes triggered by earthquakes. Alarm • Earth arguably doesn’t care whether your house is currently being burgled Phone Call • While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh! Quiz: True or False? ⊥ Earthquake | PhoneCall Burglar ⊥ Slide from William Cohen
Markov Blanket (Directed) Def: the co-parents of a node are the parents of its children Def: the Markov Blanket of a X 1 node in a directed graphical model is the set containing the X 2 X 4 X 3 node’s parents, children, and co-parents. X 5 X 8 X 6 X 7 X 9 X 10 X 11 X 13 X 12 25
Markov Blanket (Directed) Example: The Markov Def: the co-parents of a node Blanket of X 6 is are the parents of its children { X 3 , X 4 , X 5 , X 8 , X 9 , X 10 } Def: the Markov Blanket of a X 1 node in a directed graphical model is the set containing the X 2 X 4 X 3 node’s parents, children, and co-parents. Parents Parents X 5 X 8 X 6 X 7 Co-parents Parents X 9 X 10 X 11 Parents Children X 13 X 12 26
Markov Blanket (Directed) Example: The Markov Def: the co-parents of a node Blanket of X 6 is are the parents of its children { X 3 , X 4 , X 5 , X 8 , X 9 , X 10 } Def: the Markov Blanket of a X 1 node in a directed graphical model is the set containing the X 2 X 4 X 3 node’s parents, children, and co-parents. Parents Parents X 5 X 8 X 6 X 7 Theorem: a node is Co-parents Parents conditionally independent of X 9 X 10 every other node in the graph X 11 given its Markov blanket Parents Children X 13 X 12 27
D-Separation If variables X and Z are d-separated given a set of variables E Then X and Z are conditionally independent given the set E Definition #1: Variables X and Z are d-separated given a set of evidence variables E iff every path from X to Z is “blocked”. A path is “blocked” whenever: ∃ Y on path s.t. Y ∈ E and Y is a “common parent” 1. … … Z X Y ∃ Y on path s.t. Y ∈ E and Y is in a “cascade” 2. … … Z X Y ∃ Y on path s.t. {Y, descendants(Y)} ∉ E and Y is in a “v-structure” 3. … … Z X Y 28
D-Separation If variables X and Z are d-separated given a set of variables E Then X and Z are conditionally independent given the set E Definition #2: Variables X and Z are d-separated given a set of evidence variables E iff there does not exist a path in the undirected ancestral moral graph with E removed . 1. Ancestral graph : keep only X, Z, E and their ancestors 2. Moral graph : add undirected edge between all pairs of each node’s parents 3. Undirected graph : convert all directed edges to undirected 4. Givens Removed: delete any nodes in E Example Query: A ⫫ B | {D, E} Moral: Original: Ancestral: Undirected: Givens Removed: A A A A A B B B B B C C C C C ⇒ A and B connected D E D E D E D E ⇒ not d-separated F 29
Learning Objectives Bayesian Networks You should be able to… 1. Identify the conditional independence assumptions given by a generative story or a specification of a joint distribution 2. Draw a Bayesian network given a set of conditional independence assumptions 3. Define the joint distribution specified by a Bayesian network 4. User domain knowledge to construct a (simple) Bayesian network for a real-world modeling problem 5. Depict familiar models as Bayesian networks 6. Use d-separation to prove the existence of conditional independencies in a Bayesian network 7. Employ a Markov blanket to identify conditional independence assumptions of a graphical model 8. Develop a supervised learning algorithm for a Bayesian network 30
TYPES OF GRAPHICAL MODELS 31
Three Types of Graphical Models Directed Graphical Undirected Graphical Factor Graph Model Model X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 32
Recommend
More recommend