Introduction to Machine Learning Probabilistic graphical models Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Eric Xing, Matt Gormley Yifeng Tao Carnegie Mellon University 1
Recap of Basic Probability Concepts o Representation: the joint probability distribution on multiple binary variables? o State configurations in total: 2 8 o Are they all needed to be represented? o Do we get any scientific/medical insight? o Learning: where do we get all this probabilities? o Maximal-likelihood estimation? o Inference: If not all variables are observable, how to compute the conditional distribution of latent variables given evidence? o Computing p ( H | A ) would require summing over all 2 6 configurations of the unobserved variables [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 2
Graphical Model: Structure Simplifies Representation o Dependencies among variables [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 3
Probabilistic Graphical Models o If X i ’s are conditionally independent (as described by a PGM ), the joint can be factored to a product of simpler terms, e.g., o Why we may favor a PGM? o Incorporation of domain knowledge and causal (logical) structures o 2+2+4+4+4+8+4+8=36, an 8-fold reduction from 2 8 in representation cost! [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 4
Two types of GMs o Directed edges give causality relationships ( Bayesian Network or Directed Graphical Model ): o Undirected edges simply give correlations between variables ( Markov Random Field or Undirected Graphical model ): [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 5
Bayesian Network o Definition: o It consists of a graph G and the conditional probabilities P o These two parts full specify the distribution: o Qualitative Specification: G o Quantitative Specification: P [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 6
Where does the qualitative specification come from? o Prior knowledge of causal relationships o Learning from data (i.e. structure learning) o We simply prefer a certain architecture (e.g. a layered graph) o … [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 7
Quantitative Specification o Example: Conditional probability tables (CPTs) for discrete random variables [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 8
Quantitative Specification o Example: Conditional probability density functions (CPDs) for continuous random variables [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 9
Observed Variables o In a graphical model, shaded nodes are “ observed ”, i.e. their values are given [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 10
GMs are your old friends o Density estimation o Parametric and nonparametric methods o Regression o Linear, conditional mixture, nonparametric o Classification o Generative and discriminative approach o Clustering [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 11
What Independencies does a Bayes Net Model? o Independency of X and Z given Y ? P(X|Y)P(Z|Y) = P(X,Z|Y) o Three cases of interest... o Proof? [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 12
The “Burglar Alarm” example o Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. o Earth arguably doesn’t care whether your house is currently being burgled. o While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 13
Markov Blanket o Def: the co-parents of a node are the parents of its children o Def: the Markov Blanket of a node is the set containing the node’s parents, children, and co-parents. o Thm: a node is conditionally independent of every other node in the graph given its Markov blanket o Example: The Markov Blanket of X 6 is { X 3 , X 4 , X 5 , X 8 , X 9 , X 10 } [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 14
Markov Blanket o Example: The Markov Blanket of X 6 is { X 3 , X 4 , X 5 , X 8 , X 9 , X 10 } [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 15
D-Separation o Thm: If variables X and Z are d-separated given a set of variables E Then X and Z are conditionally independent given the set E o Definition: o Variables X and Z are d-separated given a set of evidence variables E iff every path from X to Z is “blocked”. [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 16
D-Separation o Variables X and Z are d-separated given a set of evidence variables E iff every path from X to Z is “blocked”. [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 17
Machine Learning [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 18
Recipe for Closed-form MLE [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 19
Learning Fully Observed BNs o How do we learn these conditional and marginal distributions for a Bayes Net? [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 20
Learning Fully Observed BNs o Learning this fully observed Bayesian Network is equivalent to learning five (small / simple) independent networks from the same data [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 21
Learning Fully Observed BNs [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 22
Learning Partially Observed BNs o Partially Observed Bayesian Network: o Maximal likelihood estimation à Incomplete log-likelihood o The log-likelihood contains unobserved latent variables o Solve with EM algorithm o Example: Gaussian Mixture Models (GMMs) [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 23
Inference of BNs o Suppose we already have the parameters of a Bayesian Network... [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 24
Approaches to inference o Exact inference algorithms o The elimination algorithm à Message Passing o Belief propagation o The junction tree algorithms o Approximate inference techniques o Variational algorithms o Stochastic simulation / sampling methods o Markov chain Monte Carlo methods [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 25
Marginalization and Elimination [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 26
Marginalization and Elimination [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 27
[Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 28
o Step 8: Wrap-up [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 29
Elimination algorithm o Elimination on trees is equivalent to message passing on branches o Message-passing is consistent in trees o Application: HMM [Slide from Eric Xing.] Yifeng Tao Carnegie Mellon University 30
Gibbs Sampling [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 31
Gibbs Sampling [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 32
Gibbs Sampling [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 33
Gibbs Sampling o Full conditionals only need to condition on the Markov Blanket o Must be “easy” to sample from conditionals o Many conditionals are log-concave and are amenable to adaptive rejection sampling [Slide from Matt Gormley.] Yifeng Tao Carnegie Mellon University 34
Take home message o Graphical models portrays the sparse dependencies of variables o Two types of graphical models: Bayesian network and Markov random field o Conditional independence, Markov blanket, and d-separation o Learning fully observed and partially observed Bayesian networks o Exact inference and approximate inference of Bayesian networks Yifeng Tao Carnegie Mellon University 35
References o Eric Xing, Ziv Bar-Joseph. 10701 Introduction to Machine Learning: http://www.cs.cmu.edu/~epxing/Class/10701/ o Matt Gormley. 10601 Introduction to Machine Learning: http://www.cs.cmu.edu/~mgormley/courses/10601/index.html Yifeng Tao Carnegie Mellon University 36
Recommend
More recommend