15-780 β Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1
Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 2
Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 3
Probabilistic graphical models Probabilistic graphical models are all about representing distributions π π where π represents some large set of random variables 0,1 ν ( π -dimensional random variable), would take Example: suppose π β 2 ν β 1 parameters to describe the full joint distribution Graphical models offer a way to represent these same distributions more compactly, by exploiting conditional independencies in the distribution Note: Iβm going to use βprobabilistic graphical modelβ and βBayesian networkβ interchangeably, even though there are differences 4
Bayesian networks A Bayesian network is defined by 1. A directed acyclic graph, π» = {π = π 1 , β¦ , π ν , πΉ} 2. A set of conditional distributions π π ν Parents π ν Defines the joint probability distribution ν π π = β π π ν Parents π ν ν=1 Equivalently: each node is conditionally independent of all non-descendants given its parents 5
Example Bayesian network X 1 X 2 X 3 X 4 Conditional independencies let us simply the joint distribution: π π 1 , π 2 , π 3 , π 4 = π π 1 π π 2 π 1 π π 3 π 1 , π 2 π π 4 π 1 , π 2 , π 3 2 4 β 1 = 15 = π π 1 π π 2 π 1 )π π 3 π 2 π π 4 π 3 parameters (assuming binary variables) 6
Example Bayesian network X 1 X 2 X 3 X 4 Conditional independencies let us simply the joint distribution: π π 1 , π 2 , π 3 , π 4 = π π 1 π π 2 π 1 π π 3 π 1 , π 2 π π 4 π 1 , π 2 , π 3 2 4 β 1 = 15 = π π 1 π π 2 π 1 )π π 3 π 2 π π 4 π 3 parameters (assuming binary variables) 1 parameter 2 parameters 7
Example Bayesian network X 1 X 2 X 3 X 4 Conditional independencies let us simply the joint distribution: π π 1 , π 2 , π 3 , π 4 = π π 1 π π 2 π 1 π π 3 π 1 , π 2 π π 4 π 1 , π 2 , π 3 2 4 β 1 = 15 = π π 1 π π 2 π 1 )π π 3 π 2 π π 4 π 3 parameters (assuming binary 7 parameters variables) 8
Poll: Simple Bayesian network What conditional independencies exist in the following Bayesian network? X 1 X 2 1. π 1 and π 2 are marginally independent 2. π 4 is conditionally independent of π 1 given π 3 X 3 3. π 1 is conditionally independent of π 4 given π 3 X 4 4. π 1 is conditionally independent of π 2 given π 3 9
Generative model Can also describe the probabilistic distribution as a sequential βstoryβ, this is called a generative model π 1 βΌ Bernoulli π 1 2 π 2 | π 1 = π¦ 1 βΌ Bernoulli π ν₯ 1 X 1 X 2 X 3 X 4 3 π 3 | π 2 = π¦ 2 βΌ Bernoulli π ν₯ 2 3 π 4 | π 3 = π¦ 3 βΌ Bernoulli π ν₯ 3 βFirst sample π 1 from a Bernoulli distribution with parameter π 1 , then sample π 2 from a 2 , where π¦ 1 is the value we sampled for π 1 , then Bernoulli distribution with parameter π ν₯ 1 sample π 3 from a Bernoulli β¦β 10
More general generative models This notion of a βsequential storyβ (generative model) is extremely powerful for describing very general distributions Naive Bayes: π βΌ Bernoulli π ν π ν |π = π§ βΌ Categorical π ν¦ Gaussian mixture model: π βΌ Categorical π π|π = π¨ βΌ πͺ π ν§ , Ξ£ ν§ 11
More general generative models Linear regression: π |π = π¦ βΌ πͺ π π π¦, π 2 Changepoint model: π βΌ Uniform 0,1 π |π = π¦ βΌ {πͺ π 1 , π 2 if π¦ < π’ πͺ π 2 , π 2 if π¦ β₯ π’ Latent Dirichlet Allocation: π documents, πΏ topics, π π words/document π π βΌ Dirichlet π½ (topic distributions per document) π π βΌ Dirichlet πΎ (word distributions per topic) π¨ π,π βΌ Categorical π π (topic of πth word in document π) π₯ π,π βΌ Categorical π π¨ π ,π (πth word in document π) 12
Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 13
The inference problem Given observations (i.e., knowing the value of some of the variables in a model), what is the distribution over the other (hidden) variables? A relatively βeasyβ problem if we observe variables at the βbeginningβ of chains in a Bayesian network: β’ If we observe the value of π 1 , then π 2 , π 3 , π 4 have the same distribution as before, just with π 1 βfixedβ X 1 X 2 X 3 X 4 β’ But if we observe π 4 what is the distribution over π 1 , π 2 , π 3 ? X 1 X 2 X 3 X 4 14
Many types of inference problems Marginal inference: given a generative distribution for π X over π = {π 1 , β¦ , π ν } , determine π(π β ) for β β {1, β¦ , π} MAP inference: determine assignment with the maximum probability Conditional variants: solve either of the two variants conditioned on some observable variables, e.g. π(π β |π β° = π¦ β° ) 15
Approaches to inference There are three categories of common approaches to inference (more exist, but these are most common) 1. Exact methods: Bayesβ rule or variable elimination methods 2. Sampling approaches: draw samples from the the distribution over hidden variables, without construction them explicitly 3. Approximate variational approaches: approximate distributions over hidden variables using βsimpleβ distributions, minimizing the difference between these distributions and the true distributions 16
Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 17
Exact inference example Mixture of Gaussians model: π βΌ Categorical π π|π = π¨ βΌ πͺ π ν§ , Ξ£ ν§ Task: compute π(π|π¦) Z X In this case, we can solve inference exactly with Bayesβ rule: π π¦ π π π π π π¦ = β ν§ π π¦ π¨ π π¨ 18
Exact inference in graphical models In some cases, itβs possible to exploit the structure of the graphical model to develop efficient exact inference methods Example: how can I compute π(π 4 ) ? X 1 X 2 X 3 X 4 π π 4 = β π π¦ 1 π π¦ 2 π¦ 1 π π¦ 3 π¦ 2 π π 4 π¦ 3 ν₯ 1 ,ν₯ 2 ,ν₯ 3 19
Need for approximate inference In most cases, the exact distribution over hidden variables cannot be computed, would require representing an exponentially large distribution over hidden variables (or infinite, in continuous case) π ν βΌ Bernoulli π ν , π = 1, β¦ , π π|π = π¨ βΌ πͺ π ν π¨, π 2 Z 1 Z 2 Z n Β· Β· Β· X Distribution π (π|π¦) is a full distribution over π binary random variables 20
Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 21
Sample-based inference If we can draw samples from a posterior distribution, then we can approximate arbitrary probabilistic queries about that distribution A naive strategy (rejection sampling): draw samples from the generative model until we find one that matches the observed data, distribution over other variables will be samples of the hidden variables given observed variables As we get more complex models, and more observed variables, probability that we see our exact observations goes to zero X 1 X 2 X 3 X 4 22
Markov Chain Monte Carlo Letβs consider a generic technique for generating samples from a distribution π π (suppose distribution is complex so that we cannot directly compute or sample) Our strategy is going to be to generate samples π ν‘ via some conditional distribution π(π ν‘+1 |π ν‘ ) , constructed to guarantee that π π ν‘ β π(π) 23
Μ Μ Μ Μ Μ Metropolis-Hastings Algorithm One of the workhorses of modern probabilistic methods 1. Pick some π¦ 0 (e.g., completely randomly) 2. For π’ = 1,2, β¦ Sample: π¦ ν‘+1 βΌ π π β² π = π¦ ν‘ Set: π¦ ν‘+1 π π¦ ν‘ π¦ ν‘+1 1, π π¦ ν‘+1 π₯. π. min π¦ ν‘+1 β π¦ ν‘+1 π¦ ν‘ π π¦ ν‘ π π¦ ν‘ otherwise 24
Notes on MH We choose π(π β² |π) so that we can easily sample from the distribution (e.g., for continuous distributions, itβs common to choose) π π β² π = π¦ = πͺ π¦ β² π¦; π½ Note that even if we cannot compute the probabilities π(π¦ ν‘ ) and π( Μ π¦ ν‘+1 ) we can π¦ ν‘+1 )/π(π¦ ν‘ ) (requires only being able to compute the often compute their ratio π( Μ unnormalized probabilities), e.g., consider the case X 1 X 2 X 3 X 4 25
Recommend
More recommend