15 780 graduate artificial intelligence probabilistic
play

15-780 Graduate Artificial Intelligence: Probabilistic inference - PowerPoint PPT Presentation

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1 Outline Probabilistic graphical models Probabilistic inference Exact inference


  1. 15-780 – Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1

  2. Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 2

  3. Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 3

  4. Probabilistic graphical models Probabilistic graphical models are all about representing distributions π‘ž π‘Œ where π‘Œ represents some large set of random variables 0,1 ν‘› ( π‘œ -dimensional random variable), would take Example: suppose π‘Œ ∈ 2 ν‘› βˆ’ 1 parameters to describe the full joint distribution Graphical models offer a way to represent these same distributions more compactly, by exploiting conditional independencies in the distribution Note: I’m going to use β€œprobabilistic graphical model” and β€œBayesian network” interchangeably, even though there are differences 4

  5. Bayesian networks A Bayesian network is defined by 1. A directed acyclic graph, 𝐻 = {π‘Š = π‘Œ 1 , … , π‘Œ ν‘› , 𝐹} 2. A set of conditional distributions π‘ž π‘Œ ν‘– Parents π‘Œ ν‘– Defines the joint probability distribution ν‘› π‘ž π‘Œ = ∏ π‘ž π‘Œ ν‘– Parents π‘Œ ν‘– ν‘–=1 Equivalently: each node is conditionally independent of all non-descendants given its parents 5

  6. Example Bayesian network X 1 X 2 X 3 X 4 Conditional independencies let us simply the joint distribution: π‘ž π‘Œ 1 , π‘Œ 2 , π‘Œ 3 , π‘Œ 4 = π‘ž π‘Œ 1 π‘ž π‘Œ 2 π‘Œ 1 π‘ž π‘Œ 3 π‘Œ 1 , π‘Œ 2 π‘ž π‘Œ 4 π‘Œ 1 , π‘Œ 2 , π‘Œ 3 2 4 βˆ’ 1 = 15 = π‘ž π‘Œ 1 π‘ž π‘Œ 2 π‘Œ 1 )π‘ž π‘Œ 3 π‘Œ 2 π‘ž π‘Œ 4 π‘Œ 3 parameters (assuming binary variables) 6

  7. Example Bayesian network X 1 X 2 X 3 X 4 Conditional independencies let us simply the joint distribution: π‘ž π‘Œ 1 , π‘Œ 2 , π‘Œ 3 , π‘Œ 4 = π‘ž π‘Œ 1 π‘ž π‘Œ 2 π‘Œ 1 π‘ž π‘Œ 3 π‘Œ 1 , π‘Œ 2 π‘ž π‘Œ 4 π‘Œ 1 , π‘Œ 2 , π‘Œ 3 2 4 βˆ’ 1 = 15 = π‘ž π‘Œ 1 π‘ž π‘Œ 2 π‘Œ 1 )π‘ž π‘Œ 3 π‘Œ 2 π‘ž π‘Œ 4 π‘Œ 3 parameters (assuming binary variables) 1 parameter 2 parameters 7

  8. Example Bayesian network X 1 X 2 X 3 X 4 Conditional independencies let us simply the joint distribution: π‘ž π‘Œ 1 , π‘Œ 2 , π‘Œ 3 , π‘Œ 4 = π‘ž π‘Œ 1 π‘ž π‘Œ 2 π‘Œ 1 π‘ž π‘Œ 3 π‘Œ 1 , π‘Œ 2 π‘ž π‘Œ 4 π‘Œ 1 , π‘Œ 2 , π‘Œ 3 2 4 βˆ’ 1 = 15 = π‘ž π‘Œ 1 π‘ž π‘Œ 2 π‘Œ 1 )π‘ž π‘Œ 3 π‘Œ 2 π‘ž π‘Œ 4 π‘Œ 3 parameters (assuming binary 7 parameters variables) 8

  9. Poll: Simple Bayesian network What conditional independencies exist in the following Bayesian network? X 1 X 2 1. π‘Œ 1 and π‘Œ 2 are marginally independent 2. π‘Œ 4 is conditionally independent of π‘Œ 1 given π‘Œ 3 X 3 3. π‘Œ 1 is conditionally independent of π‘Œ 4 given π‘Œ 3 X 4 4. π‘Œ 1 is conditionally independent of π‘Œ 2 given π‘Œ 3 9

  10. Generative model Can also describe the probabilistic distribution as a sequential β€œstory”, this is called a generative model π‘Œ 1 ∼ Bernoulli 𝜚 1 2 π‘Œ 2 | π‘Œ 1 = 𝑦 1 ∼ Bernoulli 𝜚 ν‘₯ 1 X 1 X 2 X 3 X 4 3 π‘Œ 3 | π‘Œ 2 = 𝑦 2 ∼ Bernoulli 𝜚 ν‘₯ 2 3 π‘Œ 4 | π‘Œ 3 = 𝑦 3 ∼ Bernoulli 𝜚 ν‘₯ 3 β€œFirst sample π‘Œ 1 from a Bernoulli distribution with parameter 𝜚 1 , then sample π‘Œ 2 from a 2 , where 𝑦 1 is the value we sampled for π‘Œ 1 , then Bernoulli distribution with parameter 𝜚 ν‘₯ 1 sample π‘Œ 3 from a Bernoulli …” 10

  11. More general generative models This notion of a β€œsequential story” (generative model) is extremely powerful for describing very general distributions Naive Bayes: 𝑍 ∼ Bernoulli 𝜚 ν‘– π‘Œ ν‘– |𝑍 = 𝑧 ∼ Categorical 𝜚 푦 Gaussian mixture model: π‘Ž ∼ Categorical 𝜚 π‘Œ|π‘Ž = 𝑨 ∼ π’ͺ 𝜈 푧 , Ξ£ 푧 11

  12. More general generative models Linear regression: 𝑍 |π‘Œ = 𝑦 ∼ π’ͺ πœ„ π‘ˆ 𝑦, 𝜏 2 Changepoint model: π‘Œ ∼ Uniform 0,1 𝑍 |π‘Œ = 𝑦 ∼ {π’ͺ 𝜈 1 , 𝜏 2 if 𝑦 < 𝑒 π’ͺ 𝜈 2 , 𝜏 2 if 𝑦 β‰₯ 𝑒 Latent Dirichlet Allocation: 𝑁 documents, 𝐿 topics, 𝑂 𝑗 words/document πœ„ 𝑗 ∼ Dirichlet 𝛽 (topic distributions per document) 𝜚 𝑙 ∼ Dirichlet 𝛾 (word distributions per topic) 𝑨 𝑗,π‘˜ ∼ Categorical πœ„ 𝑗 (topic of π‘˜th word in document 𝑗) π‘₯ 𝑗,π‘˜ ∼ Categorical 𝜚 𝑨 𝑗 ,π‘˜ (π‘˜th word in document 𝑗) 12

  13. Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 13

  14. The inference problem Given observations (i.e., knowing the value of some of the variables in a model), what is the distribution over the other (hidden) variables? A relatively β€œeasy” problem if we observe variables at the β€œbeginning” of chains in a Bayesian network: β€’ If we observe the value of π‘Œ 1 , then π‘Œ 2 , π‘Œ 3 , π‘Œ 4 have the same distribution as before, just with π‘Œ 1 β€œfixed” X 1 X 2 X 3 X 4 β€’ But if we observe π‘Œ 4 what is the distribution over π‘Œ 1 , π‘Œ 2 , π‘Œ 3 ? X 1 X 2 X 3 X 4 14

  15. Many types of inference problems Marginal inference: given a generative distribution for π‘ž X over π‘Œ = {π‘Œ 1 , … , π‘Œ ν‘› } , determine π‘ž(π‘Œ ℐ ) for ℐ βŠ† {1, … , π‘œ} MAP inference: determine assignment with the maximum probability Conditional variants: solve either of the two variants conditioned on some observable variables, e.g. π‘ž(π‘Œ ℐ |π‘Œ β„° = 𝑦 β„° ) 15

  16. Approaches to inference There are three categories of common approaches to inference (more exist, but these are most common) 1. Exact methods: Bayes’ rule or variable elimination methods 2. Sampling approaches: draw samples from the the distribution over hidden variables, without construction them explicitly 3. Approximate variational approaches: approximate distributions over hidden variables using β€œsimple” distributions, minimizing the difference between these distributions and the true distributions 16

  17. Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 17

  18. Exact inference example Mixture of Gaussians model: π‘Ž ∼ Categorical 𝜚 π‘Œ|π‘Ž = 𝑨 ∼ π’ͺ 𝜈 푧 , Ξ£ 푧 Task: compute π‘ž(π‘Ž|𝑦) Z X In this case, we can solve inference exactly with Bayes’ rule: π‘ž 𝑦 π‘Ž π‘ž π‘Ž π‘ž π‘Ž 𝑦 = βˆ‘ 푧 π‘ž 𝑦 𝑨 π‘ž 𝑨 18

  19. Exact inference in graphical models In some cases, it’s possible to exploit the structure of the graphical model to develop efficient exact inference methods Example: how can I compute π‘ž(π‘Œ 4 ) ? X 1 X 2 X 3 X 4 π‘ž π‘Œ 4 = βˆ‘ 𝑄 𝑦 1 𝑄 𝑦 2 𝑦 1 𝑄 𝑦 3 𝑦 2 𝑄 π‘Œ 4 𝑦 3 ν‘₯ 1 ,ν‘₯ 2 ,ν‘₯ 3 19

  20. Need for approximate inference In most cases, the exact distribution over hidden variables cannot be computed, would require representing an exponentially large distribution over hidden variables (or infinite, in continuous case) π‘Ž ν‘– ∼ Bernoulli 𝜚 ν‘– , 𝑗 = 1, … , π‘œ π‘Œ|π‘Ž = 𝑨 ∼ π’ͺ πœ„ 푇 𝑨, 𝜏 2 Z 1 Z 2 Z n Β· Β· Β· X Distribution 𝑄 (π‘Ž|𝑦) is a full distribution over π‘œ binary random variables 20

  21. Outline Probabilistic graphical models Probabilistic inference Exact inference Sample-based inference A brief look at deep generative models 21

  22. Sample-based inference If we can draw samples from a posterior distribution, then we can approximate arbitrary probabilistic queries about that distribution A naive strategy (rejection sampling): draw samples from the generative model until we find one that matches the observed data, distribution over other variables will be samples of the hidden variables given observed variables As we get more complex models, and more observed variables, probability that we see our exact observations goes to zero X 1 X 2 X 3 X 4 22

  23. Markov Chain Monte Carlo Let’s consider a generic technique for generating samples from a distribution π‘ž π‘Œ (suppose distribution is complex so that we cannot directly compute or sample) Our strategy is going to be to generate samples π‘Œ ν‘‘ via some conditional distribution π‘ž(π‘Œ ν‘‘+1 |π‘Œ ν‘‘ ) , constructed to guarantee that π‘ž π‘Œ ν‘‘ β†’ π‘ž(π‘Œ) 23

  24. Μƒ Μƒ Μƒ Μƒ Μƒ Metropolis-Hastings Algorithm One of the workhorses of modern probabilistic methods 1. Pick some 𝑦 0 (e.g., completely randomly) 2. For 𝑒 = 1,2, … Sample: 𝑦 ν‘‘+1 ∼ π‘Ÿ π‘Œ β€² π‘Œ = 𝑦 ν‘‘ Set: 𝑦 ν‘‘+1 π‘Ÿ 𝑦 ν‘‘ 𝑦 ν‘‘+1 1, π‘ž 𝑦 ν‘‘+1 π‘₯. π‘ž. min 𝑦 ν‘‘+1 ≔ 𝑦 ν‘‘+1 𝑦 ν‘‘ π‘ž 𝑦 ν‘‘ π‘Ÿ 𝑦 ν‘‘ otherwise 24

  25. Notes on MH We choose π‘Ÿ(π‘Œ β€² |π‘Œ) so that we can easily sample from the distribution (e.g., for continuous distributions, it’s common to choose) π‘Ÿ π‘Œ β€² π‘Œ = 𝑦 = π’ͺ 𝑦 β€² 𝑦; 𝐽 Note that even if we cannot compute the probabilities π‘ž(𝑦 ν‘‘ ) and π‘ž( Μƒ 𝑦 ν‘‘+1 ) we can 𝑦 ν‘‘+1 )/π‘ž(𝑦 ν‘‘ ) (requires only being able to compute the often compute their ratio π‘ž( Μƒ unnormalized probabilities), e.g., consider the case X 1 X 2 X 3 X 4 25

Recommend


More recommend