probabilistic graphical models
play

Probabilistic Graphical Models Lecture 13 Loopy Belief Propagation - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 13 Loopy Belief Propagation CS/CNS/EE 155 Andreas Krause Announcements Homework 3 out Lighter problem set to allow more time for project Next Monday: Guest lecture by Dr. Baback Moghaddam from the


  1. Probabilistic Graphical Models Lecture 13 – Loopy Belief Propagation CS/CNS/EE 155 Andreas Krause

  2. Announcements Homework 3 out Lighter problem set to allow more time for project Next Monday: Guest lecture by Dr. Baback Moghaddam from the JPL Machine Learning Group PLEASE fill out feedback forms This is a new course Your feedback can have major impact in future offerings!! 2

  3. HMMs / Kalman Filters Most famous Graphical models: Naïve Bayes model Hidden Markov model Kalman Filter Hidden Markov models Speech recognition Sequence analysis in comp. bio Kalman Filters control Cruise control in cars GPS navigation devices Tracking missiles.. Very simple models but very powerful!! 3

  4. HMMs / Kalman Filters X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 ,…,X T : Unobserved (hidden) variables Y 1 ,…,Y T : Observations HMMs: X i Multinomial, Y i arbitrary Kalman Filters: X i , Y i Gaussian distributions Non-linear KF: X i Gaussian, Y i arbitrary 4

  5. Hidden Markov Models Inference: X 1 X 2 X 3 X 4 X 5 X 6 In principle, can use VE, JT etc. New variables X t , Y t at Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 each time step � need to rerun Bayesian Filtering: Suppose we already have computed P(X t | y 1,…,t ) Want to efficiently compute P(X t+1 | y 1,…,t+1 ) 5

  6. Bayesian filtering Start with P(X 1 ) X 1 X 2 X 3 X 4 X 5 X 6 At time t Assume we have P(X t | y 1…t-1 ) Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Condition: P(X t | y 1…t ) Prediction: P(X t+1 , X t | y 1…t ) Marginalization: P(X t+1 | y 1…t ) 6

  7. Kalman Filters (Gaussian HMMs) X 1 ,…,X T : Location of object being tracked Y 1 ,…,Y T : Observations P(X 1 ): Prior belief about location at time 1 P(X t+1 |X t ): “Motion model” How do I expect my target to move in the environment? Represented as CLG: X t+1 = A X t + N(0, � � ) P(Y t | X t ): “Sensor model” What do I observe if target is at location X t ? Represented as CLG: Y t = H X t + N(0, � � ) X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 7

  8. Bayesian Filtering for KFs Can use Gaussian elimination to perform inference in “unrolled” model X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Start with prior belief P(X 1 ) At every timestep have belief P(X t | y 1:t-1 ) Condition on observation: P(X t | y 1:t ) Predict (multiply motion model): P(X t+1 ,X t | y 1:t ) “Roll-up” (marginalize prev. time): P(X t+1 | y 1:t ) 8

  9. What if observations not “linear”? Linear observations: Y t = H X t + noise Nonlinear observations: 9

  10. Incorporating Non-gaussian observations Nonlinear observation � P(Y t | X t ) not Gaussian � Make it Gaussian! � First approach: Approximate P(Y t | X t ) as CLG Linearize P(Y t | X t ) around current estimate E[X t | y 1..t-1 ] Known as Extended Kalman Filter (EKF) Can perform poorly if P(Y t | X t ) highly nonlinear Second approach: Approximate P(Y t , X t ) as Gaussian Takes correlation in X t into account After obtaining approximation, condition on Y t =y t (now a “linear” observation) 10

  11. Factored dynamical models So far: HMMs and Kalman filters X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 What if we have more than one variable at each time step? E.g., temperature at different locations, or road conditions in a road network? � Spatio-temporal models 11

  12. Dynamic Bayesian Networks At every timestep have a Bayesian Network A 1 A 2 A 3 D 1 D 2 D 3 B 1 B 2 B 3 E 1 E 2 E 3 C 1 C 2 C 3 Variables at each time step t called a “slice” S t “Temporal” edges connecting S t+1 with S t 12

  13. Flow of influence in DBNs A 1 A 2 A 3 A 4 acceleration speed S 1 S 2 S 3 S 4 L 1 L 2 L 3 L 4 location Can we do efficient filtering in BNs? 13

  14. Efficient inference in DBNs? A 1 A 2 B 1 B 2 C 1 C 2 D 1 D 2 14

  15. Approximate inference in DBNs? DBN Marginals at time 2 A 1 A 2 A 2 B 1 B 2 B 2 C 1 C 2 C 2 D 1 D 2 D 2 How can we find principled approximations that still allow efficient inference?? 15

  16. Assumed Density Filtering True marginal Approximate marginal A t A t B t B t C t C t D t D t True marginal P(X t ) fully connected Want to find “simpler” distribution Q(X t ) such that P(X t ) � Q(X t ) Optimize over parameters of Q to make Q as “close” to P as possible Similar to incorporating non-linear observations in KF! More details later (variational inference)! 16

  17. Big picture summary � � � � � � � � � � � � � � � � � � � � represent � � � � � �� � �� � �� � �� States of the world, Graphical model sensor measurements, … Want to choose a model that … represents relevant statistical dependencies between variables we can use to make inferences (make predictions, etc.) we can learn from training data 17

  18. What you have learned so far Representation Bayesian Networks Markov Networks Conditional independence is key Inference Variable Elimination and Junction tree inference Exact inference possible if graph has low treewidth Learning Parameters : Can do MLE and Bayesian learning in Bayes Nets and Markov Nets if data fully observed Structure : Can find optimal tree 18

  19. Representation Conditional independence = Factorization Represent factorization/independence as graph Directed graphs: Bayesian networks Undirected graphs: Markov networks Typically, assume factors in exponential family (e.g., Multinomial, Gaussian, …) So far, we assumed all variables in the model are known In practice Existence of variables can depend on data Number of variables can grow over time We might have hidden (unobserved variables)! 19

  20. Inference Key idea : Exploit factorization (distributivity) Complexity of inference depends on treewidth of underlying model Junction tree inference “only” exponential in treewidth In practice, often have high treewidth Always high treewidth in DBNs � Need approximate inference 20

  21. Learning Maximum likelihood estimation In BNs : independent optimization for each CPT (decomposable score) In MNs : Partition function couples parameters, but can do gradient ascent (no local optima!) Bayesian parameter estimation Conjugate priors convenient to work with Structure learning NP-hard in general Can find optimal tree (Chow Liu) So far: Assumed all variables are observed In practice: often have missing data 21

  22. The “light” side Assumed everything fully observable low treewidth no hidden variables Then everything is nice � Efficient exact inference in large models Optimal parameter estimation without local minima Can even solve some structure learning tasks exactly 22

  23. The “dark” side � � � � � � � � � � � � � � � � � � � � represent � � � � � �� � �� � �� � �� States of the world, Graphical model sensor measurements, … In the real world, these assumptions are often violated.. Still want to use graphical models to solve interesting problems.. 23

  24. Remaining Challenges Representation: Dealing with hidden variables Approximate inference for high-treewidth models Dealing with missing data This will be focus of remaining part of the course! 24

  25. Recall: Hardness of inference Computing conditional distributions: Exact solution: #P-complete Approximate solution: NP-hard Maximization: MPE: NP-complete MAP: NP PP -complete 25

  26. Inference Can exploit structure (conditional independence) to efficiently perform exact inference in many practical situations Whenever the graph is low treewidth Whenever there is context-specific independence Several other special cases For BNs where exact inference is not possible, can use algorithms for approximate inference Coming up now! 26

  27. Approximate inference Three major classes of general-purpose approaches Message passing E.g.: Loopy Belief Propagation (today!) Inference as optimization Approximate posterior distribution by simple distribution Mean field / structured mean field Sampling based inference Importance sampling, particle filtering Gibbs sampling, MCMC Many other alternatives (often for special cases) 27

  28. Recall: Message passing in Junction trees Messages between clusters: 1: CD C D I 2: DIG G S 3: GIS L J H 4:GJSL 6:JSL 5:HGJ 28

  29. BP on Tree Pairwise Markov Nets Suppose graph is given as tree pairwise C Markov net D I Don’t need a junction tree! G S Graph is already a tree! Example message: L J H More generally: Theorem : For trees, get correct answer! 29

  30. Loopy BP on arbitrary pairwise MNs What if we apply BP to a graph with loops? C Apply BP and hope for the best.. D I G S Will not generally converge.. If it converges, will not necessarily get L correct marginals J H However, in practice, answers often still useful! 30

  31. Practical aspects of Loopy BP Messages product of numbers � 1 On loopy graphs, repeatedly multiply same factors � products converge to 0 (numerical problems) Solution: Renormalize! Does not affect outcome: 31

  32. Behavior of BP P(X 1 = 1) 1 BP estimate True X 1 posterior .5 X 2 X 3 0 X 4 Iteration # Loopy BP multiplies same potentials multiple times � BP often overconfident 32

  33. When do we stop? Messages 33

  34. Does Loopy BP always converge? No! Can oscillate! Typically, oscillation the more severe the more “deterministic” the potentials Graphs from K. Murphy UAI ‘99 34

  35. What can we do to make BP converge? 35

Recommend


More recommend