divergence measures and message passing
play

Divergence measures and message passing Tom Minka Microsoft - PowerPoint PPT Presentation

Divergence measures and message passing Tom Minka Microsoft Research Cambridge, UK with thanks to the Machine Learning and Perception Group 1 Message-Passing Algorithms MF [Peterson,Anderson 87] Mean-field BP [Frey,MacKay 97] Loopy


  1. Divergence measures and message passing Tom Minka Microsoft Research Cambridge, UK with thanks to the Machine Learning and Perception Group 1

  2. Message-Passing Algorithms MF [Peterson,Anderson 87] Mean-field BP [Frey,MacKay 97] Loopy belief propagation EP [Minka 01] Expectation propagation TRW [Wainwright,Jaakkola,Willsky Tree-reweighted message 03] passing FBP [Wiegerinck,Heskes 02] Fractional belief propagation PEP [Minka 04] Power EP 2

  3. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 3

  4. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 4

  5. Estimation Problem b y d e x f z a c 5

  6. Estimation Problem b 0 1 ? y d e 0 0 1 ? 1 ? x f z a c 6

  7. Estimation Problem y x z 7

  8. Estimation Problem Queries: Want to do these quickly 8

  9. Belief Propagation y x z 9

  10. Belief Propagation Final y x z 10

  11. Belief Propagation Marginals: (Exact) (BP) Normalizing constant: 0.45 (Exact) 0.44 (BP) Argmax: (0,0,0) (Exact) (0,0,0) (BP) 11

  12. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 12

  13. Message Passing = Distributed Optimization • Messages represent a simpler distribution � � � � that approximates � � � � – A distributed representation • Message passing = optimizing � to fit � – � stands in for � when answering queries • Parameters: – What type of distribution to construct (approximating family) – What cost to minimize (divergence measure) 13

  14. How to make a message-passing algorithm 1. Pick an approximating family • fully-factorized, Gaussian, etc. 2. Pick a divergence measure 3. Construct an optimizer for that measure • usually fixed-point iteration 4. Distribute the optimization across factors 14

  15. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 15

  16. Let p,q be unnormalized distributions Kullback-Leibler (KL) divergence Alpha-divergence ( α is any real number) Asymmetric, convex 16

  17. Examples of alpha-divergence 17

  18. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = - ∞ 18

  19. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = 0 19

  20. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = 0.5 20

  21. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = 1 21

  22. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = ∞ 22

  23. Properties of alpha-divergence • α ≤ 0 seeks the mode with largest mass (not tallest) – zero-forcing : p(x)=0 forces q(x)=0 – underestimates the support of p • α ≥ 1 stretches to cover everything – inclusive : p(x)>0 forces q(x)>0 – overestimates the support of p [Frey,Patrascu,Jaakkola,Moran 00] 23

  24. Structure of alpha space inclusive (zero zero avoiding) forcing BP, MF EP α 0 1 TRW FBP, PEP 24

  25. Other properties • If q is an exact minimum of alpha-divergence: • Normalizing constant: • If α =1: Gaussian q matches mean,variance of p – Fully factorized q matches marginals of p 25

  26. Two-node example x y • q is fully-factorized, minimizes α - divergence to p • q has correct marginals only for α = 1 (BP) 26

  27. Two-node example Bimodal distribution Good Bad α = 1 (BP) •Zeros •Marginals •Peak •Mass heights α = 0 (MF) α ≤ 0.5 •Zeros •Marginals •One peak •Mass 27

  28. Two-node example Bimodal distribution Good Bad α = ∞ •Zeros •Peak heights •Marginals 28

  29. Lessons • Neither method is inherently superior – depends on what you care about • A factorized approx does not imply matching marginals (only for α =1) • Adding y to the problem can change the estimated marginal for x (though true marginal is unchanged) 29

  30. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 30

  31. Distributed divergence minimization 31

  32. Distributed divergence minimization • Write p as product of factors: • Approximate factors one by one: • Multiply to get the approximation: 32

  33. Global divergence to local divergence • Global divergence: • Local divergence: 33

  34. Message passing • Messages are passed between factors • Messages are factor approximations: • Factor � receives – Minimize local divergence to get – Send to other factors – Repeat until convergence • Produces all 6 algs 34

  35. Global divergence vs. local divergence MF 0 α local ≠ global local = global no loss from message passing In general, local ≠ global • but results are similar • BP doesn’t minimize global KL, but comes close 35

  36. Experiment • Which message passing algorithm is best at minimizing global D α (p||q)? • Procedure: 1. Run FBP with various α L 2. Compute global divergence for various α G 3. Find best α L (best alg) for each α G 36

  37. Results • Average over 20 graphs, random singleton and pairwise potentials: ��� � � �� � � � � � • Mixed potentials ( � ~ � (-1,1)): – best α L = α G (local should match global) – FBP with same α is best at minimizing D α • BP is best at minimizing KL 37

  38. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 38

  39. Hierarchy of algorithms Power EP • exp family • D α (p||q) Structured MF FBP EP • exp family • fully factorized • exp family • KL(q||p) • D α (p||q) • KL(p||q) MF TRW BP • fully factorized • fully factorized • fully factorized • D α (p||q), α >1 • KL(q||p) • KL(p||q) 39

  40. Matrix of algorithms MF Structured MF • fully factorized • exp family • KL(q||p) • KL(q||p) TRW • fully factorized approximation family Other families? • D α (p||q), α >1 (mixtures) divergence measure BP EP • fully factorized • exp family • KL(p||q) • KL(p||q) FBP Power EP • fully factorized • exp family • D α (p||q) • D α (p||q) Other divergences? 40

  41. Other Message Passing Algorithms Do they correspond to divergence measures? • Generalized belief propagation [Yedidia,Freeman,Weiss 00] • Iterated conditional modes [Besag 86] • Max-product belief revision • TRW-max-product [Wainwright,Jaakkola,Willsky 02] • Laplace propagation [Smola,Vishwanathan,Eskin 03] • Penniless propagation [Cano,Moral,Salmerón 00] • Bound propagation [Leisink,Kappen 03] 41

  42. Future work • Understand existing message passing algorithms • Understand local vs. global divergence • New message passing algorithms: – Specialized divergence measures – Richer approximating families • Other ways to minimize divergence 42

Recommend


More recommend