Divergence measures and message passing Tom Minka Microsoft Research Cambridge, UK with thanks to the Machine Learning and Perception Group 1
Message-Passing Algorithms MF [Peterson,Anderson 87] Mean-field BP [Frey,MacKay 97] Loopy belief propagation EP [Minka 01] Expectation propagation TRW [Wainwright,Jaakkola,Willsky Tree-reweighted message 03] passing FBP [Wiegerinck,Heskes 02] Fractional belief propagation PEP [Minka 04] Power EP 2
Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 3
Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 4
Estimation Problem b y d e x f z a c 5
Estimation Problem b 0 1 ? y d e 0 0 1 ? 1 ? x f z a c 6
Estimation Problem y x z 7
Estimation Problem Queries: Want to do these quickly 8
Belief Propagation y x z 9
Belief Propagation Final y x z 10
Belief Propagation Marginals: (Exact) (BP) Normalizing constant: 0.45 (Exact) 0.44 (BP) Argmax: (0,0,0) (Exact) (0,0,0) (BP) 11
Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 12
Message Passing = Distributed Optimization • Messages represent a simpler distribution � � � � that approximates � � � � – A distributed representation • Message passing = optimizing � to fit � – � stands in for � when answering queries • Parameters: – What type of distribution to construct (approximating family) – What cost to minimize (divergence measure) 13
How to make a message-passing algorithm 1. Pick an approximating family • fully-factorized, Gaussian, etc. 2. Pick a divergence measure 3. Construct an optimizer for that measure • usually fixed-point iteration 4. Distribute the optimization across factors 14
Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 15
Let p,q be unnormalized distributions Kullback-Leibler (KL) divergence Alpha-divergence ( α is any real number) Asymmetric, convex 16
Examples of alpha-divergence 17
Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = - ∞ 18
Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = 0 19
Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = 0.5 20
Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = 1 21
Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = ∞ 22
Properties of alpha-divergence • α ≤ 0 seeks the mode with largest mass (not tallest) – zero-forcing : p(x)=0 forces q(x)=0 – underestimates the support of p • α ≥ 1 stretches to cover everything – inclusive : p(x)>0 forces q(x)>0 – overestimates the support of p [Frey,Patrascu,Jaakkola,Moran 00] 23
Structure of alpha space inclusive (zero zero avoiding) forcing BP, MF EP α 0 1 TRW FBP, PEP 24
Other properties • If q is an exact minimum of alpha-divergence: • Normalizing constant: • If α =1: Gaussian q matches mean,variance of p – Fully factorized q matches marginals of p 25
Two-node example x y • q is fully-factorized, minimizes α - divergence to p • q has correct marginals only for α = 1 (BP) 26
Two-node example Bimodal distribution Good Bad α = 1 (BP) •Zeros •Marginals •Peak •Mass heights α = 0 (MF) α ≤ 0.5 •Zeros •Marginals •One peak •Mass 27
Two-node example Bimodal distribution Good Bad α = ∞ •Zeros •Peak heights •Marginals 28
Lessons • Neither method is inherently superior – depends on what you care about • A factorized approx does not imply matching marginals (only for α =1) • Adding y to the problem can change the estimated marginal for x (though true marginal is unchanged) 29
Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 30
Distributed divergence minimization 31
Distributed divergence minimization • Write p as product of factors: • Approximate factors one by one: • Multiply to get the approximation: 32
Global divergence to local divergence • Global divergence: • Local divergence: 33
Message passing • Messages are passed between factors • Messages are factor approximations: • Factor � receives – Minimize local divergence to get – Send to other factors – Repeat until convergence • Produces all 6 algs 34
Global divergence vs. local divergence MF 0 α local ≠ global local = global no loss from message passing In general, local ≠ global • but results are similar • BP doesn’t minimize global KL, but comes close 35
Experiment • Which message passing algorithm is best at minimizing global D α (p||q)? • Procedure: 1. Run FBP with various α L 2. Compute global divergence for various α G 3. Find best α L (best alg) for each α G 36
Results • Average over 20 graphs, random singleton and pairwise potentials: ��� � � �� � � � � � • Mixed potentials ( � ~ � (-1,1)): – best α L = α G (local should match global) – FBP with same α is best at minimizing D α • BP is best at minimizing KL 37
Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 38
Hierarchy of algorithms Power EP • exp family • D α (p||q) Structured MF FBP EP • exp family • fully factorized • exp family • KL(q||p) • D α (p||q) • KL(p||q) MF TRW BP • fully factorized • fully factorized • fully factorized • D α (p||q), α >1 • KL(q||p) • KL(p||q) 39
Matrix of algorithms MF Structured MF • fully factorized • exp family • KL(q||p) • KL(q||p) TRW • fully factorized approximation family Other families? • D α (p||q), α >1 (mixtures) divergence measure BP EP • fully factorized • exp family • KL(p||q) • KL(p||q) FBP Power EP • fully factorized • exp family • D α (p||q) • D α (p||q) Other divergences? 40
Other Message Passing Algorithms Do they correspond to divergence measures? • Generalized belief propagation [Yedidia,Freeman,Weiss 00] • Iterated conditional modes [Besag 86] • Max-product belief revision • TRW-max-product [Wainwright,Jaakkola,Willsky 02] • Laplace propagation [Smola,Vishwanathan,Eskin 03] • Penniless propagation [Cano,Moral,Salmerón 00] • Bound propagation [Leisink,Kappen 03] 41
Future work • Understand existing message passing algorithms • Understand local vs. global divergence • New message passing algorithms: – Specialized divergence measures – Richer approximating families • Other ways to minimize divergence 42
Recommend
More recommend