CPCS Networks – Medical Diagnosis (noisy-OR model) Test case: no evidence Anytime-mpe(0.0001) U/L error vs time 3.8 cpcs422b 3.4 cpcs360b 3.0 Upper/Lower 2.6 2.2 1.8 1.4 1.0 0.6 i=1 i=21 1 10 100 1000 Time and parameter i Time (sec) Algorithm cpcs360 cpcs422 elim-mpe 115.8 1697.6 10 − = anytime-mpe( ), 70.3 505.2 4 10 − = 1 anytime-mpe( ), 70.3 110.5 slides10 828X 2019
Outline • Mini-bucket elimination • Weighted Mini-bucket • Mini-clustering • Re-parameterization, cost-shifting • Iterative Belief propagation • Iterative-join-graph propagation slides10 828X 2019
Decomposition for Sum • Generalize technique to sum via Holder’s inequality: • Define the weighted (or powered) sum: • “Temperature” interpolates between sum & max: • Different weights do not commute: slides10 828X 2019
The Power Sum and Holder Inequality slides10 828X 2019
Working Example • Model: • Markov network • Task: A • Partition function B C (Qiang Liu slides) slides10 828X 2019
Mini-Bucket (Basic Principles) • Upper bound • Lower bound (Qiang Liu slides) slides10 828X 2019
Holder Inequality • Where and • When , the equality is achieved. (Qiang Liu slides) G. H. Hardy, J. E. Littlewood and G. Pólya, Inequalities, Cambridge Univ. Press, London and New York, 1934. •
Reverse Holder Inequality • If , but the direction of the inequality reverses. (Qiang Liu slides) G. H. Hardy, J. E. Littlewood and G. Pólya, Inequalities, Cambridge Univ. Press, London and New York, 1934.
Weighted Mini-Bucket (for summation) 𝑥 Exact bucket elimination: mini-buckets ෑ 𝑔(𝑦) 𝑦 𝜇 𝐶 𝑏, 𝑑, 𝑒, 𝑓 = 𝑔 𝑏, 𝑐 ⋅ 𝑔 𝑐, 𝑑 ⋅ 𝑔 𝑐, 𝑒 ⋅ 𝑔 𝑐, 𝑓 𝑐 bucket B: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑥 1 𝑥 2 ≤ 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 ⋅ 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑐 𝑐 bucket C: 𝜇 𝐶→𝐷 (𝑏, 𝑑) 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 = 𝜇 𝐶→𝐷 (𝑏, 𝑑) ⋅ 𝜇 𝐶→𝐸 (𝑒, 𝑓) 𝑔 𝑏, 𝑒 𝜇 𝐶→𝐸 (𝑒, 𝑓) (mini-buckets) bucket D: 𝑥 𝑥 1 𝜇 𝐷→𝐹 (𝑏, 𝑓) 𝜇 𝐸→𝐹 (𝑏, 𝑓) where bucket E: 𝑔 𝑦 = 𝑔 𝑦 𝑥 𝑦 𝑦 is the weighted or “power” sum operator 𝑔 𝑏 bucket A: 𝜇 𝐹→𝐵 (𝑏) 𝑥 𝑥 1 𝑥 2 𝑔 1 𝑦 𝑔 2 𝑦 ≤ 𝑔 1 𝑦 𝑔 2 𝑦 U = upper bound 𝑦 𝑦 𝑦 where 𝑥 1 + 𝑥 2 = 𝑥 and 𝑥 1 > 0, 𝑥 2 > 0 [Liu and Ihler, 2011] (lower bound if ) 𝑥 1 > 0, 𝑥 2 < 0 slides10 828X 2019
slides10 828X 2019
Weighted-mini-bucket for Marginal Map slides10 828X 2019
Bucket Elimination for MMAP Bucket Elimination A B: constrained elimination order SUM B C C: D E D: E: MAX A: MAP* is the marginal MAP value slides7 828X 2019
A MB and WMB for Marginal MAP B C D E mini-buckets Marginal MAP 𝑥 1 𝜇 𝐶→𝐷 𝑏, 𝑑 = 𝑔 𝑏, 𝑐 𝑔(𝑐, 𝑑) Σ 𝐶 bucket B: 𝑔 𝑏, 𝑐 𝑔 𝑐, 𝑑 𝑔 𝑐, 𝑒 𝑔 𝑐, 𝑓 𝑐 𝑥 2 𝜇 𝐶→𝐸 𝑒, 𝑓 = 𝑔 𝑐, 𝑒 𝑔(𝑐, 𝑓) Σ 𝐷 bucket C: 𝜇 𝐶→𝐷 (𝑏, 𝑑) 𝑔 𝑏, 𝑑 𝑔 𝑑, 𝑓 𝑐 (𝑥 1 + 𝑥 2 = 1) . . 𝑔 𝑏, 𝑒 𝜇 𝐶→𝐸 (𝑒, 𝑓) max D bucket D: . max E 𝜇 𝐷→𝐹 (𝑏, 𝑓) 𝜇 𝐸→𝐹 (𝑏, 𝑓) bucket E: 𝜇 𝐹→𝐵 𝑏 = max 𝜇 𝐷→𝐹 𝑏, 𝑓 𝜇 𝐸→𝐹 (𝑏, 𝑓) 𝑓 max A 𝑔 𝑏 bucket A: 𝜇 𝐹→𝐵 (𝑏) 𝑉 = max 𝑔 𝑏 𝜇 𝐹→𝐵 (𝑏) 𝑏 U = upper bound Can optimize over cost-shifting and weights (single pass “MM” or iterative message passing) [Liu and Ihler, 2011; 2013] [Dechter and Rish, 2003] slides10 828X 2019
MBE-map Process max buckets With max mini-buckets And sum buckets with weighted Mini-buckets slides10 828X 2019
Initial partitioning slides10 828X 2019
slides10 828X 2019
Complexity and Tractability of MBE(i,m) slides10 828X 2019
Outline • Mini-bucket elimination • Weighted Mini-bucket • Mini-clustering • Re-parameterization, cost-shifting • Iterative Belief propagation • Iterative-join-graph propagation slides10 828X 2019
Join-Tree Clustering (Cluster-Tree Elimination) A 1 ABC = h b c p a p b a p c a b ( , ) ( ) ( | ) ( | , ) ( 1 , 2 ) B a BC = h b c p d b p f c d h b f ( , ) ( | ) ( | , ) ( , ) ( 2 , 1 ) ( 3 , 2 ) d f , 2 BCDF C D E = h b f p d b p f c d h b c ( , ) ( | ) ( | , ) ( , ) ( 2 , 3 ) ( 1 , 2 ) c d , BF = F h b f p e b f h e f ( , ) ( | , ) ( , ) ( 3 , 2 ) ( 4 , 3 ) e 3 BEF = G h e f p e b f h b f ( , ) ( | , ) ( , ) EXACT algorithm ( 3 , 4 ) ( 2 , 3 ) b EF Time and space: = = h e f p G g e f ( , ) ( | , ) e ( 4 , 3 ) exp(cluster size)= 4 EFG exp(treewidth) slides10 828X 2019
We can replace the sum with power sum For weights that sum to 1 in each mini-bucket slides10 828X 2019
Mini-Clustering, i-bound=3 A A B C 1 B p(a), p(b|a), p(c|a,b) = 1 C D E h b c p a p b a p c a b ( , ) ( ) ( | ) ( | , ) BC ( 1 , 2 ) a F B C D G p(d|b), h (1,2) (b,c) 2 C D F = p(f|c,d) 1 1 h b p d b h b c ( ) ( | ) ( , ) ( 2 , 3 ) ( 1 , 2 ) c d , BF = 2 h f p f c d ( ) max ( | , ) ( 2 , 3 ) c d , B E F 3 p(e|b,f), h 1 (2,3) (b), h 2 (2,3) (f) APPROXIMATE algorithm EF Time and space: exp(i-bound) E F G 4 p(g|e,f) slides10 828X 2019 Number of variables in a mini-cluster
Mini-Clustering - Example A 1 ABC B = h 1 b c p a p b a p c a b ( , ) : ( ) ( | ) ( | , ) H ( 1 , 2 ) C D E ( 1 , 2 ) a = BC 1 1 h b p d b h b f ( ) : ( | ) ( , ) F ( 2 , 1 ) ( 3 , 2 ) d f , H = G h 2 c p f c d ( ) : ( | , ) max ( 2 , 1 ) ( 2 , 1 ) d f , 2 BCDF = 1 1 h b p d b h b c ( ) : ( | ) ( , ) ( 2 , 3 ) ( 1 , 2 ) H c d , ( 2 , 3 ) = h 2 f p f c d ( ) : max ( | , ) BF ( 2 , 3 ) c d , = h 1 b f p e b f h 1 e f H ( , ) : ( | , ) ( , ) ( 3 , 2 ) ( 4 , 3 ) ( 3 , 2 ) e 3 BEF = H h 1 e f p e b f h 1 b h 2 f ( , ) : ( | , ) ( ) ( ) ( 3 , 4 ) ( 3 , 4 ) ( 2 , 3 ) ( 2 , 3 ) EF b = = H h 1 e f p G g e f ( , ) : ( | , ) ( 4 , 3 ) e ( 4 , 3 ) 4 EFG slides10 828X 2019
A Cluster Tree Elimination vs. Mini-Clustering B C D E MC CTE F 1 1 ABC ABC h 1 b c h b c H ( , ) ( , ) G ( 1 , 2 ) ( 1 , 2 ) ( 1 , 2 ) 1 BC BC h b ( ) ( 2 , 1 ) h 2 c h b c H ( ) ( , ) ( 2 , 1 ) ( 2 , 1 ) ( 2 , 1 ) 2 2 BCDF BCDF h 1 b h b f ( ) ( , ) H ( 2 , 3 ) ( 2 , 3 ) ( 2 , 3 ) h 2 f ( ) BF BF ( 2 , 3 ) H h 1 b f ( , ) h b f ( , ) ( 3 , 2 ) ( 3 , 2 ) ( 3 , 2 ) 3 3 BEF BEF h e f H ( , ) h 1 e f ( , ) ( 3 , 4 ) ( 3 , 4 ) ( 3 , 4 ) EF EF h e f ( , ) H h 1 e f ( , ) ( 4 , 3 ) ( 4 , 3 ) ( 4 , 3 ) 4 4 EFG EFG slides10 828X 2019
Heuristics for partitioning (Dechter and Rish, 2003, Rollon and Dechter 2010) Scope-based Partitioning Heuristic (SCP) aims at minimizing the number of mini-buckets in the partition by including in each minibucket as many functions as respecting the i bound is satisfied Use greedy heuristic derived from a distance function to decide which functions go into a single mini-bucket slides10 828X 2019
Greedy Scope-based Partitioning slides10 828X 2019
Heuristic for Partitioning Scope-based Partitioning Heuristic. The scope-based partition heuristic (SCP) aims at minimizing the number of mini-buckets in the partition by including in each mini-bucket as many functions as possible as long as the i bound is satisfied. First, single function mini-buckets are decreasingly ordered according to their arity from left to right. Then, each mini-bucket is absorbed into the left-most mini-bucket with whom it can be merged. The time complexity of Partition( B, i ) , where B is the bucket to be partitioned, and |B|,the number of functions in the bucket, using the SCP heuristic is O ( |B| log ( |B| ) + |B|^ 2) . The scope-based heuristic is is quite fast, its shortcoming is that it does not consider the actual information in the functions. slides10 828X 2019
Greedy Partition as a function of a distance function h slides10 828X 2019
Comparing Mini-clustering against Belief Propagation. What is belief propagation slides10 828X 2019
Iterative Belief Proapagation • Belief propagation is exact for poly-trees • IBP - applying BP iteratively to cyclic networks U U U One step : 1 3 2 update 3 x 2 x ( ) ( ) U 1 U 1 BEL(U ) 1 1 u ( ) X 1 2 u ( ) X 1 X X 1 2 • No guarantees for convergence • Works well for many coding networks slides10 828X 2019
Linear Block Codes a b c d e f g h Received bits σ A B C D E F G H Input bits Gaussian channel noise + + + + + + Parity bits σ p 1 p 2 p 3 p 4 p 5 p 6 Received bits slides10 828X 2019
Probabilistic decoding Error-correcting linear block code State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly -tree algorithm applied to loopy networks) slides10 828X 2019
MBE-mpe vs. IBP MBE-mpe is better on low w* codes IBP (or BP) is better on randomly generated (high w*) codes. Bit error rate (BER) as a function of noise (sigma): slides10 828X 2019
Grid 15x15 - 10 evidence Grid 15x15, evid=10, w*=22, 10 instances Grid 15x15, evid=10, w*=22, 10 instances 0.06 0.14 MC 0.12 0.05 IBP MC IBP 0.10 0.04 Absolute error 0.08 NHD 0.03 0.06 0.02 0.04 0.01 0.02 0.00 0.00 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 i-bound i-bound Grid 15x15, evid=10, w*=22, 10 instances Grid 15x15, evid=10, w*=22, 10 instances 0.12 12 MC MC 10 0.10 IBP IBP 8 0.08 Time (seconds) Relative error 6 0.06 4 0.04 2 0.02 0 0.00 slides10 828X 2019 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 i-bound i-bound
Outline • Mini-bucket elimination • Weighted Mini-bucket • Mini-clustering • Iterative Belief propagation • Iterative-join-graph propagation • Re-parameterization, cost-shifting slides10 828X 2019
Iterative Belief Proapagation • Belief propagation is exact for poly-trees • IBP - applying BP iteratively to cyclic networks U U U One step : 1 3 2 update 3 x 2 x ( ) ( ) U 1 U 1 BEL(U ) 1 1 u ( ) X 1 2 u ( ) X 1 X X 1 2 • No guarantees for convergence • Works well for many coding networks • Lets combine iterative-nature with anytime--IJGP slides10 828X 2019
Iterative Join Graph Propagation • Loopy Belief Propagation • Cyclic graphs • Iterative • Converges fast in practice (no guarantees though) • Very good approximations (e.g., turbo decoding, LDPC codes, SAT – survey propagation) • Mini-Clustering(i) • Tree decompositions • Only two sets of messages (inward, outward) • Anytime behavior – can improve with more time by increasing the i-bound • We want to combine: • Iterative virtues of Loopy BP • Anytime behavior of Mini-Clustering(i) slides10 828X 2019
IJGP - The basic idea • Apply Cluster Tree Elimination to any join-graph • We commit to graphs that are I-maps • Avoid cycles as long as I-mapness is not violated • Result: use minimal arc-labeled join-graphs slides10 828X 2019
Tree Decomposition for Belief Updating p(a ) A p(b|a ) B p(e b f | , ) P(d|b p(c a b ) E C D | , ) p(f|c , d ) F G p(g|e , f ) 93
Tree Decomposition for belief updating A B C p(a ) p(a), p(b|a), p(c|a,b) A BC p(b|a ) B B C D F p(d|b), p(f|c,d) p(e b f | , ) BF P(d|b p(c a b ) E C D | , ) B E F p(e|b,f) p(f|c , d ) EF F E F G p(g|e,f) G p(g|e , f ) 94
CT CTE: : Clu luster Tree Eli limination 1 ABC = A h b c p a p b a p c a b ( , ) ( ) ( | ) ( | , ) ( 1 , 2 ) a BC B = h b c p d b p f c d h b f ( , ) ( | ) ( | , ) ( , ) ( 2 , 1 ) ( 3 , 2 ) d f , 2 BCDF = h b f p d b p f c d h b c ( , ) ( | ) ( | , ) ( , ) ( 2 , 3 ) ( 1 , 2 ) c d , C D E BF = h b f p e b f h e f ( , ) ( | , ) ( , ) ( 3 , 2 ) ( 4 , 3 ) e 3 BEF F = h e f p e b f h b f ( , ) ( | , ) ( , ) ( 3 , 4 ) ( 2 , 3 ) EF b = = h e f p G g e f ( , ) ( | , ) G e ( 4 , 3 ) 4 EFG Time: O ( exp(w+1 )) For each cluster P(X|e) is computed, also P(e) Space: O ( exp(sep)) 95
Example A B C p(a), p(b|a), p(c|a,b) = tree decomposit ion BN X,D,G,P A for a belief network is a BC = T T (V,E) χ ψ triple , , , where is a tree and and are labeling v V χ(v) X functions, associatin g with each verte x two sets, and B C D F p(d|b), p(f|c,d) ψ(v) P satisfying : p P 1. For each function there is exactly one vertex such that BF i ψ(v) χ(v) p scope(p ) and i i B E F χ(v)} X X {v V|X 2. For each varia ble the set forms a p(e|b,f) i i connected subtree (running intersecti on property) EF A B E F G p(g|e,f) E C D Belief network Tree decomposition F 96 G
IJGP - The basic idea • Apply Cluster Tree Elimination to any join-graph • We commit to graphs that are I-maps • Avoid cycles as long as I-mapness is not violated • Result: use minimal arc-labeled join-graphs slides10 828X 2019
Minimal Arc-Labeled Decomposition B C B C ABCDE BCE ABCDE BCE C DE C E DE C E CDEF CDEF a) Fragment of an a) Shrinking labels to make it a arc-labeled join-graph minimal arc-labeled join-graph • Use a DFS algorithm to eliminate cycles relative to each variable slides10 828X 2019
Minimal arc-labeled join-graph
Message propagation B C ABCDE BCE ABCDE h (3,1) (bc) C DE p(a), p(c), p(b|ac), 1 3 BCE C E p(d|abe),p(e|b,c) B C h(3,1)(bc) CDEF FGH h (1,2) C DE C E F F GH 2 CDEF GI FGI GHIJ Minimal arc-labeled: = h de p a p c p b ac p d abe p e bc h bc ( ) ( ) ( ) ( | ) ( | ) ( | ) ( ) sep(1,2)={D,E} ( 1 , 2 ) ( 3 , 1 ) a b c , , elim(1,2)={A,B,C} = h cde p a p c p b ac p d abe p e bc h bc ( ) ( ) ( ) ( | ) ( | ) ( | ) ( ) Non-minimal arc-labeled: ( 1 , 2 ) ( 3 , 1 ) sep(1,2)={C,D,E} a b , elim(1,2)={A,B} slides10 828X 2019
IJGP - Example A C A B C A ABC C A AB BC C D E ABDE BCE BE C DE CE CDEF F H G H F FGH H F FG GH H GI I J FGI GHIJ Belief network Loopy BP graph slides10 828X 2019
Arc-Minimal Join-Graph A A A A A A A A A A A A A C C C C A A A A A A A A A A A A A ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC C C C C C C C C C C C C C A A AB AB AB AB AB AB AB AB AB AB AB AB AB BC BC BC BC BC BC BC BC BC BC BC BC BC C C C C ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE ABDE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE BCE Arcs labeled with BE BE BE BE BE BE BE BE BE BE BE BE any single variable C C C C C C C C C C C C C DE DE DE DE DE DE DE DE DE DE DE DE DE CE CE CE CE CE CE CE CE CE CE CE CE CE should form a TREE CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF CDEF H H H H H H H H H H H H H F F F F F F F F FGH FGH FGH FGH FGH FGH FGH FGH FGH FGH FGH FGH FGH H H H H H H H H H H H H H F F F F F F F F F F F F F FG FG FG FG FG FG FG FG FG FG FG FG FG GH GH GH GH GH GH GH GH GH GH GH GH GH H H H H H H H H H H GI GI GI GI GI GI GI GI GI GI GI GI GI FGI FGI FGI FGI FGI FGI FGI FGI FGI FGI FGI FGI FGI GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ GHIJ slides10 828X 2019
Collapsing Clusters A A ABC C AB BC ABDE BCE ABCDE ABCDE ABCDE BCE BCE BCE ABCDE BE BC BC BC C DE CE CDE CDE CDE CE CE CE CDE CDEF CDEF CDEF CDEF CDEF H FGH H FGH FGH FGH F F F F F FG GH FG FG FG GH GH GH GI GI GI GI GHI FGI GHIJ FGI FGI FGI GHIJ GHIJ GHIJ FGHI GHIJ slides10 828X 2019
Join-Graphs A C A A ABC C A ABC C BC A AB BC C AB BC ABCDE ABCDE BCE ABDE BCE ABDE BCE BE DE CE CDE C C DE CE DE CE CDEF CDEF CDEF CDEF FGH H H F FGH H FGH H F F F F F GH FG GH H F GH GHI GI GI GI FGI GHIJ FGI GHIJ FGI GHIJ FGHI GHIJ more accuracy less complexity slides10 828X 2019
Bounded decompositions • We want arc-labeled decompositions such that: • the cluster size (internal width) is bounded by i (the accuracy parameter) • Possible approaches to build decompositions: • partition-based algorithms - inspired by the mini-bucket decomposition • grouping-based algorithms slides10 828X 2019
Constructing Join-Graphs A B G: (GFE) P(G|F,E) GFE C D E EF E: (EBF) (EF) EBF P(E|B,F) F G F BF F: ( FCD ) ( BF ) P(F|C,D) FCD BF CD D: (DB) (CD) P(D|B) CDB CB B C: (CAB) (CB) P(C|A,B) CAB BA B: (BA) (AB) (B) P(B|A) BA A A: (A) P(A) A a) schematic mini-bucket(i), i=3 b) arc-labeled join-graph decomposition slides10 828X 2019
IJGP properties • IJGP( i ) applies BP to min arc-labeled join-graph, whose cluster size is bounded by i • On join-trees IJGP finds exact beliefs • IJGP is a Generalized Belief Propagation algorithm (Yedidia, Freeman, Weiss 2001) • Complexity of one iteration: • time: O(deg •( n+N ) •d i+1 ) O( N•d ) • space: slides10 828X 2019
Empirical evaluation ◼ Measures: • Algorithms: ◼ Absolute error • Exact ◼ Relative error • IBP • MC ◼ Kulbach-Leibler (KL) distance • IJGP ◼ Bit Error Rate ◼ Time ◼ Networks (all variables are binary): ◼ Random networks ◼ Grid networks (MxM) ◼ CPCS 54, 360, 422 ◼ Coding networks slides10 828X 2019
Coding Networks – Bit Error Rate σ = .22 σ = .32 Coding, N=400, 1000 instances, 30 it, w*=43, sigma=.22 Coding, N=400, 500 instances, 30 it, w*=43, sigma=.32 1e-1 0.00243 IJGP MC 0.00242 IBP 1e-2 0.00241 IBP IJGP BER BER 1e-3 0.00240 0.00239 1e-4 0.00238 1e-5 0.00237 0 2 4 6 8 10 12 0 2 4 6 8 10 12 i-bound i-bound σ = .51 σ = .65 Coding, N=400, 500 instances, 30 it, w*=43, sigma=.51 Coding, N=400, 500 instances, 30 it, w*=43, sigma=.65 0.0785 0.0780 0.1914 0.0775 0.1912 IBP IBP 0.0770 0.1910 IJGP IJGP BER BER 0.0765 0.1908 0.0760 0.1906 0.0755 0.1904 0.0750 0.1902 0.0745 0.1900 0 2 4 6 8 10 12 0 2 4 6 8 10 12 i-bound i-bound slides10 828X 2019
CPCS 422 – KL Distance CPCS 422, evid=0, w*=23, 1instance CPCS 422, evid=30, w*=23, 1instance 0.1 0.1 IJGP 30 it (at convergence) MC IBP 10 it (at convergence) 0.01 0.01 KL distance KL distance IJGP at convergence MC IBP at convergence 0.001 0.001 0.0001 0.0001 2 4 6 8 10 12 14 16 18 3 4 5 6 7 8 9 10 11 12 13 14 15 16 i-bound i-bound evidence=0 evidence=30 slides10 828X 2019
CPCS 422 – KL vs. Iterations CPCS 422, evid=0, w*=23, 1instance CPCS 422, evid=30, w*=23, 1instance 1 0.1 IJGP (3) IJGP(3) IJGP(10) IJGP(10) IBP IBP 0.1 0.01 KL distance KL distance 0.01 0.001 0.001 0.0001 0.0001 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 number of iterations number of iterations evidence=0 evidence=30 slides10 828X 2019
Coding networks - Time Coding, N=400, 500 instances, 30 iterations, w*=43 10 8 IJGP 30 iterations MC IBP 30 iterations 6 Time (seconds) 4 2 0 0 2 4 6 8 10 12 i-bound slides10 828X 2019
More On the Power of Belief Propagation • BP as local minima of KL distance (Read Darwiche) • BP’s power from constraint propagation perspective. 115
Lambda is Grounding for evidence e)
Theorem: Yedidia, Frieman and Weiss 2005
Summary of IJGP so far
Outline • Mini-bucket elimination • Weighted Mini-bucket • Mini-clustering • Iterative Belief propagation • Iterative-join-graph propagation • Re-parameterization, cost-shifting slides10 828X 2019
Cost-Shifting (Reparameterization) + 𝜇(𝐶) − 𝜇(𝐶) A B f(A,B) B C f(B,C) b b 6 + 3 b b 6 – 3 A B C f(A,B,C) b g 0 – 1 b g 0 – 3 b b b 12 g b 0 + 3 g b 0 + 1 b b g 6 g g 6 – 1 g g 6 + 1 b g b 0 b g g 6 = 0 + 6 g b b 6 B λ (B) g b g 0 b 3 Modify the individual functions g g b 6 g -1 g g g 12 - but – keep the sum of functions the same slides10 828X 2019
Tightening the bound • Reparameterization (or, “cost shifting”) A B C F(A,B,C) • Decrease bound without changing overall function 0 0 0 3.0 0 0 1 2.0 A B f 1 (A,B) B C f 2 (B,C) 0 1 0 2.0 0 0 2.0 0 0 1.0 = + 0 1 1 4.0 1 0 3.5 0 1 0.0 1 0 0 4.5 0 1 1.0 1 0 1.0 1 0 1 3.5 1 1 3.0 1 1 3.0 1 1 0 4.0 1 1 1 6.0 A B f 1 (A,B) ¸ (B) B C f 2 (B,C) - ¸ (B) (Adjusting functions 0 0 2.0 0 0 1.0 cancel each other) + 0 0 = 1 0 3.5 0 1 0.0 0 1 1.0 1 0 1.0 +1 -1 (Decomposition bound is exact) 1 1 3.0 1 1 3.0 slides10 828X 2019 127
Dual Decomposition 𝑔 13 (𝑦 1 , 𝑦 3 ) 𝑔 13 (∙) 𝑦 1 𝑦 3 𝑦 1 𝑦 3 𝑦 3 𝑦 1 𝑔 12 (𝑦 1 , 𝑦 2 ) 𝑔 23 (𝑦 2 , 𝑦 3 ) 𝑔 12 (∙) 𝑔 23 (∙) 𝑦 2 𝑦 2 𝑦 2 𝐺 ∗ = min ≥ min 𝑔 𝛽 (𝑦) 𝑦 𝑔 𝛽 (𝑦) 𝑦 𝛽 𝛽 • Bound solution using decomposed optimization • Solve independently: optimistic bound slides10 828X 2019
Dual Decomposition 𝑔 13 (𝑦 1 , 𝑦 3 ) 𝜇 1→13 (𝑦 1 ) 𝜇 3→13 (𝑦 3 ) 𝑔 13 (∙) 𝑦 1 𝑦 3 𝑦 1 𝑦 3 𝜇 3→23 (𝑦 3 ) 𝜇 1→12 (𝑦 1 ) 𝑦 3 𝑦 1 𝑔 12 (𝑦 1 , 𝑦 2 ) 𝑔 23 (𝑦 2 , 𝑦 3 ) 𝑔 12 (∙) 𝑔 23 (∙) Reparameterization: 𝑦 2 𝑦 2 𝑦 2 ∀𝑘 ∶ 𝜇 𝑘→𝛽 𝑦 𝑘 = 0 𝜇 2→13 (𝑦 2 ) 𝜇 2→23 (𝑦 2 ) 𝛽∋𝑘 𝐺 ∗ = min max + 𝜇 𝑗→𝛽 𝑦 𝑗 ≥ min 𝑔 𝛽 (𝑦) 𝑦 𝑔 𝛽 (𝑦) 𝜇 𝑗→𝛽 𝑦 𝑗∈𝛽 𝛽 𝛽 • Bound solution using decomposed optimization • Solve independently: optimistic bound • Tighten the bound by reparameterization ‒ Enforce lost equality constraints via Lagrange multipliers slides10 828X 2019
Dual Decomposition 𝑔 13 (𝑦 1 , 𝑦 3 ) 𝜇 1→13 (𝑦 1 ) 𝜇 3→13 (𝑦 3 ) 𝑔 13 (∙) 𝑦 1 𝑦 3 𝑦 1 𝑦 3 𝜇 3→23 (𝑦 3 ) 𝜇 1→12 (𝑦 1 ) 𝑦 3 𝑦 1 𝑔 12 (𝑦 1 , 𝑦 2 ) 𝑔 23 (𝑦 2 , 𝑦 3 ) 𝑔 12 (∙) 𝑔 23 (∙) Reparameterization: 𝑦 2 𝑦 2 𝑦 2 ∀𝑘 ∶ 𝜇 𝑘→𝛽 𝑦 𝑘 = 0 𝜇 2→13 (𝑦 2 ) 𝜇 2→23 (𝑦 2 ) 𝛽∋𝑘 𝐺 ∗ = min max + 𝜇 𝑗→𝛽 𝑦 𝑗 ≥ min 𝑔 𝛽 (𝑦) 𝑦 𝑔 𝛽 (𝑦) 𝜇 𝑗→𝛽 𝑦 𝑗∈𝛽 𝛽 𝛽 Many names for the same class of bounds: ‒ Dual decomposition [Komodakis et al. 2007] ‒ TRW, MPLP [Wainwright et al. 2005; Globerson & Jaakkola, 2007] ‒ Soft arc consistency [Cooper & Schiex, 2004] ‒ Max-sum diffusion [Warner 2007] slides10 828X 2019
Dual Decomposition 𝑔 13 (𝑦 1 , 𝑦 3 ) 𝜇 1→13 (𝑦 1 ) 𝜇 3→13 (𝑦 3 ) 𝑔 13 (∙) 𝑦 1 𝑦 3 𝑦 1 𝑦 3 𝜇 3→23 (𝑦 3 ) 𝜇 1→12 (𝑦 1 ) 𝑦 3 𝑦 1 𝑔 12 (𝑦 1 , 𝑦 2 ) 𝑔 23 (𝑦 2 , 𝑦 3 ) 𝑔 12 (∙) 𝑔 23 (∙) Reparameterization: 𝑦 2 𝑦 2 𝑦 2 ∀𝑘 ∶ 𝜇 𝑘→𝛽 𝑦 𝑘 = 0 𝜇 2→13 (𝑦 2 ) 𝜇 2→23 (𝑦 2 ) 𝛽∋𝑘 𝐺 ∗ = min max + 𝜇 𝑗→𝛽 𝑦 𝑗 ≥ min 𝑔 𝛽 (𝑦) 𝑦 𝑔 𝛽 (𝑦) 𝜇 𝑗→𝛽 𝑦 𝑗∈𝛽 𝛽 𝛽 Many ways to optimize the bound: ‒ Sub-gradient descent [Komodakis et al. 2007; Jojic et al. 2010] ‒ Coordinate descent [Warner 2007; Globerson & Jaakkola 2007; Sontag et al. 2009; Ihler et al. 2012] ‒ Proximal optimization [Ravikumar et al, 2010] ‒ ADMM [Meshi & Globerson 2011; Martins et al. 2011; Forouzan & Ihler 2013] slides10 828X 2019
Recommend
More recommend