Introduction Dual Decomposition Experimental Results Conclusions Dual Decomposition for Marginal Inference Justin Domke Rochester Institute of Technology AAAI 2011
Introduction Dual Decomposition Experimental Results Conclusions Outline Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions
Introduction Dual Decomposition Experimental Results Conclusions Outline Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions
Introduction Dual Decomposition Experimental Results Conclusions Graphical Models • Markov Random Field / Factor Graph: p ( x ) ∝ ∏ ψ ( x c ) c
Introduction Dual Decomposition Experimental Results Conclusions Graphical Models c 1 = { 1 , 2 , 3 } , c 2 = { 3 , 4 } , c 3 = { 4 , 5 , 6 } ∝ ∏ ψ ( x c ) p ( x ) c ψ ( x 1 , x 2 , x 3 ) ψ ( x 3 , x 4 ) ψ ( x 4 , x 5 , x 6 ) =
Introduction Dual Decomposition Experimental Results Conclusions Marginal Inference • Want to recover p ( X i = x i ) . p ( x ) = ∏ c ψ ( x c ) • Brute-force sum: Define ˆ 1 ... ∑ Z ∑ x i − 1 ∑ ... ∑ P ( X i = x i ) = ˆ p ( x ) x 1 x i + 1 x M = ∑ ... ∑ Z ˆ p ( x ) x 1 x M • On trees, can do sums quickly by dynamic programming. • Sum-product algorithm / belief propagation • #P-hard • Approximate: Tree-reweighted belief propagation (TRW) • This paper: Same approximation as TRW, different algorithm.
Introduction Dual Decomposition Experimental Results Conclusions Marginal Inference • Want to recover p ( X i = x i ) . p ( x ) = ∏ c ψ ( x c ) • Brute-force sum: Define ˆ 1 ... ∑ Z ∑ x i − 1 ∑ ... ∑ P ( X i = x i ) = ˆ p ( x ) x 1 x i + 1 x M = ∑ ... ∑ Z ˆ p ( x ) x 1 x M • On trees, can do sums quickly by dynamic programming. • Sum-product algorithm / belief propagation • #P-hard • Approximate: Tree-reweighted belief propagation (TRW) • This paper: Same approximation as TRW, different algorithm.
Introduction Dual Decomposition Experimental Results Conclusions Marginal Inference • Want to recover p ( X i = x i ) . p ( x ) = ∏ c ψ ( x c ) • Brute-force sum: Define ˆ 1 ... ∑ Z ∑ x i − 1 ∑ ... ∑ P ( X i = x i ) = ˆ p ( x ) x 1 x i + 1 x M = ∑ ... ∑ Z ˆ p ( x ) x 1 x M • On trees, can do sums quickly by dynamic programming. • Sum-product algorithm / belief propagation • #P-hard • Approximate: Tree-reweighted belief propagation (TRW) • This paper: Same approximation as TRW, different algorithm.
Introduction Dual Decomposition Experimental Results Conclusions Marginal Inference • Want to recover p ( X i = x i ) . p ( x ) = ∏ c ψ ( x c ) • Brute-force sum: Define ˆ 1 ... ∑ Z ∑ x i − 1 ∑ ... ∑ P ( X i = x i ) = ˆ p ( x ) x 1 x i + 1 x M = ∑ ... ∑ Z ˆ p ( x ) x 1 x M • On trees, can do sums quickly by dynamic programming. • Sum-product algorithm / belief propagation • #P-hard • Approximate: Tree-reweighted belief propagation (TRW) • This paper: Same approximation as TRW, different algorithm.
Introduction Dual Decomposition Experimental Results Conclusions Outline Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions
Introduction Dual Decomposition Experimental Results Conclusions Motivation • TRW Convergence rates can be very slow. • If lucky, TRW = block coordate ascent on dual. • TRW may fail to converge. • Damping converges in practice, slower. • Recent alternatives guarantee convergence. [Hazan & Shashua 2009, Globerson & Jaakkola 2007b] • Not claimed faster than TRW. TRW-S [Meltzer et al. 2009] is an exception. • This paper: use a quasi-newton method on dual. • Line searches guarantee convergence. • Hopefully, faster convergence.
Introduction Dual Decomposition Experimental Results Conclusions Motivation • TRW Convergence rates can be very slow. • If lucky, TRW = block coordate ascent on dual. • TRW may fail to converge. • Damping converges in practice, slower. • Recent alternatives guarantee convergence. [Hazan & Shashua 2009, Globerson & Jaakkola 2007b] • Not claimed faster than TRW. TRW-S [Meltzer et al. 2009] is an exception. • This paper: use a quasi-newton method on dual. • Line searches guarantee convergence. • Hopefully, faster convergence.
Introduction Dual Decomposition Experimental Results Conclusions Motivation • TRW Convergence rates can be very slow. • If lucky, TRW = block coordate ascent on dual. • TRW may fail to converge. • Damping converges in practice, slower. • Recent alternatives guarantee convergence. [Hazan & Shashua 2009, Globerson & Jaakkola 2007b] • Not claimed faster than TRW. TRW-S [Meltzer et al. 2009] is an exception. • This paper: use a quasi-newton method on dual. • Line searches guarantee convergence. • Hopefully, faster convergence.
Introduction Dual Decomposition Experimental Results Conclusions Ising Model • x i ∈ {− 1 , + 1 } • p ( x ) ∝ ∏ ij exp θ ( x i , x j ) ∏ i exp ( θ ( x i ) � � � • θ ( x i ) = α F x i , α F ∈ [ − 1 , + 1 ] • θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , T ] for various T
Introduction Dual Decomposition Experimental Results Conclusions θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , 1 ]
Introduction Dual Decomposition Experimental Results Conclusions θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , 1 ] 0 10 trw −2 10 | µ − µ * | ∞ −4 10 −6 10 0 20 40 60 80 100 iters
Introduction Dual Decomposition Experimental Results Conclusions θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , 1 ] 0 10 trw dual decomp −2 10 | µ − µ * | ∞ −4 10 −6 10 0 20 40 60 80 100 iters
Introduction Dual Decomposition Experimental Results Conclusions θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , 3 ]
Introduction Dual Decomposition Experimental Results Conclusions θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , 3 ] 0 10 trw −2 10 | µ − µ * | ∞ −4 10 −6 10 0 2000 4000 6000 8000 10000 iters
Introduction Dual Decomposition Experimental Results Conclusions θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , 3 ] 0 10 trw dual decomp −2 10 | µ − µ * | ∞ −4 10 −6 10 0 2000 4000 6000 8000 10000 iters
Introduction Dual Decomposition Experimental Results Conclusions θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , 5 ]
Introduction Dual Decomposition Experimental Results Conclusions θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , 5 ] 0 10 trw −2 10 | µ − µ * | ∞ −4 10 −6 10 0 2000 4000 6000 8000 10000 iters
Introduction Dual Decomposition Experimental Results Conclusions θ ( x i , x j ) = α I x i x j , α I ∈ [ 0 , 5 ] 0 10 trw dual decomp −2 10 | µ − µ * | ∞ −4 10 −6 10 0 2000 4000 6000 8000 10000 iters
Introduction Dual Decomposition Experimental Results Conclusions Outline Introduction Graphical Models Motivation Wait a Second Dual Decomposition Dual Decomposition in General Dual Decomposition for Marginal Inference Experimental Results Experiments Conclusions Conclusions
Introduction Dual Decomposition Experimental Results Conclusions Wait a Second Question: Why should I care about very accurately computing approximate marginals!? Answer: You might not. One reason to care: • Number of iterations TRW needs for reasonable results is not easy to predict.
Introduction Dual Decomposition Experimental Results Conclusions Wait a Second Question: Why should I care about very accurately computing approximate marginals!? Answer: You might not. One reason to care: • Number of iterations TRW needs for reasonable results is not easy to predict.
Introduction Dual Decomposition Experimental Results Conclusions Why I Care Want to fit a CRF with some loss L ( θ ) = M ( µ ( θ )) . Algorithm (Domke, 2010) : 1. Get µ by running TRW with parameters θ . 2. Compute dM ( µ ) d µ 3. Get µ + by running TRW with parameters θ + r dM d µ 4. dL d θ ≈ 1 µ + − µ � � r Strong convergence needed for difference µ + − µ to be meaniningful.
Introduction Dual Decomposition Experimental Results Conclusions Why I Care Want to fit a CRF with some loss L ( θ ) = M ( µ ( θ )) . Algorithm (Domke, 2010) : 1. Get µ by running TRW with parameters θ . 2. Compute dM ( µ ) d µ 3. Get µ + by running TRW with parameters θ + r dM d µ 4. dL d θ ≈ 1 µ + − µ � � r Strong convergence needed for difference µ + − µ to be meaniningful.
Introduction Dual Decomposition Experimental Results Conclusions Why I Care Want to fit a CRF with some loss L ( θ ) = M ( µ ( θ )) . Algorithm (Domke, 2010) : 1. Get µ by running TRW with parameters θ . 2. Compute dM ( µ ) d µ 3. Get µ + by running TRW with parameters θ + r dM d µ 4. dL d θ ≈ 1 µ + − µ � � r Strong convergence needed for difference µ + − µ to be meaniningful.
Recommend
More recommend