Marginal Inference in MRFs using Frank-Wolfe David Belanger, Daniel Sheldon, Andrew McCallum School of Computer Science University of Massachusetts, Amherst { belanger,sheldon,mccallum } @cs.umass.edu December 10, 2013
Table of Contents Markov Random Fields 1 Frank-Wolfe for Marginal Inference 2 Optimality Guarantees and Convergence Rate 3 Beyond MRFs 4 Fancier FW 5 December 10, 2013 2 / 26
Table of Contents Markov Random Fields 1 Frank-Wolfe for Marginal Inference 2 Optimality Guarantees and Convergence Rate 3 Beyond MRFs 4 Fancier FW 5 December 10, 2013 3 / 26
Markov Random Fields December 10, 2013 4 / 26
Markov Random Fields � Φ θ ( x ) = θ c ( x c ) c ∈C December 10, 2013 4 / 26
Markov Random Fields � Φ θ ( x ) = θ c ( x c ) c ∈C P ( x ) = exp (Φ θ ( x )) log( Z ) December 10, 2013 4 / 26
Markov Random Fields � Φ θ ( x ) = θ c ( x c ) x → µ c ∈C P ( x ) = exp (Φ θ ( x )) log( Z ) December 10, 2013 4 / 26
Markov Random Fields � Φ θ ( x ) = θ c ( x c ) x → µ c ∈C P ( x ) = exp (Φ θ ( x )) Φ θ ( x ) → � θ , µ � log( Z ) December 10, 2013 4 / 26
Marginal Inference µ MARG = E P θ [ µ ] December 10, 2013 5 / 26
Marginal Inference µ MARG = E P θ [ µ ] µ MARG = arg max µ ∈M � µ , θ � + H M ( µ ) December 10, 2013 5 / 26
Marginal Inference µ MARG = E P θ [ µ ] µ MARG = arg max µ ∈M � µ , θ � + H M ( µ ) µ approx = arg max ¯ µ ∈L � µ , θ � + H B ( µ ) December 10, 2013 5 / 26
Marginal Inference µ MARG = E P θ [ µ ] µ MARG = arg max µ ∈M � µ , θ � + H M ( µ ) µ approx = arg max ¯ µ ∈L � µ , θ � + H B ( µ ) � H B ( µ ) = W c H ( µ c ) c ∈C December 10, 2013 5 / 26
MAP Inference µ MAP = arg max µ ∈M � µ , θ � December 10, 2013 6 / 26
MAP Inference µ MAP = arg max µ ∈M � µ , θ � θ µ MAP Black&Box&& MAP&Solver& December 10, 2013 6 / 26
MAP Inference µ MAP = arg max µ ∈M � µ , θ � θ µ MAP Black&Box&& MAP&Solver& θ µ MAP Gray&Box&& MAP&Solver& December 10, 2013 6 / 26
Marginal → MAP Reductions Hazan and Jaakkola [2012] Ermon et al. [2013] December 10, 2013 7 / 26
Table of Contents Markov Random Fields 1 Frank-Wolfe for Marginal Inference 2 Optimality Guarantees and Convergence Rate 3 Beyond MRFs 4 Fancier FW 5 December 10, 2013 8 / 26
Generic FW with Line Search y t = arg min x ∈ X � x , −∇ f ( x t − 1 ) � x t = min γ ∈ [0 , 1] f ((1 − γ ) x t + γ y t ) December 10, 2013 9 / 26
Generic FW with Line Search Compute& x t Line&Search& &Gradient& Linear&& Minimiza<on& Oracle& �r f ( x t − 1 ) y t December 10, 2013 10 / 26
FW for Marginal Inference Compute&Gradient& µ t +1 Line&Search& r F ( µ t ) = θ + r H ( µ t ) MAP& Inference& ˜ ˜ Oracle& θ µ MAP December 10, 2013 11 / 26
Subproblem Parametrization � F ( µ ) = � µ , θ � + W c H ( µ c ) c ∈C December 10, 2013 12 / 26
Subproblem Parametrization � F ( µ ) = � µ , θ � + W c H ( µ c ) c ∈C ˜ � θ = ∇ F ( µ t ) = θ + W c ∇ H ( µ c ) c ∈C December 10, 2013 12 / 26
Line Search µ t µ t +1 ˜ µ MAP December 10, 2013 13 / 26
Line Search µ t µ t +1 ˜ µ MAP Computing line search objective can scale with: December 10, 2013 13 / 26
Line Search µ t µ t +1 ˜ µ MAP Computing line search objective can scale with: Bad: # possible values in cliques. December 10, 2013 13 / 26
Line Search µ t µ t +1 ˜ µ MAP Computing line search objective can scale with: Bad: # possible values in cliques. Good: # cliques in graph. (see paper) December 10, 2013 13 / 26
Experiment #1 December 10, 2013 14 / 26
Table of Contents Markov Random Fields 1 Frank-Wolfe for Marginal Inference 2 Optimality Guarantees and Convergence Rate 3 Beyond MRFs 4 Fancier FW 5 December 10, 2013 15 / 26
Convergence Rate Convergence Rate of Frank-Wolfe [Jaggi, 2013] F ( µ t ) − F ( µ ∗ ) ≤ 2 C F t + 2(1 + δ ) December 10, 2013 16 / 26
Convergence Rate Convergence Rate of Frank-Wolfe [Jaggi, 2013] F ( µ t ) − F ( µ ∗ ) ≤ 2 C F t + 2(1 + δ ) δ C f t +2 MAP suboptimality at iter t December 10, 2013 16 / 26
Convergence Rate Convergence Rate of Frank-Wolfe [Jaggi, 2013] F ( µ t ) − F ( µ ∗ ) ≤ 2 C F t + 2(1 + δ ) δ C f t +2 MAP suboptimality at iter t − → NP-Hard December 10, 2013 16 / 26
Convergence Rate Convergence Rate of Frank-Wolfe [Jaggi, 2013] F ( µ t ) − F ( µ ∗ ) ≤ 2 C F t + 2(1 + δ ) δ C f t +2 MAP suboptimality at iter t − → NP-Hard How to deal with MAP hardness? Use MAP solver and hope for the best [Hazan and Jaakkola, 2012]. Relax to the local polytope . December 10, 2013 16 / 26
Curvature + Convergence Rate 2 C f = sup γ 2 ( f ( y ) − f ( x ) − � y − x , ∇ f ( x ) � ) x , s ∈ D ; γ ∈ [0 , 1]; y = x + γ ( s − x ) December 10, 2013 17 / 26
Curvature + Convergence Rate 2 C f = sup γ 2 ( f ( y ) − f ( x ) − � y − x , ∇ f ( x ) � ) x , s ∈ D ; γ ∈ [0 , 1]; y = x + γ ( s − x ) 0.7 0.6 0.5 0.4 entropy 0.3 µ t 0.2 µ t +1 0.1 0 0 0.2 0.4 0.6 0.8 1 prob x = 1 ˜ µ MAP December 10, 2013 17 / 26
Experiment #2 December 10, 2013 18 / 26
Table of Contents Markov Random Fields 1 Frank-Wolfe for Marginal Inference 2 Optimality Guarantees and Convergence Rate 3 Beyond MRFs 4 Fancier FW 5 December 10, 2013 19 / 26
Beyond MRFs Question Are MRFs the right Gibbs distribution to use Frank-Wolfe? December 10, 2013 20 / 26
Beyond MRFs Question Are MRFs the right Gibbs distribution to use Frank-Wolfe? Problem Family MAP Algorithm Marginal Algorithm tree-structured graphical models Viterbi Forward-Backward loopy graphical models Max-Product BP Sum-Product BP Directed Spanning Tree Chu Liu Edmonds Matrix Tree Theorem Bipartite Matching Hungarian Algorithm × December 10, 2013 20 / 26
Table of Contents Markov Random Fields 1 Frank-Wolfe for Marginal Inference 2 Optimality Guarantees and Convergence Rate 3 Beyond MRFs 4 Fancier FW 5 December 10, 2013 21 / 26
norm-regularized marginal inference µ MARG = arg max µ ∈M � µ , θ � + H M ( µ ) + λ R ( µ ) Harchaoui et al. [2013]. December 10, 2013 22 / 26
norm-regularized marginal inference µ MARG = arg max µ ∈M � µ , θ � + H M ( µ ) + λ R ( µ ) Harchaoui et al. [2013]. Local linear oracle for MRFs? µ t = arg ˜ µ ∈M∩ B r ( µ t ) � µ , θ � max Garber and Hazan [2013] December 10, 2013 22 / 26
Conclusion We need to figure out how to handle the entropy gradient. December 10, 2013 23 / 26
Conclusion We need to figure out how to handle the entropy gradient. There are plenty of extensions to further Gibbs distributions + regularizers. December 10, 2013 23 / 26
Further Reading I Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13) , pages 334–342, 2013. D. Garber and E. Hazan. A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization. ArXiv e-prints , January 2013. Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. arXiv preprint arXiv:1302.2325 , 2013. Tamir Hazan and Tommi S Jaakkola. On the Partition Function and Random Maximum A-Posteriori Perturbations. In Proceedings of the 29th International Conference on Machine Learning (ICML-12) , pages 991–998, 2012. Bert Huang and Tony Jebara. Approximating the permanent with belief propagation. arXiv preprint arXiv:0908.1769 , 2009. December 10, 2013 24 / 26
Further Reading II Mark Huber. Exact sampling from perfect matchings of dense regular bipartite graphs. Algorithmica , 44(3):183–193, 2006. Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13) , pages 427–435, 2013. James Petterson, Tiberio Caetano, Julian McAuley, and Jin Yu. Exponential family graph matching and ranking. 2009. Tim Roughgarden and Michael Kearns. Marginals-to-models reducibility. In Advances in Neural Information Processing Systems , pages 1043–1051, 2013. Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matching problems. In Advances in Neural Information Processing Systems , pages 1322–1330, 2012. Pascal O Vontobel. The bethe permanent of a non-negative matrix. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on , pages 341–346. IEEE, 2010. December 10, 2013 25 / 26
Finding the Marginal Matching Sampling Expensive, but doable [Huber, 2006, Volkovs and Zemel, 2012]. December 10, 2013 26 / 26
Recommend
More recommend