A whole family of non-exact examples 1 α � αx s if s = 1 or s = 4 θ s ( x s ) β βx s if s = 2 or s = 3 2 3 β � − γ if x s � = x t θ st ( x s , x t ) = 0 otherwise α 4 for γ sufficiently large, optimal solution is always either � 1 1 � � ( − 1) ( − 1) � 1 4 = or ( − 1) 4 = 1 1 ( − 1) ( − 1) first-order LP relaxation always exact for this problem max-product and LP relaxation give different decision boundaries: � 1 4 if 0 . 25 α + 0 . 25 β ≥ 0 Optimal/LP boundary: � x = ( − 1) 4 otherwise � 1 4 if 0 . 2393 α + 0 . 2607 β ≥ 0 Max-product boundary: � x = ( − 1) 4 otherwise Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 18 / 35
§ 3. A more general class of algorithms by introducing weights on edges, obtain a more general family of reweighted max-product algorithms with suitable edge weights, connected to linear programming relaxations many variants of these algorithms: ◮ tree-reweighted max-product (W., Jaakkola & Willsky, 2002, 2005) ◮ sequential TRMP (Kolmogorov, 2005) ◮ convex message-passing (Weiss et al., 2007) ◮ dual updating schemes (e.g., Globerson & Jaakkola, 2007) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 19 / 35
Tree-reweighted max-product algorithms (Wainwright, Jaakkola & Willsky, 2002) Message update from node t to node s : reweighted messages � �� � � � ρ vt � � M vt ( x t ) � � θ st ( x s , x ′ � t ) v ∈N ( t ) \ s + θ t ( x ′ ← M ts ( x s ) κ max exp t ) . � � (1 − ρ ts ) ρ st x ′ t ∈X t M st ( x t ) � �� � � �� � reweighted edge opposite message Properties: 1. Modified updates remain distributed and purely local over the graph. • Messages are reweighted with ρ st ∈ [0 , 1]. 2. Key differences: • Potential on edge ( s, t ) is rescaled by ρ st ∈ [0 , 1]. • Update involves the reverse direction edge. 3. The choice ρ st = 1 for all edges ( s, t ) recovers standard update. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 20 / 35
Edge appearance probabilities Experiment: What is the probability ρ e that a given edge e ∈ E belongs to a tree T drawn randomly under ρ ? f f f f b b b b e e e e (b) ρ ( T 1 ) = 1 (c) ρ ( T 2 ) = 1 (d) ρ ( T 3 ) = 1 (a) Original 3 3 3 ρ e = 2 ρ f = 1 In this example: ρ b = 1; 3 ; 3 . The vector ρ e = { ρ e | e ∈ E } must belong to the spanning tree polytope . (Edmonds, 1971) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 21 / 35
§ 4. Reweighted max-product and linear programming � � � � MAP as integer program: f ∗ = max θ s ( x s ) + θ st ( x s , x t ) x ∈X N s ∈ V ( s,t ) ∈ E define local marginal distributions (e.g., for m = 3 states): µ s (0) µ st (0 , 0) µ st (0 , 1) µ st (0 , 2) µ s ( x s ) = µ s (1) µ st ( x s , x t ) = µ st (1 , 0) µ st (1 , 1) µ st (1 , 2) µ s (2) µ st (2 , 0) µ st (2 , 1) µ st (2 , 2)
§ 4. Reweighted max-product and linear programming � � � � MAP as integer program: f ∗ = max θ s ( x s ) + θ st ( x s , x t ) x ∈X N s ∈ V ( s,t ) ∈ E define local marginal distributions (e.g., for m = 3 states): µ s (0) µ st (0 , 0) µ st (0 , 1) µ st (0 , 2) µ s ( x s ) = µ s (1) µ st ( x s , x t ) = µ st (1 , 0) µ st (1 , 1) µ st (1 , 2) µ s (2) µ st (2 , 0) µ st (2 , 1) µ st (2 , 2) alternative formulation of MAP as linear program? � � � � g ∗ = max E µ s [ θ s ( x s )] + E µ st [ θ st ( x s , x t )] ( µ s ,µ st ) ∈ M ( G ) s ∈ V ( s,t ) ∈ E � Local expectations: E µ s [ θ s ( x s )] := µ s ( x s ) θ s ( x s ) . x s
§ 4. Reweighted max-product and linear programming � � � � MAP as integer program: f ∗ = max θ s ( x s ) + θ st ( x s , x t ) x ∈X N s ∈ V ( s,t ) ∈ E define local marginal distributions (e.g., for m = 3 states): µ s (0) µ st (0 , 0) µ st (0 , 1) µ st (0 , 2) µ s ( x s ) = µ s (1) µ st ( x s , x t ) = µ st (1 , 0) µ st (1 , 1) µ st (1 , 2) µ s (2) µ st (2 , 0) µ st (2 , 1) µ st (2 , 2) alternative formulation of MAP as linear program? � � � � g ∗ = max E µ s [ θ s ( x s )] + E µ st [ θ st ( x s , x t )] ( µ s ,µ st ) ∈ M ( G ) s ∈ V ( s,t ) ∈ E � Local expectations: E µ s [ θ s ( x s )] := µ s ( x s ) θ s ( x s ) . x s Key question: What constraints must local marginals { µ s , µ st } satisfy?
Marginal polytopes for general undirected models M ( G ) ≡ set of all globally realizable marginals { µ s , µ st } : µ ∈ R d � � � � � � µ s ( x s ) = p µ ( x ) , and µ st ( x s , x t ) = p µ ( x ) x t ,t � = s x u ,u � = s,t for some p µ ( · ) over ( X 1 , . . . , X N ) ∈ { 0 , 1 , . . . , m − 1 } N . M ( G ) a a T i � µ ≤ b i polytope in d = m | V | + m 2 | E | dimensions ( m per vertex, m 2 per edge) with m N vertices number of facets?
Marginal polytope for trees M ( T ) ≡ special case of marginal polytope for tree T local marginal distributions on nodes/edges (e.g., m = 3) µ s (0) µ st (0 , 0) µ st (0 , 1) µ st (0 , 2) µ s ( x s ) = µ s (1) µ st ( x s , x t ) = µ st (1 , 0) µ st (1 , 1) µ st (1 , 2) µ s (2) µ st (2 , 0) µ st (2 , 1) µ st (2 , 2) Deep fact about tree-structured models: If { µ s , µ st } are non-negative and locally consistent : � Normalization : µ s ( x s ) = 1 x s � µ st ( x s , x ′ Marginalization : t ) = µ s ( x s ) , x ′ t then on any tree-structured graph T , they are globally consistent . Follows from junction tree theorem (Lauritzen & Spiegelhalter, 1988) . Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 24 / 35
Max-product on trees: Linear program solver MAP problem as a simple linear program: � � f ( � x ) = arg max E µ s [ θ s ( x s )] + E µ st [ θ st ( x s , x t )] µ ∈ M ( T ) � s ∈ V ( s,t ) ∈ E subject to � µ in tree marginal polytope: � � µ st ( x s , x ′ M ( T ) = � µ ≥ 0 , µ s ( x s ) = 1 , t ) = µ s ( x s ) . x s x ′ t Max-product and LP solving: on tree-structured graphs, max-product is a dual algorithm for solving the tree LP. (Wai. & Jordan, 2003) max-product message M ts ( x s ) ≡ Lagrange multiplier for enforcing the constraint � t µ st ( x s , x ′ t ) = µ s ( x s ). x ′ Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 25 / 35
Tree-based relaxation for graphs with cycles Set of locally consistent pseudomarginals for general graph G : � � � � τ ∈ R d | � τ st ( x s , x ′ L ( G ) = � τ ≥ 0 , τ s ( x s ) = 1 , t ) = τ s ( x s ) . x s x t Integral vertex M ( G ) Fractional vertex L ( G ) Key: For a general graph, L ( G ) is an outer bound on M ( G ), and yields a linear-programming relaxation of the MAP problem: µ ∈ M ( G ) θ T � τ ∈ L ( G ) θ T � f ( � x ) = max µ ≤ max τ. � �
Looseness of L ( G ) with graphs with cycles � � 0 : 5 0 : 5 2 � � � � 0 : 4 0 : 1 0 : 4 0 : 1 0 : 1 0 : 4 0 : 1 0 : 4 3 1 � � Locally consistent 0 : 1 0 : 4 � � � � (pseudo)marginals 0 : 4 0 : 1 0 : 5 0 : 5 0 : 5 0 : 5 Pseudomarginals satisfy the “obvious” local constraints: � s τ s ( x ′ Normalization: s ) = 1 for all s ∈ V . � x ′ s τ s ( x ′ Marginalization: s , x t ) = τ t ( x t ) for all edges ( s, t ). x ′ Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 27 / 35
TRW max-product and LP relaxation First-order (tree-based) LP relaxation: � � f ( � x ) ≤ max E τ s [ θ s ( x s )] + E τ st [ θ st ( x s , x t )] τ ∈ L ( G ) � s ∈ V ( s,t ) ∈ E Results: (Wainwright et al., 2005; Kolmogorov & Wainwright, 2005) : (a) Strong tree agreement Any TRW fixed-point that satisfies the strong tree agreement condition specifies an optimal LP solution. (b) LP solving: For any binary pairwise problem, TRW max-product solves the first-order LP relaxation. (c) Persistence for binary problems: Let S ⊆ V be the subset of vertices for which there exists a single point x ∗ s ∈ arg max x s ν ∗ s ( x s ). Then for any optimal solution , it holds that y s = x ∗ s . Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 28 / 35
On-going work on LPs and conic relaxations tree-reweighted max-product solves first-order LP for any binary pairwise problem (Kolmogorov & Wainwright, 2005) convergent dual ascent scheme; LP-optimal for binary pairwise problems (Globerson & Jaakkola, 2007) convex free energies and zero-temperature limits (Wainwright et al., 2005, Weiss et al., 2006; Johnson et al., 2007) coding problems: adaptive cutting-plane methods (Taghavi & Siegel, 2006; Dimakis et al., 2006) dual decomposition and sub-gradient methods: (Feldman et al., 2003; Komodakis et al., 2007, Duchi et al., 2007) solving higher-order relaxations; rounding schemes (e.g., Sontag et al., 2008; Ravikumar et al., 2008) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 29 / 35
Hierarchies of conic programming relaxations tree-based LP relaxation using L ( G ): first in a hierarchy of hypertree-based relaxations (Wainwright & Jordan, 2004) hierarchies of SDP relaxations for polynomial programming (Lasserre, 2001; Parrilo, 2002) intermediate between LP and SDP: second-order cone programming (SOCP) relaxations (Ravikumar & Lafferty, 2006; Kumar et al., 2008) all relaxations: particular outer bounds on the marginal polyope Key questions: when are particular relaxations tight? when does more computation (e.g., LP → SOCP → SDP) yield performance gains? Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 30 / 35
Stereo computation: Middlebury stereo benchmark set standard set of benchmarked examples for stereo algorithms (Scharstein & Szeliski, 2002) Tsukuba data set: Image sizes 384 × 288 × 16 ( W × H × D ) (a) Original image (b) Ground truth disparity Martin Wainwright (UC Berkeley) Graphical models and message-passing September 2, 2012 31 / 35
Comparison of different methods (a) Scanline dynamic programming (b) Graph cuts (c) Ordinary belief propagation (d) Tree-reweighted max-product (a), (b): Scharstein & Szeliski, 2002; (c): Sun et al., 2002 (d): Weiss, et al., 2005;
Ordinary belief propagation
Tree-reweighted max-product
Ground truth
Graphical models and message-passing Part II: Marginals and likelihoods Martin Wainwright UC Berkeley Departments of Statistics, and EECS Tutorial materials (slides, monograph, lecture notes) available at: www.eecs.berkeley.edu/ � wainwrig/kyoto12 September 3, 2012 Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 1 / 23
Graphs and factorization 2 ψ 7 ψ 47 3 4 7 5 1 6 ψ 456 v clique C is a fully connected subset of vertices compatibility function ψ C defined on variables x C = { x s , s ∈ C } factorization over all cliques � p ( x 1 , . . . , x N ) = 1 ψ C ( x C ) . Z C ∈ C Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 2 / 23
Core computational challenges Given an undirected graphical model (Markov random field): � 1 p ( x 1 , x 2 , . . . , x N ) = ψ C ( x C ) Z C ∈C How to efficiently compute? most probable configuration (MAP estimate): � Maximize : x = arg max x ∈X N p ( x 1 , . . . , x N ) = arg max ψ C ( x C ) . � x ∈X N C ∈C the data likelihood or normalization constant � � Sum / integrate : Z = ψ C ( x C ) C ∈C x ∈X N marginal distributions at single sites, or subsets: � � p ( X s = x s ) = 1 Sum / integrate : ψ C ( x C ) Z C ∈C x t , t � = s Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 3 / 23
§ 1. Sum-product message-passing on trees Goal: Compute marginal distribution at node u on a tree: � � x = arg max � exp( θ s ( x s ) exp( θ st ( x s , x t )) . x ∈X N s ∈ V ( s,t ) ∈ E M 12 M 32 1 2 3 � �� �� � � � p ( x ) = exp( θ 1 ( x 1 )) exp[ θ t ( x t ) + θ 2 t ( x 2 , x t )] x 1 ,x 2 ,x 3 x 2 t ∈ 1 , 3 x t
Putting together the pieces Sum-product is an exact algorithm for any tree. T w w s M wt M ts ≡ message from node t to s M ts M ut N ( t ) ≡ neighbors of node t t M vt u T u v T v � � � � � � θ st ( x s , x ′ t ) + θ t ( x ′ Update: M ts ( x s ) ← exp t ) M vt ( x t ) v ∈N ( t ) \ s x ′ t ∈X t p s ( x s ; θ ) ∝ exp { θ s ( x s ) } � Sum-marginals: t ∈N ( s ) M ts ( x s ). Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 5 / 23
Summary: sum-product on trees converges in at most graph diameter # of iterations updating a single message is an O ( m 2 ) operation overall algorithm requires O ( Nm 2 ) operations upon convergence, yields the exact node and edge marginals: � p s ( x s ) ∝ e θ s ( x s ) M us ( x s ) u ∈N ( s ) � � p st ( x s , x t ) ∝ e θ s ( x s )+ θ t ( x t )+ θ st ( x s ,x t ) M us ( x s ) M ut ( x t ) u ∈N ( s ) u ∈N ( t ) messages can also be used to compute the partition function � � � e θ s ( x s ) e θ st ( x s ,x t ) . Z = x 1 ,...,x N s ∈ V ( s,t ) ∈ E Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 6 / 23
§ 2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: ◮ error-control coding: Gallager, 1963 ◮ artificial intelligence: Pearl, 1988 ◮ turbo decoding: Berroux et al., 1993 ◮ etc.. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23
§ 2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: ◮ error-control coding: Gallager, 1963 ◮ artificial intelligence: Pearl, 1988 ◮ turbo decoding: Berroux et al., 1993 ◮ etc.. some concerns with sum-product with cycles: ◮ no convergence guarantees ◮ can have multiple fixed points ◮ final estimate of Z is not a lower/upper bound Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23
§ 2. Sum-product on graph with cycles as with max-product, a widely used heuristic with a long history: ◮ error-control coding: Gallager, 1963 ◮ artificial intelligence: Pearl, 1988 ◮ turbo decoding: Berroux et al., 1993 ◮ etc.. some concerns with sum-product with cycles: ◮ no convergence guarantees ◮ can have multiple fixed points ◮ final estimate of Z is not a lower/upper bound as before, can consider a broader class of reweighted sum-product algorithms Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 7 / 23
Tree-reweighted sum-product algorithms Message update from node t to node s : reweighted messages � �� � � � � ρ vt � � M vt ( x t ) � θ st ( x s , x ′ � � t ) v ∈N ( t ) \ s + θ t ( x ′ M ts ( x s ) ← κ exp t ) . � � (1 − ρ ts ) ρ st M st ( x t ) � �� � x ′ t ∈X t � �� � reweighted edge opposite message Properties: 1. Modified updates remain distributed and purely local over the graph. • Messages are reweighted with ρ st ∈ [0 , 1]. • Potential on edge ( s, t ) is rescaled by ρ st ∈ [0 , 1]. 2. Key differences: • Update involves the reverse direction edge. 3. The choice ρ st = 1 for all edges ( s, t ) recovers standard update. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 8 / 23
Bethe entropy approximation define local marginal distributions (e.g., for m = 3 states): µ s (0) µ st (0 , 0) µ st (0 , 1) µ st (0 , 2) µ s ( x s ) = µ s (1) µ st ( x s , x t ) = µ st (1 , 0) µ st (1 , 1) µ st (1 , 2) µ s (2) µ st (2 , 0) µ st (2 , 1) µ st (2 , 2) define node-based entropy and edge-based mutual information: � Node-based entropy: H s ( µ s ) = − µ s ( x s ) log µ s ( x s ) x s � µ st ( x s , x t ) log µ st ( x s , x t ) Mutual information: I st ( µ st ) = µ s ( x s ) µ t ( x t ) . x s ,x t ρ -reweighted Bethe entropy � � H Bethe ( µ ) = H s ( µ s ) − ρ st I st ( µ st ) , s ∈ V ( s,t ) ∈ E Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 9 / 23
Bethe entropy is exact for trees exact for trees, using the factorization: � � µ st ( x s , x t ) p ( x ; θ ) = µ s ( x s ) µ s ( x s ) µ t ( x t ) s ∈ V ( s,t ) ∈ E Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 10 / 23
Reweighted sum-product and Bethe variational principle Define the local constraint set � � � � L ( G ) = τ s , τ st | τ ≥ 0 , τ s ( x s ) = 1 , τ st ( x s , x t ) = τ s ( x s ) x s x t
Reweighted sum-product and Bethe variational principle Define the local constraint set � � � � L ( G ) = τ s , τ st | τ ≥ 0 , τ s ( x s ) = 1 , τ st ( x s , x t ) = τ s ( x s ) x s x t Theorem For any choice of positive edge weights ρ st > 0 : (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with � � � � A Bethe ( θ ; ρ ) := max � τ s , θ s � + � τ st , θ st � + H Bethe ( τ ; ρ ) . τ ∈ L ( G ) s ∈ V ( s,t ) ∈ E
Reweighted sum-product and Bethe variational principle Define the local constraint set � � � � L ( G ) = τ s , τ st | τ ≥ 0 , τ s ( x s ) = 1 , τ st ( x s , x t ) = τ s ( x s ) x s x t Theorem For any choice of positive edge weights ρ st > 0 : (a) Fixed points of reweighted sum-product are stationary points of the Lagrangian associated with � � � � A Bethe ( θ ; ρ ) := max � τ s , θ s � + � τ st , θ st � + H Bethe ( τ ; ρ ) . τ ∈ L ( G ) s ∈ V ( s,t ) ∈ E (b) For valid choices of edge weights { ρ st } , the fixed points are unique and moreover log Z ( θ ) ≤ A Bethe ( θ ; ρ ) . In addition, reweighted sum-product converges with appropriate scheduling.
Lagrangian derivation of ordinary sum-product let’s try to solve this problem by a (partial) Lagrangian formulation assign a Lagrange multiplier λ ts ( x s ) for each constraint C ts ( x s ) := τ s ( x s ) − � x t τ st ( x s , x t ) = 0 will enforce the normalization ( � x s τ s ( x s ) = 1) and non-negativity constraints explicitly the Lagrangian takes the form: � � L ( τ ; λ ) = � θ, τ � + H s ( τ s ) − I st ( τ st ) s ∈ V ( s,t ) ∈ E ( G ) � � � � � + λ st ( x t ) C st ( x t ) + λ ts ( x s ) C ts ( x s ) ( s,t ) ∈ E x t x s Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 12 / 23
Lagrangian derivation (part II) taking derivatives of the Lagrangian w.r.t τ s and τ st yields � ∂ L = θ s ( x s ) − log τ s ( x s ) + λ ts ( x s ) + C ∂τ s ( x s ) t ∈N ( s ) ∂ L θ st ( x s , x t ) − log τ st ( x s , x t ) τ s ( x s ) τ t ( x t ) − λ ts ( x s ) − λ st ( x t ) + C ′ = ∂τ st ( x s , x t ) setting these partial derivatives to zero and simplifying: � τ s ( x s ) ∝ exp � θ s ( x s ) � exp � λ ts ( x s ) � t ∈N ( s ) � � τ s ( x s , x t ) ∝ exp θ s ( x s ) + θ t ( x t ) + θ st ( x s , x t ) × � � � � � � exp λ us ( x s ) exp λ vt ( x t ) u ∈N ( s ) \ t v ∈N ( t ) \ s enforcing the constraint C ts ( x s ) = 0 on these representations yields the familiar update rule for the messages M ts ( x s ) = exp( λ ts ( x s )): � � M ts ( x s ) ← exp � θ t ( x t ) + θ st ( x s , x t ) � M ut ( x t ) x t u ∈N ( t ) \ s Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 13 / 23
Convex combinations of trees Idea: Upper bound A ( θ ) := log Z ( θ ) with a convex combination of tree-structured problems. ρ ( T 1 ) θ ( T 1 ) ρ ( T 2 ) θ ( T 2 ) ρ ( T 3 ) θ ( T 3 ) θ = + + ρ ( T 1 ) A ( θ ( T 1 )) ρ ( T 2 ) A ( θ ( T 2 )) ρ ( T 3 ) A ( θ ( T 3 )) ≤ A ( θ ) + + ρ = { ρ ( T ) } ≡ probability distribution over spanning trees θ ( T ) ≡ tree-structured parameter vector Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 14 / 23
Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 15 / 23
Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. Example: On the 2-D lattice: Grid size # trees 9 192 16 100352 3 . 26 × 10 13 36 5 . 69 × 10 42 100 Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 15 / 23
Finding the tightest upper bound Observation: For each fixed distribution ρ over spanning trees, there are many such upper bounds. Goal: Find the tightest such upper bound over all trees. Challenge: Number of spanning trees grows rapidly in graph size. By a suitable dual reformulation, problem can be avoided: Key duality relation: � � T ρ ( T ) θ ( T )= θ ρ ( T ) A ( θ ( T )) = max min � µ, θ � + H Bethe ( µ ; ρ st ) . � µ ∈ L ( G ) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 15 / 23
Edge appearance probabilities Experiment: What is the probability ρ e that a given edge e ∈ E belongs to a tree T drawn randomly under ρ ? f f f f b b b b e e e e (b) ρ ( T 1 ) = 1 (c) ρ ( T 2 ) = 1 (d) ρ ( T 3 ) = 1 (a) Original 3 3 3 ρ e = 2 ρ f = 1 In this example: ρ b = 1; 3 ; 3 . The vector ρ e = { ρ e | e ∈ E } must belong to the spanning tree polytope . (Edmonds, 1971) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 16 / 23
Why does entropy arise in the duality? Due to a deep correspondence between two problems: Maximum entropy density estimation � Maximize entropy H ( p ) = − p ( x 1 , . . . , x N ) log p ( x 1 , . . . , x N ) x subject to expectation constraints of the form � p ( x ) φ α ( x ) = � µ α . x Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 17 / 23
Why does entropy arise in the duality? Due to a deep correspondence between two problems: Maximum entropy density estimation � Maximize entropy H ( p ) = − p ( x 1 , . . . , x N ) log p ( x 1 , . . . , x N ) x subject to expectation constraints of the form � p ( x ) φ α ( x ) = � µ α . x Maximum likelihood in exponential family Maximize likelihood of parameterized densities � � � p ( x 1 , . . . , x N ; θ ) = exp θ α φ α ( x ) − A ( θ ) . α Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 17 / 23
Conjugate dual functions conjugate duality is a fertile source of variational representations any function f can be used to define another function f ∗ as follows: � � f ∗ ( v ) := sup � v, u � − f ( u ) . u ∈ R n easy to show that f ∗ is always a convex function how about taking the “dual of the dual”? I.e., what is ( f ∗ ) ∗ ? when f is well-behaved (convex and lower semi-continuous), we have ( f ∗ ) ∗ = f , or alternatively stated: � � � u, v � − f ∗ ( v ) f ( u ) = sup v ∈ R n
Geometric view: Supporting hyperplanes Question: Given all hyperplanes in R n × R with normal ( v, − 1), what is the intercept of the one that supports epi( f )? β f ( u ) � v, u � − c a Epigraph of f : epi( f ) := { ( u, β ) ∈ R n +1 | f ( u ) ≤ β } . � v, u � − c b − c a u − c b ( v, − 1) Analytically, we require the smallest c ∈ R such that: for all u ∈ R n � v, u � − c ≤ f ( u ) By re-arranging, we find that this optimal c ∗ is the dual value: � � c ∗ = sup � v, u � − f ( u ) . u ∈ R n
Example: Single Bernoulli Random variable X ∈ { 0 , 1 } yields exponential family of the form: p ( x ; θ ) ∝ exp � θ x � with A ( θ ) = log � 1 + exp( θ ) � . Let’s compute the dual A ∗ ( µ ) := sup � � µθ − log[1 + exp( θ )] . θ ∈ R (Possible) stationary point: µ = exp( θ ) / [1 + exp( θ )]. A ( θ ) A ( θ ) � µ, θ � − A ∗ ( µ ) θ θ � µ, θ � − c (b) Epigraph cannot be supported (a) Epigraph supported � µ log µ + (1 − µ ) log(1 − µ ) if µ ∈ [0 , 1] A ∗ ( µ ) = We find that: otherwise . . + ∞ µ · θ − A ∗ ( µ ) � � Leads to the variational representation: A ( θ ) = max µ ∈ [0 , 1] . Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 20 / 23
Geometry of Bethe variational problem µ int M ( G ) µ frac L ( G ) belief propagation uses a polyhedral outer approximation to M ( G ): ◮ for any graph, L ( G ) ⊇ M ( G ). ◮ equality holds ⇐ ⇒ G is a tree. Natural question: Do BP fixed points ever fall outside of the marginal polytope M ( G )? Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 21 / 23
Illustration: Globally inconsistent BP fixed points Consider the following assignment of pseudomarginals τ s , τ st : � � 0 : 5 0 : 5 2 � � � � 0 : 4 0 : 1 0 : 4 0 : 1 0 : 1 0 : 4 0 : 1 0 : 4 3 1 Locally consistent � � (pseudo)marginals 0 : 1 0 : 4 � � � � 0 : 4 0 : 1 0 : 5 0 : 5 0 : 5 0 : 5 can verify that τ ∈ L ( G ), and that τ is a fixed point of belief propagation (with all constant messages) however, τ is globally inconsistent Note: More generally: for any τ in the interior of L ( G ), can construct a distribution with τ as a BP fixed point. Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 22 / 23
High-level perspective: A broad class of methods message-passing algorithms (e.g., mean field, belief propagation) are solving approximate versions of exact variational principle in exponential families there are two distinct components to approximations: (a) can use either inner or outer bounds to M (b) various approximations to entropy function − A ∗ ( µ ) Refining one or both components yields better approximations: BP: polyhedral outer bound and non-convex Bethe approximation Kikuchi and variants: tighter polyhedral outer bounds and better entropy approximations (e.g.,Yedidia et al., 2002) Expectation-propagation: better outer bounds and Bethe-like entropy approximations (Minka, 2002) Martin Wainwright (UC Berkeley) Graphical models and message-passing September 3, 2012 23 / 23
Graphical models and message-passing: Part III: Learning graphs from data Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Graphical models and message-passing 1 / 24
Introduction previous lectures on “forward problems”: given a graphical model, perform some type of computation ◮ Part I: compute most probable (MAP) assignment ◮ Part II: compute marginals and likelihoods inverse problems concern learning the parameters and structure of graphs from data many instances of such graph learning problems: ◮ fitting graphs to politicians’ voting behavior ◮ modeling diseases with epidemiological networks ◮ traffic flow modeling ◮ interactions between different genes ◮ and so on.... Martin Wainwright (UC Berkeley) Graphical models and message-passing 2 / 24
Example: US Senate network (2004–2006 voting) (Banerjee et al., 2008; Ravikumar, W. & Lafferty, 2010)
Example: Biological networks gene networks during Drosophila life cycle (Ahmed & Xing, PNAS, 2009) many other examples: ◮ protein networks ◮ phylogenetic trees Martin Wainwright (UC Berkeley) Graphical models and message-passing 4 / 24
Learning for pairwise models drawn n samples from � � � � 1 θ s x 2 Q ( x 1 , . . . , x p ; Θ) = Z (Θ) exp s + θ st x s x t s ∈ V ( s,t ) ∈ E graph G and matrix [Θ] st = θ st of edge weights are unknown Martin Wainwright (UC Berkeley) Graphical models and message-passing 5 / 24
Learning for pairwise models drawn n samples from � � � � 1 θ s x 2 Q ( x 1 , . . . , x p ; Θ) = Z (Θ) exp s + θ st x s x t s ∈ V ( s,t ) ∈ E graph G and matrix [Θ] st = θ st of edge weights are unknown data matrix: ◮ Ising model (binary variables): X n 1 ∈ { 0 , 1 } n × p ◮ Gaussian model: X n 1 ∈ R n × p 1 �→ � estimator X n Θ Martin Wainwright (UC Berkeley) Graphical models and message-passing 5 / 24
Learning for pairwise models drawn n samples from � � � � 1 θ s x 2 Q ( x 1 , . . . , x p ; Θ) = Z (Θ) exp s + θ st x s x t s ∈ V ( s,t ) ∈ E graph G and matrix [Θ] st = θ st of edge weights are unknown data matrix: ◮ Ising model (binary variables): X n 1 ∈ { 0 , 1 } n × p ◮ Gaussian model: X n 1 ∈ R n × p 1 �→ � estimator X n Θ various loss functions are possible: ◮ graph selection: supp[ � Θ] = supp[Θ]? ◮ bounds on Kullback-Leibler divergence D ( Q � Θ � Q Θ ) ◮ bounds on | | � | Θ − Θ | | | op . Martin Wainwright (UC Berkeley) Graphical models and message-passing 5 / 24
Challenges in graph selection For pairwise models, negative log-likelihood takes form: n � 1 ) := − 1 ℓ (Θ; X n log Q ( x i 1 , . . . , x ip ; Θ) n i =1 � � = log Z (Θ) − θ s � µ s − θ st � µ st s ∈ V ( s,t )
Challenges in graph selection For pairwise models, negative log-likelihood takes form: n � 1 ) := − 1 ℓ (Θ; X n log Q ( x i 1 , . . . , x ip ; Θ) n i =1 � � = log Z (Θ) − θ s � µ s − θ st � µ st s ∈ V ( s,t ) maximizing likelihood involves computing log Z (Θ) or its derivatives (marginals) for Gaussian graphical models, this is a log-determinant program for discrete graphical models, various work-arounds are possible: ◮ Markov chain Monte Carlo and stochastic gradient ◮ variational approximations to likelihood ◮ pseudo-likelihoods
Methods for graph selection for Gaussian graphical models: ◮ ℓ 1 -regularized neighborhood regression for Gaussian MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006) ◮ ℓ 1 -regularized log-determinant (e.g., Yuan & Lin, 2006; d’Aspr´ emont et al., 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008)
Methods for graph selection for Gaussian graphical models: ◮ ℓ 1 -regularized neighborhood regression for Gaussian MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006) ◮ ℓ 1 -regularized log-determinant (e.g., Yuan & Lin, 2006; d’Aspr´ emont et al., 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008) methods for discrete MRFs ◮ exact solution for trees (Chow & Liu, 1967) ◮ local testing (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) ◮ various other methods ⋆ distribution fits by KL-divergence (Abeel et al., 2005) ⋆ ℓ 1 -regularized log. regression (Ravikumar, W. & Lafferty et al., 2008, 2010) ⋆ approximate max. entropy approach and thinned graphical models (Johnson et al., 2007) ⋆ neighborhood-based thresholding method (Bresler, Mossel & Sly, 2008)
Methods for graph selection for Gaussian graphical models: ◮ ℓ 1 -regularized neighborhood regression for Gaussian MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao & Yu, 2006) ◮ ℓ 1 -regularized log-determinant (e.g., Yuan & Lin, 2006; d’Aspr´ emont et al., 2007; Friedman, 2008; Rothman et al., 2008; Ravikumar et al., 2008) methods for discrete MRFs ◮ exact solution for trees (Chow & Liu, 1967) ◮ local testing (e.g., Spirtes et al, 2000; Kalisch & Buhlmann, 2008) ◮ various other methods ⋆ distribution fits by KL-divergence (Abeel et al., 2005) ⋆ ℓ 1 -regularized log. regression (Ravikumar, W. & Lafferty et al., 2008, 2010) ⋆ approximate max. entropy approach and thinned graphical models (Johnson et al., 2007) ⋆ neighborhood-based thresholding method (Bresler, Mossel & Sly, 2008) information-theoretic analysis ◮ pseudolikelihood and BIC criterion (Csiszar & Talata, 2006) ◮ information-theoretic limitations (Santhanam & W., 2008, 2012)
Graphs and random variables associate to each node s ∈ V a random variable X s for each subset A ⊆ V , random vector X A := { X s , s ∈ A } . 2 3 4 7 B 5 1 6 A S Maximal cliques (123) , (345) , (456) , (47) Vertex cutset S a clique C ⊆ V is a subset of vertices all joined by edges a vertex cutset is a subset S ⊂ V whose removal breaks the graph into two or more pieces Martin Wainwright (UC Berkeley) Graphical models and message-passing 8 / 24
Factorization and Markov properties The graph G can be used to impose constraints on the random vector X = X V (or on the distribution Q ) in different ways. Markov property: X is Markov w.r.t G if X A and X B are conditionally indpt. given X S whenever S separates A and B . Factorization: The distribution Q factorizes according to G if it can be expressed as a product over cliques: � 1 Q ( x 1 , x 2 , . . . , x p ) = ψ C ( x C ) Z � �� � ���� C ∈C Normalization compatibility function on clique C Theorem: (Hammersley & Clifford, 1973) For strictly positive Q ( · ), the Markov property and the Factorization property are equivalent. Martin Wainwright (UC Berkeley) Graphical models and message-passing 9 / 24
Markov property and neighborhood structure Markov properties encode neighborhood structure: d ( X s | X V \ s ) = ( X s | X N ( s ) ) � �� � � �� � Condition on full graph Condition on Markov blanket N ( s ) = { s, t, u, v, w } X s X t X w X s X u X v basis of pseudolikelihood method (Besag, 1974) basis of many graph learning algorithms (Friedman et al., 1999; Csiszar & Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006) Martin Wainwright (UC Berkeley) Graphical models and message-passing 10 / 24
Graph selection via neighborhood regression 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 . . . . . . . . . . . . . . . 0 Predict X s based on X \ s := { X s , t � = s } . 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 X \ s X s
Graph selection via neighborhood regression 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 . . . . . . . . . . . . . . . 0 Predict X s based on X \ s := { X s , t � = s } . 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 X \ s X s 1 For each node s ∈ V , compute (regularized) max. likelihood estimate: � � n � − 1 � θ [ s ] := arg min L ( θ ; X i, \ s ) + λ n � θ � 1 n ���� � �� � θ ∈ R p − 1 i =1 local log. likelihood regularization
Graph selection via neighborhood regression 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 . . . . . . . . . . . . . . . 0 Predict X s based on X \ s := { X s , t � = s } . 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 X \ s X s 1 For each node s ∈ V , compute (regularized) max. likelihood estimate: � � n � − 1 � θ [ s ] := arg min L ( θ ; X i, \ s ) + λ n � θ � 1 n ���� � �� � θ ∈ R p − 1 i =1 local log. likelihood regularization 2 Estimate the local neighborhood � N ( s ) as support of regression vector � θ [ s ] ∈ R p − 1 .
High-dimensional analysis classical analysis: graph size p fixed, sample size n → + ∞ high-dimensional analysis: allow both dimension p , sample size n , and maximum degree d to increase at arbitrary rates take n i.i.d. samples from MRF defined by G p,d study probability of success as a function of three parameters: Success( n, p, d ) = Q [Method recovers graph G p,d from n samples] theory is non-asymptotic: explicit probabilities for finite ( n, p, d )
Empirical behavior: Unrescaled plots Star graph; Linear fraction neighbors 1 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 p = 225 0 0 100 200 300 400 500 600 Number of samples
Empirical behavior: Appropriately rescaled Star graph; Linear fraction neighbors 1 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 p = 225 0 0 0.5 1 1.5 2 Control parameter Plots of success probability versus control parameter γ ( n, p, d ).
Rescaled plots (2-D lattice graphs) 4−nearest neighbor grid (attractive) 1 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 p = 225 0 0 0.5 1 1.5 2 2.5 3 Control parameter n Plots of success probability versus control parameter γ ( n, p, d ) = .
Sufficient conditions for consistent Ising selection graph sequences G p,d = ( V, E ) with p vertices, and maximum degree d . edge weights | θ st | ≥ θ min for all ( s, t ) ∈ E draw n i.i.d, samples, and analyze prob. success indexed by ( n, p, d ) Theorem (Ravikumar, W. & Lafferty, 2006, 2010)
Sufficient conditions for consistent Ising selection graph sequences G p,d = ( V, E ) with p vertices, and maximum degree d . edge weights | θ st | ≥ θ min for all ( s, t ) ∈ E draw n i.i.d, samples, and analyze prob. success indexed by ( n, p, d ) Theorem (Ravikumar, W. & Lafferty, 2006, 2010) Under incoherence conditions, for a rescaled sample n γ LR ( n, p, d ) := d 3 log p > γ crit � log p and regularization parameter λ n ≥ c 1 n , then with probability greater than � � − c 2 λ 2 1 − 2 exp n n : (a) Correct exclusion: The estimated sign neighborhood � N ( s ) correctly excludes all edges not in the true neighborhood.
Sufficient conditions for consistent Ising selection graph sequences G p,d = ( V, E ) with p vertices, and maximum degree d . edge weights | θ st | ≥ θ min for all ( s, t ) ∈ E draw n i.i.d, samples, and analyze prob. success indexed by ( n, p, d ) Theorem (Ravikumar, W. & Lafferty, 2006, 2010) Under incoherence conditions, for a rescaled sample n γ LR ( n, p, d ) := d 3 log p > γ crit � log p and regularization parameter λ n ≥ c 1 n , then with probability greater than � � − c 2 λ 2 1 − 2 exp n n : (a) Correct exclusion: The estimated sign neighborhood � N ( s ) correctly excludes all edges not in the true neighborhood. (b) Correct inclusion: For θ min ≥ c 3 λ n , the method selects the correct signed neighborhood.
Some related work thresholding estimator (poly-time for bounded degree) works with n � 2 d log p samples (Bresler et al., 2008)
Some related work thresholding estimator (poly-time for bounded degree) works with n � 2 d log p samples (Bresler et al., 2008) information-theoretic lower bound over family G p,d : any method requires at least n = Ω( d 2 log p ) samples (Santhanam & W., 2008)
Some related work thresholding estimator (poly-time for bounded degree) works with n � 2 d log p samples (Bresler et al., 2008) information-theoretic lower bound over family G p,d : any method requires at least n = Ω( d 2 log p ) samples (Santhanam & W., 2008) ℓ 1 -based method: sharper achievable rates, also failure for θ large enough to violate incoherence (Bento & Montanari, 2009)
Some related work thresholding estimator (poly-time for bounded degree) works with n � 2 d log p samples (Bresler et al., 2008) information-theoretic lower bound over family G p,d : any method requires at least n = Ω( d 2 log p ) samples (Santhanam & W., 2008) ℓ 1 -based method: sharper achievable rates, also failure for θ large enough to violate incoherence (Bento & Montanari, 2009) empirical study: ℓ 1 -based method can succeed beyond phase transition on Ising model (Aurell & Ekeberg, 2011)
§ 3. Info. theory: Graph selection as channel coding graphical model selection is an unorthodox channel coding problem: Martin Wainwright (UC Berkeley) Graphical models and message-passing 18 / 24
Recommend
More recommend