Expectation Consistent Approximate Inference Ole Winther Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Lyngby, Denmark owi@imm.dtu.dk In collaboration with Manfred Opper ISIS School of Electronics and Computer Science University of Southampton SO17 1BJ, United Kingdom mo@ecs.soton.ac.uk 1
Motivation • Contemporary machine learning uses complex flexible proba- bilistic models. • Bayesian inference is typically intractable. • Approximate polynomial complexity methods needed. • VB, Bethe, EP and EC: Use tractable factorization of original model. • EC: Expectation Consistency between 2 distributions, e.g. dis- crete and Gaussian microsoft001
Exact Inference in Tree Graphs Bethe – tree factorization, e.g. p ( x ) = 1 Zf 12 f 13 f 1 f 2 f 3 Write p ( x ) in terms of marginals q i ( x i ) and q ij ( x i , x j ) q ( x ) = q 12 ( x 1 , x 2 ) q 23 ( x 2 , x 3) p ( x ) = q 2 ( x 2 ) Z 12 Z 23 Z = Z 1 Message-parsing: Effective inference for p ( x ) discrete or Gaussian. microsoft001
Bethe Approximation Bethe approximation – treat p ( x ), e.g. p ( x ) = 1 Zf 12 f 23 f 13 f 1 f 2 f 3 as if it was a tree-graph q ( x ) = q 12 ( x 1 , x 2 ) q 23 ( x 2 , x 3) q 13 ( x 1 , x 3 ) . q 1 ( x 1 ) q 2 ( x 2 ) q 3 ( x 3 ) Works extremely well in “sparse systems” - e.g. low density decod- ing. Disadvantage over-counting – q ( x ) not a density. microsoft001
Variational Bayes (VB) Minimize KL-divergence in restricted tractable family q ( x ) = � i q i ( x i ): q i ( x i ) = argmin KL [ q ( x ) || p ( x )] ∝ exp � ln p ( x ) � q \ q i ( x i ) q i ( x i ) Example Gaussian: q ( x ) = N ( x ; m q , C q ) p ( x ) = N ( x ; m , C ) → 1 m q = m C q and ij = δ ij � � C − 1 ii In general (factorized) VB reliable on mean, but under-estimates width of distribution (see e.g. MacKay, 2003, Opper & Winther 2004). Important for parameter-estimation (see e.g. Minka & Lafferty). microsoft001
Motivating EC and Overview We are looking for a tractable approximation that • can handle “dense graphs” (better than Bethe+). • estimate correlations (better than VB). Free energy Why it works – central limit theorem. Algorithmics and connection to EP Simulations, conclusions and outlook microsoft001
Expectation Consistent (EC) free energy Calculate partition function � � Z = d x f ( x ) = d x f q ( x ) f r ( x ) Problem: Z intractable – integral not analytical and/or summation exponential in number of variables N . Introduce tractable distribution q ( x ) 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) Z q can be calculated in polynomial time. � d x f r ( x ) f q ( x ) exp � � ( λ q − λ q ) T g ( x ) Z Z = Z q = Z q � d x f q ( x ) exp λ T Z q q g ( x ) � � − λ T �� = Z q f r ( x ) exp q g ( x ) q microsoft001
Free energy Free energy exact: � � − λ T �� − ln Z = − ln Z q − ln f r ( x ) exp q g ( x ) q Variational approximation use Jensen: ln � f ( x ) � ≥ � ln f ( x ) � − ln Z ≤ − ln Z q − � ln f r ( x ) � q + λ T q � g ( x ) � q Find λ q by minimizing the upper bound. � − λ T � Better to average over f r ( x ) exp q g ( x ) approximately. Retain more averaging in that way. microsoft001
Expectation consistent approximation Define g ( x ) such that both 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) 1 Z r ( λ r ) f r ( x ) exp( λ T r ( x ) = r g ( x )) are tractable. Excludes some models tractable in the variational approach (with- out further approximations). microsoft001
Example I – the Ising model Binary variables – spins – x i = ± 1 with pairwise interactions � f q ( x ) = Ψ i ( x i ) i [ δ ( x i + 1) + δ ( x i − 1)] e θ i x i ψ i ( x i ) = � 1 � 2 x T Jx = exp � f r ( x ) = exp x i J ij x j i>j E.g. set g ( x ) to first and second order x 1 , − x 2 2 , x 2 , − x 2 2 , . . . , x N , − x 2 � � 1 2 N g ( x ) = 2 q ( x ) – a factorized binary distribution r ( x ) – multivariate Gaussian. Interpretation of g ( x ) will be clear shortly. microsoft001
Bethe and EC factorization Z Bethe = Z 12 Z 23 Z 13 . Z 1 Z 2 Z 3 Z EC will be similar in spirit: Z q Z r Z EC = . Z s(eparator) microsoft001
Example II – Gaussian processes Supervised learning: Inputs x 1 , . . . , x N and targets t 1 , . . . , t N . Gaussian process prior over functions y = ( y ( x 1 ) , . . . , y ( x N )): 1 − 1 � � 2 y T C − 1 y p ( y ) = exp � (2 π ) N det C Likelihood, observation model: p ( t | y ( x )), e.g. noise-free classifica- tion p ( t | y ( x )) = Θ( ty ( x )) � � p ( t i | y ( x i )) p ( y ) Z = d y i Same structure as ex. I – factorized and multivariate Gaussian (Opper&Winther,2000; Minka 2001). microsoft001
Expectation Consistent (Helmholtz) Free Energy Exchange average wrt q ( x ) with one over simpler distribution s ( x ). 1 � λ T � s ( x ) = Z s ( λ s ) exp s g ( x ) Approximation: � � − λ T �� � � − λ T �� q ≈ f r ( x ) exp q g ( x ) f r ( x ) exp q g ( x ) s Parameters λ q , λ s to be optimized in suitable way: � � − λ T �� − ln Z ≈ − ln Z q − ln f r ( x ) exp q g ( x ) s � � λ T � = − ln d x f q ( x ) exp q g ( x ) � � ( λ s − λ q ) T g ( x ) � − ln d x f r ( x ) exp � � λ T � + ln d x exp s g ( x ) microsoft001
Determining the Parameters Expectation consistency: ∂ ln Z EC = 0 : � g ( x ) � q = � g ( x ) � r ∂ λ q ∂ ln Z EC � g ( x ) � r = � g ( x ) � s = 0 : ∂ λ s where 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) 1 Z r ( λ r ) f r ( x ) exp( λ T r ( x ) = r g ( x )) with λ r = λ s − λ q 1 Z r ( λ s ) exp( λ T s ( x ) = s g ( x )) Z r Z q ≈ Z Z s Approximation symmetric in q ( x ) and r ( x ). s ( x ) is the “separator”. microsoft001
Why it Works Neither q or r are good approximations to p . But marginal distributions and moments can be precise! x 1 , − x 2 2 , . . . , x N , − x 2 � � 1 N g ( x ) = and λ = ( γ 1 , Λ 1 , . . . , γ N , Λ N ): 2 γ q,i x i − Λ q,i x 2 � � � q ( x ) = q i ( x i ) q i ( x i ) ∝ Ψ i ( x i ) exp . i i The central limit theorem saves us: the details of the distribution of the marginalized variables not important, only first and second moments. Cavity method (Onsager 1936, Mezard, Parisi & Vira- soro 1987). Exact under some conditions: “dense models”, many variables, no dominating interactions and not too strong interactions. Other complications such as non-ergodicity (RSB). microsoft001
Non-trivial estimates in EC • Marginal distributions q ( x i ) (factorized moments) Ψ i ( x i ) exp( γ T q x − x T Λ q x / 2) � q ( x ) ∝ i q ( x i ) ∝ Ψ i ( x i ) exp( γ q,i x i − x 2 i Λ q,i / 2) . • Correlations r ( x ) global Gaussian approximation r ( x ) ∝ exp( γ T r x − x T ( Λ r − J ) x / 2) � ( Λ r − J ) − 1 � Covariance C ( x i , x j ) = � x i x j � r ( x ) −� x i � r ( x ) � x j � r ( x ) = ij . • The free energy − ln Z EC ≈ − ln Z . Z is the marginal likelihood (or evidence ) of the model. • Supervised learning, Predictive distribution and leave-one-out (Opper & Winther, 2000). microsoft001
Non-Convex Optimization � d x f ( x ) exp � � λ T g ( x ) Partition function Z ( λ ) = is convex in λ : H = ∂ 2 ln Z − � g ( x ) � � g ( x ) � T . � g ( x ) g ( x ) T � = ∂ λ T λ EC non-convex optimization – like Bethe and variational. − ln Z EC ( λ q , λ s ) = − ln Z q ( λ q ) − ln Z r ( λ s − λ q )+ ln Z s ( λ s ) � � λ T � − ln = d x f q ( x ) exp q g ( x ) � � ( λ s − λ q ) T g ( x ) � − ln d x f r ( x ) exp � � λ T � + ln d x exp s g ( x ) Optimize with single loop (no warranty) or double loop (slow) . microsoft001
Single Loop – Objective Expectation consistency � g ( x ) � q = � g ( x ) � r = � g ( x ) � s with 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) 1 Z r ( λ r ) f r ( x ) exp( λ T r ( x ) = r g ( x )) with λ r = λ s − λ q 1 Z r ( λ s ) exp( λ T s ( x ) = s g ( x )) Sending messages r → q → r → . . . and make s consistent. microsoft001
Single Loop – Propagation Algorithms 1. Send messages from r to q • Calculate separator s ( x ). Solve for λ s : � g ( x ) � s = µ µ µ r ( t ) ≡ � g ( x ) � r ( x ; t ) • Update q ( x ): λ q ( t + 1) := λ s − λ r ( t ) 2. Send messages from q to r • Calculate separator s ( x ). Solve for λ s : � g ( x ) � s = µ µ µ q ( t + 1) ≡ � g ( x ) � q ( x ; t +1) • Update r ( x ): λ r ( t + 1) := λ s − λ q ( t + 1) Expectation Propagation (EP): sequential factor-by-factor update. microsoft001
Single Loop Details q ( x ) non-Gaussian, factorized or on a spanning tree and r ( x ) multi-variate Gaussian. Complexity O ( N 3 ). x 1 , − x 2 2 , x 2 , − x 2 2 , . . . , x N , − x 2 � � 1 2 N Factorized moments g ( x ) = : 2 � � γ s,i x i − Λ s,i x 2 Gaussian s ( x ) = � i s i ( x i ) and s i ( x i ) ∝ exp i / 2 . Moment matching to mean and variance of q and r : γ s,i := m i /v i and Λ s,i := 1 /v i . All second moments on a spanning tree: q ( x ) moments can be inferred by (exact) message parsing . s ( x ) multi-variate Gaussian on a spanning tree, solve using tree- decomposition of Z . microsoft001
Recommend
More recommend