Probabilistic & Unsupervised Learning Belief Propagation Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2016
Recall: Belief Propagation on undirected trees Joint distribution of undirected tree: p ( X ) = 1 � � f i ( X i ) f ij ( X i , X j ) Z X i X j nodes i edges ( ij ) Messages computed recursively: � � M j → i ( X i ) := f ij ( X i , X j ) f j ( X j ) M l → j ( X j ) X j l ∈ ne ( j ) \ i Marginal distributions: � p ( X i ) ∝ f i ( X i ) M k → i ( X i ) k ∈ ne ( i ) � � p ( X i , X j ) ∝ f ij ( X i , X j ) f i ( X i ) f j ( X j ) M k → i ( X i ) M l → j ( X j ) k ∈ ne ( i ) \ j l ∈ ne ( j ) \ i
Loopy Belief Propagation Joint distribution of undirected graph: p ( X ) = 1 � � f i ( X i ) f ij ( X i , X j ) X i X j Z nodes i edges ( ij ) Messages computed recursively (with few guarantees of convergence): � � M j → i ( X i ) := f ij ( X i , X j ) f j ( X j ) M l → j ( X j ) X j l ∈ ne ( j ) \ i Marginal distributions are approximate in general: � p ( X i ) ≈ b i ( X i ) ∝ f i ( X i ) M k → i ( X i ) k ∈ ne ( i ) � � p ( X i , X j ) ≈ b ij ( X i , X j ) ∝ f ij ( X i , X j ) f i ( X i ) f j ( X j ) M k → i ( X i ) M l → j ( X j ) k ∈ ne ( i ) \ j l ∈ ne ( j ) \ i
Dealing with loops ◮ Accuracy : BP posterior marginals are approximate on all non-trees because evidence is over counted, but converged approximations are frequently found to be good (particularly in their means).
Dealing with loops ◮ Accuracy : BP posterior marginals are approximate on all non-trees because evidence is over counted, but converged approximations are frequently found to be good (particularly in their means). ◮ Convergence : no general guarantee, but BP does converge in some cases: ◮ Trees. ◮ Graphs with a single loop. ◮ Distributions with sufficiently weak interactions. ◮ Graphs with long (and weak) loops ◮ Gaussian networks: means correct, variances may also converge.
Dealing with loops ◮ Accuracy : BP posterior marginals are approximate on all non-trees because evidence is over counted, but converged approximations are frequently found to be good (particularly in their means). ◮ Convergence : no general guarantee, but BP does converge in some cases: ◮ Trees. ◮ Graphs with a single loop. ◮ Distributions with sufficiently weak interactions. ◮ Graphs with long (and weak) loops ◮ Gaussian networks: means correct, variances may also converge. ◮ Damping : Common approach to encourage convergence (cf EP) � � M new i → j ( X j ) := ( 1 − α ) M old i → j ( X j ) + α f ij ( X i , X j ) f i ( X i ) M k → i ( X i ) X i k ∈ ne ( i ) \ j
Dealing with loops ◮ Accuracy : BP posterior marginals are approximate on all non-trees because evidence is over counted, but converged approximations are frequently found to be good (particularly in their means). ◮ Convergence : no general guarantee, but BP does converge in some cases: ◮ Trees. ◮ Graphs with a single loop. ◮ Distributions with sufficiently weak interactions. ◮ Graphs with long (and weak) loops ◮ Gaussian networks: means correct, variances may also converge. ◮ Damping : Common approach to encourage convergence (cf EP) � � M new i → j ( X j ) := ( 1 − α ) M old i → j ( X j ) + α f ij ( X i , X j ) f i ( X i ) M k → i ( X i ) X i k ∈ ne ( i ) \ j ◮ Grouping variables : Variables can be grouped into cliques to improve accuracy. ◮ Region graph approximations. ◮ Cluster variational method. ◮ Junction graph.
Different Interpretations of Loopy Belief Propagation Loopy BP can be interpreted as a fixed point algorithm from a few different perspectives: ◮ Expectation propagation. ◮ Tree-based reparametrization. ◮ Bethe free energy.
Different Interpretations of Loopy Belief Propagation Loopy BP can be interpreted as a fixed point algorithm from a few different perspectives: ◮ Expectation propagation. ◮ Tree-based reparametrization. ◮ Bethe free energy.
Loopy BP as message-based Expectation Propagation ⇒ Approximate pairwise factors f ij by product of messages: f ij ( X i , X j ) ≈ ˜ f ij ( X i , X j ) = M i → j ( X j ) M j → i ( X i ) Thus, the full joint is approximated by a factorised distribution: � � p ( X ) ≈ 1 f ij ( X i , X j ) = 1 � � � � � ˜ f i ( X i ) f i ( X i ) M j → i ( X i ) = b i ( X i ) Z Z nodes i edges ( ij ) nodes i j ∈ ne ( i ) nodes i but with multiple factors for most X i .
Loopy BP as message-based EP X j X i Then the EP updates to the messages are:
Loopy BP as message-based EP X j X i Then the EP updates to the messages are: ◮ Deletion: � � � � q ¬ ij ( X ) = f i ( X i ) f j ( X j ) M k → i ( X i ) M l → j ( X j ) f s ( X s ) M t → s ( X s ) s � = i , j k ∈ ne ( i ) \ j l ∈ ne ( j ) \ i t ∈ ne ( s )
Loopy BP as message-based EP X j X i Then the EP updates to the messages are: ◮ Deletion: � � � � q ¬ ij ( X i , X j ) = f i ( X i ) f j ( X j ) M k → i ( X i ) M l → j ( X j ) f s ( X s ) M t → s ( X s ) s � = i , j k ∈ ne ( i ) \ j l ∈ ne ( j ) \ i t ∈ ne ( s )
Loopy BP as message-based EP X j X i Then the EP updates to the messages are: ◮ Deletion: � � � � q ¬ ij ( X i , X j ) = f i ( X i ) f j ( X j ) M k → i ( X i ) M l → j ( X j ) f s ( X s ) M t → s ( X s ) s � = i , j k ∈ ne ( i ) \ j l ∈ ne ( j ) \ i t ∈ ne ( s ) ◮ Projection: { M new i → j , M new j → i } = argmin KL [ f ij ( X i , X j ) q ¬ ij ( X i , X j ) � M j → i ( X i ) M i → j ( X j ) q ¬ ij ( X i , X j )]
Loopy BP as message-based EP X j X i Then the EP updates to the messages are: ◮ Deletion: � � � � q ¬ ij ( X i , X j ) = f i ( X i ) f j ( X j ) M k → i ( X i ) M l → j ( X j ) f s ( X s ) M t → s ( X s ) s � = i , j k ∈ ne ( i ) \ j l ∈ ne ( j ) \ i t ∈ ne ( s ) ◮ Projection: { M new i → j , M new j → i } = argmin KL [ f ij ( X i , X j ) q ¬ ij ( X i , X j ) � M j → i ( X i ) M i → j ( X j ) q ¬ ij ( X i , X j )] Now, q ¬ ij () factors ⇒ rhs factors ⇒ min is achieved by marginals of f ij () q ¬ ij ()
Loopy BP as message-based EP X j X i Then the EP updates to the messages are: ◮ Deletion: � � � � q ¬ ij ( X i , X j ) = f i ( X i ) f j ( X j ) M k → i ( X i ) M l → j ( X j ) f s ( X s ) M t → s ( X s ) s � = i , j k ∈ ne ( i ) \ j l ∈ ne ( j ) \ i t ∈ ne ( s ) ◮ Projection: { M new i → j , M new j → i } = argmin KL [ f ij ( X i , X j ) q ¬ ij ( X i , X j ) � M j → i ( X i ) M i → j ( X j ) q ¬ ij ( X i , X j )] Now, q ¬ ij () factors ⇒ rhs factors ⇒ min is achieved by marginals of f ij () q ¬ ij () � � � � � M new j → i ( X i ) q ¬ ij ( X i ) = f ij ( X i , X j ) f j ( X j ) M l → j ( X j ) f i ( X i ) M k → i ( X i ) X j l ∈ ne ( j ) \ i k ∈ ne ( i ) \ j � � � �� � � � ⇒ M new j → i ( X i ) = f ij ( X i , X j ) f j ( X j ) M l → j ( X j ) q ¬ ij ( X i ) X j l ∈ ne ( j ) \ i
Message-based EP ◮ Thus message-based EP in a loopy graph need not be seen as two separate approximations one to the sites and one to the cavity (as we had in the EP lecture).
Message-based EP ◮ Thus message-based EP in a loopy graph need not be seen as two separate approximations one to the sites and one to the cavity (as we had in the EP lecture). ◮ Instead, we can see it as a more severe constraint on the approximate sites: not just to an ExpFam factor, but to a product of ExpFam messages.
Message-based EP ◮ Thus message-based EP in a loopy graph need not be seen as two separate approximations one to the sites and one to the cavity (as we had in the EP lecture). ◮ Instead, we can see it as a more severe constraint on the approximate sites: not just to an ExpFam factor, but to a product of ExpFam messages. ◮ On a tree-structured graph the message-factored version of EP finds the same marginals as standard EP .
Message-based EP ◮ Thus message-based EP in a loopy graph need not be seen as two separate approximations one to the sites and one to the cavity (as we had in the EP lecture). ◮ Instead, we can see it as a more severe constraint on the approximate sites: not just to an ExpFam factor, but to a product of ExpFam messages. ◮ On a tree-structured graph the message-factored version of EP finds the same marginals as standard EP . ◮ Messages are calculated in exactly the same way as before (cf NLSSM).
Message-based EP ◮ Thus message-based EP in a loopy graph need not be seen as two separate approximations one to the sites and one to the cavity (as we had in the EP lecture). ◮ Instead, we can see it as a more severe constraint on the approximate sites: not just to an ExpFam factor, but to a product of ExpFam messages. ◮ On a tree-structured graph the message-factored version of EP finds the same marginals as standard EP . ◮ Messages are calculated in exactly the same way as before (cf NLSSM). ◮ Pairwise marginals can be found after convergence by computing ˜ P ( y i − 1 , y i ) as required (cf Forward-backward for HMMs).
Recommend
More recommend