approximate inference on graphical models variational
play

Approximate inference on graphical models: variational methods - PowerPoint PPT Presentation

Approximate inference on graphical models: variational methods Alexandre Bouchard-C ot e Exact inference in general graphs. . . Recall: we now have an exact, general and efficient inference algorithm: the Junction-Tree algorithm


  1. Approximate inference on graphical models: variational methods Alexandre Bouchard-Cˆ ot´ e

  2. Exact inference in general graphs. . . • Recall: we now have an exact, general and “efficient” inference algorithm: the Junction-Tree algorithm • Why should you care about approximate inference?

  3. Exact inference in general graphs is hard • Running time of JT is exponential in the max clique size of the JT • Don’t have to look very far to find graphs were JT is arbitrary slow

  4. We need approximate inference • A very hot topic in the Machine Learning community • Lots of important, open problems • Next lectures: Markov Chain Monte Carlo (MCMC) algorithms • Today: a completely different approach. . .

  5. Variational methods for approximate inference • Framework: – Cast the inference problem into a variational (optimization) problem – Relax (simplify) the variational problem

  6. Variational vs. sampling approaches Sampling Variational + Converges to the true answer Generally very fast Large toolbox and literature Deterministic algorithms − Mixing can be slow Approximation can be poor Assessing convergence Approximation can fail

  7. Program • Specific examples of variational methods • Outline of the unifying theory • Examples revisited

  8. Examples

  9. Examples will be on the good old Ising model • An undirected graphical model structured as a lattice in R d • Sufficient statistics φ ( x s , x t ) = x s x t , x u ∈ {− 1 , +1 } encourage agreement of neighbors: � � � P θ ( X = � x ) = exp θ s,t x s x t − A ( θ ) . ( s,t ) ∈ E • We will actually use x u ∈ { 0 , 1 } in the derivations to slightly simplify the derivations

  10. Importance and physical interpretation • For grid in dimension > 2 , encapsulates the full hardness of inference • Originates from statistical physics: model for a crystal structure – vertices represent spin of particles – edges represent bonds • Demonstration. . .

  11. Example 1: Loopy Belief Propagation • Run max-product, even if you are not supposed to. . . � � M t → s ( x s ) ∝ φ s,t ( x s , x t ) φ t ( x t ) M u → t ( x t ) x t ∈{ 0 , 1 } u ∈ N ( t ) −{ s } • t sends message to s when it has received the messages from all the other neighbors : with this protocol, makes sense only on trees

  12. Example 1: Loopy Belief Propagation • t sends message to s when it has received the messages from all the other neighbors : with this protocol, makes sense only on trees • On trees, the following protocol is equivalent: initialize the messages to one, then, at every iteration, all nodes send a message using what they received from the previous iteration • Makes sense on arbitrary graphs! • Does it work?

  13. Example 2: Naive mean field • A simpler coordinate ascent algorithm 1 µ u ← � � − 2 � 1 + exp s ∈ N ( u ) −{ u } θ s,u µ s • Our goal is to make sense out of these algorithms

  14. Unifying theory

  15. The plan • Focus on computing A ( θ ) and µ ( θ ) = E θ φ ( X ) • Construct an optimization problem s.t. – A ( θ ) is its maximum value – µ ( θ ) is the maximizing argument • Relax/simplify this optimization problem

  16. How to construct a variational formulation for A ? • Key concept: convex duality (recall A is convex. . . ) • Two equivalent ways to specify convex functions

  17. Convex Duality • The convex conjugate of f : R d → R ∪ { + ∞} , denoted f ∗ makes this equivalence explicit: � � f ∗ ( y ) := sup � y, x � − f ( x ) , x ∈ R d • set f ∗ ( x ) = + ∞ for unbounded values: f ∗ : R d → R ∪ { + ∞} .

  18. Geometric picture • Warning: for pedagogical reasons, assume for now that f is univariate, twice differentiable and strictly convex (can be made more general!!) • “ f acts on points, f ∗ acts on tangents”

  19. Connection with our problem • We will show that for convex f : f ∗∗ := ( f ∗ ) ∗ = f • Using this with f = A and expanding the definition of convex conjugacy: � � A ( θ ) = A ∗∗ ( θ ) = sup � θ, x � − A ∗ ( x ) , x a variational formulation .

  20. f ∗∗ = f (1) • First, let us fix some y 0 ∈ R and find a “closed form” for f ∗ ( y 0 ) : � � f ∗ ( y 0 ) := sup xy 0 − f ( x ) , x ∈ R d • Use differentiability and convexity to apply the derivative test: d � � = y 0 − f ′ ( x ) = 0 ⇒ f ′ ( x max ) = y 0 xy 0 − f ( x ) d x • Observation: f strictly convex ⇒ f ′ strictly increasing ⇒ the function x �→ f ′ ( x ) is invertible

  21. f ∗∗ = f (1) • First, let us fix some y 0 ∈ R and find a “closed form” for f ∗ ( y 0 ) : � � f ∗ ( y 0 ) := sup xy 0 − f ( x ) , x ∈ R d • Use differentiability and convexity to apply the derivative test: d � � = 0 ⇒ f ′ ( x max ) = y 0 ⇒ x max = f ′− 1 ( y 0 ) xy 0 − f ( x ) d x • Plug this in xy 0 − f ( x ) to get f ∗ : � � f ∗ ( y ) = = yf ′− 1 ( y ) − f ( f ′− 1 ( y )) yx max − f ( x max )

  22. f ∗∗ = f (2) • We can repeat the same process on � � f ∗∗ ( y 0 ) := sup xy 0 − f ∗ ( x ) , x ∈ R d using the expression f ∗ ( x ) = xf ′− 1 ( x ) − f ( f ′− 1 ( x )) we found in the previous slide. • f ∗′ looks bad at first but things cancel out: f ∗′ = d � � xf ′− 1 ( x ) − f ( f ′− 1 ( x )) d x = f ′− 1 ( x ) + x ( f ′− 1 ′ ( x )) f ′′ ( x ) − f ′ ( f ′− 1 ( x )) ( f ′− 1 ′ ( x )) f ′′ ( x ) � �� � x = f ′− 1 ( x )

  23. f ∗∗ = f (3) • This actually yields an alternate, implicit characterization of the convex conjugate (in the context of our restricted assumptions): f ∗′ ( x ) = f ′− 1 ( x ) • can already see from this equation applied twice that f ∗∗ = f up to a constant • Applying the derivative test as in slide (1) of the derivation and using this result, we get f ∗∗ = f (check).

  24. Caveat • There was a problem with this derivation: • f ′ ( f ′− 1 ( x )) = x only defined for x in the image of f ′ , ℑ ( f ′ ) • Set to + ∞ otherwise • In inference of A setting: will correspond to constraints on the realizable mean parameters

  25. Example: Bernouilli random variable • P ( X = x ) ∝ exp( θx ) for x ∈ { 0 , 1 } , θ ∈ R • A ( θ ) = log(1 + exp( θ )) • Let’s compute A ∗ using the formula we derived: � µA ′− 1 ( µ ) − A ( A ′− 1 ( µ )) for µ ∈ ℑ ( A ′ ), A ∗ ( µ ) = . + ∞ otherwise • By the way, recall: E θ φ ( X ) = A ′ ( θ ) , here φ ( x ) = x , that explains the notation

  26. Example: Bernouilli random variable • A ( θ ) = log(1 + exp( θ )) , exp θ • A ′ ( θ ) = 1+exp θ , µ • A ′− 1 ( µ ) = log 1 − µ , for µ ∈ ℑ ( A ′ ) = (0 , 1)

  27. Plug-in these in A ∗ ( µ ) = µA ′− 1 ( µ ) − A ( A ′− 1 ( µ )) • A ( θ ) = log(1 + exp( θ )) , µ • A ′− 1 ( µ ) = log 1 − µ , for µ ∈ ℑ ( A ′ ) = (0 , 1) • Get, for µ ∈ ℑ ( A ′ ) = (0 , 1) : A ∗ ( µ ) = µA ′− 1 ( µ ) − A ( A ′− 1 ( µ )) � � µ µ = µ log 1 − µ − log 1 + exp log 1 − µ = µ log µ + (1 − µ ) log(1 − µ ) • Does that look familiar?

  28. General expression for A ∗ • This is the negative entropy! • This actually holds in general: let H µ := − E µ log p µ ( X ) be the entropy of the r.v. characterized by the moment parameters µ , then � − H µ if µ ∈ M A ∗ ( µ ) = + ∞ otherwise • Here, M := ℑ ( A ′ ) is the set of realizable mean parameters

  29. Negative entropy interpretation • General derivation: for µ ∈ M : A ∗ ( µ ) = � θ ( µ ) , µ � − A ( θ ( µ )) = � θ ( µ ) , E θ ( µ ) φ ( X ) � − log Z ( θ ( µ )) = E θ ( µ ) log exp � θ ( µ ) , φ ( X ) � − E θ ( µ ) log Z ( θ ( µ )) = E θ ( µ ) log exp � θ ( µ ) , φ ( X ) � Z ( θ ( µ )) = E θ ( µ ) log p µ ( X ) = − H µ

  30. Finally, a variational formulation • Given θ 0 , the following optimization problem: � � sup � θ 0 , µ � + H µ µ such that µ ∈ M , – has optimal value A ( θ 0 ) , • Moreover: it is maximized by µ = E θ φ ( X ) : – value µ max achieving the sup is s.t. θ 0 = A ∗′ ( µ max ) – using A ∗′ ( x ) = A ′− 1 ( x ) , get µ max = A ′ ( θ 0 ) – hence µ max = E θ 0 φ ( X )

  31. Finally, a variational formulation • Given θ 0 , the following optimization problem: � � sup � θ 0 , µ � + H µ µ such that µ ∈ M , – has optimal solution A ( θ 0 ) , – is maximized by µ = E θ φ ( X ) • How can we relax this optimization problem? ˜ – restrict M to a subset, M ⊂ M on which H µ is easy to compute – approximate H µ

  32. Examples, revisited

  33. Mean field: a relaxation on M • Mean field as a relaxation on M • Recall: removing edges in a graph. model ⇒ adding independence constraints ⇒ smaller set of realizable mean parameters ˜ • Formally: M := { µ s,t : µ s,t = µ s µ t }

Recommend


More recommend