Approximate inference on graphical models: variational methods - PowerPoint PPT Presentation

Approximate inference on graphical models: variational methods Alexandre Bouchard-Cˆ ot´ e

Exact inference in general graphs. . . • Recall: we now have an exact, general and “efficient” inference algorithm: the Junction-Tree algorithm • Why should you care about approximate inference?

Exact inference in general graphs is hard • Running time of JT is exponential in the max clique size of the JT • Don’t have to look very far to find graphs were JT is arbitrary slow

We need approximate inference • A very hot topic in the Machine Learning community • Lots of important, open problems • Next lectures: Markov Chain Monte Carlo (MCMC) algorithms • Today: a completely different approach. . .

Variational methods for approximate inference • Framework: – Cast the inference problem into a variational (optimization) problem – Relax (simplify) the variational problem

Variational vs. sampling approaches Sampling Variational + Converges to the true answer Generally very fast Large toolbox and literature Deterministic algorithms − Mixing can be slow Approximation can be poor Assessing convergence Approximation can fail

Program • Specific examples of variational methods • Outline of the unifying theory • Examples revisited

Examples

Examples will be on the good old Ising model • An undirected graphical model structured as a lattice in R d • Sufficient statistics φ ( x s , x t ) = x s x t , x u ∈ {− 1 , +1 } encourage agreement of neighbors: � � � P θ ( X = � x ) = exp θ s,t x s x t − A ( θ ) . ( s,t ) ∈ E • We will actually use x u ∈ { 0 , 1 } in the derivations to slightly simplify the derivations

Importance and physical interpretation • For grid in dimension > 2 , encapsulates the full hardness of inference • Originates from statistical physics: model for a crystal structure – vertices represent spin of particles – edges represent bonds • Demonstration. . .

Example 1: Loopy Belief Propagation • Run max-product, even if you are not supposed to. . . � � M t → s ( x s ) ∝ φ s,t ( x s , x t ) φ t ( x t ) M u → t ( x t ) x t ∈{ 0 , 1 } u ∈ N ( t ) −{ s } • t sends message to s when it has received the messages from all the other neighbors : with this protocol, makes sense only on trees

Example 1: Loopy Belief Propagation • t sends message to s when it has received the messages from all the other neighbors : with this protocol, makes sense only on trees • On trees, the following protocol is equivalent: initialize the messages to one, then, at every iteration, all nodes send a message using what they received from the previous iteration • Makes sense on arbitrary graphs! • Does it work?

Example 2: Naive mean field • A simpler coordinate ascent algorithm 1 µ u ← � � − 2 � 1 + exp s ∈ N ( u ) −{ u } θ s,u µ s • Our goal is to make sense out of these algorithms

Unifying theory

The plan • Focus on computing A ( θ ) and µ ( θ ) = E θ φ ( X ) • Construct an optimization problem s.t. – A ( θ ) is its maximum value – µ ( θ ) is the maximizing argument • Relax/simplify this optimization problem

How to construct a variational formulation for A ? • Key concept: convex duality (recall A is convex. . . ) • Two equivalent ways to specify convex functions

Convex Duality • The convex conjugate of f : R d → R ∪ { + ∞} , denoted f ∗ makes this equivalence explicit: � � f ∗ ( y ) := sup � y, x � − f ( x ) , x ∈ R d • set f ∗ ( x ) = + ∞ for unbounded values: f ∗ : R d → R ∪ { + ∞} .

Geometric picture • Warning: for pedagogical reasons, assume for now that f is univariate, twice differentiable and strictly convex (can be made more general!!) • “ f acts on points, f ∗ acts on tangents”

Connection with our problem • We will show that for convex f : f ∗∗ := ( f ∗ ) ∗ = f • Using this with f = A and expanding the definition of convex conjugacy: � � A ( θ ) = A ∗∗ ( θ ) = sup � θ, x � − A ∗ ( x ) , x a variational formulation .

f ∗∗ = f (1) • First, let us fix some y 0 ∈ R and find a “closed form” for f ∗ ( y 0 ) : � � f ∗ ( y 0 ) := sup xy 0 − f ( x ) , x ∈ R d • Use differentiability and convexity to apply the derivative test: d � � = y 0 − f ′ ( x ) = 0 ⇒ f ′ ( x max ) = y 0 xy 0 − f ( x ) d x • Observation: f strictly convex ⇒ f ′ strictly increasing ⇒ the function x �→ f ′ ( x ) is invertible

f ∗∗ = f (1) • First, let us fix some y 0 ∈ R and find a “closed form” for f ∗ ( y 0 ) : � � f ∗ ( y 0 ) := sup xy 0 − f ( x ) , x ∈ R d • Use differentiability and convexity to apply the derivative test: d � � = 0 ⇒ f ′ ( x max ) = y 0 ⇒ x max = f ′− 1 ( y 0 ) xy 0 − f ( x ) d x • Plug this in xy 0 − f ( x ) to get f ∗ : � � f ∗ ( y ) = = yf ′− 1 ( y ) − f ( f ′− 1 ( y )) yx max − f ( x max )

f ∗∗ = f (2) • We can repeat the same process on � � f ∗∗ ( y 0 ) := sup xy 0 − f ∗ ( x ) , x ∈ R d using the expression f ∗ ( x ) = xf ′− 1 ( x ) − f ( f ′− 1 ( x )) we found in the previous slide. • f ∗′ looks bad at first but things cancel out: f ∗′ = d � � xf ′− 1 ( x ) − f ( f ′− 1 ( x )) d x = f ′− 1 ( x ) + x ( f ′− 1 ′ ( x )) f ′′ ( x ) − f ′ ( f ′− 1 ( x )) ( f ′− 1 ′ ( x )) f ′′ ( x ) � �� x = f ′− 1 ( x )

f ∗∗ = f (3) • This actually yields an alternate, implicit characterization of the convex conjugate (in the context of our restricted assumptions): f ∗′ ( x ) = f ′− 1 ( x ) • can already see from this equation applied twice that f ∗∗ = f up to a constant • Applying the derivative test as in slide (1) of the derivation and using this result, we get f ∗∗ = f (check).

Caveat • There was a problem with this derivation: • f ′ ( f ′− 1 ( x )) = x only defined for x in the image of f ′ , ℑ ( f ′ ) • Set to + ∞ otherwise • In inference of A setting: will correspond to constraints on the realizable mean parameters

Example: Bernouilli random variable • P ( X = x ) ∝ exp( θx ) for x ∈ { 0 , 1 } , θ ∈ R • A ( θ ) = log(1 + exp( θ )) • Let’s compute A ∗ using the formula we derived: � µA ′− 1 ( µ ) − A ( A ′− 1 ( µ )) for µ ∈ ℑ ( A ′ ), A ∗ ( µ ) = . + ∞ otherwise • By the way, recall: E θ φ ( X ) = A ′ ( θ ) , here φ ( x ) = x , that explains the notation

Example: Bernouilli random variable • A ( θ ) = log(1 + exp( θ )) , exp θ • A ′ ( θ ) = 1+exp θ , µ • A ′− 1 ( µ ) = log 1 − µ , for µ ∈ ℑ ( A ′ ) = (0 , 1)

Plug-in these in A ∗ ( µ ) = µA ′− 1 ( µ ) − A ( A ′− 1 ( µ )) • A ( θ ) = log(1 + exp( θ )) , µ • A ′− 1 ( µ ) = log 1 − µ , for µ ∈ ℑ ( A ′ ) = (0 , 1) • Get, for µ ∈ ℑ ( A ′ ) = (0 , 1) : A ∗ ( µ ) = µA ′− 1 ( µ ) − A ( A ′− 1 ( µ )) � � µ µ = µ log 1 − µ − log 1 + exp log 1 − µ = µ log µ + (1 − µ ) log(1 − µ ) • Does that look familiar?

General expression for A ∗ • This is the negative entropy! • This actually holds in general: let H µ := − E µ log p µ ( X ) be the entropy of the r.v. characterized by the moment parameters µ , then � − H µ if µ ∈ M A ∗ ( µ ) = + ∞ otherwise • Here, M := ℑ ( A ′ ) is the set of realizable mean parameters

Negative entropy interpretation • General derivation: for µ ∈ M : A ∗ ( µ ) = � θ ( µ ) , µ � − A ( θ ( µ )) = � θ ( µ ) , E θ ( µ ) φ ( X ) � − log Z ( θ ( µ )) = E θ ( µ ) log exp � θ ( µ ) , φ ( X ) � − E θ ( µ ) log Z ( θ ( µ )) = E θ ( µ ) log exp � θ ( µ ) , φ ( X ) � Z ( θ ( µ )) = E θ ( µ ) log p µ ( X ) = − H µ

Finally, a variational formulation • Given θ 0 , the following optimization problem: � � sup � θ 0 , µ � + H µ µ such that µ ∈ M , – has optimal value A ( θ 0 ) , • Moreover: it is maximized by µ = E θ φ ( X ) : – value µ max achieving the sup is s.t. θ 0 = A ∗′ ( µ max ) – using A ∗′ ( x ) = A ′− 1 ( x ) , get µ max = A ′ ( θ 0 ) – hence µ max = E θ 0 φ ( X )

Finally, a variational formulation • Given θ 0 , the following optimization problem: � � sup � θ 0 , µ � + H µ µ such that µ ∈ M , – has optimal solution A ( θ 0 ) , – is maximized by µ = E θ φ ( X ) • How can we relax this optimization problem? ˜ – restrict M to a subset, M ⊂ M on which H µ is easy to compute – approximate H µ

Examples, revisited

Mean field: a relaxation on M • Mean field as a relaxation on M • Recall: removing edges in a graph. model ⇒ adding independence constraints ⇒ smaller set of realizable mean parameters ˜ • Formally: M := { µ s,t : µ s,t = µ s µ t }

Approximate inference on graphical models: variational methods - PowerPoint PPT Presentation

Approximate inference on graphical models: variational methods Alexandre Bouchard-C ot e Exact inference in general graphs. . . Recall: we now have an exact, general and efficient inference algorithm: the Junction-Tree algorithm

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

CSci 8980: Advanced Topics in Graphical Models Variational Inference Instructor: Arindam Banerjee

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Graphical Models Graphical Models MAP inference Siamak Ravanbakhsh Winter 2018 Learning

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

CS 4803 / 7643: Deep Learning Guest Lecture: Embeddings and world2vec Feb. 18 th 2020 Ledell Wu

A Probabilistic Approach to Diachronic Phonology Alexandre Bouchard-C ot e Percy Liang Tom

Management of Osteoarthri4s: From the Periphery to the Spine

AGM 29 th April 2019 Mike Lamont Chairman 2015 Vision Stockport, a group of Stockport

Hybrid Parallelization of Particle-in-Cell (PIC) Algorithm For Simulation Of Low Temperature

CMPT 825 Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class

Object detection Wed Feb 24 Kristen Grauman UT Austin Announcements Reminder: Assignment 2

Lattice calculations & DiRAC facility Matthew Wingate DAMTP , University of Cambridge

Approximate inference on graphical models: variational methods - PowerPoint PPT Presentation

Approximate inference on graphical models: variational methods Alexandre Bouchard-C ot e Exact inference in general graphs. . . Recall: we now have an exact, general and efficient inference algorithm: the Junction-Tree algorithm

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

CSci 8980: Advanced Topics in Graphical Models Variational Inference Instructor: Arindam Banerjee

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Graphical Models Graphical Models MAP inference Siamak Ravanbakhsh Winter 2018 Learning

Graphical Models Graphical Models Exponential family &amp; Variational Inference I Siamak

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

CS 4803 / 7643: Deep Learning Guest Lecture: Embeddings and world2vec Feb. 18 th 2020 Ledell Wu

A Probabilistic Approach to Diachronic Phonology Alexandre Bouchard-C ot e Percy Liang Tom

Management of Osteoarthri4s: From the Periphery to the Spine

AGM 29 th April 2019 Mike Lamont Chairman 2015 Vision Stockport, a group of Stockport

Hybrid Parallelization of Particle-in-Cell (PIC) Algorithm For Simulation Of Low Temperature

CMPT 825 Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class

Object detection Wed Feb 24 Kristen Grauman UT Austin Announcements Reminder: Assignment 2

Lattice calculations &amp; DiRAC facility Matthew Wingate DAMTP , University of Cambridge

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak

Lattice calculations & DiRAC facility Matthew Wingate DAMTP , University of Cambridge