Adversarial Training for MCMC Shengjia Zhao Stefano Ermon March 7, 2018 Stanford University A-NICE-MC Jiaming Song
1. Motivation 2. Notations and Problem Setup 3. Adversarial Training for Markov Chains 4. Adversarial Training for MCMC 5. Experiments 1 Table of contents
Motivation
2 Bayesian Inference Parameters θ , observations D : Input prior p ( θ ) and likelihood p ( D| θ ) Output posterior p ( θ |D ) through Bayes’ rule: p ( θ |D ) = p ( θ ) p ( D| θ ) p ( D ) Problem : marginal p ( D ) is intractable! Solutions : Variational Inference and Markov chain Monte Carlo
3 tractable model and minimize its distance with the posterior. Bayesian Inference Variational Inference : approximate the posterior with some Examples mean field approximation [2] Advantages optimization is efficient Drawbacks performance limited by choice of model
sampled from Markov chain with desired stationary distribution. 4 Bayesian Inference Markov chain Monte Carlo : approximate the posterior with particles Method Proposal for next particle + Metropolis-Hastings Examples Gibbs sampling [4], Hamiltonian Monte Carlo [8] Advantages reaches the true posterior asymptotically Drawbacks need many samples to obtain good estimates
• cannot apply expressive function approximations directly • proposals are hand-designed in general • hard to evaluate / optimize metrics 5 Deep Bayesian Learning Variational Inference <- Deep Learning : ✓ • stochastic gradient descent as optimization algorithm • expressive function approximations to represent model Markov chain Monte Carlo <- Deep Learning : ✗
We introduce A-NICE-MC, a new method for training flexible MCMC kernels. • proposals are parameterized using (deep) neural networks • use adversarial methods to train a Markov chain that • matches a target stationary quickly (burn-in) • achieves low autocorrelation between samples (mixing) • learned proposals are much more efficient than traditional ones 6 Outline Markov chain Monte Carlo + Deep Learning : ✓
Notations and Problem Setup
where the following Markov chain: 7 Notations A sequence of continuous random variables { x t } ∞ t = 0 is drawn through x 0 ∼ π 0 x t + 1 ∼ T θ ( x t + 1 | x t ) • T θ ( ·| x ) : a stochastic transition kernel parametrized by θ • π 0 : some initial distribution for x 0 . • π t θ : state distribution at time t . T θ is defined through an implicit generative model f θ ( ·| x , v ) , where v ∼ p ( v ) is an auxiliary random variable.
8 • an (intractable) posterior distribution • a data distribution (which we can sample from) Problem Setup Let p d ( x ) be a target distribution over x ∈ R n , e.g.: Our objective is to find a T θ such that: 1. Low bias: The stationary distribution is close to the target distribution (minimize | π θ − p d | ) . 2. Efficiency: { π t θ } ∞ t = 0 converges quickly (minimize t such that | π t θ − p d | < δ ). 3. Low variance: Samples from one chain { x t } ∞ t = 0 should be as uncorrelated as possible (minimize autocorrelation of { x t } ∞ t = 0 ).
Problem setup: We consider two settings for specifying the target distribution. samples) 9 Settings Input A target distribution p d ( x ) Output A transition kernel T θ ( ·| x ) • p d ( x ) is a data distribution (samples, no analytic expression) • p d ( x ) an analytic expression (up to normalization constant, no
Adversarial Training for Markov Chains
10 (1) Parametrized Markov Chains Assume we have direct access to samples from p d ( x ) , and the transition kernel T θ ( x t + 1 | x t ) is the following implicit generative model: v ∼ p ( v ) x t + 1 = f θ ( x t , v ) for which the stationary π θ ( x ) exists. Goal : find θ such that π θ ( x ) is close to p d .
(integration over all the possible paths) • sampling is easy for Markov chains! • likelihood-free methods only requires samples • Example: Generative Adversarial Networks [5] 11 Training Markov Chains Likelihood-based Approaches : • the value of π θ ( x ) is typically intractable to compute • the marginal distribution π t θ ( x ) at time t is also intractable Likelihood-free Apporoaches
12 D (2) D max G max G min This describes the following objective [1]: the generator and samples from p d . Generative Adversarial Networks Generator G ( z ) : generates samples by transforming a noise variable z ∼ p ( z ) into G ( z ) Discriminator D ( x ) : trained to distinguish between samples from V ( D , G ) = min E x ∼ p d [ D ( x )] − E z ∼ p ( z ) [ D ( G ( z ))]
In our settings: It is hard to sample from the stationary or optimize through a long chain! 13 Likelihood-free Training for Markov Chains • p d ( x ) is the empirical distribution from the samples ✓ • G θ ( z ) is the stationary? approximate with state after t steps? ✗
We can construct an objective that can be optimized efficiently through the two conditions. 14 Conditions for Stationary Distribution We consider two necessary conditions for p d to be a stationary: • p d should be close to π b for some time step b • p d is a fixed point for the transition operator
15 Markov GAN (MGAN) objective: min is applied m times, starting from some “real” sample x d max D • T m x denotes “fake” samples from the generator where Markov GAN θ [ D (¯ x | x d ) [ D (¯ E x ∼ p d [ D ( x )] − λ E ¯ x )] − ( 1 − λ ) E x d ∼ p d , ¯ x )] (3) x ∼ T m θ (¯ x ∼ π b θ • λ ∈ ( 0 , 1 ) , b ∈ N + , m ∈ N + are hyperparameters • ¯ θ ( x | x d ) denotes the distribution of x when the transition kernel
16 converge to p d min We use two types of samples from the generator for training: max D (4) fixed point at p d Markov GAN (MGAN) objective: Markov GAN E x ∼ p d [ D ( x )] − λ E ¯ θ [ D (¯ x )] − ( 1 − λ ) E x d ∼ p d , ¯ x | x d ) [ D (¯ x )] x ∼ T m θ (¯ x ∼ π b θ � �� � � �� � 1. Samples after b transitions, starting from x 0 ∼ π 0 . 2. Samples after m transitions, starting from x d ∼ p d .
17 chain. If the following two conditions hold: m variation; Justifications Proposition Consider a sequence of ergodic Markov chains over state space S . Define π n as the stationary distribution for the n-th Markov chain, and π t n as the probability distribution at time step t for the n-th n } ∞ 1. ∃ b > 0 such that the sequence { π b n = 1 converges to p d in total 2. ∃ ϵ > 0 , ρ < 1 such that ∃ M > 0 , ∀ m > M if ∥ π t m − p d ∥ TV < ϵ , then ∥ π t + 1 − p d ∥ TV < ρ ∥ π t m − p d ∥ TV ; then the sequence of stationary distributions { π n } ∞ n = 1 converges to p d in total variation.
18 Sketch of Proof Proof. The goal is to prove that ∀ δ > 0, ∃ N > 0 , T > 0, such that ∀ n > N , t > T , ∥ π t n − p d ∥ TV < δ . • ∃ N > 0, such that ∀ n > N , ∥ π b n − p d ∥ TV < ϵ (Assumption 1). • ∀ n > max ( N , M ) , ∀ δ > 0, ∃ T = b + max ( 0 , ⌈ log ρ δ − log ρ ϵ ⌉ ) + 1, such that ∀ t > T , ∥ π t n − p d ∥ TV < δ (Assumption 2). Hence the sequence { π n } ∞ n = 1 converges to p d in total variation.
(5) the MNIST dataset. Consecutive samples can be related in label (red box), inclination (green box) or width (blue box). 19 Example: Generative Model for Images We experiment with a distribution p d over images, such as digits (MNIST) and faces (CelebA), where x t + 1 = f θ ( x t , v ) is defined as z ′ = ReLU ( z + β v ) x t + 1 = decoder θ ( z ′ ) z = encoder θ ( x t ) where β is a hyperparameter we set to 0 . 1. Figure 1: Visualizing samples of π 1 to π 50 (each row) from a model trained on
We use a classifier to classify the generated images and evaluate the 20 Transition Probabilities on MNIST class transition probabilities T θ ( y t + 1 | y t ) Figure 2: The transition is not symmetric!
Adversarial Training for MCMC
specified by an analytical expression: (6) where There are two additional challenges: • We want the stationary to be exactly p d • We do not have direct access to samples from p d 21 Analytical Target Now consider the settings where the target distribution p d is p d ( x ) ∝ exp ( − U ( x )) • U ( x ) is a known energy function • normalization constant for U ( x ) is not available
We use ideas from the Markov Chain Monte Carlo (MCMC) literature to address the first challenge. (7) 22 Metropolis Hastings Detailed Balance : p d ( x ) T θ ( x ′ | x ) = p d ( x ′ ) T θ ( x | x ′ ) for all x and x ′ . Metropolis-Hastings • a sample x ′ is first obtained from a proposal distribution g θ ( x ′ | x ) • x ′ is accepted with the following probability: ( ) 1 , exp ( U ( x ) − U ( x ′ )) g θ ( x | x ′ ) A θ ( x ′ | x ) = min g θ ( x ′ | x ) Let T θ ( x ′ | x ) = g θ ( x ′ | x ) A θ ( x ′ | x ) , then the Markov chain has stationary of p d [6].
Recommend
More recommend