a nice mc
play

A-NICE-MC Jiaming Song 1. Motivation 2. Notations and Problem - PowerPoint PPT Presentation

Adversarial Training for MCMC Shengjia Zhao Stefano Ermon March 7, 2018 Stanford University A-NICE-MC Jiaming Song 1. Motivation 2. Notations and Problem Setup 3. Adversarial Training for Markov Chains 4. Adversarial Training for MCMC 5.


  1. Adversarial Training for MCMC Shengjia Zhao Stefano Ermon March 7, 2018 Stanford University A-NICE-MC Jiaming Song

  2. 1. Motivation 2. Notations and Problem Setup 3. Adversarial Training for Markov Chains 4. Adversarial Training for MCMC 5. Experiments 1 Table of contents

  3. Motivation

  4. 2 Bayesian Inference Parameters θ , observations D : Input prior p ( θ ) and likelihood p ( D| θ ) Output posterior p ( θ |D ) through Bayes’ rule: p ( θ |D ) = p ( θ ) p ( D| θ ) p ( D ) Problem : marginal p ( D ) is intractable! Solutions : Variational Inference and Markov chain Monte Carlo

  5. 3 tractable model and minimize its distance with the posterior. Bayesian Inference Variational Inference : approximate the posterior with some Examples mean field approximation [2] Advantages optimization is efficient Drawbacks performance limited by choice of model

  6. sampled from Markov chain with desired stationary distribution. 4 Bayesian Inference Markov chain Monte Carlo : approximate the posterior with particles Method Proposal for next particle + Metropolis-Hastings Examples Gibbs sampling [4], Hamiltonian Monte Carlo [8] Advantages reaches the true posterior asymptotically Drawbacks need many samples to obtain good estimates

  7. • cannot apply expressive function approximations directly • proposals are hand-designed in general • hard to evaluate / optimize metrics 5 Deep Bayesian Learning Variational Inference <- Deep Learning : ✓ • stochastic gradient descent as optimization algorithm • expressive function approximations to represent model Markov chain Monte Carlo <- Deep Learning : ✗

  8. We introduce A-NICE-MC, a new method for training flexible MCMC kernels. • proposals are parameterized using (deep) neural networks • use adversarial methods to train a Markov chain that • matches a target stationary quickly (burn-in) • achieves low autocorrelation between samples (mixing) • learned proposals are much more efficient than traditional ones 6 Outline Markov chain Monte Carlo + Deep Learning : ✓

  9. Notations and Problem Setup

  10. where the following Markov chain: 7 Notations A sequence of continuous random variables { x t } ∞ t = 0 is drawn through x 0 ∼ π 0 x t + 1 ∼ T θ ( x t + 1 | x t ) • T θ ( ·| x ) : a stochastic transition kernel parametrized by θ • π 0 : some initial distribution for x 0 . • π t θ : state distribution at time t . T θ is defined through an implicit generative model f θ ( ·| x , v ) , where v ∼ p ( v ) is an auxiliary random variable.

  11. 8 • an (intractable) posterior distribution • a data distribution (which we can sample from) Problem Setup Let p d ( x ) be a target distribution over x ∈ R n , e.g.: Our objective is to find a T θ such that: 1. Low bias: The stationary distribution is close to the target distribution (minimize | π θ − p d | ) . 2. Efficiency: { π t θ } ∞ t = 0 converges quickly (minimize t such that | π t θ − p d | < δ ). 3. Low variance: Samples from one chain { x t } ∞ t = 0 should be as uncorrelated as possible (minimize autocorrelation of { x t } ∞ t = 0 ).

  12. Problem setup: We consider two settings for specifying the target distribution. samples) 9 Settings Input A target distribution p d ( x ) Output A transition kernel T θ ( ·| x ) • p d ( x ) is a data distribution (samples, no analytic expression) • p d ( x ) an analytic expression (up to normalization constant, no

  13. Adversarial Training for Markov Chains

  14. 10 (1) Parametrized Markov Chains Assume we have direct access to samples from p d ( x ) , and the transition kernel T θ ( x t + 1 | x t ) is the following implicit generative model: v ∼ p ( v ) x t + 1 = f θ ( x t , v ) for which the stationary π θ ( x ) exists. Goal : find θ such that π θ ( x ) is close to p d .

  15. (integration over all the possible paths) • sampling is easy for Markov chains! • likelihood-free methods only requires samples • Example: Generative Adversarial Networks [5] 11 Training Markov Chains Likelihood-based Approaches : • the value of π θ ( x ) is typically intractable to compute • the marginal distribution π t θ ( x ) at time t is also intractable Likelihood-free Apporoaches

  16. 12 D (2) D max G max G min This describes the following objective [1]: the generator and samples from p d . Generative Adversarial Networks Generator G ( z ) : generates samples by transforming a noise variable z ∼ p ( z ) into G ( z ) Discriminator D ( x ) : trained to distinguish between samples from V ( D , G ) = min E x ∼ p d [ D ( x )] − E z ∼ p ( z ) [ D ( G ( z ))]

  17. In our settings: It is hard to sample from the stationary or optimize through a long chain! 13 Likelihood-free Training for Markov Chains • p d ( x ) is the empirical distribution from the samples ✓ • G θ ( z ) is the stationary? approximate with state after t steps? ✗

  18. We can construct an objective that can be optimized efficiently through the two conditions. 14 Conditions for Stationary Distribution We consider two necessary conditions for p d to be a stationary: • p d should be close to π b for some time step b • p d is a fixed point for the transition operator

  19. 15 Markov GAN (MGAN) objective: min is applied m times, starting from some “real” sample x d max D • T m x denotes “fake” samples from the generator where Markov GAN θ [ D (¯ x | x d ) [ D (¯ E x ∼ p d [ D ( x )] − λ E ¯ x )] − ( 1 − λ ) E x d ∼ p d , ¯ x )] (3) x ∼ T m θ (¯ x ∼ π b θ • λ ∈ ( 0 , 1 ) , b ∈ N + , m ∈ N + are hyperparameters • ¯ θ ( x | x d ) denotes the distribution of x when the transition kernel

  20. 16 converge to p d min We use two types of samples from the generator for training: max D (4) fixed point at p d Markov GAN (MGAN) objective: Markov GAN E x ∼ p d [ D ( x )] − λ E ¯ θ [ D (¯ x )] − ( 1 − λ ) E x d ∼ p d , ¯ x | x d ) [ D (¯ x )] x ∼ T m θ (¯ x ∼ π b θ � �� � � �� � 1. Samples after b transitions, starting from x 0 ∼ π 0 . 2. Samples after m transitions, starting from x d ∼ p d .

  21. 17 chain. If the following two conditions hold: m variation; Justifications Proposition Consider a sequence of ergodic Markov chains over state space S . Define π n as the stationary distribution for the n-th Markov chain, and π t n as the probability distribution at time step t for the n-th n } ∞ 1. ∃ b > 0 such that the sequence { π b n = 1 converges to p d in total 2. ∃ ϵ > 0 , ρ < 1 such that ∃ M > 0 , ∀ m > M if ∥ π t m − p d ∥ TV < ϵ , then ∥ π t + 1 − p d ∥ TV < ρ ∥ π t m − p d ∥ TV ; then the sequence of stationary distributions { π n } ∞ n = 1 converges to p d in total variation.

  22. 18 Sketch of Proof Proof. The goal is to prove that ∀ δ > 0, ∃ N > 0 , T > 0, such that ∀ n > N , t > T , ∥ π t n − p d ∥ TV < δ . • ∃ N > 0, such that ∀ n > N , ∥ π b n − p d ∥ TV < ϵ (Assumption 1). • ∀ n > max ( N , M ) , ∀ δ > 0, ∃ T = b + max ( 0 , ⌈ log ρ δ − log ρ ϵ ⌉ ) + 1, such that ∀ t > T , ∥ π t n − p d ∥ TV < δ (Assumption 2). Hence the sequence { π n } ∞ n = 1 converges to p d in total variation.

  23. (5) the MNIST dataset. Consecutive samples can be related in label (red box), inclination (green box) or width (blue box). 19 Example: Generative Model for Images We experiment with a distribution p d over images, such as digits (MNIST) and faces (CelebA), where x t + 1 = f θ ( x t , v ) is defined as z ′ = ReLU ( z + β v ) x t + 1 = decoder θ ( z ′ ) z = encoder θ ( x t ) where β is a hyperparameter we set to 0 . 1. Figure 1: Visualizing samples of π 1 to π 50 (each row) from a model trained on

  24. We use a classifier to classify the generated images and evaluate the 20 Transition Probabilities on MNIST class transition probabilities T θ ( y t + 1 | y t ) Figure 2: The transition is not symmetric!

  25. Adversarial Training for MCMC

  26. specified by an analytical expression: (6) where There are two additional challenges: • We want the stationary to be exactly p d • We do not have direct access to samples from p d 21 Analytical Target Now consider the settings where the target distribution p d is p d ( x ) ∝ exp ( − U ( x )) • U ( x ) is a known energy function • normalization constant for U ( x ) is not available

  27. We use ideas from the Markov Chain Monte Carlo (MCMC) literature to address the first challenge. (7) 22 Metropolis Hastings Detailed Balance : p d ( x ) T θ ( x ′ | x ) = p d ( x ′ ) T θ ( x | x ′ ) for all x and x ′ . Metropolis-Hastings • a sample x ′ is first obtained from a proposal distribution g θ ( x ′ | x ) • x ′ is accepted with the following probability: ( ) 1 , exp ( U ( x ) − U ( x ′ )) g θ ( x | x ′ ) A θ ( x ′ | x ) = min g θ ( x ′ | x ) Let T θ ( x ′ | x ) = g θ ( x ′ | x ) A θ ( x ′ | x ) , then the Markov chain has stationary of p d [6].

Recommend


More recommend