Neural Variational Inference and Learning Andriy Mnih, Karol Gregor - PowerPoint PPT Presentation

Neural Variational Inference and Learning Andriy Mnih, Karol Gregor 22 June 2014 1 / 14

Introduction ◮ Training directed latent variable models is difficult because inference in them is intractable. ◮ Both MCMC and traditional variational methods involve iterative procedures for each datapoint. ◮ A promising new way to train directed latent variable models: ◮ Use feedforward approximation to inference to implement efficient sampling from the variational posterior. ◮ We propose a general version of this approach that 1. Can handle both discrete and continuous latent variables. 2. Does not require any model-specific derivations beyond computing gradients w.r.t. parameters. 2 / 14

High-level overview ◮ A general approach to variational inference based on three ideas: 1. Approximating the posterior using highly expressive feed-forward inference networks (e.g. neural nets). ◮ These have to be efficient to evaluate and sample from. 2. Using gradient-based updates to improve the variational bound. 3. Computing the gradients using samples from the inference net. ◮ Key: The inference net implements efficient sampling from the approximate posterior. 3 / 14

Variational inference (I) ◮ Given a directed latent variable model that naturally factorizes as P θ ( x , h ) = P θ ( x | h ) P θ ( h ) , ◮ We can lower-bound the contribution of x to the log-likelihood as follows: log P θ ( x ) ≥ E Q [ log P θ ( x , h ) − log Q φ ( h | x )] = L θ,φ ( x ) , where Q φ ( h | x ) is an arbitrary distribution. ◮ In the context of variational inference, Q φ ( h | x ) is called the variational posterior . 4 / 14

Variational inference (II) ◮ Variational learning involves alternating between maximizing the lower bound L θ,φ ( x ) w.r.t. the variational distribution Q φ ( h | x ) and model parameters θ . ◮ Typically variational inference requires: ◮ Variational distributions Q with simple factored form and no parameter sharing between distributions for different x . ◮ Simple models P θ ( x , h ) yielding tractable expectations. ◮ Iterative optimization to compute Q for each x . ◮ We would like to avoid iterative inference, while allowing expressive, potentially multimodal, posteriors, and highly expressive models. 5 / 14

Neural variational inference and learning (NVIL) ◮ We achieve these goals by using a feed-forward model for Q φ ( h | x ) , making the dependence of the approximate posterior on the input x parametric. ◮ This allows us to sample from Q φ ( h | x ) very efficiently. ◮ We will refer to Q as the inference network because it implements approximate inference for the model being trained. ◮ We train the model by (locally) maximizing the variational bound L θ,φ ( x ) w.r.t. θ and φ . ◮ We compute all the required expectations using samples from Q . 6 / 14

Gradients of the variational bound ◮ The gradients of the bound w.r.t. to the model and inference net parameters are: � ∂ � ∂ ∂θ L θ,φ ( x ) = E Q ∂θ log P θ ( x , h ) , � � ∂ ( log P θ ( x , h ) − log Q φ ( h | x )) ∂ ∂φ L θ,φ ( x ) = E Q ∂φ log Q φ ( h | x ) . ◮ Note that the learning signal for the inference net is l φ ( x , h ) = log P θ ( x , h ) − log Q φ ( h | x ) . ◮ This signal is effectively the same as log P θ ( h | x ) − log Q φ ( h | x ) (up to a constant w.r.t. h ), but is tractable to compute. ◮ The price to pay for tractability is the high variance of the resulting estimates. 7 / 14

Parameter updates ◮ Given an observation x , we can estimate the gradients using Monte Carlo: 1. Sample h ∼ Q φ ( h | x ) 2. Compute ∂θ L θ,φ ( x ) ≈ ∂ ∂ ∂θ log P θ ( x , h ) ∂φ L θ,φ ( x ) ≈ ( log P θ ( x , h ) − log Q φ ( h | x )) ∂ ∂ ∂φ log Q φ ( h | x ) ◮ Problem: The resulting estimator of the inference network gradient is too high-variance to be useful in practice. ◮ It can be made practical, however, using several simple model-independent variance reduction techniques. 8 / 14

Reducing variance (I) ◮ Key observation : if h is sampled from Q φ ( h | x ) , ( log P θ ( x , h ) − log Q φ ( h | x ) − b ) ∂ ∂φ log Q φ ( h | x ) ∂ is an unbiased estimator of ∂φ L θ,φ ( x ) for any b independent of h . ◮ However, the variance of the estimator does depend on b , which allows us to obtain lower-variance estimators by choosing b carefully. ◮ Our strategy is to choose b so that the resulting learning signal log P θ ( x , h ) − log Q φ ( h | x ) − b is close to zero. ◮ Borrowing terminology from reinforcement learning, we call b a baseline . 9 / 14

Reducing variance (II) Techniques for reducing estimator variance: 1. Constant baseline : b = a running estimate of the mean of l φ ( x , h ) = log P θ ( x , h ) − log Q φ ( h | x ) . ◮ Makes the learning signal zero-mean. ◮ Enough to obtain reasonable models on MNIST. 2. Input-dependent baseline : b ψ ( x ) . ◮ Can be seen as capturing log P θ ( x ) . ◮ An MLP with a single real-valued output. ◮ Makes learning considerably faster and leads to better results. 3. Variance normalization : scale the learning signal to unit variance. ◮ Can be seen as simple global learning rate adaptation. ◮ Makes learning faster and more robust. 4. Local learning signals : ◮ Take advantage of the Markov properties of the models. 10 / 14

Effects of variance reduction Sigmoid belief network with two hidden layers of 200 units on MNIST. SBN 200−200 −100 −120 −140 Validation set bound −160 −180 −200 Baseline, IDB, & VN −220 Baseline & VN Baseline only VN only No baselines & no VN −240 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of parameter updates 11 / 14

Document modelling results ◮ Task: model the joint distribution of word counts in bags of words describing documents. ◮ Models: SBN and fDARN models with one hidden layer ◮ Datasets: ◮ 20 Newsgroups: 11K documents, 2K vocabulary ◮ Reuters RCV1: 800K documents, 10K vocabulary ◮ Performance metric: perplexity M ODEL D IM 20 N EWS R EUTERS SBN 50 909 784 F DARN 50 917 724 F DARN 200 598 LDA 50 1091 1437 LDA 200 1058 1142 R EP S OFT M AX 50 953 988 D OC N ADE 50 896 742 12 / 14

Conclusions ◮ NVIL is a simple and general training method for directed latent variable models. ◮ Can handle both continuous and discrete latent variables. ◮ Easy to apply, requiring no model-specific derivations beyond gradient computation. ◮ Promising document modelling results with DARN and SBN models. 13 / 14

Thank you! 14 / 14

Neural Variational Inference and Learning Andriy Mnih, Karol Gregor - PowerPoint PPT Presentation

Neural Variational Inference and Learning Andriy Mnih, Karol Gregor 22 June 2014 1 / 14 Introduction Training directed latent variable models is difficult because inference in them is intractable. Both MCMC and traditional variational

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Meta-Amoruized Variational Inference and Learning Kristy Choi CS236: December 4th, 2019

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family

Introduction to Path Tracing Marc Sunet Table of contents From Ray Tracing to Path Tracing The

A variance reduction method for computing VaR 1. Computing Value at Risk by Monte Carlo

Variance reduction Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

W HAT CAN BE DONE WITH A SO SIMPLE LOGIC ? The facts are propositional variables The knowledge is

Multi-Index Monte Carlo and Multi-Index Stochastic Collocation Ra ul Tempone Alexander von

REVERSE MONTE CARLO STATUS, PERSPECTIVE Giovanni Santin, ESA/ESTEC Laurent Desorgher,

r t st Pt Prss

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun & Rich Zemels

Neural Variational Inference and Learning Andriy Mnih, Karol Gregor - PowerPoint PPT Presentation

Neural Variational Inference and Learning Andriy Mnih, Karol Gregor 22 June 2014 1 / 14 Introduction Training directed latent variable models is difficult because inference in them is intractable. Both MCMC and traditional variational

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Meta-Amoruized Variational Inference and Learning Kristy Choi CS236: December 4th, 2019

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family

Introduction to Path Tracing Marc Sunet Table of contents From Ray Tracing to Path Tracing The

A variance reduction method for computing VaR 1. Computing Value at Risk by Monte Carlo

Variance reduction Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

W HAT CAN BE DONE WITH A SO SIMPLE LOGIC ? The facts are propositional variables The knowledge is

Multi-Index Monte Carlo and Multi-Index Stochastic Collocation Ra ul Tempone Alexander von

REVERSE MONTE CARLO STATUS, PERSPECTIVE Giovanni Santin, ESA/ESTEC Laurent Desorgher,

r t st Pt Prss

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun &amp; Rich Zemels

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun & Rich Zemels