Variational Autoencoders Recap: Story so far A classification MLP - PowerPoint PPT Presentation

Variational Autoencoders

Recap: Story so far • A classification MLP actually comprises two components • A “feature extraction network” that converts the inputs into linearly separable features • Or nearly linearly separable features • A final linear classifier that operates on the linearly separable features • Neural networks can be used to perform linear or non-linear PCA • “Autoencoders” • Can also be used to compose constructive dictionaries for data • Which, in turn can be used to model data distributions

Recap: The penultimate layer 𝑧 1 𝑧 2 y 2 x 1 x 2 y 1 • The network up to the output layer may be viewed as a transformation that transforms data from non-linear classes to linearly separable features • We can now attach any linear classifier above it for perfect classification • Need not be a perceptron • In fact, slapping on an SVM on top of the features may be more generalizable!

Recap: The behavior of the layers

Recap: Auto-encoders and PCA Training: Learning 𝑋 by minimizing L2 divergence 𝐲 ො 𝒙 𝑼 x = 𝑥 𝑈 𝑥x ො x 2 = x − w 𝑈 𝑥x 2 𝑒𝑗𝑤 ො x, x = x − ො 𝒙 ෡ 𝑋 = argmin 𝐹 𝑒𝑗𝑤 ො x, x 𝑋 𝐲 ෡ x − w 𝑈 𝑥x 2 𝑋 = argmin 𝐹 𝑋 5

Recap: Auto-encoders and PCA 𝐲 ො 𝒙 𝑼 𝒙 𝐲 • The autoencoder finds the direction of maximum energy • Variance if the input is a zero-mean RV • All input vectors are mapped onto a point on the principal axis 6

Recap: Auto-encoders and PCA • Varying the hidden layer value only generates data along the learned manifold • May be poorly learned • Any input will result in an output along the learned manifold

Recap: Learning a data-manifold Sax dictionary DECODER • The decoder represents a source-specific generative dictionary • Exciting it will produce typical data from the source! 8

Overview • Just as autoencoders can be viewed as performing a non-linear PCA, variational autoencoders can be viewed as performing a non-linear Factor Analysis (FA) • Variational autoencoders (VAEs) get their name from variational inference, a technique that can be used for parameter estimation • We will introduce Factor Analysis, variational inference and expectation maximization, and finally VAEs

Why Generative Models? Training data • Unsupervised/Semi-supervised learning: More training data available • E.g. all of the videos on YouTube

Why generative models? Many right answers • Caption -> Image • Outline -> Image https://openreview.net/pdf?id=Hyvw0L9el A man in an orange jacket with sunglasses and a hat skis down a hill https://arxiv.org/abs/1611.07004

Why generative models? Intrinsic to task Example: Super resolution https://arxiv.org/abs/1609.04802

Why generative models? Insight • What kind of structure can we find in complex observations (MEG recording of brain activity above, gene-expression network to the left)? • Is there a low dimensional manifold underlying these complex observations? • What can we learn about the brain, cellular https://bmcbioinformatics.biomedcentral.c function, etc. if we know more about these om/articles/10.1186/1471-2105-12-327 manifolds?

Factor Analysis • Generative model: Assumes that data are generated from real valued latent variables Bishop – Pattern Recognition and Machine Learning

Factor Analysis model Factor analysis assumes a generative model • where the 𝑗𝑢ℎ observation, 𝒚 𝒋 ∈ ℝ 𝐸 is conditioned on • a vector of real valued latent variables 𝒜 𝒋 ∈ ℝ 𝑀 . Here we assume the prior distribution is Gaussian: 𝑞 𝒜 𝒋 = 𝒪(𝒜 𝒋 |𝝂 𝟏 , 𝚻 𝟏 ) We also will use a Gaussian for the data likelihood: 𝑞 𝒚 𝒋 𝒜 𝒋 , 𝑿, 𝝂, 𝛀 = 𝒪(𝑿𝒜 𝒋 + 𝝂, 𝛀) Where 𝑿 ∈ ℝ 𝐸×𝑀 , 𝛀 ∈ ℝ 𝐸×𝐸 , 𝛀 is diagonal

Marginal distribution of observed 𝒚 𝒋 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = න 𝒪(𝑿𝒜 𝒋 + 𝝂, 𝛀) 𝒪 𝒜 𝒋 𝝂 𝟏 , 𝚻 𝟏 𝐞𝒜 𝒋 = 𝒪 𝒚 𝒋 𝑿𝝂 𝟏 + 𝝂, 𝛀 + 𝑿 𝚻 𝟏 𝑿 𝑈 Note that we can rewrite this as: 𝑞 𝒚 𝒋 ෢ 𝝂, 𝛀 + ෢ 𝑿෢ 𝑿 𝑈 𝑿, ෝ 𝝂, 𝛀 = 𝒪 𝒚 𝒋 ෝ − 1 𝝂 = 𝑿𝝂 𝟏 + 𝝂 and ෢ 2 . Where ෝ 𝑿 = 𝑿𝚻 𝟏 Thus without loss of generality (since 𝝂 𝟏 , 𝚻 𝟏 are absorbed into learnable parameters) we let: 𝑞 𝒜 𝒋 = 𝒪 𝒜 𝒋 𝟏, 𝑱 And find: 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = 𝒪 𝒚 𝒋 𝝂, 𝛀 + 𝑿𝑿 𝑈

Marginal distribution interpretation • We can see from 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = 𝒪 𝒚 𝒋 𝝂, 𝛀 + 𝑿𝑿 𝑈 that the covariance matrix of the data distribution is broken into 2 terms • A diagonal part 𝛀 : variance not shared between variables • A low rank matrix 𝑿𝑿 𝑈 : shared variance due to latent factors

Special Case: Probabilistic PCA (PPCA) • Probabilistic PCA is a special case of Factor Analysis • We further restrict 𝛀 = 𝜏 2 𝑱 (assume isotropic independent variance) • Possible to show that when the data are centered ( 𝝂 = 0 ), the limiting case where 𝜏 → 0 gives back the same solution for 𝑿 as PCA • Factor analysis is a generalization of PCA that models non-shared variance (can think of this as noise in some situations, or individual variation in others)

Inference in FA • To find the parameters of the FA model, we use the Expectation Maximization (EM) algorithm • EM is very similar to variational inference • We’ll derive EM by first finding a lower bound on the log -likelihood we want to maximize, and then maximizing this lower bound

Evidence Lower Bound decomposition • For any distributions 𝑟 𝑨 , 𝑞(𝑨) we have: ≜ න 𝑟 𝑨 log 𝑟(𝑨) KL 𝑟 𝑨 || 𝑞 𝑨 𝑞(𝑨) 𝐞𝑨 • Consider the KL divergence of an arbitrary weighting distribution 𝑟 𝑨 from a conditional distribution 𝑞 𝑨|𝑦, 𝜄 : 𝑟(𝑨) ≜ න 𝑟 𝑨 log KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 𝑞(𝑨|𝑦, 𝜄) 𝐞𝑨 = න 𝑟 𝑨 [log 𝑟 𝑨 − log 𝑞(𝑨|𝑦, 𝜄)] 𝐞𝑨

Applying Bayes log 𝑞 𝑨 𝑦, 𝜄 = log 𝑞 𝑦 𝑨, 𝜄 𝑞(𝑨|𝜄) 𝑞(𝑦|𝜄) = log 𝑞 𝑦 𝑨, 𝜄 + log 𝑞 𝑨 𝜄 − log 𝑞 𝑦 𝜄 Then: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = න 𝑟 𝑨 [log 𝑟 𝑨 − log 𝑞(𝑨|𝑦, 𝜄)] 𝐞𝑨 = න 𝑟 𝑨 log 𝑟 𝑨 − log 𝑞 𝑦 𝑨, 𝜄 − log 𝑞 𝑨 𝜄 + log 𝑞 𝑦 𝜄 𝐞𝑨

Rewriting the divergence • Since the last term does not depend on z, and we know ׬ 𝑟 𝑨 d𝑨 = 1 , we can pull it out of the integration: න 𝑟 𝑨 log 𝑟 𝑨 − log 𝑞 𝑦 𝑨, 𝜄 − log 𝑞 𝑨 𝜄 + log 𝑞 𝑦 𝜄 𝐞𝑨 = න 𝑟 𝑨 log 𝑟 𝑨 − log 𝑞 𝑦 𝑨, 𝜄 − log 𝑞 𝑨 𝜄 𝐞𝑨 + log 𝑞 𝑦 𝜄 𝑟(𝑨) = න 𝑟 𝑨 log 𝑞 𝑦 𝑨, 𝜄 𝑞(𝑨, 𝜄) 𝐞𝑨 + log 𝑞 𝑦 𝜄 𝑟(𝑨) = න 𝑟 𝑨 log 𝑞(𝑦, 𝑨 |𝜄) 𝐞𝑨 + log 𝑞 𝑦 𝜄 Then we have: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 + log 𝑞 𝑦 𝜄

Evidence Lower Bound • From basic probability we have: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 + log 𝑞 𝑦 𝜄 • We can rearrange the terms to get the following decomposition: log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 − KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 • We define the evidence lower bound (ELBO) as: ℒ 𝑟, 𝜄 ≜ −KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 Then: log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 + ℒ 𝑟, 𝜄

Why the name evidence lower bound? • Rearranging the decomposition log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 + ℒ 𝑟, 𝜄 • we have ℒ 𝑟, 𝜄 = log 𝑞 𝑦 𝜄 − KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 • Since KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 ≥ 0 , ℒ 𝑟, 𝜄 is a lower bound on the log- likelihood we want to maximize • 𝑞 𝑦 𝜄 is sometimes called the evidence • When is this bound tight ? When 𝑟 𝑨 = 𝑞 𝑨|𝑦, 𝜄 • The ELBO is also sometimes called the variational bound

Visualizing ELBO decomposition Bishop – Pattern Recognition and Machine Learning • Note: all we have done so far is decompose the log probability of the data, we still have exact equality • This holds for any distribution 𝑟

Expectation Maximization • Expectation Maximization alternately optimizes the ELBO, ℒ 𝑟, 𝜄 , with respect to 𝑟 (the E step) and 𝜄 (the M step) • Initialize 𝜄 (0) • At each iteration 𝑢 = 1, … • E step: Hold 𝜄 (𝑢−1) fixed, find 𝑟 (𝑢) which maximizes ℒ 𝑟, 𝜄 (𝑢−1) • M step: Hold 𝑟 (𝑢) fixed, find 𝜄 (𝑢) which maximizes ℒ 𝑟 (𝑢) , 𝜄

The E step Bishop – Pattern Recognition and Machine Learning • Suppose we are at iteration 𝑢 of our algorithm. How do we maximize ℒ 𝑟, 𝜄 (𝑢−1) with respect to 𝑟 ? We know that: argmax 𝑟 ℒ 𝑟, 𝜄 (𝑢−1) = argmax 𝑟 log 𝑞 𝑦|𝜄 𝑢−1 − KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 (𝑢−1)

Variational Autoencoders Recap: Story so far A classification MLP - PowerPoint PPT Presentation

Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly separable features

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Variational Autoencoders Tom Fletcher March 25, 2019 Talking about this paper: Diederik Kingma

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

Semi-Amortized Variational Autoencoders Yoon Kim Sam Wiseman Andrew Miller David Sontag

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

A STOR A STOR STORY SO FAR STORY SO FAR Y SO FAR SO FAR Brian Bruce Brian Bruce

1 QC STORY -32 QC STORY -32 QC STORY -32 QC Story-1 QC Story-1 QC Story-1 Awards and

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 /

The Story so Far.... t = +380,000 years 1964 The Story so Far.... t = +600 million years 1964

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

CS598LAZ - Variational Autoencoders Raymond Yeh, Junting Lou, Teck-Yian Lim Outline - Review

LUC HENDRIKS RADBOUD UNIVERSITY, NIJMEGEN (NL) VARIATIONAL

Disentangling Disentanglement in Variational Autoencoders ICML 2019 June 12, 2019 Departments

DXA studio 40 Greene Avenue October 17, 2017 GREENE AVENUE 4 STORY 4 STORY 4 STORY 4 STORY

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Symmetric-Key Encryption: constructions Lecture 4 PRG, Stream Cipher Story So Far Story So Far

Identification of Complexity Factors for Remote Towers Billy Josefsson Joern

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon

Qualification of SSR1 cavities for PIP-II Soniya Samani Queen Mary University of London

Reconciling Traditional Optimization Analyses and Modern Deep Learning: the Intrinsic Learning

Sterile neutrino as Dark Matter Oleg Ruchayskiy Institut des Hautes Etudes Scientifiques

Marta Bunge Intrinsic n-stack completions over a topos Joint work with Claudio

The Method of Intrinsic Scaling Jos Miguel Urbano CMUC, University of Coimbra, Portugal

Data and potatoes (some drafts of stories) NomenclatureS Stories that we could have told