variational autoencoders recap story so far
play

Variational Autoencoders Recap: Story so far A classification MLP - PowerPoint PPT Presentation

Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly separable features


  1. Variational Autoencoders

  2. Recap: Story so far • A classification MLP actually comprises two components • A “feature extraction network” that converts the inputs into linearly separable features • Or nearly linearly separable features • A final linear classifier that operates on the linearly separable features • Neural networks can be used to perform linear or non-linear PCA • “Autoencoders” • Can also be used to compose constructive dictionaries for data • Which, in turn can be used to model data distributions

  3. Recap: The penultimate layer 𝑧 1 𝑧 2 y 2 x 1 x 2 y 1 • The network up to the output layer may be viewed as a transformation that transforms data from non-linear classes to linearly separable features • We can now attach any linear classifier above it for perfect classification • Need not be a perceptron • In fact, slapping on an SVM on top of the features may be more generalizable!

  4. Recap: The behavior of the layers

  5. Recap: Auto-encoders and PCA Training: Learning 𝑋 by minimizing L2 divergence 𝐲 ො 𝒙 𝑼 x = 𝑥 𝑈 𝑥x ො x 2 = x − w 𝑈 𝑥x 2 𝑒𝑗𝑤 ො x, x = x − ො 𝒙 ෡ 𝑋 = argmin 𝐹 𝑒𝑗𝑤 ො x, x 𝑋 𝐲 ෡ x − w 𝑈 𝑥x 2 𝑋 = argmin 𝐹 𝑋 5

  6. Recap: Auto-encoders and PCA 𝐲 ො 𝒙 𝑼 𝒙 𝐲 • The autoencoder finds the direction of maximum energy • Variance if the input is a zero-mean RV • All input vectors are mapped onto a point on the principal axis 6

  7. Recap: Auto-encoders and PCA • Varying the hidden layer value only generates data along the learned manifold • May be poorly learned • Any input will result in an output along the learned manifold

  8. Recap: Learning a data-manifold Sax dictionary DECODER • The decoder represents a source-specific generative dictionary • Exciting it will produce typical data from the source! 8

  9. Overview • Just as autoencoders can be viewed as performing a non-linear PCA, variational autoencoders can be viewed as performing a non-linear Factor Analysis (FA) • Variational autoencoders (VAEs) get their name from variational inference, a technique that can be used for parameter estimation • We will introduce Factor Analysis, variational inference and expectation maximization, and finally VAEs

  10. Why Generative Models? Training data • Unsupervised/Semi-supervised learning: More training data available • E.g. all of the videos on YouTube

  11. Why generative models? Many right answers • Caption -> Image • Outline -> Image https://openreview.net/pdf?id=Hyvw0L9el A man in an orange jacket with sunglasses and a hat skis down a hill https://arxiv.org/abs/1611.07004

  12. Why generative models? Intrinsic to task Example: Super resolution https://arxiv.org/abs/1609.04802

  13. Why generative models? Insight • What kind of structure can we find in complex observations (MEG recording of brain activity above, gene-expression network to the left)? • Is there a low dimensional manifold underlying these complex observations? • What can we learn about the brain, cellular https://bmcbioinformatics.biomedcentral.c function, etc. if we know more about these om/articles/10.1186/1471-2105-12-327 manifolds?

  14. Factor Analysis • Generative model: Assumes that data are generated from real valued latent variables Bishop – Pattern Recognition and Machine Learning

  15. Factor Analysis model Factor analysis assumes a generative model • where the 𝑗𝑢ℎ observation, 𝒚 𝒋 ∈ ℝ 𝐸 is conditioned on • a vector of real valued latent variables 𝒜 𝒋 ∈ ℝ 𝑀 . Here we assume the prior distribution is Gaussian: 𝑞 𝒜 𝒋 = 𝒪(𝒜 𝒋 |𝝂 𝟏 , 𝚻 𝟏 ) We also will use a Gaussian for the data likelihood: 𝑞 𝒚 𝒋 𝒜 𝒋 , 𝑿, 𝝂, 𝛀 = 𝒪(𝑿𝒜 𝒋 + 𝝂, 𝛀) Where 𝑿 ∈ ℝ 𝐸×𝑀 , 𝛀 ∈ ℝ 𝐸×𝐸 , 𝛀 is diagonal

  16. Marginal distribution of observed 𝒚 𝒋 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = න 𝒪(𝑿𝒜 𝒋 + 𝝂, 𝛀) 𝒪 𝒜 𝒋 𝝂 𝟏 , 𝚻 𝟏 𝐞𝒜 𝒋 = 𝒪 𝒚 𝒋 𝑿𝝂 𝟏 + 𝝂, 𝛀 + 𝑿 𝚻 𝟏 𝑿 𝑈 Note that we can rewrite this as: 𝑞 𝒚 𝒋 ෢ 𝝂, 𝛀 + ෢ 𝑿෢ 𝑿 𝑈 𝑿, ෝ 𝝂, 𝛀 = 𝒪 𝒚 𝒋 ෝ − 1 𝝂 = 𝑿𝝂 𝟏 + 𝝂 and ෢ 2 . Where ෝ 𝑿 = 𝑿𝚻 𝟏 Thus without loss of generality (since 𝝂 𝟏 , 𝚻 𝟏 are absorbed into learnable parameters) we let: 𝑞 𝒜 𝒋 = 𝒪 𝒜 𝒋 𝟏, 𝑱 And find: 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = 𝒪 𝒚 𝒋 𝝂, 𝛀 + 𝑿𝑿 𝑈

  17. Marginal distribution interpretation • We can see from 𝑞 𝒚 𝒋 𝑿, 𝝂, 𝛀 = 𝒪 𝒚 𝒋 𝝂, 𝛀 + 𝑿𝑿 𝑈 that the covariance matrix of the data distribution is broken into 2 terms • A diagonal part 𝛀 : variance not shared between variables • A low rank matrix 𝑿𝑿 𝑈 : shared variance due to latent factors

  18. Special Case: Probabilistic PCA (PPCA) • Probabilistic PCA is a special case of Factor Analysis • We further restrict 𝛀 = 𝜏 2 𝑱 (assume isotropic independent variance) • Possible to show that when the data are centered ( 𝝂 = 0 ), the limiting case where 𝜏 → 0 gives back the same solution for 𝑿 as PCA • Factor analysis is a generalization of PCA that models non-shared variance (can think of this as noise in some situations, or individual variation in others)

  19. Inference in FA • To find the parameters of the FA model, we use the Expectation Maximization (EM) algorithm • EM is very similar to variational inference • We’ll derive EM by first finding a lower bound on the log -likelihood we want to maximize, and then maximizing this lower bound

  20. Evidence Lower Bound decomposition • For any distributions 𝑟 𝑨 , 𝑞(𝑨) we have: ≜ න 𝑟 𝑨 log 𝑟(𝑨) KL 𝑟 𝑨 || 𝑞 𝑨 𝑞(𝑨) 𝐞𝑨 • Consider the KL divergence of an arbitrary weighting distribution 𝑟 𝑨 from a conditional distribution 𝑞 𝑨|𝑦, 𝜄 : 𝑟(𝑨) ≜ න 𝑟 𝑨 log KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 𝑞(𝑨|𝑦, 𝜄) 𝐞𝑨 = න 𝑟 𝑨 [log 𝑟 𝑨 − log 𝑞(𝑨|𝑦, 𝜄)] 𝐞𝑨

  21. Applying Bayes log 𝑞 𝑨 𝑦, 𝜄 = log 𝑞 𝑦 𝑨, 𝜄 𝑞(𝑨|𝜄) 𝑞(𝑦|𝜄) = log 𝑞 𝑦 𝑨, 𝜄 + log 𝑞 𝑨 𝜄 − log 𝑞 𝑦 𝜄 Then: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = න 𝑟 𝑨 [log 𝑟 𝑨 − log 𝑞(𝑨|𝑦, 𝜄)] 𝐞𝑨 = න 𝑟 𝑨 log 𝑟 𝑨 − log 𝑞 𝑦 𝑨, 𝜄 − log 𝑞 𝑨 𝜄 + log 𝑞 𝑦 𝜄 𝐞𝑨

  22. Rewriting the divergence • Since the last term does not depend on z, and we know ׬ 𝑟 𝑨 d𝑨 = 1 , we can pull it out of the integration: න 𝑟 𝑨 log 𝑟 𝑨 − log 𝑞 𝑦 𝑨, 𝜄 − log 𝑞 𝑨 𝜄 + log 𝑞 𝑦 𝜄 𝐞𝑨 = න 𝑟 𝑨 log 𝑟 𝑨 − log 𝑞 𝑦 𝑨, 𝜄 − log 𝑞 𝑨 𝜄 𝐞𝑨 + log 𝑞 𝑦 𝜄 𝑟(𝑨) = න 𝑟 𝑨 log 𝑞 𝑦 𝑨, 𝜄 𝑞(𝑨, 𝜄) 𝐞𝑨 + log 𝑞 𝑦 𝜄 𝑟(𝑨) = න 𝑟 𝑨 log 𝑞(𝑦, 𝑨 |𝜄) 𝐞𝑨 + log 𝑞 𝑦 𝜄 Then we have: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 + log 𝑞 𝑦 𝜄

  23. Evidence Lower Bound • From basic probability we have: KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 + log 𝑞 𝑦 𝜄 • We can rearrange the terms to get the following decomposition: log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 − KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 • We define the evidence lower bound (ELBO) as: ℒ 𝑟, 𝜄 ≜ −KL 𝑟 𝑨 || 𝑞 𝑦, 𝑨 |𝜄 Then: log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 + ℒ 𝑟, 𝜄

  24. Why the name evidence lower bound? • Rearranging the decomposition log 𝑞 𝑦 𝜄 = KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 + ℒ 𝑟, 𝜄 • we have ℒ 𝑟, 𝜄 = log 𝑞 𝑦 𝜄 − KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 • Since KL 𝑟 𝑨 ||𝑞 𝑨|𝑦, 𝜄 ≥ 0 , ℒ 𝑟, 𝜄 is a lower bound on the log- likelihood we want to maximize • 𝑞 𝑦 𝜄 is sometimes called the evidence • When is this bound tight ? When 𝑟 𝑨 = 𝑞 𝑨|𝑦, 𝜄 • The ELBO is also sometimes called the variational bound

  25. Visualizing ELBO decomposition Bishop – Pattern Recognition and Machine Learning • Note: all we have done so far is decompose the log probability of the data, we still have exact equality • This holds for any distribution 𝑟

  26. Expectation Maximization • Expectation Maximization alternately optimizes the ELBO, ℒ 𝑟, 𝜄 , with respect to 𝑟 (the E step) and 𝜄 (the M step) • Initialize 𝜄 (0) • At each iteration 𝑢 = 1, … • E step: Hold 𝜄 (𝑢−1) fixed, find 𝑟 (𝑢) which maximizes ℒ 𝑟, 𝜄 (𝑢−1) • M step: Hold 𝑟 (𝑢) fixed, find 𝜄 (𝑢) which maximizes ℒ 𝑟 (𝑢) , 𝜄

  27. The E step Bishop – Pattern Recognition and Machine Learning • Suppose we are at iteration 𝑢 of our algorithm. How do we maximize ℒ 𝑟, 𝜄 (𝑢−1) with respect to 𝑟 ? We know that: argmax 𝑟 ℒ 𝑟, 𝜄 (𝑢−1) = argmax 𝑟 log 𝑞 𝑦|𝜄 𝑢−1 − KL 𝑟 𝑨 || 𝑞 𝑨|𝑦, 𝜄 (𝑢−1)

Recommend


More recommend