a tutorial on deep probabilistic generative models
play

A Tutorial on Deep Probabilistic Generative Models Ryan P. Adams - PowerPoint PPT Presentation

A Tutorial on Deep Probabilistic Generative Models Ryan P. Adams Princeton University Machine Learning Summer School Buenos Aires, Argentina June 2018 lips.cs.princeton.edu @ryan_p_adams Tutorial Outline What is generative modeling?


  1. Recipe 3: Specify a log density directly Markov chain Monte Carlo (MCMC): ▶ Random walk that converges to g θ ( x ) . 1 g θ ( x ) = Z θ exp { f θ ( x ) } ▶ Uses a stochastic operator T ( x ′ ← x ) . ▶ Ergodic and leave g θ ( x ) invariant: ∫ g θ ( x ′ ) T ( x ← x ′ ) dx ′ g θ ( x ) = ▶ Several common recipes: ▶ Metropolis–Hastings ▶ Gibbs sampling ▶ Slice sampling ▶ Hamiltonian Monte Carlo

  2. Recipe 3: Specify a log density directly Markov chain Monte Carlo (MCMC): ▶ Random walk that converges to g θ ( x ) . 1 g θ ( x ) = Z θ exp { f θ ( x ) } ▶ Uses a stochastic operator T ( x ′ ← x ) . ▶ Ergodic and leave g θ ( x ) invariant: ∫ g θ ( x ′ ) T ( x ← x ′ ) dx ′ g θ ( x ) = ▶ Several common recipes: ▶ Metropolis–Hastings ▶ Gibbs sampling ▶ Slice sampling ▶ Hamiltonian Monte Carlo

  3. Recipe 3: Specify a log density directly Markov chain Monte Carlo (MCMC): ▶ Random walk that converges to g θ ( x ) . 1 g θ ( x ) = Z θ exp { f θ ( x ) } ▶ Uses a stochastic operator T ( x ′ ← x ) . ▶ Ergodic and leave g θ ( x ) invariant: ∫ g θ ( x ′ ) T ( x ← x ′ ) dx ′ g θ ( x ) = ▶ Several common recipes: ▶ Metropolis–Hastings ▶ Gibbs sampling ▶ Slice sampling ▶ Hamiltonian Monte Carlo

  4. Recipe 3: Specify a log density directly Markov chain Monte Carlo (MCMC): ▶ Random walk that converges to g θ ( x ) . 1 g θ ( x ) = Z θ exp { f θ ( x ) } ▶ Uses a stochastic operator T ( x ′ ← x ) . ▶ Ergodic and leave g θ ( x ) invariant: ∫ g θ ( x ′ ) T ( x ← x ′ ) dx ′ g θ ( x ) = ▶ Several common recipes: ▶ Metropolis–Hastings ▶ Gibbs sampling ▶ Slice sampling ▶ Hamiltonian Monte Carlo

  5. Recipe 3: Specify a log density directly Markov chain Monte Carlo (MCMC): ▶ Random walk that converges to g θ ( x ) . 1 g θ ( x ) = Z θ exp { f θ ( x ) } ▶ Uses a stochastic operator T ( x ′ ← x ) . ▶ Ergodic and leave g θ ( x ) invariant: ∫ g θ ( x ′ ) T ( x ← x ′ ) dx ′ g θ ( x ) = ▶ Several common recipes: ▶ Metropolis–Hastings ▶ Gibbs sampling ▶ Slice sampling ▶ Hamiltonian Monte Carlo

  6. Recipe 3: Specify a log density directly Markov chain Monte Carlo (MCMC): ▶ Random walk that converges to g θ ( x ) . 1 g θ ( x ) = Z θ exp { f θ ( x ) } ▶ Uses a stochastic operator T ( x ′ ← x ) . ▶ Ergodic and leave g θ ( x ) invariant: ∫ g θ ( x ′ ) T ( x ← x ′ ) dx ′ g θ ( x ) = ▶ Several common recipes: ▶ Metropolis–Hastings ▶ Gibbs sampling ▶ Slice sampling ▶ Hamiltonian Monte Carlo

  7. Recipe 3: Specify a log density directly Markov chain Monte Carlo (MCMC): ▶ Random walk that converges to g θ ( x ) . 1 g θ ( x ) = Z θ exp { f θ ( x ) } ▶ Uses a stochastic operator T ( x ′ ← x ) . ▶ Ergodic and leave g θ ( x ) invariant: ∫ g θ ( x ′ ) T ( x ← x ′ ) dx ′ g θ ( x ) = ▶ Several common recipes: ▶ Metropolis–Hastings ▶ Gibbs sampling ▶ Slice sampling ▶ Hamiltonian Monte Carlo

  8. Recipe 3: Specify a log density directly Example: Ising Model ▶ Classic model of ferromagnetism with binary “spins” ▶ Influential in computer vision ▶ Unary and pairwise potentials in energy: ∑ ∑ E ( x ) = − f θ ( x ) = − θ ij x i x j − θ i x i ij i Credit: Kai Zhang, Columbia

  9. Recipe 3: Specify a log density directly Example: Restricted Boltzmann Machine (Freund and Haussler, 1992, Smolensky, 1986) ▶ Special case of the Ising model ▶ Bipartite: hidden and visible layers ▶ Fully connected between layers ▶ Typically trained with contrastive divergence hidden units Credit: Tieleman (2008) visible units

  10. Recipe 3: Specify a log density directly Example: Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009) ▶ Special case of the Ising model ▶ k -partite: hidden and visible layers ▶ Fully connected between layers Credit: Salakhutdinov and Hinton (2009)

  11. Tutorial Outline What is generative modeling? Recipes for flexible generative models Algorithms for learning generative models from data Variational autoencoder Combining graphical models and neural networks

  12. Inductive principles for flexible generative models We get N data { x n } N n = 1 ; how do we fit the parameters θ ? ▶ Penalized maximum likelihood ▶ Computing a Bayesian posterior ▶ Score matching (Hyvärinen, 2005) ▶ Moment matching (e.g., Li et al. (2015)) ▶ Maximum mean discrepancy (Dziugaite et al., 2015, Gretton et al., 2012) ▶ Pseudo-likelihood

  13. MLE for invertible transformations When f θ ( · ) is bijective, things are easy to reason about: N ∑ ln P ( { x n } N ln π ( f − 1 θ ( x n ) ) + ln |J [ f − 1 n = 1 | θ ) = θ ( x n ) ] | n = 1 ▶ Just use automatic differentiation to get gradients. ▶ Note: need the derivative of the Jacobian. ▶ The matrix J [ f − 1 θ ( x n ) ] may become nearly singular during training, causing numeric issues. See Rippel and Adams (2013) for a discussion. ▶ Real NVP (Dinh et al., 2016) parameterizes the matrix to have a Jacobian determinant that is easy to compute.

  14. MLE for non-invertible transformations f θ ( · ) non-surjective: some data have zero probability, i.e., infinite log loss f θ ( · ) non-injective: data have multiple latent values In general, you have to sum over the ways you could’ve gotten each x n : N ∫ π ( z ) ∑ ln P ( { x n } N n = 1 | θ ) = ln |J [ f θ ( z )] | dz z : f θ ( z )= x n n = 1 Here we have to sum up all the ways we might’ve gotten each x n . Non-surjective f θ ( · ) means that the pre-image of x n could be empty, i.e., { z : f θ ( z ) = x n } = ∅ .

  15. Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data.

  16. Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar.

  17. Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution.

  18. Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution.

  19. Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution. 4. Use your favorite two-sample test to compare the distributions.

  20. Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution. 4. Use your favorite two-sample test to compare the distributions. 5. Search for an f θ ( · ) that passes the test for many h in a big set H .

  21. Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution. 4. Use your favorite two-sample test to compare the distributions. 5. Search for an f θ ( · ) that passes the test for many h in a big set H . A nice kernel formalism for constructing tests and H is maximum mean discrepancy (Gretton et al., 2012). See also Dziugaite et al. (2015) and Huszar (2015).

  22. Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution. 4. Use your favorite two-sample test to compare the distributions. 5. Search for an f θ ( · ) that passes the test for many h in a big set H . A nice kernel formalism for constructing tests and H is maximum mean discrepancy (Gretton et al., 2012). See also Dziugaite et al. (2015) and Huszar (2015). You could also parameterize and learn the test with a generative adversarial network (Goodfellow et al., 2014). David Warde-Farley will talk about GANs next week.

  23. MLE for latent variable models Like the non-injective transformation case, the mixing case requires integrating over latent hypotheses: N ∫ N ∫ ∑ ∑ ln P ( { x n } N n = 1 | θ ) = P ( x n , z n | θ ) dz n = P ( x n | z n , θ ) P ( z n | θ ) dz ln ln n = 1 n = 1 Generally, three ways to do this kind of integral in ML: ▶ Addition – easy to do expectation maximization with discrete latent variables ▶ Quadrature – good rates in low dimensions, but bad in high dimensions ▶ Monte Carlo – approximate the integral with a sample mean ▶ Variational methods – approximate pieces with more tractable distributions

  24. MLE with latent variables: expectation maximization Initialize θ ( 0 ) to a reasonable starting point, then iterate: ▶ E-step – Compute expected complete-data log likelihood under θ ( t ) : N ∑ Q ( θ | θ ( t ) ) = E z n | x n ,θ ( t ) [ ln P ( x n , z n | θ ) ] n = 1 ▶ M-step – Maximize this expected log likelihood with respect to θ : θ ( t + 1 ) = arg max Q ( θ | θ ( t ) ) θ That expectation may be just as hard as the marginal likelihood, however.

  25. MLE for latent variable models: Monte Carlo EM One approach to the integral is to use Monte Carlo. Recall: M ∫ π ( z ) f ( z ) dz = E [ f ( z ) ] ≈ 1 ∑ z ( m ) ∼ π f ( z ( m ) ) where M m = 1 Initialize θ ( 0 ) to a reasonable starting point, then iterate: ▶ E-step – Compute expected complete-data log likelihood under θ ( t ) , using M samples from the conditional on z n : N M Q ( θ | θ ( t ) ) = 1 ∑ ∑ ln P ( x n , z ( m ) | θ ) n M n = 1 m = 1 ▶ M-step – Maximize this expected log likelihood with respect to θ : θ ( t + 1 ) = arg max Q ( θ | θ ( t ) ) θ

  26. MLE for latent variable models: Variational EM Introduce a tractable (typically factored) distribution family on the { z n } N n = 1 : N ∏ q γ ( { z n } N n = 1 ) = q γ n ( z n ) n = 1 Jensen’s inequality lets us lower bound the marginal likelihood: ∫ ∫ q γ n ( z n ) P ( x n , z n | θ ) q γ n ( z n ) ln P ( x n , z n | θ ) ln dz n ≥ dz n q γ n ( z n ) q γ n ( z n ) Alternate between maximizing with respect to γ and θ . If the q γ n ( z n ) family contains P ( z n | x n , θ ) then it’s just regular EM. If not, then it provides a coherent way to approximate the difficult expectation. More on this later when we discuss variational autoencoders in detail.

  27. MLE for energy models “Energy models” specify the density directly via its log: ∫ g θ ( x ) = 1 exp { f θ ( x ) } Z θ = exp { f θ ( x ) } Z θ We generally can’t compute the partition function Z θ : [ N ] ∑ ln P ( { x n } N n = 1 | θ ) = f θ ( x n ) − N ln Z θ n = 1 You really do have to account for the partition function in learning. Z θ prevents the model from assigning high probability everywhere!

  28. MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1

  29. MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1 [ N ] ) − 1 ∂ (∫ ∫ ∂ ∑ = ∂θ f θ ( x n ) − N exp { f θ ( x ) } dx exp { f θ ( x ) } dx ∂θ n = 1

  30. MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1 [ N ] ) − 1 ∂ (∫ ∫ ∂ ∑ = ∂θ f θ ( x n ) − N exp { f θ ( x ) } dx exp { f θ ( x ) } dx ∂θ n = 1 [ N ] ∫ ∂ − N 1 ∂ ∑ = ∂θ f θ ( x n ) ∂θ exp { f θ ( x ) } dx Z θ n = 1

  31. MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1 [ N ] ) − 1 ∂ (∫ ∫ ∂ ∑ = ∂θ f θ ( x n ) − N exp { f θ ( x ) } dx exp { f θ ( x ) } dx ∂θ n = 1 [ N ] ∫ ∂ − N 1 ∂ ∑ = ∂θ f θ ( x n ) ∂θ exp { f θ ( x ) } dx Z θ n = 1 [ N ] ∫ ∂ 1 exp { f θ ( x ) } ∂ ∑ = ∂θ f θ ( x n ) − N ∂θ f θ ( x ) dx Z θ n = 1

  32. MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1 [ N ] ) − 1 ∂ (∫ ∫ ∂ ∑ = ∂θ f θ ( x n ) − N exp { f θ ( x ) } dx exp { f θ ( x ) } dx ∂θ n = 1 [ N ] ∫ ∂ − N 1 ∂ ∑ = ∂θ f θ ( x n ) ∂θ exp { f θ ( x ) } dx Z θ n = 1 [ N ] ∫ ∂ 1 exp { f θ ( x ) } ∂ ∑ = ∂θ f θ ( x n ) − N ∂θ f θ ( x ) dx Z θ n = 1 [ ∂ [ ∂ ( ] ]) − E model = N E data ∂θ f θ ( x ) ∂θ f θ ( x )

  33. MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N

  34. MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N ▶ Use Monte Carlo for the second term by generating fantasy data?

  35. MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N ▶ Use Monte Carlo for the second term by generating fantasy data? ▶ Bad news: generating data is hard, have to use Markov chain Monte Carlo.

  36. MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N ▶ Use Monte Carlo for the second term by generating fantasy data? ▶ Bad news: generating data is hard, have to use Markov chain Monte Carlo. ▶ Contrastive divergence – start at one of the data and run K steps of MCMC (Hinton, 2002). For RBMs, good features, bad densities.

  37. MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N ▶ Use Monte Carlo for the second term by generating fantasy data? ▶ Bad news: generating data is hard, have to use Markov chain Monte Carlo. ▶ Contrastive divergence – start at one of the data and run K steps of MCMC (Hinton, 2002). For RBMs, good features, bad densities. ▶ Persistent contrastive divergence – don’t restart the Markov chain between updates (Tieleman, 2008), often does better.

  38. Training a binary RBM with CD ▶ Binary data x ∈ { 0 , 1 } D ▶ Binary hidden units h ∈ { 0 , 1 } J ▶ Parameters: weight matrix W ∈ R D × J , biases b vis ∈ R D and b hid ∈ R J ▶ Energy function: E ( x , h ; W , b vis , b hid ) = − x T Wh − x T b vis − h T b hid ▶ Hidden given visible: J 1 ∏ P ( h | x , W , b hid ) = 1 + exp {− W T x − b hid } j = 1 ▶ Visible given hidden: D 1 ∏ P ( x | h , W , b vis ) = 1 + exp {− Wh − b vis } d = 1

  39. Training a binary RBM with CD hidden units hidden units visible units visible units Bipartite structure of RBM makes Gibbs sampling easy.

  40. Training a binary RBM with CD hidden units hidden units visible units visible units Contrastive divergence: start at data and Gibbs sample K times.

  41. Training a binary RBM with CD 1: Input: Parameters W , ( b vis , b hid ) ; input x ∈ { 0 , 1 } D ; learning rate α > 0 2: Output: Updated parameters W ′ , b ′ vis b ′ hid 3: h pos ∼ h | x , W , b hid ▷ Sample hiddens given visibles. 4: h neg ← h pos ▷ Initialize negative hiddens. 5: for t = 1 . . . K do x neg ← x | h neg , W , b vis ▷ Sample fantasy data. 6: h neg ← h | x neg , W , b hid ▷ Sample hiddens for fantasy data. 7: 8: end for 9: W ′ ← W + α ( xh T pos − x neg h T neg ) ▷ Approximate stochastic gradient update. 10: b ′ vis ← b vis + α ( x pos − x neg ) 11: b ′ hid ← b hid + α ( h pos − h neg )

  42. Score matching for energy models Hyvärinen (2005) proposed an alternative way to avoid the partition function. Score function: gradient of the log likelihood with respect to the data . ψ ( x ; θ ) = ∂ ∂ x ln P ( x | θ ) = ∂ ∂ x ( f θ ( x ) − ln Z θ ) = ∂ ∂ x f θ ( x )

  43. Score matching for energy models Hyvärinen (2005) proposed an alternative way to avoid the partition function. Score function: gradient of the log likelihood with respect to the data . ψ ( x ; θ ) = ∂ ∂ x ln P ( x | θ ) = ∂ ∂ x ( f θ ( x ) − ln Z θ ) = ∂ ∂ x f θ ( x ) Fitting a score function: ▶ Given observed data { x n } N n = 1 , construct a density estimate p data ( x ) . ▶ Denote the “empirical score function” of this density estimate as ψ data ( x ) . ▶ Model and empirical score functions should be similar: ∫ J ( θ ) = 1 p data ( x ) || ψ ( x ; θ ) − ψ data ( x ) || 2 dx 2

  44. Score matching for energy models Hyvärinen (2005) showed that this objective can be simplified: ∫ J ( θ ) = 1 p data ( x ) || ψ ( x ; θ ) − ψ data ( x ) || 2 dx 2 [ ] ∫ 1 T ψ ( x ; θ ) + 1 2 || ψ ( x ; θ ) || 2 = p data ( x ) dx + const

  45. Score matching for energy models Hyvärinen (2005) showed that this objective can be simplified: ∫ J ( θ ) = 1 p data ( x ) || ψ ( x ; θ ) − ψ data ( x ) || 2 dx 2 [ ] ∫ 1 T ψ ( x ; θ ) + 1 2 || ψ ( x ; θ ) || 2 = p data ( x ) dx + const We don’t actually need ψ data ( x ) and can use the raw empirical p data ( x ) : J ( θ ) = 1 1 T ψ ( x n ; θ ) + 1 ∑ ˜ 2 || ψ ( x n ; θ ) || 2 N n = 1 If the model is identifiable, ˆ θ = arg min θ ˜ J ( θ ) is a consistent estimator.

  46. Tutorial Outline What is generative modeling? Recipes for flexible generative models Algorithms for learning generative models from data Variational autoencoder Combining graphical models and neural networks

  47. A closer look at the variational autoencoder Consider a latent variable model that combines Recipes 1 and 2: Basic VAE Generative Model (Kingma and Welling, 2014) Spherical Gaussian latent variable: z ∼ N ( 0 , I ) Transform with a neural network to parameterize another Gaussian: x | z , θ ∼ N ( µ θ ( z ) , Σ θ ( z )) Given some data { x n } N n = 1 , maximize the likelihood with respect to θ : N ∫ ∑ θ ⋆ = arg max ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) dz n θ n = 1

  48. Variational autoencoder z ∼ N ( 0 , I ) x | z , θ ∼ N ( µ θ ( z ) , Σ θ ( z )) Credit: OpenAI blog post on generative models

  49. Learning the VAE model with mean-field We want to solve this: N ∫ ∑ θ ⋆ = arg max N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) dz n ln θ n = 1 ▶ Have to estimate the z n associated with each x n . ▶ Can’t use vanilla EM because P ( z n | x n , θ ) is complicated. ▶ Approximate P ( z n | x n , θ ) with N ( z n | m n , V n ) . ▶ Compute the evidence lower bound using Jensen’s inequality: ∫ N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) dz n ln ∫ N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) ≥ dz n N ( z n | m n , V n )

  50. Maximize the VAE mean-field objective directly? We could try to maximize this objective directly: N ∫ N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) ∑ L ( θ, { m n , V n } N n = 1 ) = dz n N ( z n | m n , V n ) n = 1 N [∫ ∑ N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) dz n = n = 1 ] ∫ N ( z n | 0 , I ) + N ( z n | m n , V n ) ln N ( z n | m n , V n ) dz n N ∑ ( ) = E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] − KL [ N ( z n | m n , V n ) ||N ( z n | 0 , I )] n = 1

  51. Maximize the VAE mean-field objective directly? We could try to maximize this objective directly: N ∑ E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] − KL [ N ( z n | m n , V n ) ||N ( z n | 0 , I )] � �� � � �� � n = 1 difference between approximation and prior (easy) expected complete-data log likelihood Annoying because ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard.

  52. Maximize the VAE mean-field objective directly? Zooming in on the expected complete-data log likelihood: ∫ E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] = N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) dz n Can we just draw z ( m ) ∼ N ( z n | m n , V n ) and use Monte Carlo? n M E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] ≈ 1 ∑ ln N ( x n | µ θ ( z ( m ) n ) , Σ θ ( z ( m ) n )) M m = 1 ▶ Gradient with respect to θ ? No problem. ▶ Gradient with respect to m n and V n ? Where did they go?!?!? Kingma and Welling (2014) suggested a clever trick.

  53. The Reparameterization Trick The reparameterization trick is a way to address the following general situation: ∫ ∇ α E z ∼ π α [ f ( z ) ] = ∇ α π α ( z ) f ( z ) dz Here the parameter α governs the distribution under which the expectation is being taken. If we sample z m ∼ π α , we get something non-differentiable in α : [ ] M 1 ∑ ∇ α f ( z m ) M m = 1

  54. The Reparameterization Trick Can simulate from many “standard” parametric distributions via differentiable parametric transformation of a fixed distribution. 3 Examples: aw + b ∼ N ( b , a 2 ) w ∼ N ( 0 , 1 ) ⇒ univariate Gaussian: = Aw + b ∼ N ( b , AA T ) w ∼ N ( 0 , I ) ⇒ multivariate Gaussian: = exponential: w ∼ U ( 0 , 1 ) = ⇒ − ln ( w ) /λ ∼ Exp ( λ ) w ∼ Gamma ( k , 1 ) ⇒ a w ∼ Gamma ( k , a ) gamma: = Reparametrize the integral using the simple fixed distribution ρ ( w ) and an α -parameterized transformation: ∫ ∫ ∇ α E z ∼ π α [ f ( z ) ] = ∇ α π α ( z ) f ( z ) dz = ∇ α ρ ( w ) f ( t α ( w )) dw 3 Essentially anything with a reasonable quantile function.

  55. The Reparameterization Trick Reparametrize the integral using the simple fixed distribution ρ ( w ) and an α -parameterized transformation: ∫ ∫ ∇ α E z ∼ π α [ f ( z ) ] = ∇ α π α ( z ) f ( z ) dz = ∇ α ρ ( w ) f ( t α ( w )) dw Draw w m ∼ ρ ( w ) and now Monte Carlo plays nicely with differentiation: M ∫ 1 ∑ ∇ α ρ ( w ) f ( t α ( w )) dw ≈ ∇ α f ( t α ( w m )) M m = 1 M ≈ 1 ∑ ∇ α f ( t α ( w m )) M m = 1 Shakir Mohamed has a very nice blog post discussing this trick (Mohamed, 2015).

  56. Reparameterization and the VAE Draw a set of ϵ ( m ) ∼ N ( 0 , I ) and parameterize via W n such that W n W T n = V n : n ∫ E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] = N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) dz n ∫ = N ( ϵ n | 0 , I ) ln N ( x n | µ θ ( W n ϵ n + m n ) , Σ θ ( W n ϵ n + m n )) d ϵ n M ≈ 1 ∑ ln N ( x n | µ θ ( W n ϵ ( m ) + m n ) , Σ θ ( W n ϵ ( m ) + m n )) n n M m = 1 Now it is possible to differentiate with regard to m n and W n .

  57. Amortizing Inference in the VAE Recall that there were several annoying things about mean-field VI in our model: ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard.

  58. Amortizing Inference in the VAE Recall that there were several annoying things about mean-field VI in our model: ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard.

  59. Amortizing Inference in the VAE Recall that there were several annoying things about mean-field VI in our model: ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard. Can we just look at a datum and guess its variational parameters?

  60. Amortizing Inference in the VAE Recall that there were several annoying things about mean-field VI in our model: ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard. Can we just look at a datum and guess its variational parameters? Anybody have any good function approximators lying around?

  61. Amortizing Inference in the VAE Recall that there were several annoying things about mean-field VI in our model: ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard. Can we just look at a datum and guess its variational parameters? Anybody have any good function approximators lying around?

  62. Amortizing Inference in the VAE Throw away all of the per-datum variational parameters { m n , V n } N n = 1 . Replace them with parametric functions that see the input: m γ ( x ) and V γ ( x ) . Rederive the lower bound with γ instead of { m n , V n } N n = 1 : N ∑ L ( θ, γ ) = E z n | x n ,γ [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] − KL [ N ( z n | m γ ( x n ) , Σ γ ( x n )) ||N ( z n | 0 , I )] n = 1 Can now do mini-batch stochastic optimization without local variables. Amortized : pay up front and then use it cheaply. (Gershman and Goodman, 2014)

  63. What does this have to do with autoencoders? encoder = “recognition network” = amortized inference takes data and maps it to (a distribution over) a latent representation decoder = likelihood = generative model takes latent representation and produces data decoder encoder

  64. What does this have to do with autoencoders? encoder = “recognition network” = amortized inference takes data and maps it to (a distribution over) a latent representation decoder = likelihood = generative model takes latent representation and produces data decoder likelihood encoder amortized inference

  65. Importance Weighted Autoencoder (Burda et al., 2016) ∫ ∫ ∫ q ( z ) P ( x , z | θ ) q ( z ) ln P ( x , z | θ ) ln P ( x | θ ) = ln P ( x , z | θ ) dz = ln dz ≥ dz q ( z ) q ( z ) Rather than using a single z , compute the ELBO with multiple z : [ P ( x , z ( 1 ) | θ ) + P ( x , z ( 2 ) | θ ) ] ∫ q ( z ( 1 ) ) q ( z ( 2 ) ) dz ( 1 ) dz ( 2 ) ln P ( x | θ ) = ln 2 q ( z ( 1 ) ) 2 q ( z ( 2 ) ) [ P ( x , z ( 1 ) | θ ) + P ( x , z ( 2 ) | θ ) ] ∫ q ( z ( 1 ) ) q ( z ( 2 ) ) ln dz ( 1 ) dz ( 2 ) ≥ 2 q ( z ( 1 ) ) 2 q ( z ( 2 ) ) More generally, allow for K “importance samples”: [ ] P ( x , z ( k ) | θ ) K ln 1 ∑ ln P ( x | θ ) ≤ E z ( 1 ) ,..., z ( K ) ∼ q ( z ) q ( z ( k ) ) K k = 1 All else being equal, bigger K leads to a tighter bound.

  66. Tutorial Outline What is generative modeling? Recipes for flexible generative models Algorithms for learning generative models from data Variational autoencoder Combining graphical models and neural networks

Recommend


More recommend