A Tutorial on Deep Probabilistic Generative Models Ryan P. Adams - PowerPoint PPT Presentation

Recipe 3: Specify a log density directly Markov chain Monte Carlo (MCMC): ▶ Random walk that converges to g θ ( x ) . 1 g θ ( x ) = Z θ exp { f θ ( x ) } ▶ Uses a stochastic operator T ( x ′ ← x ) . ▶ Ergodic and leave g θ ( x ) invariant: ∫ g θ ( x ′ ) T ( x ← x ′ ) dx ′ g θ ( x ) = ▶ Several common recipes: ▶ Metropolis–Hastings ▶ Gibbs sampling ▶ Slice sampling ▶ Hamiltonian Monte Carlo

Recipe 3: Specify a log density directly Example: Ising Model ▶ Classic model of ferromagnetism with binary “spins” ▶ Influential in computer vision ▶ Unary and pairwise potentials in energy: ∑ ∑ E ( x ) = − f θ ( x ) = − θ ij x i x j − θ i x i ij i Credit: Kai Zhang, Columbia

Recipe 3: Specify a log density directly Example: Restricted Boltzmann Machine (Freund and Haussler, 1992, Smolensky, 1986) ▶ Special case of the Ising model ▶ Bipartite: hidden and visible layers ▶ Fully connected between layers ▶ Typically trained with contrastive divergence hidden units Credit: Tieleman (2008) visible units

Recipe 3: Specify a log density directly Example: Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009) ▶ Special case of the Ising model ▶ k -partite: hidden and visible layers ▶ Fully connected between layers Credit: Salakhutdinov and Hinton (2009)

Tutorial Outline What is generative modeling? Recipes for flexible generative models Algorithms for learning generative models from data Variational autoencoder Combining graphical models and neural networks

Inductive principles for flexible generative models We get N data { x n } N n = 1 ; how do we fit the parameters θ ? ▶ Penalized maximum likelihood ▶ Computing a Bayesian posterior ▶ Score matching (Hyvärinen, 2005) ▶ Moment matching (e.g., Li et al. (2015)) ▶ Maximum mean discrepancy (Dziugaite et al., 2015, Gretton et al., 2012) ▶ Pseudo-likelihood

MLE for invertible transformations When f θ ( · ) is bijective, things are easy to reason about: N ∑ ln P ( { x n } N ln π ( f − 1 θ ( x n ) ) + ln |J [ f − 1 n = 1 | θ ) = θ ( x n ) ] | n = 1 ▶ Just use automatic differentiation to get gradients. ▶ Note: need the derivative of the Jacobian. ▶ The matrix J [ f − 1 θ ( x n ) ] may become nearly singular during training, causing numeric issues. See Rippel and Adams (2013) for a discussion. ▶ Real NVP (Dinh et al., 2016) parameterizes the matrix to have a Jacobian determinant that is easy to compute.

MLE for non-invertible transformations f θ ( · ) non-surjective: some data have zero probability, i.e., infinite log loss f θ ( · ) non-injective: data have multiple latent values In general, you have to sum over the ways you could’ve gotten each x n : N ∫ π ( z ) ∑ ln P ( { x n } N n = 1 | θ ) = ln |J [ f θ ( z )] | dz z : f θ ( z )= x n n = 1 Here we have to sum up all the ways we might’ve gotten each x n . Non-surjective f θ ( · ) means that the pre-image of x n could be empty, i.e., { z : f θ ( z ) = x n } = ∅ .

Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data.

Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar.

Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution.

Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution.

Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution. 4. Use your favorite two-sample test to compare the distributions.

Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution. 4. Use your favorite two-sample test to compare the distributions. 5. Search for an f θ ( · ) that passes the test for many h in a big set H .

Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution. 4. Use your favorite two-sample test to compare the distributions. 5. Search for an f θ ( · ) that passes the test for many h in a big set H . A nice kernel formalism for constructing tests and H is maximum mean discrepancy (Gretton et al., 2012). See also Dziugaite et al. (2015) and Huszar (2015).

Statistical tests for non-invertible transformations Sometimes you don’t care about the density or the latent representation. You just want fantasy data that has the same statistical properties as the real data. 1. Cook up a function h that takes an x and produces a scalar. 2. Transform some real data with h and get the empirical distribution. 3. Transform some fantasy data with h and get the empirical distribution. 4. Use your favorite two-sample test to compare the distributions. 5. Search for an f θ ( · ) that passes the test for many h in a big set H . A nice kernel formalism for constructing tests and H is maximum mean discrepancy (Gretton et al., 2012). See also Dziugaite et al. (2015) and Huszar (2015). You could also parameterize and learn the test with a generative adversarial network (Goodfellow et al., 2014). David Warde-Farley will talk about GANs next week.

MLE for latent variable models Like the non-injective transformation case, the mixing case requires integrating over latent hypotheses: N ∫ N ∫ ∑ ∑ ln P ( { x n } N n = 1 | θ ) = P ( x n , z n | θ ) dz n = P ( x n | z n , θ ) P ( z n | θ ) dz ln ln n = 1 n = 1 Generally, three ways to do this kind of integral in ML: ▶ Addition – easy to do expectation maximization with discrete latent variables ▶ Quadrature – good rates in low dimensions, but bad in high dimensions ▶ Monte Carlo – approximate the integral with a sample mean ▶ Variational methods – approximate pieces with more tractable distributions

MLE with latent variables: expectation maximization Initialize θ ( 0 ) to a reasonable starting point, then iterate: ▶ E-step – Compute expected complete-data log likelihood under θ ( t ) : N ∑ Q ( θ | θ ( t ) ) = E z n | x n ,θ ( t ) [ ln P ( x n , z n | θ ) ] n = 1 ▶ M-step – Maximize this expected log likelihood with respect to θ : θ ( t + 1 ) = arg max Q ( θ | θ ( t ) ) θ That expectation may be just as hard as the marginal likelihood, however.

MLE for latent variable models: Monte Carlo EM One approach to the integral is to use Monte Carlo. Recall: M ∫ π ( z ) f ( z ) dz = E [ f ( z ) ] ≈ 1 ∑ z ( m ) ∼ π f ( z ( m ) ) where M m = 1 Initialize θ ( 0 ) to a reasonable starting point, then iterate: ▶ E-step – Compute expected complete-data log likelihood under θ ( t ) , using M samples from the conditional on z n : N M Q ( θ | θ ( t ) ) = 1 ∑ ∑ ln P ( x n , z ( m ) | θ ) n M n = 1 m = 1 ▶ M-step – Maximize this expected log likelihood with respect to θ : θ ( t + 1 ) = arg max Q ( θ | θ ( t ) ) θ

MLE for latent variable models: Variational EM Introduce a tractable (typically factored) distribution family on the { z n } N n = 1 : N ∏ q γ ( { z n } N n = 1 ) = q γ n ( z n ) n = 1 Jensen’s inequality lets us lower bound the marginal likelihood: ∫ ∫ q γ n ( z n ) P ( x n , z n | θ ) q γ n ( z n ) ln P ( x n , z n | θ ) ln dz n ≥ dz n q γ n ( z n ) q γ n ( z n ) Alternate between maximizing with respect to γ and θ . If the q γ n ( z n ) family contains P ( z n | x n , θ ) then it’s just regular EM. If not, then it provides a coherent way to approximate the difficult expectation. More on this later when we discuss variational autoencoders in detail.

MLE for energy models “Energy models” specify the density directly via its log: ∫ g θ ( x ) = 1 exp { f θ ( x ) } Z θ = exp { f θ ( x ) } Z θ We generally can’t compute the partition function Z θ : [ N ] ∑ ln P ( { x n } N n = 1 | θ ) = f θ ( x n ) − N ln Z θ n = 1 You really do have to account for the partition function in learning. Z θ prevents the model from assigning high probability everywhere!

MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1

MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1 [ N ] ) − 1 ∂ (∫ ∫ ∂ ∑ = ∂θ f θ ( x n ) − N exp { f θ ( x ) } dx exp { f θ ( x ) } dx ∂θ n = 1

MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1 [ N ] ) − 1 ∂ (∫ ∫ ∂ ∑ = ∂θ f θ ( x n ) − N exp { f θ ( x ) } dx exp { f θ ( x ) } dx ∂θ n = 1 [ N ] ∫ ∂ − N 1 ∂ ∑ = ∂θ f θ ( x n ) ∂θ exp { f θ ( x ) } dx Z θ n = 1

MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1 [ N ] ) − 1 ∂ (∫ ∫ ∂ ∑ = ∂θ f θ ( x n ) − N exp { f θ ( x ) } dx exp { f θ ( x ) } dx ∂θ n = 1 [ N ] ∫ ∂ − N 1 ∂ ∑ = ∂θ f θ ( x n ) ∂θ exp { f θ ( x ) } dx Z θ n = 1 [ N ] ∫ ∂ 1 exp { f θ ( x ) } ∂ ∑ = ∂θ f θ ( x n ) − N ∂θ f θ ( x ) dx Z θ n = 1

MLE for energy models: contrastive divergence [ N ] ∫ ∂ ∂ − N ∂ ∑ ∂θ ln P ( { x n } N n = 1 | θ ) = ∂θ f θ ( x n ) ∂θ ln exp { f θ ( x ) } dx n = 1 [ N ] ) − 1 ∂ (∫ ∫ ∂ ∑ = ∂θ f θ ( x n ) − N exp { f θ ( x ) } dx exp { f θ ( x ) } dx ∂θ n = 1 [ N ] ∫ ∂ − N 1 ∂ ∑ = ∂θ f θ ( x n ) ∂θ exp { f θ ( x ) } dx Z θ n = 1 [ N ] ∫ ∂ 1 exp { f θ ( x ) } ∂ ∑ = ∂θ f θ ( x n ) − N ∂θ f θ ( x ) dx Z θ n = 1 [ ∂ [ ∂ ( ] ]) − E model = N E data ∂θ f θ ( x ) ∂θ f θ ( x )

MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N

MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N ▶ Use Monte Carlo for the second term by generating fantasy data?

MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N ▶ Use Monte Carlo for the second term by generating fantasy data? ▶ Bad news: generating data is hard, have to use Markov chain Monte Carlo.

MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N ▶ Use Monte Carlo for the second term by generating fantasy data? ▶ Bad news: generating data is hard, have to use Markov chain Monte Carlo. ▶ Contrastive divergence – start at one of the data and run K steps of MCMC (Hinton, 2002). For RBMs, good features, bad densities.

MLE for energy models: contrastive divergence Gradient is the difference between two expectations: [ ∂ [ ∂ ] ] 1 ∂ ∂θ ln P ( { x n } N n = 1 | θ ) = E data − E model ∂θ f θ ( x ) ∂θ f θ ( x ) N ▶ Use Monte Carlo for the second term by generating fantasy data? ▶ Bad news: generating data is hard, have to use Markov chain Monte Carlo. ▶ Contrastive divergence – start at one of the data and run K steps of MCMC (Hinton, 2002). For RBMs, good features, bad densities. ▶ Persistent contrastive divergence – don’t restart the Markov chain between updates (Tieleman, 2008), often does better.

Training a binary RBM with CD ▶ Binary data x ∈ { 0 , 1 } D ▶ Binary hidden units h ∈ { 0 , 1 } J ▶ Parameters: weight matrix W ∈ R D × J , biases b vis ∈ R D and b hid ∈ R J ▶ Energy function: E ( x , h ; W , b vis , b hid ) = − x T Wh − x T b vis − h T b hid ▶ Hidden given visible: J 1 ∏ P ( h | x , W , b hid ) = 1 + exp {− W T x − b hid } j = 1 ▶ Visible given hidden: D 1 ∏ P ( x | h , W , b vis ) = 1 + exp {− Wh − b vis } d = 1

Training a binary RBM with CD hidden units hidden units visible units visible units Bipartite structure of RBM makes Gibbs sampling easy.

Training a binary RBM with CD hidden units hidden units visible units visible units Contrastive divergence: start at data and Gibbs sample K times.

Training a binary RBM with CD 1: Input: Parameters W , ( b vis , b hid ) ; input x ∈ { 0 , 1 } D ; learning rate α > 0 2: Output: Updated parameters W ′ , b ′ vis b ′ hid 3: h pos ∼ h | x , W , b hid ▷ Sample hiddens given visibles. 4: h neg ← h pos ▷ Initialize negative hiddens. 5: for t = 1 . . . K do x neg ← x | h neg , W , b vis ▷ Sample fantasy data. 6: h neg ← h | x neg , W , b hid ▷ Sample hiddens for fantasy data. 7: 8: end for 9: W ′ ← W + α ( xh T pos − x neg h T neg ) ▷ Approximate stochastic gradient update. 10: b ′ vis ← b vis + α ( x pos − x neg ) 11: b ′ hid ← b hid + α ( h pos − h neg )

Score matching for energy models Hyvärinen (2005) proposed an alternative way to avoid the partition function. Score function: gradient of the log likelihood with respect to the data . ψ ( x ; θ ) = ∂ ∂ x ln P ( x | θ ) = ∂ ∂ x ( f θ ( x ) − ln Z θ ) = ∂ ∂ x f θ ( x )

Score matching for energy models Hyvärinen (2005) proposed an alternative way to avoid the partition function. Score function: gradient of the log likelihood with respect to the data . ψ ( x ; θ ) = ∂ ∂ x ln P ( x | θ ) = ∂ ∂ x ( f θ ( x ) − ln Z θ ) = ∂ ∂ x f θ ( x ) Fitting a score function: ▶ Given observed data { x n } N n = 1 , construct a density estimate p data ( x ) . ▶ Denote the “empirical score function” of this density estimate as ψ data ( x ) . ▶ Model and empirical score functions should be similar: ∫ J ( θ ) = 1 p data ( x ) || ψ ( x ; θ ) − ψ data ( x ) || 2 dx 2

Score matching for energy models Hyvärinen (2005) showed that this objective can be simplified: ∫ J ( θ ) = 1 p data ( x ) || ψ ( x ; θ ) − ψ data ( x ) || 2 dx 2 [ ] ∫ 1 T ψ ( x ; θ ) + 1 2 || ψ ( x ; θ ) || 2 = p data ( x ) dx + const

Score matching for energy models Hyvärinen (2005) showed that this objective can be simplified: ∫ J ( θ ) = 1 p data ( x ) || ψ ( x ; θ ) − ψ data ( x ) || 2 dx 2 [ ] ∫ 1 T ψ ( x ; θ ) + 1 2 || ψ ( x ; θ ) || 2 = p data ( x ) dx + const We don’t actually need ψ data ( x ) and can use the raw empirical p data ( x ) : J ( θ ) = 1 1 T ψ ( x n ; θ ) + 1 ∑ ˜ 2 || ψ ( x n ; θ ) || 2 N n = 1 If the model is identifiable, ˆ θ = arg min θ ˜ J ( θ ) is a consistent estimator.

A closer look at the variational autoencoder Consider a latent variable model that combines Recipes 1 and 2: Basic VAE Generative Model (Kingma and Welling, 2014) Spherical Gaussian latent variable: z ∼ N ( 0 , I ) Transform with a neural network to parameterize another Gaussian: x | z , θ ∼ N ( µ θ ( z ) , Σ θ ( z )) Given some data { x n } N n = 1 , maximize the likelihood with respect to θ : N ∫ ∑ θ ⋆ = arg max ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) dz n θ n = 1

Variational autoencoder z ∼ N ( 0 , I ) x | z , θ ∼ N ( µ θ ( z ) , Σ θ ( z )) Credit: OpenAI blog post on generative models

Learning the VAE model with mean-field We want to solve this: N ∫ ∑ θ ⋆ = arg max N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) dz n ln θ n = 1 ▶ Have to estimate the z n associated with each x n . ▶ Can’t use vanilla EM because P ( z n | x n , θ ) is complicated. ▶ Approximate P ( z n | x n , θ ) with N ( z n | m n , V n ) . ▶ Compute the evidence lower bound using Jensen’s inequality: ∫ N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) dz n ln ∫ N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) ≥ dz n N ( z n | m n , V n )

Maximize the VAE mean-field objective directly? We could try to maximize this objective directly: N ∫ N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) N ( z n | 0 , I ) ∑ L ( θ, { m n , V n } N n = 1 ) = dz n N ( z n | m n , V n ) n = 1 N [∫ ∑ N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) dz n = n = 1 ] ∫ N ( z n | 0 , I ) + N ( z n | m n , V n ) ln N ( z n | m n , V n ) dz n N ∑ ( ) = E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] − KL [ N ( z n | m n , V n ) ||N ( z n | 0 , I )] n = 1

Maximize the VAE mean-field objective directly? We could try to maximize this objective directly: N ∑ E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] − KL [ N ( z n | m n , V n ) ||N ( z n | 0 , I )] � �� n = 1 difference between approximation and prior (easy) expected complete-data log likelihood Annoying because ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard.

Maximize the VAE mean-field objective directly? Zooming in on the expected complete-data log likelihood: ∫ E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] = N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) dz n Can we just draw z ( m ) ∼ N ( z n | m n , V n ) and use Monte Carlo? n M E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] ≈ 1 ∑ ln N ( x n | µ θ ( z ( m ) n ) , Σ θ ( z ( m ) n )) M m = 1 ▶ Gradient with respect to θ ? No problem. ▶ Gradient with respect to m n and V n ? Where did they go?!?!? Kingma and Welling (2014) suggested a clever trick.

The Reparameterization Trick The reparameterization trick is a way to address the following general situation: ∫ ∇ α E z ∼ π α [ f ( z ) ] = ∇ α π α ( z ) f ( z ) dz Here the parameter α governs the distribution under which the expectation is being taken. If we sample z m ∼ π α , we get something non-differentiable in α : [ ] M 1 ∑ ∇ α f ( z m ) M m = 1

The Reparameterization Trick Can simulate from many “standard” parametric distributions via differentiable parametric transformation of a fixed distribution. 3 Examples: aw + b ∼ N ( b , a 2 ) w ∼ N ( 0 , 1 ) ⇒ univariate Gaussian: = Aw + b ∼ N ( b , AA T ) w ∼ N ( 0 , I ) ⇒ multivariate Gaussian: = exponential: w ∼ U ( 0 , 1 ) = ⇒ − ln ( w ) /λ ∼ Exp ( λ ) w ∼ Gamma ( k , 1 ) ⇒ a w ∼ Gamma ( k , a ) gamma: = Reparametrize the integral using the simple fixed distribution ρ ( w ) and an α -parameterized transformation: ∫ ∫ ∇ α E z ∼ π α [ f ( z ) ] = ∇ α π α ( z ) f ( z ) dz = ∇ α ρ ( w ) f ( t α ( w )) dw 3 Essentially anything with a reasonable quantile function.

The Reparameterization Trick Reparametrize the integral using the simple fixed distribution ρ ( w ) and an α -parameterized transformation: ∫ ∫ ∇ α E z ∼ π α [ f ( z ) ] = ∇ α π α ( z ) f ( z ) dz = ∇ α ρ ( w ) f ( t α ( w )) dw Draw w m ∼ ρ ( w ) and now Monte Carlo plays nicely with differentiation: M ∫ 1 ∑ ∇ α ρ ( w ) f ( t α ( w )) dw ≈ ∇ α f ( t α ( w m )) M m = 1 M ≈ 1 ∑ ∇ α f ( t α ( w m )) M m = 1 Shakir Mohamed has a very nice blog post discussing this trick (Mohamed, 2015).

Reparameterization and the VAE Draw a set of ϵ ( m ) ∼ N ( 0 , I ) and parameterize via W n such that W n W T n = V n : n ∫ E z n | m n , V n [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] = N ( z n | m n , V n ) ln N ( x n | µ θ ( z n ) , Σ θ ( z n )) dz n ∫ = N ( ϵ n | 0 , I ) ln N ( x n | µ θ ( W n ϵ n + m n ) , Σ θ ( W n ϵ n + m n )) d ϵ n M ≈ 1 ∑ ln N ( x n | µ θ ( W n ϵ ( m ) + m n ) , Σ θ ( W n ϵ ( m ) + m n )) n n M m = 1 Now it is possible to differentiate with regard to m n and W n .

Amortizing Inference in the VAE Recall that there were several annoying things about mean-field VI in our model: ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard.

Amortizing Inference in the VAE Recall that there were several annoying things about mean-field VI in our model: ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard. Can we just look at a datum and guess its variational parameters?

Amortizing Inference in the VAE Recall that there were several annoying things about mean-field VI in our model: ▶ The number of optimized dimensions scales with N . ▶ We have to perform an optimization to make an out-of-sample inference. ▶ Computing the expected complete data log likelihood looks hard. Can we just look at a datum and guess its variational parameters? Anybody have any good function approximators lying around?

Amortizing Inference in the VAE Throw away all of the per-datum variational parameters { m n , V n } N n = 1 . Replace them with parametric functions that see the input: m γ ( x ) and V γ ( x ) . Rederive the lower bound with γ instead of { m n , V n } N n = 1 : N ∑ L ( θ, γ ) = E z n | x n ,γ [ N ( x n | µ θ ( z n ) , Σ θ ( z n ))] − KL [ N ( z n | m γ ( x n ) , Σ γ ( x n )) ||N ( z n | 0 , I )] n = 1 Can now do mini-batch stochastic optimization without local variables. Amortized : pay up front and then use it cheaply. (Gershman and Goodman, 2014)

What does this have to do with autoencoders? encoder = “recognition network” = amortized inference takes data and maps it to (a distribution over) a latent representation decoder = likelihood = generative model takes latent representation and produces data decoder encoder

What does this have to do with autoencoders? encoder = “recognition network” = amortized inference takes data and maps it to (a distribution over) a latent representation decoder = likelihood = generative model takes latent representation and produces data decoder likelihood encoder amortized inference

Importance Weighted Autoencoder (Burda et al., 2016) ∫ ∫ ∫ q ( z ) P ( x , z | θ ) q ( z ) ln P ( x , z | θ ) ln P ( x | θ ) = ln P ( x , z | θ ) dz = ln dz ≥ dz q ( z ) q ( z ) Rather than using a single z , compute the ELBO with multiple z : [ P ( x , z ( 1 ) | θ ) + P ( x , z ( 2 ) | θ ) ] ∫ q ( z ( 1 ) ) q ( z ( 2 ) ) dz ( 1 ) dz ( 2 ) ln P ( x | θ ) = ln 2 q ( z ( 1 ) ) 2 q ( z ( 2 ) ) [ P ( x , z ( 1 ) | θ ) + P ( x , z ( 2 ) | θ ) ] ∫ q ( z ( 1 ) ) q ( z ( 2 ) ) ln dz ( 1 ) dz ( 2 ) ≥ 2 q ( z ( 1 ) ) 2 q ( z ( 2 ) ) More generally, allow for K “importance samples”: [ ] P ( x , z ( k ) | θ ) K ln 1 ∑ ln P ( x | θ ) ≤ E z ( 1 ) ,..., z ( K ) ∼ q ( z ) q ( z ( k ) ) K k = 1 All else being equal, bigger K leads to a tighter bound.

A Tutorial on Deep Probabilistic Generative Models Ryan P. Adams - PowerPoint PPT Presentation

A Tutorial on Deep Probabilistic Generative Models Ryan P. Adams Princeton University Machine Learning Summer School Buenos Aires, Argentina June 2018 lips.cs.princeton.edu @ryan_p_adams Tutorial Outline What is generative modeling?

Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan

generative design systems Generative Brief Design Definitions Workshop Processes

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

Deep Generative models for Inverse Problems Alex Dimakis joint work with Ashish Bora, Dave Van

4 Deep Generative Models BVM 2018 Tutorial: Advanced Deep Learning Methods Jens Petersen Dept.

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

Probabilistic Models of Cognition: Generative models Table of Contents Chapter

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Introduction to Generative Models (and GANs) Haoqiang Fan fhq@megvii.com Nov. 2017 Figures

Probability Functional Descent: A Unifying Perspective on GANs, VI, and RL Casey Chu <

Modals, conditionals, and probabilistic generative models Topic 1: intro to probability &

Invertible Generative Models for Inverse Problems Mitigating Representation Error and Dataset Bias

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Policy Evaluation with Latent Confounders via Optimal Balance Andrew Bennett 1 Cornell University

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Learning Semantic Visual Codebook for Action Recognition by Embedding into Concept Space Behrouz

Knowledge Transfer Using Latent Variable Models Ayan Acharya UT Austin, Department of ECE July

19 Auto Lecture encoders : Ankur Bambhanoliya Scribes : Donald Hamnett Motivation

Unsupervised Discovery of Object Landmarks as Structural Representations Yuting Zhang 1 , Yijie

More Than a Query Language: SQL in the 21 st Century @MarkusWinand @ModernSQL

Web-based Attacks on Local IoT Devices Gunes Acar Danny Huang Frank Li Arvind Narayanan

Sambuz

Useful Links

Newsletter

Mail Us

A Tutorial on Deep Probabilistic Generative Models Ryan P. Adams - PowerPoint PPT Presentation

A Tutorial on Deep Probabilistic Generative Models Ryan P. Adams Princeton University Machine Learning Summer School Buenos Aires, Argentina June 2018 lips.cs.princeton.edu @ryan_p_adams Tutorial Outline What is generative modeling?

Learning Deep Generative Models Inference &amp; Representation Lecture 12 Rahul G. Krishnan

generative design systems Generative Brief Design Definitions Workshop Processes

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

Deep Generative models for Inverse Problems Alex Dimakis joint work with Ashish Bora, Dave Van

4 Deep Generative Models BVM 2018 Tutorial: Advanced Deep Learning Methods Jens Petersen Dept.

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

Probabilistic Models of Cognition: Generative models Table of Contents Chapter

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Introduction to Generative Models (and GANs) Haoqiang Fan fhq@megvii.com Nov. 2017 Figures

Probability Functional Descent: A Unifying Perspective on GANs, VI, and RL Casey Chu &lt;

Modals, conditionals, and probabilistic generative models Topic 1: intro to probability &amp;

Invertible Generative Models for Inverse Problems Mitigating Representation Error and Dataset Bias

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Policy Evaluation with Latent Confounders via Optimal Balance Andrew Bennett 1 Cornell University

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Learning Semantic Visual Codebook for Action Recognition by Embedding into Concept Space Behrouz

Knowledge Transfer Using Latent Variable Models Ayan Acharya UT Austin, Department of ECE July

19 Auto Lecture encoders : Ankur Bambhanoliya Scribes : Donald Hamnett Motivation

Unsupervised Discovery of Object Landmarks as Structural Representations Yuting Zhang 1 , Yijie

More Than a Query Language: SQL in the 21 st Century @MarkusWinand @ModernSQL

Web-based Attacks on Local IoT Devices Gunes Acar Danny Huang Frank Li Arvind Narayanan

Sambuz

Useful Links

Newsletter

Mail Us

Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan

Probability Functional Descent: A Unifying Perspective on GANs, VI, and RL Casey Chu <

Modals, conditionals, and probabilistic generative models Topic 1: intro to probability &