Probabilistic Graphical Models Inference & Learning in DL Zhiting Hu Lecture 19, March 29, 2017 Reading: 1
Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of π l Tractable likelihood function p # (π) l E.g., l π π¦, π¨|π½ = π π¦|π¨ π(π¨|π½) 2
Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of π l Tractable likelihood function p # (π) l E.g., Sigmoid Belief Nets l (3) + π . (3) , π . 6 π / π π€ ./ = 1 π . ,π / = π π . (3) = 1 π 9 , π / : ,π 9 ) = π(π 9 : + π 9 ) 6 π / (:) = 0,1 ? π β 9/ π / (3) = 0,1 > π / π / = 0,1 = 3
Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of π l Tractable likelihood function p # (π) l E.g., Deep generative model parameterized with NNs (e.g., VAEs) l π # π π = π π; π # π ,π π π = π(π; π,π±) 4
Deep Generative Models l Implicit probabilistic models Defines a stochastic process to simulate data π l Do not require tractable likelihood function l Data simulator l Natural approach for problems in population genetics, weather, ecology, etc. l E.g., generate data from a deterministic equation given parameters and random l noise (e.g., GANs) π / = π π / ; πΎ π / βΌ π(π,π±) 5
Recap: Variational Inference π # π,π l Consider a probabilistic model π R π | π l Assume variational distribution l Lower bound for log likelihood log π π + Qπ R π π log π # π, π = πΏπ π π π π || π πΎ π π π R π π π β₯ Qπ R π π logπ # π, π π R π π π β β(πΎ, π;π) Free energy l πΊ πΎ, π; π = βlog π π + πΏπ(π π π π || π πΎ (π|π)) 6
Wake Sleep Algorithm π # π π l Consider a generative model E.g., sigmoid brief nets l l Variational bound: log π π β₯ Q π R π π log π # π,π π R π π β β(πΎ, π;π) π π R π π l Use a inference network π # l Maximize the bound w.r.t. Γ Wake phase max πΎ E \(π|π) log π πΎ π π l π π π Get samples from through bottom-up pass l Use the samples as targets for updating the generator l 7
Wake Sleep Algorithm l [Hinton et al., Science 1995] l Generally applicable to a wide range of generative models by training a separate inference network π(π) π # π π l Consider a generative model , with prior E.g., multi-layer brief nets l l Free energy πΊ πΎ, π; π = βlog π π + πΏπ(π π π π || π πΎ (π|π)) π R π π l Inference network a.k.a. recognition network l 8
R 2 Wake Sleep Algorithm R 1 π l Free energy: πΊ πΎ, π; π = βlog π π + πΏπ(π π π π || π πΎ (π|π)) π # l Minimize the free energy w.r.t. Γ Wake phase max πΎ E \(π|π) log π πΎ π π l π Get samples from through bottom-up pass on training data l Use the samples as targets for updating the generator l [Figure courtesy: Maeiβs slides] 9
G 2 R 2 Wake Sleep Algorithm G 1 R 1 π l Free energy: πΊ πΎ, π; π = βlog π π + πΏπ(π π π π || π πΎ (π|π)) π R π π l Maximize the free energy w.r.t. ? computationally expensive / high variance l l Instead, maximize w.r.t. . Γ π R π π Sleep phase πΊβ² πΎ,π;π = βlog π π + πΏπ(π π π || π R (π|π)) max π E ^(π,π) log π R π π l π βDreamingβ up samples from through top-down pass l Use the samples as targets for updating the recognition network l 10
G 2 R 2 Wake Sleep Algorithm G 1 R 1 π l Wake phase: Use recognition network to perform a bottom-up pass in order to create samples l for layers above (from data) Train generative network using samples obtained from recognition model l l Sleep phase: Use generative weights to reconstruct data by performing a top-down pass l Train recognition weights using samples obtained from generative model l l KL is not symmetric l Doesnβt optimize a well-defined objective function l Not guaranteed to converge 11
Variational Auto-encoders (VAEs) l [Kingma & Welling, 2014] l Enjoy similar applicability with wake-sleep algorithm Not applicable to discrete latent variables l Optimize a variational lower bound on the log-likelihood l Reduce variance through reparameterization of the recognition l distribution Alternatives: use control variates as in reinforcement learning [Mnih & Gregor, l 2014] 12
Variational Auto-encoders (VAEs) π # π π π(π) l Generative model , with prior a.k.a. decoder l π R π π l Inference network a.k.a. encoder, recognition network l l Variational lower bound log π π β₯ E \ _ π π logπ # π,π β KL π R π π || π π β β(πΎ,π;π) 13
Variational Auto-encoders (VAEs) l Variational lower bound β πΎ, π; π = E \ _ π π log π # π, π β KL(π R π π || π(π)) π # π π β(πΎ, π; π) l Optimize w.r.t. the same with the wake phase l β(πΎ, π; π) π R π π l Optimize w.r.t. Directly computing the gradient with MC estimation l a REINFORCE-like update rule which suffers from high variance [Mnih & Gregor 2014] (Next lecture for more on REINFORCE) VAEs use a reparameterization trick to reduce variance l 14
VAEs: Reparameterization Trick l 15
VAEs: Reparameterization Trick l π R π (9) π (9) = πͺ(π 9 ; π 9 ,π :(9) π±) π = π R (π) is a deterministic mapping of π [Figure courtesy: Changβs slides] 16
VAEs: Reparameterization Trick Variational lower bound l β πΎ, π;π = E \ _ π π log π # π,π β KL π R π π || π π E \ _ π π log π # π, π = E πβΌπͺ(π,π±) log π # π,π R π π R π π β πΎ, π; π Optimize w.r.t. l πΌ R E \ _ π π log π # π, π = E πβΌπͺ(π,π±) πΌ R log π # π,π R π l Uses the gradients w.r.t. the latent variables l For Gaussian distributions, can be computed KL π R π π || π π l and differentiated analytically 17
VAEs: Training l 18
VAEs: Results l 19
VAEs: Results l Generated MNIST images [Gregor et al., 2015] 20
VAEs: Limitations and variants l Element-wise reconstruction error For image generation, to reconstruct every pixels l Sensitive to irrelevant variance, e.g., translations l Variant: feature-wise (perceptual-level) reconstruction [Dosovitskiy et al., 2016] l Use a pre-trained neural network to extract features of data l Generated images are required to have similar feature vectors with the data l Variant: Combining VAEs with GANs [Larsen et al., 2016] (more later) l Reconstruction results with different loss 21
VAEs: Limitations and variants l Not applicable to discrete latent variables Differentiable reparameterization does not apply to discrete variables l Wake-sleep algorithm/GANs allow discrete latents l Variant: marginalize out discrete latents [Kingma et al., 2014] l Expensive when the discrete space is large l Variant: use continuous approximations l Gumbel-softmax [Jang et al, 2017] for approximating multinomial variables l Variant: combine VAEs with wake-sleep algorithm [Hu et al., 2017] l 22
VAEs: Limitations and variants Usually use a fixed standard normal distribution as prior l π π = πͺ(π; π,π±) l For ease of inference and learning l Limited flexibility: converting the data distribution to fixed, single-mode prior l distribution Variant: use hierarchical nonparametric priors [Goyal et al., 2017] l E.g., Dirichlet process, nested Chinese restaurant process (more later) l Learn the structures of priors jointly with the model l 23
VAEs: Limitations and variants Usually use a fixed standard normal distribution as prior l π π = πͺ(π; π,π±) l For ease of inference and learning l Limited flexibility: converting the data distribution to fixed, single-mode prior l distribution Variant: use hierarchical nonparametric priors [Goyal et al., 2017] l E.g., Dirichlet process, nested Chinese restaurant process (more later) l Learn the structures of priors jointly with the model l 24
Deep Generative Models l Implicit probabilistic models Defines a stochastic process to simulate data π l Do not require tractable likelihood function l Data simulator l Natural approach for problems in population genetics, weather, ecology, etc. l E.g., generate data from a deterministic equation given parameters and random l noise (e.g., GANs) π / = π π / ; πΎ π / βΌ π(π,π±) 25
Generative Adversarial Nets (GANs) [Goodfellow et al., 2014] l Assume implicit generative model l Learn cost function jointly l Interpreted as a mini-max game between a generator and a l discriminator Generate sharp, high-fidelity samples l 26
Recommend
More recommend