probabilistic graphical models
play

Probabilistic Graphical Models Inference & Learning in DL - PowerPoint PPT Presentation

Probabilistic Graphical Models Inference & Learning in DL Zhiting Hu Lecture 19, March 29, 2017 Reading: 1 Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of l


  1. Probabilistic Graphical Models Inference & Learning in DL Zhiting Hu Lecture 19, March 29, 2017 Reading: 1

  2. Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of π’š l Tractable likelihood function p # (π’š) l E.g., l π‘ž 𝑦, 𝑨|𝛽 = π‘ž 𝑦|𝑨 π‘ž(𝑨|𝛽) 2

  3. Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of π’š l Tractable likelihood function p # (π’š) l E.g., Sigmoid Belief Nets l (3) + 𝑑 . (3) , 𝑑 . 6 π’Š / π‘ž 𝑀 ./ = 1 𝒙 . ,π’Š / = 𝜏 𝒙 . (3) = 1 𝒙 9 , π’Š / : ,𝑑 9 ) = 𝜏(𝒙 9 : + 𝑑 9 ) 6 π’Š / (:) = 0,1 ? π‘ž β„Ž 9/ π’Š / (3) = 0,1 > π’Š / π’˜ / = 0,1 = 3

  4. Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of π’š l Tractable likelihood function p # (π’š) l E.g., Deep generative model parameterized with NNs (e.g., VAEs) l π‘ž # π’š π’œ = 𝑂 π’š; 𝜈 # π’œ ,𝜏 π‘ž π’œ = 𝑂(π’œ; 𝟏,𝑱) 4

  5. Deep Generative Models l Implicit probabilistic models Defines a stochastic process to simulate data π’š l Do not require tractable likelihood function l Data simulator l Natural approach for problems in population genetics, weather, ecology, etc. l E.g., generate data from a deterministic equation given parameters and random l noise (e.g., GANs) π’š / = 𝑕 π’œ / ; 𝜾 π’œ / ∼ 𝑂(𝟏,𝑱) 5

  6. Recap: Variational Inference π‘ž # π’š,π’œ l Consider a probabilistic model π‘Ÿ R π’œ | π’š l Assume variational distribution l Lower bound for log likelihood log π‘ž π’š + Qπ‘Ÿ R π’œ π’š log π‘ž # π’š, π’œ = 𝐿𝑀 π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 π’œ π’š π‘Ÿ R π’œ π’š π’œ β‰₯ Qπ‘Ÿ R π’œ π’š logπ‘ž # π’š, π’œ π‘Ÿ R π’œ π’š π’œ ≔ β„’(𝜾, 𝝔;π’š) Free energy l 𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 (π’œ|π’š)) 6

  7. Wake Sleep Algorithm π‘ž # π’š π’œ l Consider a generative model E.g., sigmoid brief nets l l Variational bound: log π‘ž π’š β‰₯ Q π‘Ÿ R π’œ π’š log π‘ž # π’š,π’œ π‘Ÿ R π’œ π’š ≔ β„’(𝜾, 𝝔;π’š) π’œ π‘Ÿ R π’œ π’š l Use a inference network π‘ž # l Maximize the bound w.r.t. Γ  Wake phase max 𝜾 E \(π’œ|π’š) log π‘ž 𝜾 π’š π’œ l π‘Ÿ π’œ π’š Get samples from through bottom-up pass l Use the samples as targets for updating the generator l 7

  8. Wake Sleep Algorithm l [Hinton et al., Science 1995] l Generally applicable to a wide range of generative models by training a separate inference network π‘ž(π’œ) π‘ž # π’š π’œ l Consider a generative model , with prior E.g., multi-layer brief nets l l Free energy 𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 (π’œ|π’š)) π‘Ÿ R π’œ π’š l Inference network a.k.a. recognition network l 8

  9. R 2 Wake Sleep Algorithm R 1 π’š l Free energy: 𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 (π’œ|π’š)) π‘ž # l Minimize the free energy w.r.t. Γ  Wake phase max 𝜾 E \(π’œ|π’š) log π‘ž 𝜾 π’š π’œ l π‘Ÿ Get samples from through bottom-up pass on training data l Use the samples as targets for updating the generator l [Figure courtesy: Maei’s slides] 9

  10. G 2 R 2 Wake Sleep Algorithm G 1 R 1 π’š l Free energy: 𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 (π’œ|π’š)) π‘Ÿ R π’œ π’š l Maximize the free energy w.r.t. ? computationally expensive / high variance l l Instead, maximize w.r.t. . Γ  π‘Ÿ R π’œ π’š Sleep phase 𝐺′ 𝜾,𝝔;π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘ž π’œ π’š || π‘Ÿ R (π’œ|π’š)) max 𝝔 E ^(π’œ,π’š) log π‘Ÿ R π’œ π’š l π‘ž β€œDreaming” up samples from through top-down pass l Use the samples as targets for updating the recognition network l 10

  11. G 2 R 2 Wake Sleep Algorithm G 1 R 1 π’š l Wake phase: Use recognition network to perform a bottom-up pass in order to create samples l for layers above (from data) Train generative network using samples obtained from recognition model l l Sleep phase: Use generative weights to reconstruct data by performing a top-down pass l Train recognition weights using samples obtained from generative model l l KL is not symmetric l Doesn’t optimize a well-defined objective function l Not guaranteed to converge 11

  12. Variational Auto-encoders (VAEs) l [Kingma & Welling, 2014] l Enjoy similar applicability with wake-sleep algorithm Not applicable to discrete latent variables l Optimize a variational lower bound on the log-likelihood l Reduce variance through reparameterization of the recognition l distribution Alternatives: use control variates as in reinforcement learning [Mnih & Gregor, l 2014] 12

  13. Variational Auto-encoders (VAEs) π‘ž # π’š π’œ π‘ž(π’œ) l Generative model , with prior a.k.a. decoder l π‘Ÿ R π’œ π’š l Inference network a.k.a. encoder, recognition network l l Variational lower bound log π‘ž π’š β‰₯ E \ _ π’œ π’š logπ‘ž # π’š,π’œ βˆ’ KL π‘Ÿ R π’œ π’š || π‘ž π’œ ≔ β„’(𝜾,𝝔;π’š) 13

  14. Variational Auto-encoders (VAEs) l Variational lower bound β„’ 𝜾, 𝝔; π’š = E \ _ π’œ π’š log π‘ž # π’š, π’œ βˆ’ KL(π‘Ÿ R π’œ π’š || π‘ž(π’œ)) π‘ž # π’š π’œ β„’(𝜾, 𝝔; π’š) l Optimize w.r.t. the same with the wake phase l β„’(𝜾, 𝝔; π’š) π‘Ÿ R π’œ π’š l Optimize w.r.t. Directly computing the gradient with MC estimation l a REINFORCE-like update rule which suffers from high variance [Mnih & Gregor 2014] (Next lecture for more on REINFORCE) VAEs use a reparameterization trick to reduce variance l 14

  15. VAEs: Reparameterization Trick l 15

  16. VAEs: Reparameterization Trick l π‘Ÿ R π’œ (9) π’š (9) = π’ͺ(π’œ 9 ; 𝝂 9 ,𝝉 :(9) 𝑱) π’œ = π’œ R (𝝑) is a deterministic mapping of 𝝑 [Figure courtesy: Chang’s slides] 16

  17. VAEs: Reparameterization Trick Variational lower bound l β„’ 𝜾, 𝝔;π’š = E \ _ π’œ π’š log π‘ž # π’š,π’œ βˆ’ KL π‘Ÿ R π’œ π’š || π‘ž π’œ E \ _ π’œ π’š log π‘ž # π’š, π’œ = E π‘βˆΌπ’ͺ(𝟏,𝑱) log π‘ž # π’š,π’œ R 𝝑 π‘Ÿ R π’œ π’š β„’ 𝜾, 𝝔; π’š Optimize w.r.t. l 𝛼 R E \ _ π’œ π’š log π‘ž # π’š, π’œ = E π‘βˆΌπ’ͺ(𝟏,𝑱) 𝛼 R log π‘ž # π’š,π’œ R 𝝑 l Uses the gradients w.r.t. the latent variables l For Gaussian distributions, can be computed KL π‘Ÿ R π’œ π’š || π‘ž π’œ l and differentiated analytically 17

  18. VAEs: Training l 18

  19. VAEs: Results l 19

  20. VAEs: Results l Generated MNIST images [Gregor et al., 2015] 20

  21. VAEs: Limitations and variants l Element-wise reconstruction error For image generation, to reconstruct every pixels l Sensitive to irrelevant variance, e.g., translations l Variant: feature-wise (perceptual-level) reconstruction [Dosovitskiy et al., 2016] l Use a pre-trained neural network to extract features of data l Generated images are required to have similar feature vectors with the data l Variant: Combining VAEs with GANs [Larsen et al., 2016] (more later) l Reconstruction results with different loss 21

  22. VAEs: Limitations and variants l Not applicable to discrete latent variables Differentiable reparameterization does not apply to discrete variables l Wake-sleep algorithm/GANs allow discrete latents l Variant: marginalize out discrete latents [Kingma et al., 2014] l Expensive when the discrete space is large l Variant: use continuous approximations l Gumbel-softmax [Jang et al, 2017] for approximating multinomial variables l Variant: combine VAEs with wake-sleep algorithm [Hu et al., 2017] l 22

  23. VAEs: Limitations and variants Usually use a fixed standard normal distribution as prior l π‘ž π’œ = π’ͺ(π’œ; 𝟏,𝑱) l For ease of inference and learning l Limited flexibility: converting the data distribution to fixed, single-mode prior l distribution Variant: use hierarchical nonparametric priors [Goyal et al., 2017] l E.g., Dirichlet process, nested Chinese restaurant process (more later) l Learn the structures of priors jointly with the model l 23

  24. VAEs: Limitations and variants Usually use a fixed standard normal distribution as prior l π‘ž π’œ = π’ͺ(π’œ; 𝟏,𝑱) l For ease of inference and learning l Limited flexibility: converting the data distribution to fixed, single-mode prior l distribution Variant: use hierarchical nonparametric priors [Goyal et al., 2017] l E.g., Dirichlet process, nested Chinese restaurant process (more later) l Learn the structures of priors jointly with the model l 24

  25. Deep Generative Models l Implicit probabilistic models Defines a stochastic process to simulate data π’š l Do not require tractable likelihood function l Data simulator l Natural approach for problems in population genetics, weather, ecology, etc. l E.g., generate data from a deterministic equation given parameters and random l noise (e.g., GANs) π’š / = 𝑕 π’œ / ; 𝜾 π’œ / ∼ 𝑂(𝟏,𝑱) 25

  26. Generative Adversarial Nets (GANs) [Goodfellow et al., 2014] l Assume implicit generative model l Learn cost function jointly l Interpreted as a mini-max game between a generator and a l discriminator Generate sharp, high-fidelity samples l 26

Recommend


More recommend