discrete latent variable models
play

Discrete Latent Variable Models Stefano Ermon, Aditya Grover - PowerPoint PPT Presentation

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 1 / 29 Summary Major themes in the course Representation: Latent variable


  1. Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 1 / 29

  2. Summary Major themes in the course Representation: Latent variable vs. fully observed Objective function and optimization algorithm: Many divergences and distances optimized via likelihood-free (two sample test) or likelihood based methods Evaluation of generative models Combining different models and variants Plan for today: Discrete Latent Variable Modeling Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 2 / 29

  3. Why should we care about discreteness? Discreteness is all around us! Decision Making: Should I attend CS 236 lecture or not? Structure learning Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 3 / 29

  4. Why should we care about discreteness? Many data modalities are inherently discrete Graphs Text, DNA Sequences, Program Source Code, Molecules, and lots more Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 4 / 29

  5. Stochastic Optimization Consider the following optimization problem max φ E q φ ( z ) [ f ( z )] Recap example: Think of q ( · ) as the inference distribution for a VAE � � log p θ ( x , z ) max θ,φ E q φ ( z | x ) . q ( z | x ) Gradients w.r.t. θ can be derived via linearity of expectation ∇ θ E q ( z ; φ ) [log p ( z , x ; θ ) − log q ( z ; φ )] = E q ( z ; φ ) [ ∇ θ log p ( z , x ; θ )] � 1 ∇ θ log p ( z k , x ; θ ) ≈ k k If z is continuous, q ( · ) is reparameterizable, and f ( · ) is differentiable in φ , then we can use reparameterization to compute gradients w.r.t. φ What if any of the above assumptions fail? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 5 / 29

  6. Stochastic Optimization with REINFORCE Consider the following optimization problem max E q φ ( z ) [ f ( z )] φ For many class of problem scenarios, reparameterization trick is inapplicable Scenario 1: f ( · ) is non-differentiable in φ e.g., optimizing a black box reward function in reinforcement learning Scenario 2: q φ ( z ) cannot be reparameterized as a differentiable function of φ with respect to a fixed base distribution e.g., discrete distributions REINFORCE is a general-purpose solution to both these scenarios We will first analyze it in the context of reinforcement learning and then extend it to latent variable models with discrete latent variables Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 6 / 29

  7. REINFORCE for reinforcement learning Example: Pulling arms of slot machines Which arm to pull? Set A of possible actions. E.g., pull arm 1, arm 2, . . . , etc. Each action z ∈ A has a reward f ( z ) Randomized policy for choosing actions q φ ( z ) parameterized by φ . For example, φ could be the parameters of a multinomial distribution Goal: Learn the parameters φ that maximize our earnings (in expectation) max E q φ ( z ) [ f ( z )] φ Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 7 / 29

  8. Policy Gradients Want to compute a gradient with respect to φ of the expected reward � E q φ ( z ) [ f ( z )] = q φ ( z ) f ( z ) z = � ∂φ i f ( z ) = � ∂ ∂ q φ ( z ) ∂ q φ ( z ) 1 E q φ ( z ) [ f ( z )] z q φ ( z ) ∂φ i f ( z ) z q φ ( z ) ∂φ i � ∂ log q φ ( z ) � = � z q φ ( z ) ∂ log q φ ( z ) f ( z ) = E q φ ( z ) f ( z ) ∂φ i ∂φ i Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 8 / 29

  9. REINFORCE Gradient Estimation Want to compute a gradient with respect to φ of � E q φ ( z ) [ f ( z )] = q φ ( z ) f ( z ) z The REINFORCE rule is ∇ φ E q φ ( z ) [ f ( z )] = E q φ ( z ) [ f ( z ) ∇ φ log q φ ( z )] We can now construct a Monte Carlo estimate Sample z 1 , · · · , z K from q φ ( z ) and estimate � ∇ φ E q φ ( z ) [ f ( z )] ≈ 1 f ( z k ) ∇ φ log q φ ( z k ) K k Assumption: The distribution q ( · ) is easy to sample from and evaluate probabilities Works for both discrete and continuous distributions Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 9 / 29

  10. Variational Learning of Latent Variable Models To learn the variational approximation we need to compute the gradient with respect to φ of � L ( x ; θ, φ ) = q φ ( z | x ) log p ( z , x ; θ ) + H ( q φ ( z | x )) z = E q φ ( z | x ) [log p ( z , x ; θ ) − log q φ ( z | x ))] The function inside the brackets also depends on φ (and θ, x ). Want to compute a gradient with respect to φ of � E q φ ( z | x ) [ f ( φ, θ, z , x )] = q φ ( z | x ) f ( φ, θ, z , x ) z The REINFORCE rule is ∇ φ E q φ ( z | x ) [ f ( φ, θ, z , x )] = E q φ ( z | x ) [ f ( φ, θ, z , x ) ∇ φ log q φ ( z | x ) + ∇ φ f ( φ, θ, z , x )] We can now construct a Monte Carlo estimate of ∇ φ L ( x ; θ, φ ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 10 / 29

  11. REINFORCE Gradient Estimates have High Variance Want to compute a gradient with respect to φ of � E q φ ( z ) [ f ( z )] = q φ ( z ) f ( z ) z The REINFORCE rule is ∇ φ E q φ ( z ) [ f ( z )] = E q φ ( z ) [ f ( z ) ∇ φ log q φ ( z )] Monte Carlo estimate: Sample z 1 , · · · , z K from q φ ( z ) � ∇ φ E q φ ( z ) [ f ( z )] ≈ 1 f ( z k ) ∇ φ log q φ ( z k ) := f MC ( z 1 , · · · , z K ) K k Monte Carlo estimates of gradients are unbiased � � f MC ( z 1 , · · · , z K ) E z 1 , ··· , z K ∼ q φ ( z ) = ∇ φ E q φ ( z ) [ f ( z )] Almost never used in practice because of high variance Variance can be reduced via carefully designed control variates Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 11 / 29

  12. Control Variates The REINFORCE rule is ∇ φ E q φ ( z ) [ f ( z )] = E q φ ( z ) [ f ( z ) ∇ φ log q φ ( z )] Given any constant B (a control variate) ∇ φ E q φ ( z ) [ f ( z )] = E q φ ( z ) [( f ( z ) − B ) ∇ φ log q φ ( z )] To see why, � � E q φ ( z ) [ B ∇ φ log q φ ( z )] = B q φ ( z ) ∇ φ log q φ ( z ) = B ∇ φ q φ ( z ) z z � = B ∇ φ q φ ( z ) = B ∇ φ 1 = 0 z Monte Carlo gradient estimates of both f ( z ) and f ( z ) − B have same expectation These estimates can however have different variances Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 12 / 29

  13. Control variates Suppose we want to compute � E q φ ( z ) [ f ( z )] = q φ ( z ) f ( z ) z Define � � � f ( z ) = f ( z ) + a h ( z ) − E q φ ( z ) [ h ( z )] h ( z ) is referred to as a control variate Assumption: E q φ ( z ) [ h ( z )] is known Monte Carlo gradient estimates of f ( z ) and � f ( z ) have the same expectation E z 1 , ··· , z K ∼ q φ ( z ) [ � f MC ( z 1 , · · · , z K )] = E z 1 , ··· , z K ∼ q φ ( z ) [ f MC ( z 1 , · · · , z K )] but different variances Can try to learn and update the control variate during training Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 13 / 29

  14. Control variates Can derive an alternate Monte Carlo estimate for REINFORCE gradients based on control variates Sample z 1 , · · · , z K from q φ ( z ) ∇ φ E q φ ( z ) [ f ( z )] � � = ∇ φ E q φ ( z ) [ f ( z ) + a h ( z ) − E q φ ( z ) [ h ( z )] ] � � � � K ≈ 1 1 k f ( z k ) ∇ φ log q φ ( z k ) + a k =1 h ( z k ) − E q φ ( z ) [ h ( z )] K K � � := f MC ( z 1 , · · · , z K ) + a h MC ( z 1 , · · · , z K ) − E q φ ( z ) [ h ( z )] := � f MC ( z 1 , · · · , z K ) What is Var( � f MC ) vs. Var( f MC )? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 14 / 29

  15. Control variates Comparing Var( � f MC ) vs. Var( f MC ) � � Var( � f MC ) = Var( f MC + a h MC − E q φ ( z ) [ h ( z )] ) = Var( f MC + ah MC ) Var( f MC ) + a 2 Var( h MC ) + 2 a Cov( f MC , h MC ) = To get the optimal coefficient a ∗ that minimizes the variance, take derivatives w.r.t. a and set them to 0 a ∗ = − Cov( f MC , h MC ) Var( h MC ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 15 / 29

  16. Control variates Comparing Var( � f MC ) vs. Var( f MC ) Var( � f MC ) = Var( f MC ) + a 2 Var( h MC ) + 2 a Cov( f MC , h MC ) Setting the coefficient a = a ∗ = − Cov( f MC , h MC ) Var( h MC ) Var( f MC ) − Cov( f MC , h MC ) 2 Var( � f MC ) = Var( h MC ) Cov( f MC , h MC ) 2 = Var( f MC ) − Var( h MC )Var( f MC )Var( f MC ) (1 − ρ ( f MC , h MC ) 2 )Var( f MC ) = Correlation coefficient ρ ( f MC , h MC ) is between -1 and 1. For maximum variance reduction, we want f MC and h MC to be highly correlated Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 16 / 29

  17. Neural Variational Inference and Learning (NVIL) Latent variable models with discrete latent variables are often referred to as belief networks Variational learning objective is same as ELBO � L ( x ; θ, φ ) = q φ ( z | x ) log p ( z , x ; θ ) + H ( q φ ( z | x )) z = E q φ ( z | x ) [log p ( z , x ; θ ) − log q φ ( z | x ))] := E q φ ( z | x ) [ f ( φ, θ, z , x )] Here, z is discrete and hence we cannot use reparameterization Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 15 17 / 29

Recommend


More recommend