lecture 2 gradient estimators
play

Lecture 2: Gradient Estimators CSC 2547 Spring 2018 David Duvenaud - PowerPoint PPT Presentation

Lecture 2: Gradient Estimators CSC 2547 Spring 2018 David Duvenaud Based mainly on slides by Will Grathwohl, Dami Choi, Yuhuai Wu and Geo ff Roeder Where do we see this guy? L ( ) = E p ( b | ) [ f ( b )] Just about everywhere!


  1. Lecture 2: Gradient Estimators CSC 2547 Spring 2018 David Duvenaud Based mainly on slides by Will Grathwohl, Dami Choi, Yuhuai Wu and Geo ff Roeder

  2. Where do we see this guy? L ( θ ) = E p ( b | θ ) [ f ( b )] • Just about everywhere! • Variational Inference • Reinforcement Learning • Hard Attention • And so many more!

  3. Gradient based optimization • Gradient based optimization is the standard method used today to optimize expectations • Necessary if models are neural-net based • Very rarely can this gradient be computed analytically

  4. Otherwise, we estimate… • A number of approaches exist to estimate this gradient • They make varying levels of assumptions about the distribution and function being optimized • Most popular methods either make strong assumptions or su ff er from high variance

  5. REINFORCE (Williams, 1992) g REINFORCE [ f ] = f ( b ) ∂ ˆ ∂θ log p ( b | θ ) , b ∼ p ( b | θ ) • Unbiased • Su ff ers from high variance • Has few requirements • Easy to compute

  6. Reparameterization (Kingma & Welling, 2014) g reparam [ f ] = ∂ f ∂ b ˆ b = T ( ✓ , ✏ ) , ✏ ∼ p ( ✏ ) ∂ b ∂θ • Makes stronger assumptions • Lower variance empirically • Requires is known and f ( b ) di ff erentiable • Unbiased • Requires is p ( b | θ ) reparameterizable

  7. Concrete (Maddison et al., 2016) ∂σ ( z/t ) ∂ f g concrete [ f ] = ˆ z = T ( ✓ , ✏ ) , ✏ ∼ p ( ✏ ) ∂σ ( z/t ) ∂θ • Biased • Works well in practice • Adds temperature hyper-parameter • Requires that is known, and di ff erentiable f ( b ) • Low variance from • Requires is reparameterizable reparameterization p ( z | θ ) • Requires behaves predictably outside of domain f ( b )

  8. Control Variates • Allow us to reduce variance of a Monte Carlo estimator g new ( b ) = ˆ ˆ g ( b ) − c ( b ) + E p ( b ) [ c ( b )] • Variance is reduced if corr( g, c ) > 0 • Does not change bias

  9. Putting it all together • We would like a general gradient estimator that is • unbiased • low variance • usable when is unknown f ( b ) • useable when is discrete p ( b | θ )

  10. Backpropagation Through

  11. Backpropagation Through

  12. Backpropagation Through Will Grathwohl Dami Choi Yuhuai Wu Geo ff Roeder David Duvenaud

  13. Our Approach g LAX = g REINFORCE [ f ] − g REINFORCE [ c φ ] + g reparam [ c φ ] ˆ

  14. Our Approach g LAX = g REINFORCE [ f ] − g REINFORCE [ c φ ] + g reparam [ c φ ] ˆ = [ f ( b ) − c φ ( b )] ∂ ∂θ log p ( b | ✓ ) + ∂ ∂θ c φ ( b ) • Start with the reinforce estimator for f ( b ) • We introduce a new function c φ ( b ) • We subtract the reinforce estimator of its gradient and add the reparameterization estimator • Can be thought of as using the reinforce estimator of as a control c φ ( b ) variate

  15. Optimizing the Control Variate  ∂ � ∂ g 2 ∂φ Variance(ˆ g ) = E ∂φ ˆ • For any unbiased estimator we can get Monte Carlo estimates for the gradient of the variance of ˆ g • Use to optimize c φ

  16. What about discrete b?

  17. Extension to discrete p ( b | θ ) z )] ∂ ∂θ log p ( b | θ ) + ∂ ∂θ c φ ( z ) − ∂ g RELAX = [ f ( b ) − c φ (˜ ˆ ∂θ c φ (˜ z ) b = H ( z ) , z ∼ p ( z | θ ) , ˜ z ∼ p ( z | b, θ ) • When b is discrete, we introduce a relaxed distribution p ( z | θ ) and a function where H ( H ( z ) = b ∼ p ( b | θ ) • We use the conditioning scheme introduced in REBAR (Tucker et al. 2017) • Unbiased for all c φ

  18. A Simple Example E p ( b | θ ) [( t − b ) 2 ] • Used to validate REBAR (used t = .45) • We use t = .499 • REBAR, REINFORCE fail due to noise outweighing signal • Can RELAX improve?

  19. • RELAX outperforms baselines • Considerably reduced variance! • RELAX learns reasonable surrogate

  20. Analyzing the Surrogate • REBAR’s fixed surrogate cannot produce consistent and correct gradients • RELAX learns to balance REINFORCE variance and reparameterization variance

  21. A More Interesting Application log p ( x ) ≥ L ( θ ) = E q ( b | x ) [log p ( x | b ) + log p ( b ) − log q ( b | x )] • Discrete VAE • Latent state is 200 Bernoulli variables • Discrete sampling makes reparameterization estimator unusable c φ ( z ) = f ( σ λ ( z )) + r ρ ( z )

  22. Results

  23. Reinforcement Learning • Policy gradient methods are very popular today (A2C, A3C, ACKTR) • Seeks to find argmax θ E τ ∼ π ( τ | θ ) [ R ( τ )] • Does this by estimating ∂ ∂θ E τ ∼ π ( τ | θ ) [ R ( τ )] • R is not known so many popular estimators cannot be used

  24. Actor Critic " T # T ∂ log π ( a t | s t , θ ) X X g AC = ˆ r t 0 − c φ ( s t ) ∂θ t =1 t 0 = t • is an estimate of the value function c φ • This is exactly the REINFORCE estimator using an estimate of the value function as a control variate • Why not use action in control variate? • Dependence on action would add bias

  25. LAX for RL " T # T ∂ log π ( a t | s t , θ ) + ∂ X X g LAX = ˆ r t 0 − c φ ( s t , a t ) t ) ∂θ c φ ( s t , a t ) ∂θ t =1 t 0 = t • Allows for action dependence in control variate • Remains unbiased • Similar extension available for discrete action spaces

  26. Results • Improved performance • Lower variance gradient estimates

  27. Future Work • What does the optimal surrogate look like? • Many possible variations of LAX and RELAX • Which provides the best tradeo ff between variance, ease of implementation, scope of application, performance • RL • Incorporate other variance reduction techniques (GAE, reward bootstrapping, trust-region) • Ways to train the surrogate o ff -policy • Applications • Inference of graph structure (coming soon) • Inference of discrete neural network architecture components (coming soon)

  28. Directions • Surrogate can take any form • can rely on global information even if forward pass only uses local info • Can depend on order even if forward pass is invariant • Reparameterization can take many forms, ongoing work on reparameterizing through rejection sampling, or distributions on permutations

  29. Reparameterizing the Birkho ff Polytope for Variational Permutation Inference

  30. Learning Latent Permutations with Gumbel-Sinkhorn Networks

  31. Why are we optimizing policies anyways? • Next week: Variational optimization

Recommend


More recommend