Adaptive Antithetic Sampling for Variance Reduction Hongyu Ren*, Shengjia Zhao*, Stefano Ermon *equal contribution
Goal Estimation of π = π½ π(π¦) [π π¦ ] is ubiquitous in machine learning problems. Environment π π¦ π(π¨|π¦) π(π¨) π π¦ Reward State Action πΈ Real/Fake π(π¨) π» π(π¦|π¨) Agent π(π¨) π π¦ log π(π¦, π¨) π½ π(π) ΰ· π (π‘ π’ , π π’ ) π½ π π¦ π½ π π¨|π¦ π½ π(π¦) log πΈ(π¦) + π½ π π¨ log 1 β πΈ π» π¨ π(π¨|π¦) π’ Variational Autoencoder Generative Adversarial Nets Reinforcement Learning
Goal Estimation of π = π½ π(π¦) [π π¦ ] is ubiquitous in machine learning problems. 1 i.i.d. Monte Carlo Estimation: π β 2 (π π¦ 1 + π(π¦ 2 )) π¦ 1 , π¦ 2 βΌ π(π¦) 1 MC is unbiased: π½ 2 (π π¦ 1 + π(π¦ 2 )) = π High variance Estimation can be far off with small sample size
Goal Estimation of π = π½ π(π¦) [π π¦ ] is ubiquitous in machine learning problems. 1 i.i.d. Monte Carlo Estimation: π β 2 (π π¦ 1 + π(π¦ 2 )) π¦ 1 , π¦ 2 βΌ π(π¦) Trivial solution: Better solution: use more samples! better sampling strategy than i.i.d.
Antithetic Sampling Donβt sample i.i.d. π¦ 1 , π¦ 2 βΌ π π¦ 1 π(π¦ 2 ) Sample correlated distribution π¦ 1 , π¦ 2 βΌ π(π¦ 1 , π¦ 2 ) Unbiased if Goal: minimize π π¦ 1 = π π¦ 1 π π¦ 1 + π(π¦ 2 ) π π¦ 2 = π(π¦ 2 ) Var π(π¦ 1 ,π¦ 2 ) 2
Example: Negative Sampling π π¦ 1 , π¦ 2 defined by π¦ 2 π¦ 2 π π¦ 2 1.Sample π¦ 1 βΌ π(π¦) . π¦ 1 Marginal 2.Pick π¦ 2 = βπ¦ 1 . on π¦ 2 Marginal on π¦ 1 π¦ 1 π π¦ 1
Example: Negative Sampling Best Case Example π π¦ 1 , π¦ 2 defined by 1.Sample π¦ 1 βΌ π(π¦) . 2.Pick π¦ 2 = βπ¦ 1 . π π¦ 1 + π(π¦ 2 ) = 0 2 matches π = π¦ 3 πΉ π(π¦) [π π¦ ] = 0 π π¦ 1 +π(π¦ 2 ) Var π(π¦ 1 ,π¦ 2 ) = 0 2 no error for a sample size of 2!
Example: Negative Sampling Worst Case Example π π¦ 1 , π¦ 2 defined by 1.Sample π¦ 1 βΌ π(π¦) . 2.Pick π¦ 2 = βπ¦ 1 . π = π¦ 2 π π¦ 1 = π(π¦ 2 ) , π¦ 2 redundant π π¦ 1 +π(π¦ 2 ) Var π(π¦ 1 ,π¦ 2 ) doubles! 2
General Result Question: is there an antithetic distribution that always works better than i.i.d.? Yes: sampling without replacement is always a tiny bit better. No Free Lunch (Theorem 1): no antithetic distribution work better than sampling without replacement for every function π .
Valid Distribution Set π¦ 2 π¦ 2 π¦ 1 π¦ 1 π π£πππππ‘ππ : Set of distributions π(π¦ 1 , π¦ 2 ) that satisfy π π¦ 1 = π π¦ 1 , π π¦ 2 = π π¦ 2
Variance of example functions Pick this distribution π¦ 2 Low Variance π¦ 1 High Variance π π£πππππ‘ππ : Set of distributions π(π¦ 1 , π¦ 2 ) that satisfy π π¦ 1 = π π¦ 1 , π π¦ 2 = π π¦ 2 1 = π¦ 3 π
Variance of example functions π¦ 2 Pick this distribution π¦ 1 Low Variance High Variance High Variance π π£πππππ‘ππ : Set of distributions π(π¦ 1 , π¦ 2 ) that satisfy π π¦ 1 = π π¦ 1 , π π¦ 2 = π π¦ 2 2 = π π¦ + 2π¦π‘ππ(π¦) π
Pick Good Distribution for a Class of Functions π¦ 2 ο = π 1 , π 2 , β¦ π¦ 1 Low Variance High Variance on average for ο on average for ο π π£πππππ‘ππ : Set of distributions π(π¦ 1 , π¦ 2 ) that satisfy π π¦ 1 = π π¦ 1 , π π¦ 2 = π π¦ 2
Pick Good Distribution for a class of functions π¦ 2 π¦ 1 High Variance Low Variance on average on average π π£πππππ‘ππ : Set of distributions π(π¦ 1 , π¦ 2 ) that satisfy π π¦ 1 = π π¦ 1 , π π¦ 2 = π π¦ 2 Generalization Training Low variance for similar functions Pick a good π for several functions
Training Objective π π¦ 1 + π π¦ 2 min π π½ πβΌ ο Var π π¦ 1 ,π¦ 2 2 π‘. π’. π π¦ 1 , π¦ 2 β π π£πππππ‘ππ
Practical Training Algorithm We design 1. Parameterization for π π£πππππ‘ππ via copulas. 2. A surrogate objective to optimize the variance.
Wasserstein GAN w/ gradient penalty Variance of Gradient Inception Score Inception Score Inception Score Inception Score Variance Wall Clock Time Batch Size Batch Size per Iteration Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems . 2017.
Importance Weighted Autoencoder Our method VS negative sampling Our method VS i.i.d. sampling Probability of Improvement Log Likelihood Improvement (higher is better) Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. "Importance weighted autoencoders." arXiv preprint arXiv:1509.00519 (2015).
Conclusion β’ Define a general family of (parameterized) unbiased antithetic distribution. β’ Propose an optimization framework to learn the antithetic distribution based on the task at hand. β’ Sampling from the resulting joint distribution reduces variance at negligible computation cost. Welcome to our poster session for further discussions! Thursday 6:30-9pm @ Pacific Ballroom #205
Recommend
More recommend