low variance and zero variance baselines in extensive
play

Low-Variance and Zero-Variance Baselines in Extensive-Form Games - PowerPoint PPT Presentation

1 2 Low-Variance and Zero-Variance Baselines in Extensive-Form Games Trevor Davis 2,* , Martin Schmid 1 , Michael Bowling 1,2 *Work done during internship at DeepMind Monte Carlo game solving Extensive-form games (EFGs) Monte Carlo game


  1. 1 2 Low-Variance and Zero-Variance Baselines in Extensive-Form Games Trevor Davis 2,* , Martin Schmid 1 , Michael Bowling 1,2 *Work done during internship at DeepMind

  2. Monte Carlo game solving Extensive-form games (EFGs)

  3. Monte Carlo game solving Extensive-form games (EFGs)

  4. Baseline functions - evaluating unsampled actions

  5. Our Contribution VR-MCCFR This work (Schmid et al., AAAI 2019) Lower variance, faster convergence ● Provable zero-variance samples ●

  6. Monte carlo evaluation Unbiased updates at h

  7. Monte Carlo evaluation Unbiased updates at h where

  8. Monte Carlo evaluation Unbiased updates at h where Unsampled actions:

  9. Baseline functions

  10. Evaluation with baseline Without baseline:

  11. Evaluation with baseline Without baseline: Baseline correction:

  12. Evaluation with baseline Without baseline: Baseline correction: (control variate)

  13. Theoretical results Theorem 1: baseline-corrected values are unbiased: Theorem 2: each baseline-corrected value has variance bounded by a sum of squared prediction errors in the subtree rooted at a

  14. Baseline function selection We want Learned history baseline: We know Set to average of previous samples

  15. Baseline function selection We want Learned history baseline: We know Set to average of previous samples Note: depends on strategies - not stationary ∴ is not an unbiased estimate of current expectation still unbiased

  16. Baseline convergence evaluation Leduc poker, Monte Carlo Counterfactual Regret Minimization (MCCFR+) No baseline VR-MCCFR (Schmid et al.) Learned history baseline

  17. Predictive baseline Updating with learned history baseline: Optimal baseline depends on strategy update:

  18. Predictive baseline Updating with learned history baseline: Use strategy to update baseline: Optimal baseline depends on strategy update: Recursively set

  19. Zero-variance updates If: We use the predictive baseline ● We sample public outcomes ● All outcomes are sampled at least once ● Theorem: the baseline-corrected values have zero variance

  20. Baseline variance evaluation Leduc poker, Monte Carlo Counterfactual Regret Minimization (MCCFR+) No baseline VR-MCCFR (Schmid et al.) Learned history baseline Predictive baseline

  21. Conclusion Lower variance, faster convergence ● Provable zero-variance samples ●

Recommend


More recommend