u g
play

U G A V ! Michael Johanson, Nolan Bard, Marc Lanctot, " - PowerPoint PPT Presentation

Efficient Nash Equilibrium Approximation through Monte Carlo Counterfactual Regret Minimization AAMAS 2012 - June 6, 2012 Q J # $ K 1 0 P C R " ! U G A V ! Michael Johanson, Nolan Bard, Marc Lanctot, " # ! K Q $


  1. Efficient Nash Equilibrium Approximation through Monte Carlo Counterfactual Regret Minimization AAMAS 2012 - June 6, 2012 Q J # $ K 1 0 P C R " ! U G A V ! Michael Johanson, Nolan Bard, Marc Lanctot, " # ! K Q $ A J Richard Gibson, Michael Bowling ! 0 1 University of Alberta Computer Poker Research Group University of Alberta Wednesday, November 14, 2012

  2. ♣ ♥ ♦ ♠ Motivation Tackling the practical challenge of Nash equilibrium computation in large games Strategy that is guaranteed to not lose on expectation (2-player, zero-sum) Very useful property in practice: Dominant approach in the Annual Computer Poker Competition 2008: beat human professionals at 2-player limit Texas hold’em poker Wednesday, November 14, 2012

  3. Motivation Size of Game Solved The poker community is 10 10 now solving games with 10 9 # Information Sets decisions (information sets). 10 9 10 8 LPs don’t scale to this size of game. We’ve made great 10 7 progress on efficient 10 6 approximation algorithms. 10 5 (CFR, EGT) 2006 2007 2008 2009 2010 2011 Computer Poker Competition Year Wednesday, November 14, 2012

  4. ♦ ♠ ♥ ♣ ♦ Counterfactual Regret Minimization (CFR), NIPS 2007 CFR is the competition’s most CFR Convergence popular algorithm. Best Response (mbb/game) 100 Iterative, resembles self-play; 75 reinforcement learning flavour. 50 Memory efficient (2 doubles 25 per infoset-action) Converges quickly (1/ ε 2 ) 0 Programmer Friendly 0 5,000 10,000 15,000 20,000 Easy to implement and optimize Computation Time (seconds) Linear speedup with many cores This paper: a new CFR variant that converges more quickly in imperfect information games. Wednesday, November 14, 2012

  5. ♣ ♥ ♠ ♦ Counterfactual Regret Minimization (CFR) Basic idea: ? ? Start with two uniform versus random strategies. Play them against each other. Put a regret minimizing agent at every decision, and let it Regret (I)=(-2,1,4) I σ (I)=(0,0.2,0.8) independently learn its part of the strategy. Run many iterations: walk the game tree, agents update their parts of the strategy. Nash σ Average strategy profile Equilibrium converges to equilibrium. Wednesday, November 14, 2012

  6. ♥ ♣ ♠ ♦ Counterfactual Regret Minimization (CFR) Root To update a decision, we need: Probability of other players taking their series of actions p=0.2 π -i (I)=0.2 Expected value (or unbiased I estimate) of actions’ utilities V(I)=(-2,2,6) given opponent’s strategy 2 -2 6 Recursively walk the tree: Push forwards opponent action probabilities Return EV at this terminal node or in this subtree Terminal Nodes Wednesday, November 14, 2012

  7. ♥ ♣ ♦ Chance-Sampled CFR Public Chance In practice, a sampling variant of CFR is used. My Private Chance Chance Sampling: on each iteration, randomly sample one Opponent Private Chance set of chance events and only update that part of the tree. Recursion: Terminal nodes: Get an PASS one scalar (opponent reach unbiased estimate of my state’s probability) value. Takes O(1) time. RETURN one scalar (value of subgame) Wednesday, November 14, 2012

  8. New CFR Sampling Variants Opponent-Public Chance Sampling (CS) Chance Sampling (OPCS) Public chance Sample: Opponent Private Chance My Private Chance Sample: Public chance Opponent Private Chance Expand: My Private Chance Public Self-Public Chance Sampling (PCS) Chance Sampling (CS) My Private Chance Sample: Public chance Sample: Public chance My Private Chance Expand: Expand: Opponent Private Chance Opponent Private Chance Wednesday, November 14, 2012

  9. ♥ ♣ ♥ ♠ ♦ Opponent-Public Chance Sampling (OPCS) Sample one Public chance event Public Chance Sample one opponent private chance event Enumerate all of my possible My Private Chance private chance events ...(45 choose 2) KEY OBSERVATION: Opponent Private Chance Opponent can’t observe my chance event, so their strategy is the same for all of them. I can efficiently update all Recursion: of these decisions in the same PASS one scalar recursive pass! (opponent reach probability) Terminal nodes: n states to evaluate. RETURN a vector Takes O( n ) time. (values of subgames Wednesday, November 14, 2012

  10. New CFR Sampling Variants Opponent-Public Chance Sampling (CS) Slower, Chance Sampling (OPCS) Many updates Public chance Sample: Sample: Opponent Private Chance per iteration My Private Chance Public chance Opponent Private Chance Expand: My Private Chance Public Self-Public Chance Sampling (PCS) Chance Sampling (SPCS) My Private Chance Sample: Public chance Sample: Public chance My Private Chance Expand: Expand: Opponent Private Chance Opponent Private Chance Wednesday, November 14, 2012

  11. New CFR Sampling Variants Opponent-Public Chance Sampling (CS) Slower, Chance Sampling (OPCS) Many updates Public chance Sample: Opponent Private Chance per iteration My Private Chance Sample: Public chance Opponent Private Chance Expand: My Private Chance Public Self-Public Chance Sampling (PCS) Chance Sampling (SPCS) My Private Chance Sample: Public chance Sample: Public chance My Private Chance Expand: Expand: Opponent Private Chance Opponent Private Chance Wednesday, November 14, 2012

  12. ♦ ♣ ♥ ♠ ♥ Self-Public Chance Sampling (SPCS) Public Chance Sample one Public chance event Sample one of my private chance events My Private Chance Enumerate all of opponent’s possible private chance events Opponent Private Chance Terminal nodes: n states to evaluate. Much more precise estimate of my ...(45 choose 2) value, since I compare my state to all of theirs! Recursion: PASS one vector RESULT: (opponent reach probabilities) Slow but very precise updates. RETURN one scalar (value of subgame) Wednesday, November 14, 2012

  13. New CFR Sampling Variants Opponent-Public Chance Sampling (CS) Slower, Chance Sampling (OPCS) Many updates Public chance Sample: Opponent Private Chance per iteration My Private Chance Sample: Public chance Opponent Private Chance Expand: My Private Chance Slower, Very precise updates Public Self-Public Chance Sampling (PCS) Chance Sampling (SPCS) My Private Chance Sample: Public chance Sample: Public chance My Private Chance Expand: Expand: Opponent Private Chance Opponent Private Chance Wednesday, November 14, 2012

  14. New CFR Sampling Variants Opponent-Public Chance Sampling (CS) Slower, Chance Sampling (OPCS) Many updates Public chance Sample: Opponent Private Chance per iteration My Private Chance Sample: Public chance Opponent Private Chance Expand: My Private Chance Slower, Very precise updates Public Self-Public Chance Sampling (PCS) Chance Sampling (SPCS) My Private Chance Sample: Public chance Sample: Public chance My Private Chance Expand: Expand: Opponent Private Chance Opponent Private Chance Wednesday, November 14, 2012

  15. ♥ ♣ ♥ ♠ ♦ Public Chance Sampling (PCS) Sample one Public chance event Public Chance Enumerate all of my private chance events Enumerate all of opponent’s possible My Private Chance private chance events ...(47 choose 2) Terminal nodes: n states to evaluate Opponent Private Chance against n states. Looks like O(n 2 ) ...(47 choose 2) work. But depending on game structure, O( n ) is often possible, Recursion: making it as fast as OPCS or SPCS! PASS one vector (opponent reach RESULT: probability) Slower, but do many precise updates RETURN one vector on each iteration. (value of subgame) Wednesday, November 14, 2012

  16. New CFR Sampling Variants Opponent-Public Chance Sampling (CS) Slower, Chance Sampling (OPCS) More updates Public chance Sample: Opponent Private Chance per iteration My Private Chance Sample: Public chance Opponent Private Chance Expand: My Private Chance Same speed, Slower, very precise Very precise updates updates Public Self-Public Chance Sampling (PCS) Chance Sampling (SPCS) My Private Chance Sample: Same speed, Public chance Sample: Public chance many updates per iteration My Private Chance Expand: Expand: Opponent Private Chance Opponent Private Chance Wednesday, November 14, 2012

  17. Results: 2-round, 4-bet Poker 94 million decision points (information sets) 10 4 CS OPCS Best response (mbb/g) 10 3 SPCS PCS 10 2 10 1 10 0 10 -1 10 2 10 3 10 4 10 5 Time (seconds) Wednesday, November 14, 2012

  18. Abstracted Limit Texas Hold’em Poker Larger abstractions Real Abstract are better in practice, Poker Poker but take longer to solve. Game Game 3*10 14 Can evaluate by 10 9 measuring Decisions Abstraction Decisions exploitability in (infosets) (infosets) abstract game. Wednesday, November 14, 2012

  19. Results: Abstracted Limit Texas Hold’em Poker Abstract Best Response (mbb/g) 10 2 CS CS PCS PCS 10 1 10 0 5 buckets 8 buckets 3.6m decisions 23.6m decisions 10 -1 CS CS PCS PCS 10 1 10 0 10 buckets 12 buckets 57.3m decisions 118.6m decisions 10 -1 10 1 10 2 10 3 10 4 10 5 10 1 10 2 10 3 10 4 10 5 10 6 Time (seconds) Wednesday, November 14, 2012

  20. Alternate domain: Bluff, an imperfect information dice game CS 10 -1 PCS Best Response 10 -2 10 -3 10 -4 10 3 10 4 10 5 Time (seconds) Wednesday, November 14, 2012

Recommend


More recommend