monte carlo tree search for monte carlo tree search for
play

Monte-Carlo tree search for Monte-Carlo tree search for - PowerPoint PPT Presentation

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player, no-limit Texas hold'em poker Texas hold'em poker Guy Van den Broeck Should I bluff? Deceptive play Should I bluff? Is he bluffing? Opponent


  1. MCTS for games with uncertainty?  Expected reward distributions (ERD)  Sample selection using ERD  Backpropagation of ERD [VandenBroeck09]

  2. Expected reward distribution MiniMax Estimating 10 samples 100 samples ∞ samples Variance

  3. Expected reward distribution MiniMax Estimating 10 samples 100 samples ∞ samples Variance

  4. Expected reward distribution MiniMax Estimating 10 samples 100 samples ∞ samples Variance

  5. Expected reward distribution MiniMax Estimating 10 samples 100 samples ∞ samples Variance

  6. Expected reward distribution MiniMax Estimating 10 samples 100 samples ∞ samples Variance Sampling

  7. Expected reward distribution MiniMax ExpectiMax/MixiMax Estimating 10 samples 100 samples ∞ samples Variance Sampling

  8. Expected reward distribution MiniMax ExpectiMax/MixiMax Estimating 10 samples 100 samples ∞ samples Variance Sampling

  9. Expected reward distribution MiniMax ExpectiMax/MixiMax Estimating 10 samples 100 samples ∞ samples Variance Sampling

  10. Expected reward distribution MiniMax ExpectiMax/MixiMax Estimating 10 samples 100 samples ∞ samples Variance Sampling

  11. Expected reward distribution MiniMax ExpectiMax/MixiMax Estimating 10 samples 100 samples ∞ samples Variance Sampling Uncertainty + Sampling

  12. Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples ∞ samples Variance Sampling Uncertainty + Sampling

  13. Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples ∞ samples Variance Sampling Uncertainty + Sampling

  14. Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples ∞ samples Variance Sampling Uncertainty + Sampling

  15. Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples ∞ samples Variance Sampling Uncertainty + Sampling

  16. Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples ∞ samples Variance Sampling Uncertainty + Sampling Sampling

  17. Expected reward distribution MiniMax ExpectiMax/MixiMax ExpectiMax/MixiMax Estimating / T(P) 10 samples 100 samples ∞ samples Variance Sampling Uncertainty + Sampling Sampling

  18. ERD selection strategy  Objective?  Find maximum expected reward  Sample more in subtrees with (1) High expected reward (2) Uncertain estimate  UCT does (1) but not really (2)  CrazyStone does (1) and (2) for deterministic games (Go)  UCT+ selection : (1) (2)

  19. ERD selection strategy  Objective?  Find maximum expected reward  Sample more in subtrees with (1) High expected reward (2) Uncertain estimate  UCT does (1) but not really (2)  CrazyStone does (1) and (2) for deterministic games (Go)  UCT+ selection : “Expected value under perfect play”

  20. ERD selection strategy  Objective?  Find maximum expected reward  Sample more in subtrees with (1) High expected reward (2) Uncertain estimate  UCT does (1) but not really (2)  CrazyStone does (1) and (2) for deterministic games (Go)  UCT+ selection : “Measure of uncertainty due to sampling”

  21. ERD max-distribution backpropagation max A B … … 3 4

  22. ERD max-distribution backpropagation sample-weighted max 3.5 A B … … 3 4

  23. ERD max-distribution backpropagation sample-weighted max 3.5 max A B 4 … … 3 4

  24. ERD max-distribution backpropagation sample-weighted max 3.5 max A B 4 … … “When the game reaches P, we'll have more time to find the real “ 3 4

  25. ERD max-distribution backpropagation sample-weighted max 3.5 max A B 4 max-distribution … … 4.5 3 4

  26. ERD max-distribution backpropagation P(B<4) = 0.5 P(B>4) = 0.5 P(A<4) = 0.8 P(A>4) = 0.2 max A<4 A>4 B<4 0.8*0.5 0.2*0.5 B>4 0.8*0.5 0.2*0.5 A B P(max(A,B)>4) = 0.6 > 0.5 … … 4.5 3 4

  27. Experiments  2*MCTS  2*MCTS  Max-distribution  UCT+ (stddev)  Sample-weighted  UCT

  28. Outline  Overview approach  The Poker game tree  Opponent model  Monte-Carlo tree search  Research challenges  Search  Uncertainty in MCTS  Continuous action spaces  Opponent model  Online learning  Concept drift  Conclusion

  29. Dealing with continuous actions  Sample discrete actions relative  Progressive betsize unpruning [Chaslot08] (ignores smoothness of EV function)  ...  Tree learning search (work in progress)

  30. Tree learning search  Based on regression tree induction from data streams  training examples arrive quickly  nodes split when significant reduction in stddev  training examples are immediately forgotten  Edges in TLS tree are not actions, but sets of actions , e.g., (raise in [2,40]), (fold or call)  MCTS provides a stream of (action,EV) examples  Split action sets to reduce stddev of EV (when significant)

  31. Tree learning search max Bet in [0,10] {Fold, Call} max ? ?

  32. Tree learning search max Bet in [0,10] {Fold, Call} max ? ?

  33. Tree learning search max Bet in [0,10] {Fold, Call} max ? ? Optimal split at 4

  34. Tree learning search max Bet in [0,10] {Fold, Call} max Bet in Bet in [4,10] [0,4] max max ? ? ? ?

  35. Tree learning search one action of P1 one action of P2

  36. Selection Phase P1 Sample 2.4 Each node has EV estimate, which generalizes over actions

  37. Expansion P1 P2 Selected Node

  38. Expansion P1 P2 Expanded node P3 Represents any action of P3

  39. Backpropagation New sample; Split becomes significant

  40. Backpropagation New sample; Split becomes significant

  41. Outline  Overview approach  The Poker game tree  Opponent model  Monte-Carlo tree search  Research challenges  Search  Uncertainty in MCTS  Continuous action spaces  Opponent model  Online learning  Concept drift  Conclusion

  42. Outline  Overview approach  The Poker game tree  Opponent model  Monte-Carlo tree search  Research challenges  Search  Uncertainty in MCTS  Continuous action spaces  Opponent model  Online learning  Concept drift  Conclusion

  43. Online learning of opponent model  Start from (safe) model of general opponent  Exploit weaknesses of specific opponent Start to learn model of specific opponent (exploration of opponent behavior)

Recommend


More recommend