monte carlo continual resolving for online strategy
play

Monte Carlo Continual Resolving for Online Strategy Computation in - PowerPoint PPT Presentation

Monte Carlo Continual Resolving for Online Strategy Computation in Imperfect Information Games Michal Sustr Faculty of Electrical Engineering, Czech Technical University michal.sustr@aic.fel.cvut.cz December 6, 2018 Michal Sustr (FEE CTU)


  1. Public state Public state Public partition is any partition S of H \ Z whose elements are closed under ∼ (telling apart). An element S of any such S is called a public state . Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 15 / 54

  2. Public state Public state Public partition is any partition S of H \ Z whose elements are closed under ∼ (telling apart). An element S of any such S is called a public state . Public partition induces public tree - a tree that we can traverse and compute strategy online . Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 15 / 54

  3. IIGS-3 beginning of the game Round 1 Round 2 (only beginning) 2 3 1 2 2 3 3 2 3 1 1 3 1 1 2 2 3 3 1 3 3 1 2 1 1 2 2 3 1 2 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 16 / 54

  4. IIGS-3 augmented infosets I aug 1 Round 1 Round 2 (only beginning) 2 3 1 2 2 3 3 2 3 1 1 3 1 1 2 2 3 3 1 3 3 1 2 1 1 2 2 3 1 2 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 17 / 54

  5. IIGS-3 augmented infosets I aug 2 Round 1 Round 2 (only beginning) 2 3 1 2 2 3 3 2 3 1 1 3 1 1 2 2 3 3 1 3 3 1 2 1 1 2 2 3 1 2 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 18 / 54

  6. IIGS-3 public tree Round 1 Round 2 Round 3 play play play play play play Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 19 / 54

  7. IIGS-3 public tree induced by domain specific I aug We’d like something more refined when possible. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 20 / 54

  8. IIGS-3 public tree induced by domain specific I aug We’d like something more refined when possible. Round 1 Round 2 Round 3 lose win play play play win lose draw draw play play play play play win lose play play play lose win play play play draw draw draw draw play play play play play play win lose play play play lose win play play play lose win draw draw play play play play play lose win play play play Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 20 / 54

  9. Solution concepts Approximate Nash Equilibrium Profile σ ( σ i ∈ Σ i is a behavioural strategy of player i ) is ǫ − NE if u i ( σ ′ ( ∀ i ∈ { 1 , 2 } ) : u i ( σ ) ≥ max i , σ opp i ) − ǫ. σ ′ i ∈ Σ i Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 21 / 54

  10. Solution concepts Approximate Nash Equilibrium Profile σ ( σ i ∈ Σ i is a behavioural strategy of player i ) is ǫ − NE if u i ( σ ′ ( ∀ i ∈ { 1 , 2 } ) : u i ( σ ) ≥ max i , σ opp i ) − ǫ. σ ′ i ∈ Σ i Exploitability of strategy profile expl i ( σ ) := u i ( σ ∗ ) − u i ( σ i , σ ′ min opp i ) opp i ∈ Σ opp i σ ′ and expl( σ ) := 1 / 2 [expl 1 ( σ ) + expl 2 ( σ )] Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 21 / 54

  11. Complexity? Finding NE in general sum games is PPAD-hard (not NP-complete, because we know the answer must exist). Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 22 / 54

  12. Complexity? Finding NE in general sum games is PPAD-hard (not NP-complete, because we know the answer must exist). Zero sum EFGs with imperfect recall are NP-hard Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 22 / 54

  13. Complexity? Finding NE in general sum games is PPAD-hard (not NP-complete, because we know the answer must exist). Zero sum EFGs with imperfect recall are NP-hard Zero sum EFGs with perfect recall can be formulated as linear program The number of constraints is equal to the number of pure strategies of the other player Can be exponential in size of the game But this optimization problem has polynomial-time “separation oracles” There is a polynomial time algorithm that tests if a given point satisfies all inequalities and if not, finds violated one Ellipsoid method can be applied to solve in polynomial time Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 22 / 54

  14. Complexity? But even simple games are huge. Game |S| |I| |H| |Z| | Ω | IIGS(5) 363 9948 41331 14400 13 LD(1,1,6) 4098 24576 147456 147420 396 GP(3,3,2,2) 2671 7920 23760 44883 45 IIGS-13 |H| ≈ 10 22 no-limit Texas hold ’em |H| ≈ 10 160 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 23 / 54

  15. Offline algorithms Sequence-form LP Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

  16. Offline algorithms Sequence-form LP Double Oracle Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

  17. Offline algorithms Sequence-form LP Double Oracle EGT (Excessive Gap Technique) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

  18. Offline algorithms Sequence-form LP Double Oracle EGT (Excessive Gap Technique) CFR (Counterfactual Regret Minimization) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

  19. Offline algorithms Sequence-form LP Double Oracle EGT (Excessive Gap Technique) CFR (Counterfactual Regret Minimization) CFR+ Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

  20. Offline algorithms Sequence-form LP Double Oracle EGT (Excessive Gap Technique) CFR (Counterfactual Regret Minimization) CFR+ . . . probably more, but not much Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

  21. CFR Counter-factuals Counterfactual value (CFV) of player i with strategy profile σ is v σ i ( h ) := π σ − i ( h ) u σ i ( h ) and counterfactual regret at information set I for playing action a is r σ i ( I , a ) := v σ i ( I , a ) − v σ i ( I ) . Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 25 / 54

  22. CFR Counter-factuals Counterfactual value (CFV) of player i with strategy profile σ is v σ i ( h ) := π σ − i ( h ) u σ i ( h ) and counterfactual regret at information set I for playing action a is r σ i ( I , a ) := v σ i ( I , a ) − v σ i ( I ) . Immediate counterfactual regret � T 1 r σ t ¯ ¯ R T R T i , imm ( I ) := max i , imm ( I , a ) := max i ( I , a ) T a ∈A ( I ) a ∈A ( I ) t =1 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 25 / 54

  23. CFR Regret matching update rule R t , + ¯ i , imm ( I , a ) σ t +1 ( I )( a ) := � a ′ ∈A ( I ) ¯ R t , + i , imm ( I , a ′ ) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 26 / 54

  24. CFR Regret matching update rule R t , + ¯ i , imm ( I , a ) σ t +1 ( I )( a ) := � a ′ ∈A ( I ) ¯ R t , + i , imm ( I , a ′ ) Average strategy � T t =1 π σ t i ( I ) σ t ( I , a ) σ T ( I )( a ) := ¯ (where I ∈ I i ) � T t =1 π σ t i ( I ) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 26 / 54

  25. CFR Player 1 Player 2 1.0 RM Rock RM Rock RM Paper RM Paper RM Scissors RM Scissors Avg Rock Avg Rock 0.8 Avg Paper Avg Paper Avg Scissors Avg Scissors 0.6 P ( a ) 0.4 0.2 0.0 0 50 100 150 200 0 50 100 150 200 iterations iterations Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 27 / 54

  26. MCCFR Monte Carlo variant of CFR - we will sample one terminal history at a time (this is called Outcome Sampling). Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 28 / 54

  27. MCCFR Monte Carlo variant of CFR - we will sample one terminal history at a time (this is called Outcome Sampling). Sampling distribution Sampling distribution must have positive probability of sampling any leaf (even in unreachable parts of the tree). We use sampling strategy σ t ,ǫ := (1 − ǫ ) σ t + ǫ · rnd 1 where ǫ ∈ (0 , 1] controls the exploration and rnd ( I )( a ) := |A ( I ) | Also called “epsilon-on-policy exploration”. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 28 / 54

  28. MCCFR Sampled regrets Regrets are now sampled � w I · ( π σ t ( z | ha ) − π σ t ( z | h )) if ha ⊏ z r σ t ˜ i ( I , a ) := otherwise , w I · (0 − π σ t ( z | h )) where h denotes the prefix of z which is in I and w I stands for π σ t − i ( z | h ) u i ( z ) . π σ t ,ǫ ( z ) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 29 / 54

  29. MCCFR average strategy (1/2) Current strategy is not guaranteed to converge to equilibrium, so we need to calculate the average strategy. Recall that � T t =1 π σ t i ( I ) σ t ( I , a ) σ T ( I )( a ) := ¯ (where I ∈ I i ) � T t =1 π σ t i ( I ) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 30 / 54

  30. MCCFR average strategy (1/2) Current strategy is not guaranteed to converge to equilibrium, so we need to calculate the average strategy. Recall that � T t =1 π σ t i ( I ) σ t ( I , a ) σ T ( I )( a ) := ¯ (where I ∈ I i ) � T t =1 π σ t i ( I ) which can be rewritten as acc T ( I , a ) σ T ( I )( a ) := � ¯ a ′ ∈A ( I ) acc T ( I , a ′ ) , where acc denotes the cumulative sum T � π σ t acc T ( I , a ) = i ( I ) σ t ( I , a ) . t =1 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 30 / 54

  31. MCCFR average strategy (2/2) There are multiple ways to calculate average strategy with MCCFR. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 31 / 54

  32. MCCFR average strategy (2/2) There are multiple ways to calculate average strategy with MCCFR. We use stochastically-weighted averaging (updating only infosets that are on the trajectory of sampled terminal z) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 31 / 54

  33. MCCFR average strategy (2/2) There are multiple ways to calculate average strategy with MCCFR. We use stochastically-weighted averaging (updating only infosets that are on the trajectory of sampled terminal z)  π σ t  ( h ) π σ t ,ǫ ( h ) σ t ( I , a ) if ha ⊏ z i acc t ( I )( a ) := acc t − 1 ( I )( a ) +  0 otherwise. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 31 / 54

  34. Online algorithms Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 32 / 54

  35. Online algorithms without guarantees Information-Set Monte Carlo Tree Search samples as in a perfect information game, but computes statistics for the whole infoset various selection functions - we use UCT, RM Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 33 / 54

  36. Online algorithms without guarantees Information-Set Monte Carlo Tree Search samples as in a perfect information game, but computes statistics for the whole infoset various selection functions - we use UCT, RM Unsafe resolving do not deal with what happens outside of subgame summarize what happened so far with chance node with σ c ( ∅ , a ) = π σ ( h ) /π σ ( S ) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 33 / 54

  37. Online algorithms with guaranteees OOS Update sampling distribution to send more samples in current play position Incremental tree building Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 34 / 54

  38. Online algorithms with guaranteees OOS Update sampling distribution to send more samples in current play position Incremental tree building Continual Resolving Use safe resolving (gadget game) repeatedly in public states We need to get CFVs somehow! Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 34 / 54

  39. Continual resolving Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 35 / 54

  40. Gadget game For continual resolving, we will need to construct “resolving gadget game”: Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 36 / 54

  41. Gadget game For continual resolving, we will need to construct “resolving gadget game”: T T T F F F Continue as in the original game Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 36 / 54

  42. Continual resolving Main idea: Repeatedly construct gadget game in each encountered public state S during play. Solve the gadget game We need to store data ∀ I ∈ S : Reach probabilities of infosets π σ opp i ( I ) CFVs of infosets v − i ( I ) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 37 / 54

  43. Function Play of Continual Resolving Input : Information set I ∈ I 1 Output: An action a ∈ A ( I ) 1 S ← the public state which contains I ; 2 if S / ∈ KPS then � G ( S ) ← BuildResolvingGame( S , D(S) ) ; 3 KPS ← KPS ∪ S; 4 NPS ← all S ′ ∈ S where CR acts for the first time after leaving KPS; 5 ρ, ˜ D ← Resolve( � ˜ G ( S ) , NPS ) ; 6 σ 1 | S ′ ← ˜ ρ | S ′ ; 7 D ← calculate data for NPS based on D, σ 1 and ˜ D; 8 9 end 10 return a ∼ σ 1 ( I ) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 38 / 54

  44. MCCR Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 39 / 54

  45. Description of algorithm We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

  46. Description of algorithm We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. σ T ( h ) . u ¯ While sampling we store expected values ˜ Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

  47. Description of algorithm We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. σ T ( h ) . u ¯ While sampling we store expected values ˜ σ T ( h ). σ T σ T v ¯ ( h ) = π ¯ u ¯ The CFVs are simply obtained as ˜ − i ( h )˜ i Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

  48. Description of algorithm We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. σ T ( h ) . u ¯ While sampling we store expected values ˜ σ T ( h ). σ T σ T v ¯ ( h ) = π ¯ u ¯ The CFVs are simply obtained as ˜ − i ( h )˜ i σ T ( h ), We know how we arrived to h , so we easily store π ¯ Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

  49. Description of algorithm We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. σ T ( h ) . u ¯ While sampling we store expected values ˜ σ T ( h ). σ T σ T v ¯ ( h ) = π ¯ u ¯ The CFVs are simply obtained as ˜ − i ( h )˜ i σ T ( h ), We know how we arrived to h , so we easily store π ¯ σ T ( h ) well. u ¯ but we need to estimate ˜ Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

  50. Description of algorithm We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. σ T ( h ) . u ¯ While sampling we store expected values ˜ σ T ( h ). σ T σ T v ¯ ( h ) = π ¯ u ¯ The CFVs are simply obtained as ˜ − i ( h )˜ i σ T ( h ), We know how we arrived to h , so we easily store π ¯ σ T ( h ) well. u ¯ but we need to estimate ˜ This is difficult, because we use σ t to sample, not ¯ σ t ! Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

  51. Bounds CR bound v ) and � Suppose that CR uses D = ( r 1 , ˜ G ( S , σ 1 , ˜ v ). Then the exploitability of its strategy is bounded by expl 1 ( σ 1 ) ≤ ǫ ˜ v 0 + ǫ R 1 + ǫ ˜ v 1 + · · · + ǫ ˜ N − 1 + ǫ R v N , n := � where N is the number of resolving steps and ǫ R expl 1 (˜ ρ n ), � � � � � v ( J ) − v σ ∗ n 1 , CBR ǫ ˜ v n := � ˜ ( J ) � 2 J ∈ ˆ S n +1 (2) are the exploitability (in � G ( S n )) and value estimation error made by the n -th resolver (resp. initialization for n = 0). Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 41 / 54

  52. Bounds MCCR bound With probability at least (1 − p ) N +1 , the exploitability of stratey σ computed by MCCR satisfies √ A i � � � √ � 2 / √ p + 1 |I i | ∆ u , i 2 + 2 N − 1 expl i ( σ ) ≤ √ T 0 √ T R , δ Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 42 / 54

  53. Evaluation of online algorithms Evaluation of online algorithms is hard . Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

  54. Evaluation of online algorithms Evaluation of online algorithms is hard . Correct “brute-force” way: simulate all the possible game trajectories and resolve accordingly. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

  55. Evaluation of online algorithms Evaluation of online algorithms is hard . Correct “brute-force” way: simulate all the possible game trajectories and resolve accordingly. Drawback: O ( t | S | ) and for reliability we should use more seeds! Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

  56. Evaluation of online algorithms Evaluation of online algorithms is hard . Correct “brute-force” way: simulate all the possible game trajectories and resolve accordingly. Drawback: O ( t | S | ) and for reliability we should use more seeds! Averaging over seeds we produce what we call ¯ ¯ σ (double bar) strategy. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

  57. Evaluation of online algorithms Evaluation of online algorithms is hard . Correct “brute-force” way: simulate all the possible game trajectories and resolve accordingly. Drawback: O ( t | S | ) and for reliability we should use more seeds! Averaging over seeds we produce what we call ¯ ¯ σ (double bar) strategy. This strategy is no worse than individual seed strategies: T � σ T ) ≤ 1 expl(¯ σ t ) ¯ expl(¯ T t ( t means seeds in this context) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

  58. Experimental results Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 44 / 54

  59. Results - CFVs 10 0 t ) 10 1 exp l ( 10 2 10 3 Averages of CFVs differences 0.20 0.15 0.10 0.05 0.00 Variances of CFVs differences 0.150 0.125 0.100 0.075 0.050 0.025 0.000 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Number of samples B-RPS IIGS-5 LD-116 GP-3322 PTTT IIGS-13 LD-226 GP-4644 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 45 / 54

  60. Results - convergence, “reset” variant 0.4 0.6 0.4 0.2 0.2 10 2 10 2 10 3 10 3 10 4 10 4 # 10 2 # 10 2 o o f i f i t e t e r a 10 3 r a 10 3 t i 10 5 t i 10 5 o n o n s s n i n i r 10 4 r 10 4 o o # of iterations per gadget game o o # of iterations per gadget game t 10 6 t 10 6 10 5 10 5 10 7 10 7 10 6 10 6 0.08 0.6 0.06 0.4 0.04 0.2 10 2 10 2 10 3 10 3 10 4 10 4 # 10 2 # 10 2 o f o f i i t e t e r a 10 5 10 3 r a 10 5 10 3 t i o t o i n n s s i n i n r 10 4 r 10 4 o o # of iterations per gadget game o o # of iterations per gadget game t 10 6 t 10 6 10 5 10 5 10 7 10 7 10 6 10 6 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 46 / 54

  61. Results - convergence, “keep” variant 0.6 0.03 0.4 0.02 0.2 0.01 10 2 10 2 10 3 10 3 10 4 10 4 # 10 2 # 10 2 o o f i f i t e t e r a 10 3 r a 10 3 t i 10 5 t i 10 5 o n o n s s i n n i r 10 4 r 10 4 o o # of iterations per gadget game o o # of iterations per gadget game t 10 6 t 10 6 10 5 10 5 10 7 10 7 10 6 10 6 0.100 0.6 0.075 0.4 0.050 0.2 0.025 10 2 10 2 10 3 10 3 10 4 10 4 # 10 2 # 10 2 o f o f i i t e t e r a 10 5 10 3 r a 10 5 10 3 t i o t o i n n s s i n i n r 10 4 r 10 4 o o # of iterations per gadget game o o # of iterations per gadget game t 10 6 t 10 6 10 5 10 5 10 7 10 7 10 6 10 6 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 47 / 54

  62. Results - convergence, slices 10 0 10 1 expl 2 ( ) 10 2 10 2 10 3 10 4 10 5 10 6 T R , T 0 = 10 7 10 0 10 1 expl 2 ( ) 10 2 10 2 10 3 10 4 10 5 10 6 10 7 T 0 , T R = 10 6 GP-3322 (reset) IIGS-5 (reset) LD-116 (reset) GP-3322 (keep) IIGS-5 (keep) LD-116 (keep) Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 48 / 54

  63. Results - exploitability given time budget IIGS-5 LD-116 GP-3322 10 0 MCCR (reset) expl 2 ( ) MCCR (keep) OOS (PST) 1 10 MCCFR RND 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 Time [ms] Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 49 / 54

  64. Results - exploration parameter II-GS(5) LD(1,1,6) GP(3,3,2,2) 10 0 10 0 10 1 6 × 10 1 6 × 10 2 4 × 10 1 4 × 10 2 expl 1 ( ) expl 1 ( ) expl 1 ( ) 10 1 3 × 10 1 3 × 10 2 = 0.2 2 × 10 1 = 0.2 2 × 10 2 = 0.2 = 0.4 = 0.4 = 0.4 = 0.6 = 0.6 = 0.6 = 0.8 = 0.8 = 0.8 10 2 10 1 10 2 10 2 10 3 10 4 10 5 10 6 10 2 10 3 10 4 10 5 10 6 10 2 10 3 10 4 10 5 10 6 Samples in gadget Samples in gadget Samples in gadget 10 0 10 0 10 0 6 × 10 1 4 × 10 1 expl 1 ( ) expl 1 ( ) expl 1 ( ) 10 1 10 1 3 × 10 1 = 0.2 2 × 10 1 = 0.2 = 0.2 = 0.4 = 0.4 = 0.4 = 0.6 = 0.6 = 0.6 = 0.8 = 0.8 = 0.8 10 2 10 1 10 2 10 2 10 3 10 4 10 5 10 6 10 2 10 3 10 4 10 5 10 6 10 2 10 3 10 4 10 5 10 6 Samples in gadget Samples in gadget Samples in gadget Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 50 / 54

  65. Results - selected domains IIGS-13 MCCR (reset) MCCFR OOS (PST) OOS (IST) RM UCT RND MCCR (keep) -23.8 ± 7.8 8.0 ± 8.0 11.2 ± 8.0 -7.6 ± 8.0 -54.1 ± 6.8 -70.4 ± 5.7 43.3 ± 7.3 MCCR (reset) 22.0 ± 7.9 20.8 ± 7.9 0.8 ± 8.1 -34.6 ± 7.6 -61.5 ± 6.4 52.9 ± 6.9 MCCFR -1.4 ± 8.0 -18.6 ± 7.9 -58.5 ± 6.6 -76.8 ± 5.2 37.2 ± 7.5 OOS (PST) -19.9 ± 7.9 -58.4 ± 6.6 -76.1 ± 5.2 34.2 ± 7.6 OOS (IST) -40.3 ± 7.4 -60.0 ± 6.5 54.2 ± 6.7 RM -22.4 ± 7.9 81.3 ± 4.7 UCT 92.0 ± 3.1 LD-226 MCCR (reset) MCCFR OOS (PST) OOS (IST) RM UCT RND MCCR (keep) 1.0 ± 8.1 46.0 ± 7.2 45.2 ± 7.3 -23.6 ± 7.9 -34.0 ± 7.7 -34.4 ± 7.7 75.0 ± 5.4 MCCR (reset) 39.8 ± 7.5 45.0 ± 7.3 -32.0 ± 7.7 -42.0 ± 7.4 -46.0 ± 7.2 81.8 ± 4.7 MCCFR 1.6 ± 8.1 -51.8 ± 7.0 -48.0 ± 7.1 -43.6 ± 7.3 52.2 ± 7.0 OOS (PST) -58.2 ± 6.6 -53.2 ± 6.9 -47.2 ± 7.2 42.6 ± 7.4 OOS (IST) -10.0 ± 8.1 -19.0 ± 8.0 83.6 ± 4.5 RM -13.2 ± 8.1 80.6 ± 4.8 UCT 75.6 ± 5.3 GP-4644 MCCR (reset) MCCFR OOS (PST) OOS (IST) RM UCT RND MCCR (keep) -0.0 ± 4.2 10.3 ± 5.8 13.2 ± 5.6 -0.2 ± 5.3 -1.0 ± 4.3 -4.1 ± 3.6 18.7 ± 5.7 MCCR (reset) 9.7 ± 4.8 11.5 ± 5.0 1.1 ± 4.2 -3.4 ± 3.7 -2.2 ± 3.1 15.5 ± 5.1 MCCFR -5.4 ± 6.3 -11.5 ± 5.5 -12.2 ± 4.9 -8.6 ± 4.0 11.6 ± 6.1 OOS (PST) -12.2 ± 5.5 -10.8 ± 4.9 -6.6 ± 4.1 11.2 ± 6.1 OOS (IST) -0.4 ± 4.2 -0.0 ± 3.4 18.0 ± 5.6 RM 0.2 ± 2.3 19.7 ± 4.5 UCT 18.5 ± 4.2 Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 51 / 54

  66. Future work Find a good domain to show dominance. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

  67. Future work Find a good domain to show dominance. Variance reduction. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

  68. Future work Find a good domain to show dominance. Variance reduction. Using OOS as resolver in parent public state. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

  69. Future work Find a good domain to show dominance. Variance reduction. Using OOS as resolver in parent public state. Heuristics for resolving (neural networks) on general games. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

  70. Future work Find a good domain to show dominance. Variance reduction. Using OOS as resolver in parent public state. Heuristics for resolving (neural networks) on general games. Find ǫ -NE CFVs without having the strategy. Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

Recommend


More recommend