multi agent learning
play

Multi-agent learning Compa ring algo rithms empirially Gerard - PowerPoint PPT Presentation

Multi-agent learning Compa ring algo rithms empirially Gerard Vreeswijk , Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Sunday 21 st June, 2020 pitting games against


  1. to even out randomness settling-in phase burn-in phase Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4

  2. to even out randomness settling-in phase burn-in phase Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff (alternatives: no-regret, . . . ). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4

  3. settling-in phase burn-in phase Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff (alternatives: no-regret, . . . ). ■ Often each entry is computed multiple times randomness in to even out algorithms (which are implementations of response rules). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4

  4. Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff (alternatives: no-regret, . . . ). ■ Often each entry is computed multiple times randomness in to even out algorithms (which are implementations of response rules). ■ Sometimes there is a phase (a.k.a. phase ) in which settling-in burn-in payoffs are not yet recorded. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4

  5. Work of Axelrod (1980, 1984) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 5

  6. Axelrod receiving the National Medal of Science (2014) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 6

  7. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  8. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  9. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  10. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  11. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  12. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  13. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  14. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  15. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. All contestants Tit-for-tat, Shubik, Nydegger, were informed about the Joss, . . . , Random. results of the first tournament. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  16. Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. All contestants Tit-for-tat, Shubik, Nydegger, were informed about the Joss, . . . , Random. results of the first tournament. Winner: Tit-for-tat. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  17. Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. All contestants Tit-for-tat, Shubik, Nydegger, were informed about the Joss, . . . , Random. results of the first tournament. Winner: Tit-for-tat. Response rules (algorithms) ■ In 2012, Alexander Stewart and were mostly reactive. One Joshua Plotkin ran a variant of could hardly speak of learning. Axelrod’s tournament with 19 ■ Grand table: all pairs play 200 strategies to test the rounds. This was repeated 5 effectiveness of the then newly times to even out randomness. discovered Zero-Determinant strategies . ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

  18. Work of Zawadzki et al. (2014) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 8

  19. Zawadzki et al. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

  20. Zawadzki et al. ■ Contestants: FP, Determinate, Awesome, Meta, WoLF-IGA, GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

  21. Zawadzki et al. ■ Contestants: FP, Determinate, Awesome, Meta, WoLF-IGA, GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A motivation for this set of 11 algorithms, other than “state-of-the-art” wasn’t given. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

  22. Zawadzki et al. ■ Contestants: FP, Determinate, Awesome, Meta, WoLF-IGA, GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A motivation for this set of 11 algorithms, other than “state-of-the-art” wasn’t given. ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = games with normal covariant random payoffs; D 2 = Bertrand oligopoly; D 3 = Cournot duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

  23. Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = games with normal covariant random payoffs; D 2 = Bertrand oligopoly; D 3 = Cournot duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

  24. Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Grand table: each algorithm pair plays all 600 games for 10 4 ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = rounds. games with normal covariant random payoffs; D 2 = Bertrand oligopoly; D 3 = Cournot duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

  25. Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Grand table: each algorithm pair plays all 600 games for 10 4 ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = rounds. games with normal covariant ■ Evaluation: through random payoffs; D 2 = Bertrand non-parametric tests and oligopoly; D 3 = Cournot squared heat plots. duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

  26. Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Grand table: each algorithm pair plays all 600 games for 10 4 ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = rounds. games with normal covariant ■ Evaluation: through random payoffs; D 2 = Bertrand non-parametric tests and oligopoly; D 3 = Cournot squared heat plots. duopoly; D 4 = dispersion games; D 5 = grab the dollar ■ Conclusion: Q-learning is the type games; D 6 = guess two overall winner. thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

  27. Zawadzki et al. Mean reward over all opponents and games. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 10

  28. Zawadzki et al. Mean reward over all opponents and games. Mean regret over all opponents and games. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 10

  29. Zawadzki et al. Mean reward against different game suites. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 11

  30. Zawadzki et al. Mean reward against different game suites. Mean reward against different opponents. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 11

  31. test statisti Student's t-distribution -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

  32. test statisti Student's t-distribution -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

  33. test statisti Student's t-distribution -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: ■ Compute the average difference ¯ X D , and the average standard deviation of differences ¯ s D , of all n pairs (we see nine here). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

  34. -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: ■ Compute the average difference ¯ X D , and the average standard deviation of differences ¯ s D , of all n pairs (we see nine here). ■ If the two series are generated by the same random process, the √ n ) should follow the test statisti t = ¯ X D / ( ¯ s D t-distribution with Student's mean 0 and n − 1 degrees of freedom. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

  35. Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: ■ Compute the average difference ¯ X D , and the average standard deviation of differences ¯ s D , of all n pairs (we see nine here). ■ If the two series are generated by the same random process, the √ n ) should follow the test statisti t = ¯ X D / ( ¯ s D t-distribution with Student's mean 0 and n − 1 degrees of freedom. ■ If t is too eccentric, then we’ll have to reject that possibility, since eccentric values of t are unlikely (“have a low p -value ”). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

  36. Parametric test: paired t-test Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 13

  37. test statisti empiri al umulative distribution fun tions -value Non-parametric test: the Kolmogorov-Smirnov test Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14

  38. test statisti empiri al umulative distribution fun tions -value Non-parametric test: the Kolmogorov-Smirnov test ■ Test whether two distributions are generated by the same random process. H 0 : yes. H 1 : no. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14

  39. -value Non-parametric test: the Kolmogorov-Smirnov test ■ Test whether two distributions are generated by the same random process. H 0 : yes. H 1 : no. ■ The statisti is the maximum distance between the test empiri al fun tions of the two samples. umulative distribution Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14

  40. Non-parametric test: the Kolmogorov-Smirnov test ■ Test whether two distributions are generated by the same random process. H 0 : yes. H 1 : no. ■ The statisti is the maximum distance between the test empiri al fun tions of the two samples. umulative distribution ■ The p -value is the probability of seeing a test statistic (i.e., max distance) as high as the one observed, under the assumption that both samples were drawn from the same distribution. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14

  41. p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

  42. p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

  43. p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

  44. p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

  45. p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

  46. enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

  47. enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Outcome: Q-Learning is the only algorithm that is not probabilistically dominated by other algorithms. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

  48. Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Outcome: Q-Learning is the only algorithm that is not probabilistically dominated by other algorithms. ■ the difference between average reward and maxmin value ( enfo r eable o� ). pa y Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

  49. Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Outcome: Q-Learning is the only algorithm that is not probabilistically dominated by other algorithms. ■ the difference between average reward and maxmin value ( enfo r eable o� ). pa y Outcome: Q-Learning attained an enforceable payoff more frequently than any other algorithm. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

  50. Work of Bouzy et al. (2010) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 16

  51. rank Bouzy et al. (2010) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  52. rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  53. rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  54. rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  55. rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  56. rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  57. rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  58. Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  59. Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in ■ Bouzy et al. are familiar with Z ∩ [ − 9, 9 ] (if I understand the work of Airiau et al. and correctly, else it’s [ − 9, 9 ] ). Zawadzki et al. . ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  60. Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in ■ Bouzy et al. are familiar with Z ∩ [ − 9, 9 ] (if I understand the work of Airiau et al. and correctly, else it’s [ − 9, 9 ] ). Zawadzki et al. . ■ Grand table: each pair plays Contrary to Airiau et al. and 3 × 10 6 rounds (!) on a random Zawadzki et al. ., the ranking game. Restart 100 times to even still fluctuates after 3 × 10 6 out randomness. rounds ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  61. Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in ■ Bouzy et al. are familiar with Z ∩ [ − 9, 9 ] (if I understand the work of Airiau et al. and correctly, else it’s [ − 9, 9 ] ). Zawadzki et al. . ■ Grand table: each pair plays Contrary to Airiau et al. and 3 × 10 6 rounds (!) on a random Zawadzki et al. ., the ranking game. Restart 100 times to even still fluctuates after 3 × 10 6 out randomness. rounds . . . ?! ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

  62. eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

  63. eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

  64. eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

  65. eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

  66. Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Variation: lag . eliminate b y Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

  67. Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Variation: lag . eliminate b y ■ Algorithm: Repeat: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

  68. Bouzy et al. (2010) Variation: rank . cumulative returns of the eliminate b y two worst performers is ■ Algorithm: Repeat: √ n T ( n T the larger than 600/ ● Rank, eliminate the worst, number of tournaments and subtract all payoffs performed), then the laggard earned against that player is removed. from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Variation: lag . eliminate b y ■ Algorithm: Repeat: ● Organise a tournament. If the difference between the Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

  69. Bouzy et al. (2010) Variation: rank . cumulative returns of the eliminate b y two worst performers is ■ Algorithm: Repeat: √ n T ( n T the larger than 600/ ● Rank, eliminate the worst, number of tournaments and subtract all payoffs performed), then the laggard earned against that player is removed. The laggard is from the revenues of all also removed if the global survivors. ranking has not changed during 100 n 2 p ( n p − 1 ) 2 ■ Final ranking: M3, Sat, UCB, tournaments since the last JR, . . . elimination ( n p is the current number of players). Variation: lag . eliminate b y ■ Algorithm: Repeat: ● Organise a tournament. If the difference between the Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

  70. Bouzy et al. (2010) Variation: rank . cumulative returns of the eliminate b y two worst performers is ■ Algorithm: Repeat: √ n T ( n T the larger than 600/ ● Rank, eliminate the worst, number of tournaments and subtract all payoffs performed), then the laggard earned against that player is removed. The laggard is from the revenues of all also removed if the global survivors. ranking has not changed during 100 n 2 p ( n p − 1 ) 2 ■ Final ranking: M3, Sat, UCB, tournaments since the last JR, . . . elimination ( n p is the current number of players). Variation: lag . eliminate b y ■ Algorithm: Repeat: ■ Final ranking: M3, Sat, UCB, FP, . . . ● Organise a tournament. If the difference between the Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

  71. Bouzy et al. Ranking evolution according to the number of steps played in games (logscale). The key is ordered according to the final ranking. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 19

  72. Bouzy et al. Ranking based on eliminations (logscale). The key is ordered according to the final ranking. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 20

  73. Bouzy et al. (2010) Variation: games . sele t sub- lasses of Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

  74. Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

  75. Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

  76. Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

  77. Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . ■ Different number of actions ( n × n games). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

  78. Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . ■ Different number of actions ( n × n games). Conclusion: M3, Sat, and UCB perform best. Do not maintain averages but geometric (decaying) averages of payoffs. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

  79. Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . ■ Different number of actions ( n × n games). Conclusion: M3, Sat, and UCB perform best. Do not maintain averages but geometric (decaying) averages of payoffs. Another interesting direction is exploring why Exp3 is the best MAL player on both cooperative games and competitive games, but not on general-sum games, and to exploit this fact to design a new MAL algorithm. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

  80. Work of Airiau et al. (2007) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 22

  81. �tness-p rop o rtionate sele tion Airiau et al. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 23

Recommend


More recommend