to even out randomness settling-in phase burn-in phase Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4
to even out randomness settling-in phase burn-in phase Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff (alternatives: no-regret, . . . ). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4
settling-in phase burn-in phase Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff (alternatives: no-regret, . . . ). ■ Often each entry is computed multiple times randomness in to even out algorithms (which are implementations of response rules). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4
Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff (alternatives: no-regret, . . . ). ■ Often each entry is computed multiple times randomness in to even out algorithms (which are implementations of response rules). ■ Sometimes there is a phase (a.k.a. phase ) in which settling-in burn-in payoffs are not yet recorded. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4
Work of Axelrod (1980, 1984) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 5
Axelrod receiving the National Medal of Science (2014) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 6
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. All contestants Tit-for-tat, Shubik, Nydegger, were informed about the Joss, . . . , Random. results of the first tournament. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. All contestants Tit-for-tat, Shubik, Nydegger, were informed about the Joss, . . . , Random. results of the first tournament. Winner: Tit-for-tat. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. All contestants Tit-for-tat, Shubik, Nydegger, were informed about the Joss, . . . , Random. results of the first tournament. Winner: Tit-for-tat. Response rules (algorithms) ■ In 2012, Alexander Stewart and were mostly reactive. One Joshua Plotkin ran a variant of could hardly speak of learning. Axelrod’s tournament with 19 ■ Grand table: all pairs play 200 strategies to test the rounds. This was repeated 5 effectiveness of the then newly times to even out randomness. discovered Zero-Determinant strategies . ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7
Work of Zawadzki et al. (2014) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 8
Zawadzki et al. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9
Zawadzki et al. ■ Contestants: FP, Determinate, Awesome, Meta, WoLF-IGA, GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9
Zawadzki et al. ■ Contestants: FP, Determinate, Awesome, Meta, WoLF-IGA, GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A motivation for this set of 11 algorithms, other than “state-of-the-art” wasn’t given. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9
Zawadzki et al. ■ Contestants: FP, Determinate, Awesome, Meta, WoLF-IGA, GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A motivation for this set of 11 algorithms, other than “state-of-the-art” wasn’t given. ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = games with normal covariant random payoffs; D 2 = Bertrand oligopoly; D 3 = Cournot duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9
Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = games with normal covariant random payoffs; D 2 = Bertrand oligopoly; D 3 = Cournot duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9
Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Grand table: each algorithm pair plays all 600 games for 10 4 ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = rounds. games with normal covariant random payoffs; D 2 = Bertrand oligopoly; D 3 = Cournot duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9
Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Grand table: each algorithm pair plays all 600 games for 10 4 ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = rounds. games with normal covariant ■ Evaluation: through random payoffs; D 2 = Bertrand non-parametric tests and oligopoly; D 3 = Cournot squared heat plots. duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9
Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Grand table: each algorithm pair plays all 600 games for 10 4 ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = rounds. games with normal covariant ■ Evaluation: through random payoffs; D 2 = Bertrand non-parametric tests and oligopoly; D 3 = Cournot squared heat plots. duopoly; D 4 = dispersion games; D 5 = grab the dollar ■ Conclusion: Q-learning is the type games; D 6 = guess two overall winner. thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9
Zawadzki et al. Mean reward over all opponents and games. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 10
Zawadzki et al. Mean reward over all opponents and games. Mean regret over all opponents and games. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 10
Zawadzki et al. Mean reward against different game suites. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 11
Zawadzki et al. Mean reward against different game suites. Mean reward against different opponents. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 11
test statisti Student's t-distribution -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12
test statisti Student's t-distribution -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12
test statisti Student's t-distribution -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: ■ Compute the average difference ¯ X D , and the average standard deviation of differences ¯ s D , of all n pairs (we see nine here). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12
-value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: ■ Compute the average difference ¯ X D , and the average standard deviation of differences ¯ s D , of all n pairs (we see nine here). ■ If the two series are generated by the same random process, the √ n ) should follow the test statisti t = ¯ X D / ( ¯ s D t-distribution with Student's mean 0 and n − 1 degrees of freedom. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12
Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: ■ Compute the average difference ¯ X D , and the average standard deviation of differences ¯ s D , of all n pairs (we see nine here). ■ If the two series are generated by the same random process, the √ n ) should follow the test statisti t = ¯ X D / ( ¯ s D t-distribution with Student's mean 0 and n − 1 degrees of freedom. ■ If t is too eccentric, then we’ll have to reject that possibility, since eccentric values of t are unlikely (“have a low p -value ”). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12
Parametric test: paired t-test Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 13
test statisti empiri al umulative distribution fun tions -value Non-parametric test: the Kolmogorov-Smirnov test Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14
test statisti empiri al umulative distribution fun tions -value Non-parametric test: the Kolmogorov-Smirnov test ■ Test whether two distributions are generated by the same random process. H 0 : yes. H 1 : no. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14
-value Non-parametric test: the Kolmogorov-Smirnov test ■ Test whether two distributions are generated by the same random process. H 0 : yes. H 1 : no. ■ The statisti is the maximum distance between the test empiri al fun tions of the two samples. umulative distribution Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14
Non-parametric test: the Kolmogorov-Smirnov test ■ Test whether two distributions are generated by the same random process. H 0 : yes. H 1 : no. ■ The statisti is the maximum distance between the test empiri al fun tions of the two samples. umulative distribution ■ The p -value is the probability of seeing a test statistic (i.e., max distance) as high as the one observed, under the assumption that both samples were drawn from the same distribution. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14
p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15
p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15
p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15
p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15
p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15
enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15
enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Outcome: Q-Learning is the only algorithm that is not probabilistically dominated by other algorithms. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15
Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Outcome: Q-Learning is the only algorithm that is not probabilistically dominated by other algorithms. ■ the difference between average reward and maxmin value ( enfo r eable o� ). pa y Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15
Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Outcome: Q-Learning is the only algorithm that is not probabilistically dominated by other algorithms. ■ the difference between average reward and maxmin value ( enfo r eable o� ). pa y Outcome: Q-Learning attained an enforceable payoff more frequently than any other algorithm. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15
Work of Bouzy et al. (2010) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 16
rank Bouzy et al. (2010) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in ■ Bouzy et al. are familiar with Z ∩ [ − 9, 9 ] (if I understand the work of Airiau et al. and correctly, else it’s [ − 9, 9 ] ). Zawadzki et al. . ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in ■ Bouzy et al. are familiar with Z ∩ [ − 9, 9 ] (if I understand the work of Airiau et al. and correctly, else it’s [ − 9, 9 ] ). Zawadzki et al. . ■ Grand table: each pair plays Contrary to Airiau et al. and 3 × 10 6 rounds (!) on a random Zawadzki et al. ., the ranking game. Restart 100 times to even still fluctuates after 3 × 10 6 out randomness. rounds ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in ■ Bouzy et al. are familiar with Z ∩ [ − 9, 9 ] (if I understand the work of Airiau et al. and correctly, else it’s [ − 9, 9 ] ). Zawadzki et al. . ■ Grand table: each pair plays Contrary to Airiau et al. and 3 × 10 6 rounds (!) on a random Zawadzki et al. ., the ranking game. Restart 100 times to even still fluctuates after 3 × 10 6 out randomness. rounds . . . ?! ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17
eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18
eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18
eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18
eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18
Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Variation: lag . eliminate b y Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18
Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Variation: lag . eliminate b y ■ Algorithm: Repeat: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18
Bouzy et al. (2010) Variation: rank . cumulative returns of the eliminate b y two worst performers is ■ Algorithm: Repeat: √ n T ( n T the larger than 600/ ● Rank, eliminate the worst, number of tournaments and subtract all payoffs performed), then the laggard earned against that player is removed. from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Variation: lag . eliminate b y ■ Algorithm: Repeat: ● Organise a tournament. If the difference between the Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18
Bouzy et al. (2010) Variation: rank . cumulative returns of the eliminate b y two worst performers is ■ Algorithm: Repeat: √ n T ( n T the larger than 600/ ● Rank, eliminate the worst, number of tournaments and subtract all payoffs performed), then the laggard earned against that player is removed. The laggard is from the revenues of all also removed if the global survivors. ranking has not changed during 100 n 2 p ( n p − 1 ) 2 ■ Final ranking: M3, Sat, UCB, tournaments since the last JR, . . . elimination ( n p is the current number of players). Variation: lag . eliminate b y ■ Algorithm: Repeat: ● Organise a tournament. If the difference between the Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18
Bouzy et al. (2010) Variation: rank . cumulative returns of the eliminate b y two worst performers is ■ Algorithm: Repeat: √ n T ( n T the larger than 600/ ● Rank, eliminate the worst, number of tournaments and subtract all payoffs performed), then the laggard earned against that player is removed. The laggard is from the revenues of all also removed if the global survivors. ranking has not changed during 100 n 2 p ( n p − 1 ) 2 ■ Final ranking: M3, Sat, UCB, tournaments since the last JR, . . . elimination ( n p is the current number of players). Variation: lag . eliminate b y ■ Algorithm: Repeat: ■ Final ranking: M3, Sat, UCB, FP, . . . ● Organise a tournament. If the difference between the Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18
Bouzy et al. Ranking evolution according to the number of steps played in games (logscale). The key is ordered according to the final ranking. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 19
Bouzy et al. Ranking based on eliminations (logscale). The key is ordered according to the final ranking. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 20
Bouzy et al. (2010) Variation: games . sele t sub- lasses of Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21
Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21
Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21
Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21
Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . ■ Different number of actions ( n × n games). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21
Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . ■ Different number of actions ( n × n games). Conclusion: M3, Sat, and UCB perform best. Do not maintain averages but geometric (decaying) averages of payoffs. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21
Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . ■ Different number of actions ( n × n games). Conclusion: M3, Sat, and UCB perform best. Do not maintain averages but geometric (decaying) averages of payoffs. Another interesting direction is exploring why Exp3 is the best MAL player on both cooperative games and competitive games, but not on general-sum games, and to exploit this fact to design a new MAL algorithm. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21
Work of Airiau et al. (2007) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 22
�tness-p rop o rtionate sele tion Airiau et al. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 23
Recommend
More recommend