Multi-agent learning Compa ring algo rithms empirially Gerard - PowerPoint PPT Presentation

to even out randomness settling-in phase burn-in phase Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4

to even out randomness settling-in phase burn-in phase Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff (alternatives: no-regret, . . . ). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4

settling-in phase burn-in phase Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff (alternatives: no-regret, . . . ). ■ Often each entry is computed multiple times randomness in to even out algorithms (which are implementations of response rules). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4

Round robin tournament Given a pool of games G to test on, all approaches have in common that they have a table of res : grand head-to-head s o A 1 A 2 . . . A 12 avg algorithm A 1 3.2 5.1 . . . 4.7 4.1 A 2 2.4 1.2 . . . 2.2 1.3 . . . . . ... . . . . . . . . . . A 12 3.1 6.1 . . . 3.8 4.2 ■ Entries are measures for the protagonist (row), which p erfo rman e almost always is average payoff (alternatives: no-regret, . . . ). ■ Often each entry is computed multiple times randomness in to even out algorithms (which are implementations of response rules). ■ Sometimes there is a phase (a.k.a. phase ) in which settling-in burn-in payoffs are not yet recorded. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 4

Work of Axelrod (1980, 1984) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 5

Axelrod receiving the National Medal of Science (2014) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 6

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the prisoner’s dilemma. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed algorithms + 1 random = 15: Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. Tit-for-tat, Shubik, Nydegger, Joss, . . . , Random. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. All contestants Tit-for-tat, Shubik, Nydegger, were informed about the Joss, . . . , Random. results of the first tournament. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Zero-Determinant strategies Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. All contestants Tit-for-tat, Shubik, Nydegger, were informed about the Joss, . . . , Random. results of the first tournament. Winner: Tit-for-tat. Response rules (algorithms) were mostly reactive. One could hardly speak of learning. ■ Grand table: all pairs play 200 rounds. This was repeated 5 times to even out randomness. ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Axelrod: tournament for the repeated prisoner’s dilemma ■ One game to test: the Axelrod, Robert. "Effective choice in the prisoner’s dilemma." Journal of prisoner’s dilemma. conflict resolution 24.1 (1980): 3-25. ■ Contestants: 14 constructed ■ Second tournament: 64 algorithms + 1 random = 15: contestants. All contestants Tit-for-tat, Shubik, Nydegger, were informed about the Joss, . . . , Random. results of the first tournament. Winner: Tit-for-tat. Response rules (algorithms) ■ In 2012, Alexander Stewart and were mostly reactive. One Joshua Plotkin ran a variant of could hardly speak of learning. Axelrod’s tournament with 19 ■ Grand table: all pairs play 200 strategies to test the rounds. This was repeated 5 effectiveness of the then newly times to even out randomness. discovered Zero-Determinant strategies . ■ Winner: Tit-for-tat. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 7

Work of Zawadzki et al. (2014) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 8

Zawadzki et al. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

Zawadzki et al. ■ Contestants: FP, Determinate, Awesome, Meta, WoLF-IGA, GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

Zawadzki et al. ■ Contestants: FP, Determinate, Awesome, Meta, WoLF-IGA, GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A motivation for this set of 11 algorithms, other than “state-of-the-art” wasn’t given. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

Zawadzki et al. ■ Contestants: FP, Determinate, Awesome, Meta, WoLF-IGA, GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A motivation for this set of 11 algorithms, other than “state-of-the-art” wasn’t given. ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = games with normal covariant random payoffs; D 2 = Bertrand oligopoly; D 3 = Cournot duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = games with normal covariant random payoffs; D 2 = Bertrand oligopoly; D 3 = Cournot duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Grand table: each algorithm pair plays all 600 games for 10 4 ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = rounds. games with normal covariant random payoffs; D 2 = Bertrand oligopoly; D 3 = Cournot duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Grand table: each algorithm pair plays all 600 games for 10 4 ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = rounds. games with normal covariant ■ Evaluation: through random payoffs; D 2 = Bertrand non-parametric tests and oligopoly; D 3 = Cournot squared heat plots. duopoly; D 4 = dispersion games; D 5 = grab the dollar type games; D 6 = guess two thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

Zawadzki et al. ■ Contestants: FP, Determinate, ■ Game pool: 600 games: 100 games for each size 2 2 , 4 2 , 6 2 , Awesome, Meta, WoLF-IGA, 8 2 , 10 2 , randomly selected from GSA, RVS, QL, Minmax-Q, Minmax-Q-IDR, Random. A D , and 100 games of dimension motivation for this set of 11 2 × 2 from Rapoport’s algorithms, other than catalogue. “state-of-the-art” wasn’t given. ■ Grand table: each algorithm pair plays all 600 games for 10 4 ■ Games: a suite of 13 interesting families, D = D 1 , . . . , D 13 : D 1 = rounds. games with normal covariant ■ Evaluation: through random payoffs; D 2 = Bertrand non-parametric tests and oligopoly; D 3 = Cournot squared heat plots. duopoly; D 4 = dispersion games; D 5 = grab the dollar ■ Conclusion: Q-learning is the type games; D 6 = guess two overall winner. thirds of the average games; . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 9

Zawadzki et al. Mean reward over all opponents and games. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 10

Zawadzki et al. Mean reward over all opponents and games. Mean regret over all opponents and games. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 10

Zawadzki et al. Mean reward against different game suites. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 11

Zawadzki et al. Mean reward against different game suites. Mean reward against different opponents. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 11

test statisti Student's t-distribution -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

test statisti Student's t-distribution -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

test statisti Student's t-distribution -value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: ■ Compute the average difference ¯ X D , and the average standard deviation of differences ¯ s D , of all n pairs (we see nine here). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

-value Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: ■ Compute the average difference ¯ X D , and the average standard deviation of differences ¯ s D , of all n pairs (we see nine here). ■ If the two series are generated by the same random process, the √ n ) should follow the test statisti t = ¯ X D / ( ¯ s D t-distribution with Student's mean 0 and n − 1 degrees of freedom. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

Parametric test: paired t-test A 1 A 2 A 3 dots A 1 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.1 3.1 4.7 5.1 1.1 1.2 3.5 4.2 3.8 . . . . . . . . . A 2 G 1 G 2 G 3 G 1 G 2 G 3 G 1 G 2 G 3 . . . . . . . . . 2.7 3.5 4.1 4.9 0.9 1.9 3.7 4.7 4.5 . . . . . . . . . Paired t-test: ■ Compute the average difference ¯ X D , and the average standard deviation of differences ¯ s D , of all n pairs (we see nine here). ■ If the two series are generated by the same random process, the √ n ) should follow the test statisti t = ¯ X D / ( ¯ s D t-distribution with Student's mean 0 and n − 1 degrees of freedom. ■ If t is too eccentric, then we’ll have to reject that possibility, since eccentric values of t are unlikely (“have a low p -value ”). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 12

Parametric test: paired t-test Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 13

test statisti empiri al umulative distribution fun tions -value Non-parametric test: the Kolmogorov-Smirnov test Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14

test statisti empiri al umulative distribution fun tions -value Non-parametric test: the Kolmogorov-Smirnov test ■ Test whether two distributions are generated by the same random process. H 0 : yes. H 1 : no. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14

-value Non-parametric test: the Kolmogorov-Smirnov test ■ Test whether two distributions are generated by the same random process. H 0 : yes. H 1 : no. ■ The statisti is the maximum distance between the test empiri al fun tions of the two samples. umulative distribution Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14

Non-parametric test: the Kolmogorov-Smirnov test ■ Test whether two distributions are generated by the same random process. H 0 : yes. H 1 : no. ■ The statisti is the maximum distance between the test empiri al fun tions of the two samples. umulative distribution ■ The p -value is the probability of seeing a test statistic (i.e., max distance) as high as the one observed, under the assumption that both samples were drawn from the same distribution. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 14

p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

p robabilisti ally dominate enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

enfo r eable pa y o� Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Outcome: Q-Learning is the only algorithm that is not probabilistically dominated by other algorithms. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Outcome: Q-Learning is the only algorithm that is not probabilistically dominated by other algorithms. ■ the difference between average reward and maxmin value ( enfo r eable o� ). pa y Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

Zawadzki et al. Variations / extensions. Investigate: ■ The relation between game sizes and rewards. Outcome: no relation. ■ The correlation between regret and average reward. ■ The correlation between distance to nearest Nash and average reward. ■ Which algorithms dominate which other algorithms. p robabilisti ally (Cf. article for a definition of this concept.) Outcome: Q-Learning is the only algorithm that is not probabilistically dominated by other algorithms. ■ the difference between average reward and maxmin value ( enfo r eable o� ). pa y Outcome: Q-Learning attained an enforceable payoff more frequently than any other algorithm. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 15

Work of Bouzy et al. (2010) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 16

rank Bouzy et al. (2010) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

rank Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, Bully, Optimistic, Random (12 games). ■ Games: random 2-player, 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in Z ∩ [ − 9, 9 ] (if I understand correctly, else it’s [ − 9, 9 ] ). ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in ■ Bouzy et al. are familiar with Z ∩ [ − 9, 9 ] (if I understand the work of Airiau et al. and correctly, else it’s [ − 9, 9 ] ). Zawadzki et al. . ■ Grand table: each pair plays 3 × 10 6 rounds (!) on a random game. Restart 100 times to even out randomness. ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in ■ Bouzy et al. are familiar with Z ∩ [ − 9, 9 ] (if I understand the work of Airiau et al. and correctly, else it’s [ − 9, 9 ] ). Zawadzki et al. . ■ Grand table: each pair plays Contrary to Airiau et al. and 3 × 10 6 rounds (!) on a random Zawadzki et al. ., the ranking game. Restart 100 times to even still fluctuates after 3 × 10 6 out randomness. rounds ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

Bouzy et al. (2010) ■ Contestants: Minimax, FP, QL, JR, . . . JR, Sat, M3, UCB, Exp3, HMC, ■ Evaluation: Plot with x -axis = Bully, Optimistic, Random (12 log rounds and y -axis the games). rank of that algorithm w.r.t. ■ Games: random 2-player, performance. 3 × 3-actions, with payoffs in ■ Bouzy et al. are familiar with Z ∩ [ − 9, 9 ] (if I understand the work of Airiau et al. and correctly, else it’s [ − 9, 9 ] ). Zawadzki et al. . ■ Grand table: each pair plays Contrary to Airiau et al. and 3 × 10 6 rounds (!) on a random Zawadzki et al. ., the ranking game. Restart 100 times to even still fluctuates after 3 × 10 6 out randomness. rounds . . . ?! ■ Final ranking: UCB, M3, Sat, Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 17

eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

eliminate b y lag Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Variation: lag . eliminate b y Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

Bouzy et al. (2010) Variation: rank . eliminate b y ■ Algorithm: Repeat: ● Rank, eliminate the worst, and subtract all payoffs earned against that player from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Variation: lag . eliminate b y ■ Algorithm: Repeat: Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

Bouzy et al. (2010) Variation: rank . cumulative returns of the eliminate b y two worst performers is ■ Algorithm: Repeat: √ n T ( n T the larger than 600/ ● Rank, eliminate the worst, number of tournaments and subtract all payoffs performed), then the laggard earned against that player is removed. from the revenues of all survivors. ■ Final ranking: M3, Sat, UCB, JR, . . . Variation: lag . eliminate b y ■ Algorithm: Repeat: ● Organise a tournament. If the difference between the Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

Bouzy et al. (2010) Variation: rank . cumulative returns of the eliminate b y two worst performers is ■ Algorithm: Repeat: √ n T ( n T the larger than 600/ ● Rank, eliminate the worst, number of tournaments and subtract all payoffs performed), then the laggard earned against that player is removed. The laggard is from the revenues of all also removed if the global survivors. ranking has not changed during 100 n 2 p ( n p − 1 ) 2 ■ Final ranking: M3, Sat, UCB, tournaments since the last JR, . . . elimination ( n p is the current number of players). Variation: lag . eliminate b y ■ Algorithm: Repeat: ● Organise a tournament. If the difference between the Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

Bouzy et al. (2010) Variation: rank . cumulative returns of the eliminate b y two worst performers is ■ Algorithm: Repeat: √ n T ( n T the larger than 600/ ● Rank, eliminate the worst, number of tournaments and subtract all payoffs performed), then the laggard earned against that player is removed. The laggard is from the revenues of all also removed if the global survivors. ranking has not changed during 100 n 2 p ( n p − 1 ) 2 ■ Final ranking: M3, Sat, UCB, tournaments since the last JR, . . . elimination ( n p is the current number of players). Variation: lag . eliminate b y ■ Algorithm: Repeat: ■ Final ranking: M3, Sat, UCB, FP, . . . ● Organise a tournament. If the difference between the Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 18

Bouzy et al. Ranking evolution according to the number of steps played in games (logscale). The key is ordered according to the final ranking. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 19

Bouzy et al. Ranking based on eliminations (logscale). The key is ordered according to the final ranking. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 20

Bouzy et al. (2010) Variation: games . sele t sub- lasses of Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . ■ Different number of actions ( n × n games). Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . ■ Different number of actions ( n × n games). Conclusion: M3, Sat, and UCB perform best. Do not maintain averages but geometric (decaying) averages of payoffs. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

Bouzy et al. (2010) Variation: games . sele t sub- lasses of ■ Only cooperative games (shared payoffs): Exp3, M3, Bully, JR, . . . ■ Only competitive games (zero-sum payoffs): Exp3, M3, Minimax, JR, . . . ■ Specific matrix games: penalty game, climbing game, coordination game, . . . ■ Different number of actions ( n × n games). Conclusion: M3, Sat, and UCB perform best. Do not maintain averages but geometric (decaying) averages of payoffs. Another interesting direction is exploring why Exp3 is the best MAL player on both cooperative games and competitive games, but not on general-sum games, and to exploit this fact to design a new MAL algorithm. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 21

Work of Airiau et al. (2007) Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 22

�tness-p rop o rtionate sele tion Airiau et al. Author: Gerard Vreeswijk. Slides last modified on June 21 st , 2020 at 21:18 Multi-agent learning: Comparing algorithms empirically, slide 23

Multi-agent learning Compa ring algo rithms empirially Gerard - PowerPoint PPT Presentation

Multi-agent learning Compa ring algo rithms empirially Gerard Vreeswijk , Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Sunday 21 st June, 2020 pitting games against

Multi-agent learning Multi-agent reinforcement learning Gerard Vreeswijk , Intelligent Systems

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department,

REINFORCEMENT LEARNING IN MULTI-AGENT SYSTEMS MACHINE LEARNING MEETUP DR. ANA PELETEIRO

ROMA: Multi-Agent Reinforcement Learning with Emerging Roles Tonghan Wang, Heng Dong, Victor

Overview Multi-Agent Systems Introduction to multi-agent systems and agent societies Agent

QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department,

YunQi 2050 - DRL Session Communication in Multi-agent Reinforcement Learning Ying Wen

Multi-agent learning Emergence of Conventions Gerard Vreeswijk , Intelligent Systems Group,

W HAT S AN A GENT ? Weiss, p. 29 [after Wooldridge and Jennings]: An agent is a

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department,

Multi-agent learning Repeated games Gerard Vreeswijk , Intelligent Systems Group, Computer Science

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik

SIKS tutorial Agent Systems Multi-agent learning Gerard Vreeswijk Intelligent Systems

Multi-agent learning Fictitious Play Gerard Vreeswijk , Intelligent Systems Group, Computer

Multi-agent learning Teaching strategies Gerard Vreeswijk , Intelligent Systems Group, Computer

Multi-agent learning Multi-a rmed bandit algo rithms Gerard Vreeswijk , Intelligent Software

MULTI-AGENT NAVIGATION MULTI-AGENT NAVIGATION Why do it? Autonomous cars Robot

Multi-agent learning Satisficing strategies Ronald Chu, Geertjan van Vliet , Technical Artificial

Multi-agent learning Simplied Poker Yannick Bitane , April 14th, 2011. Yannick Bitane. Slides

Finding Friend and Foe in Multi-agent Games Jack Serrino, Max Kleiman-Weiner, David Parkes,

On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond

Learning Agent Learning Agents An Agent that observes its performance and adapts its

Multi-agent learning Compa ring algo rithms empirially Gerard - PowerPoint PPT Presentation

Multi-agent learning Compa ring algo rithms empirially Gerard Vreeswijk , Intelligent Software Systems, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Sunday 21 st June, 2020 pitting games against

Multi-agent learning Multi-agent reinforcement learning Gerard Vreeswijk , Intelligent Systems

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department,

REINFORCEMENT LEARNING IN MULTI-AGENT SYSTEMS MACHINE LEARNING MEETUP DR. ANA PELETEIRO

ROMA: Multi-Agent Reinforcement Learning with Emerging Roles Tonghan Wang, Heng Dong, Victor

Overview Multi-Agent Systems Introduction to multi-agent systems and agent societies Agent

QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department,

YunQi 2050 - DRL Session Communication in Multi-agent Reinforcement Learning Ying Wen

Multi-agent learning Emergence of Conventions Gerard Vreeswijk , Intelligent Systems Group,

W HAT S AN A GENT ? Weiss, p. 29 [after Wooldridge and Jennings]: An agent is a

Social Learning in Multi Agent Multi Armed Bandits Abishek Sankararaman, UC Berkeley April 9,

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department,

Multi-agent learning Repeated games Gerard Vreeswijk , Intelligent Systems Group, Computer Science

Multi-agent learning Erik Berbee &amp; Bas van Gijzel , Master Student AT, Utrecht University Erik

SIKS tutorial Agent Systems Multi-agent learning Gerard Vreeswijk Intelligent Systems

Multi-agent learning Fictitious Play Gerard Vreeswijk , Intelligent Systems Group, Computer

Multi-agent learning Teaching strategies Gerard Vreeswijk , Intelligent Systems Group, Computer

Multi-agent learning Multi-a rmed bandit algo rithms Gerard Vreeswijk , Intelligent Software

MULTI-AGENT NAVIGATION MULTI-AGENT NAVIGATION Why do it? Autonomous cars Robot

Multi-agent learning Satisficing strategies Ronald Chu, Geertjan van Vliet , Technical Artificial

Multi-agent learning Simplied Poker Yannick Bitane , April 14th, 2011. Yannick Bitane. Slides

Finding Friend and Foe in Multi-agent Games Jack Serrino*, Max Kleiman-Weiner*, David Parkes,

On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond

Learning Agent Learning Agents An Agent that observes its performance and adapts its

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik

Finding Friend and Foe in Multi-agent Games Jack Serrino, Max Kleiman-Weiner, David Parkes,