multi agent learning
play

Multi-agent learning Satisficing strategies Ronald Chu, Geertjan van - PowerPoint PPT Presentation

Multi-agent learning Satisficing strategies Multi-agent learning Satisficing strategies Ronald Chu, Geertjan van Vliet , Technical Artificial Intelligence, Universiteit Utrecht Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March,


  1. Multi-agent learning Satisficing strategies Multi-agent learning Satisficing strategies Ronald Chu, Geertjan van Vliet , Technical Artificial Intelligence, Universiteit Utrecht Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 1

  2. Multi-agent learning Satisficing strategies Outline • What is satisficing • Satisficing in the repeated prisoner’s dillema (RPD) • Satisficing in the multi-agent social dilemma (MASD) Stimpson et al. (2001): Satisficing and Learning Cooperation in the Prisoner’s Dilemma Stimpson et al. (2003): Learning To Cooperate in a Social Dilemma A Satisficing Approach to Bargaining Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 2

  3. Multi-agent learning Satisficing strategies Satisficing (1) Optimize: Choose the best available option Satisfice: Choose an option that meets a certain aspiration level Doesn’t have to be unique or in any way the best. Why satisficing? • No information needed except: – The available actions – The payoff of the last action • Aspiration level is adaptive Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 3

  4. Multi-agent learning Satisficing strategies Satisficing (2) At time t player A has a strategy pair ( A t , α t ) • Action A t ∈ { C , D } • Aspiration level α t • Payoff R t Strategy is updated each round • A t + 1 = A t iff R t ≥ α t , otherwise A t + 1 � = A t • α t + 1 = λα t + ( 1 − λ ) R t where 0 ≤ λ ≤ 1 Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 4

  5. Multi-agent learning Satisficing strategies Satisficing (3) t tvt A t R t t tvt A t R t α t α t 0 C C 3 4.00 10 C D 4 1.72 1 C D 4 3.50 11 D D 2 2.86 2 D D 2 3.75 12 D C 1 2.43 3 D C 1 2.88 13 C D 4 1.71 4 C D 4 1.94 14 D D 2 2.86 5 D D 2 2.97 15 D C 1 2.43 6 D C 1 2.48 16 C D 4 1.71 7 C D 4 1.74 17 D D 2 2.86 8 D D 2 2.87 18 D C 1 2.43 9 D C 1 2.44 19 C D 4 1.71 Satisficing strategy (with λ = 0.5 ) against a tit-for-tat strategy. Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 5

  6. Multi-agent learning Satisficing strategies Repeated Prisoner’s Dilemma (1) A two-player two-action social dilemma • Initial research focuses on Nash equilibrium • Mutual cooperation is rational in repeated prisoner’s dilemma (Axelrod 1984) • Usual assumes that the agent knows: – the structure of the game – the decisions of the opponent(s) – the payoffs of the opponent(s) – that opponents’ actions affect the outcomes Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 6

  7. Multi-agent learning Satisficing strategies Repeated Prisoner’s Dilemma (2) Extend notation for a two-player game: • second player has strategy pair ( B t , β t ) • both players have the same learning rate λ Payoff matrix for the PD is generalized: C D • σ payoff for mutual cooperation ( σ , σ ) ( 0, 1 ) C • δ payoff for mutual defection ( 1, 0 ) ( δ , δ ) D • 0 < δ < σ < 1 • 0.5 < σ Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 7

  8. Multi-agent learning Satisficing strategies RPD Experiment Several possible outcomes: • Convergence to a fixed strategy • Convergence to some action cycle • No convergence. Stimpson et. al. ran 5.000 runs of the repeated PD, with uniformly distributed bounded random values. Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 8

  9. Multi-agent learning Satisficing strategies RPD Experiment: results 74% CC 25% DD-DC-DD-CD 1% DD 0% DD-CC-DC There are several parameters influencing convergence: • Payoffs • Initial aspirations • Initial actions • Learning rate Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 9

  10. Multi-agent learning Satisficing strategies RPD Experiment: payoffs Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 10

  11. Multi-agent learning Satisficing strategies RPD Experiment: initial aspirations Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 11

  12. Multi-agent learning Satisficing strategies RPD Experiment: initial actions 81.6% CC 81.6% DD 73.7% Random 66.7% DC or CD Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 12

  13. Multi-agent learning Satisficing strategies RPD Experiment: learning rate Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 13

  14. Multi-agent learning Satisficing strategies RPD Experiment: conclusion (1) Satisficing strategie converges in the RPD to mutual cooperation 1. Big difference between mutual cooperation and defection payoffs 2. High initial aspirations 3. Similar initial behavior 4. Slow learning rate Stimpson et. al. ran 5.000 runs with these parameters with 100% convergence to mutual cooperation. Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 14

  15. Multi-agent learning Satisficing strategies RPD Experiment: conclusion (2) In our 5.000 runs there was 94.1% convergence to mutual cooperation (94.8% with max. rounds 100.000) Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 15

  16. Multi-agent learning Satisficing strategies Multi-agent social dilemma • Introduction • Satisficing algorithm Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 16

  17. Multi-agent learning Satisficing strategies Multi-agent social dilemma (1) Basic characteristics • Choice between selfish goal or group goal • Benefits from both group goal and selfish goal • Multi-action, multi-agent (more than 2x2) • Repeated game • Individual defection is the best option as long as other agents contribute Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 17

  18. Multi-agent learning Satisficing strategies Multi-agent social dilemma (2) Game structure. • We have M + 1 actions and N agents ( N = | A | ) • Each agent i ∈ A contributes c i units to the group, c i ∈ N , 0 < c i < M • Reward received: R i ( c ) = k g ( ∑ c j ) + k s ( M − c i ) j ∈ A • Dynamics depend on weight of the group goal K g versus the selfish goal K s , which is assumed constant and the same for all agents Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 18

  19. Multi-agent learning Satisficing strategies MASD Satisficing algorithm (1) • Ideally converges to ( M , . . . , M ) • All agents need to be satisficed to converge to a action pair • One agent will give up playing M if another changes strategy • Works best with: – Initial aspirations higher than the best possible reward – Slow learning rate Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 19

  20. Multi-agent learning Satisficing strategies MASD Satisficing algorithm (2) At time t player i ∈ A has a strategy pair ( A t i , α t i ) • Action A t i ∈ { 0, . . . , M } • Aspiration level α t i • Payoff R t i resulting from its strategy at t − 1 Strategy is updated each round • A t + 1 = A t iff R t i ≥ α t i , otherwise choose new action random • α t + 1 = λα t i + ( 1 − λ ) R t i i Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 20

  21. Multi-agent learning Satisficing strategies MASD Satisficing algorithm: example M = 10, k = 0.6, and λ = 0.99 Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 21

  22. Multi-agent learning Satisficing strategies Break Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 22

  23. Multi-agent learning Satisficing strategies MASD Reward function k g and k s . • R i ( c ) = k g ( ∑ j ∈ A c j ) + k s ( M − c i ) . • Make the reward range independent of N and M: 1 k g = NM k s = 1 M k • Introduce weight factor k to the selfish goal and then k s = M . Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 23

  24. Multi-agent learning Satisficing strategies MASD Reward function Interesting values for k . 1 k g = NM k s = k M • Goals are equally important when k g = k s ⇔ k = 1 N . • When k > 1, then the selfish goal is always preferred by any agent (Exercise). 1 • N < k < 1 which means that k s > k g . Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 24

  25. Multi-agent learning Satisficing strategies MASD Reward function Final Reward function . • Inserting the new constants and using • c − i = ∑ j ∈ A \{ i } c j the contribution of other agents 1 NM + k c i R i ( c ) = NM c − i + M ( M − c i ) (1) • Dividing by ( 1 − k ) and dropping constants R i ( c ) = ( 1 − kN ) c i + c − i (2) NM ( 1 − k ) Ronald Chu, Geertjan van Vliet. Last update: Thursday 24 th March, 2011 at 12:53h. Slide 25

Recommend


More recommend