multi agent learning
play

Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, - PowerPoint PPT Presentation

Multi-agent learning Multi-agent reinforcement learning Multi-agent reinfo rement lea rning Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The


  1. Multi-agent learning Multi-agent reinforcement learning Multi-agent reinfo r ement lea rning Multi-agent learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 1

  2. Multi-agent learning Multi-agent reinforcement learning Indep endent Lea rners ( IL ) Agents that attempt to learn Research questions Joint A tion Lea rners ( JAL ) Agents that attempt to learn both 1. Are there differences between (a) i. The values of single actions (single-action RL). (b) i. The values of joint actions (multi-action RL). ii. The behaviour employed by other agents (Fictitious Play). 2. Are RL algorithms guaranteed to converge in multi-agent settings? If so, do they converge to equilibria? Are these equilibria optimal? 3. How are rates of convergence and limit points influenced by the system structure and action selection strategies? Claus et al. address some of these questions in a limited setting, namely, A repeated cooperative two-player multiple-action game in strategic form. Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 2

  3. Multi-agent learning Multi-agent reinforcement learning Cited work Claus and Boutilier (1998). “The Dynamics of Reinforcement Learning in Cooperative Multia- gent Systems” in: Proc. of the Fifteenth National Conf. on Artificial Intelligence , pp. 746-752. The paper on which this presentation is mostly based on. Watkins and Dayan (1992). “Q-learning”. Machine Learning , Vol. 8 , pp. 279-292. Mainly the result that Q-learning converges to the optimum action-values with probability one as long as all actions are repeatedly sampled in all states and the action-values are represented discretely. Fudenberg, D. and D. Kreps (1993): “Learning Mixed Equilibria,” Games and Economic Behavior , Vol. 5 , pp. 320-367. Mainly Proposition 6.1 and its proof pp. 342-344. Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 3

  4. Multi-agent learning Multi-agent reinforcement learning Q-learning • The general version of Q-learning • Single-state reinforcement is multi-state and amounts to learning rule: continuously updating the Q new ( a ) = ( 1 − λ ) Q old ( a ) + λ · r various Q ( s , a ) with • Two sufficient conditions for r ( s , a , s ′ ) + γ · max Q ( s ′ , a ) (1) a convergence in Q-learning (Watkins, Dayan, 1992): • In the present setting, there is only one state (namely, the stage 1. Parameter λ decreases through time such that ∑ t λ is game G ) so that (1) reduces to divergent and ∑ t λ 2 is r ( s , a , s ) convergent. which may be abbreviated to r ( a ) 2. All actions are sampled infinitely often. or even r . Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 4

  5. Multi-agent learning Multi-agent reinforcement learning Exploitive vs. non-exploitive exploration Non-exploitive explo ration This is like what happens in the ǫ -part of ǫ -greedy Convergence on Q-learning does not depend on the exploration strategy Exploitive explo ration Even during exploration, there is a probabilistic bias to used. (It is just that all actions must be sampled infinitely often.) learning. exploring optimal actions. Example . Boltzmann exploration (a.k.a. soft max, mixed logit, or quantal response function): e Q ( a ) / T ∑ a ′ e Q ( a ′ ) / T with T > 0. Letting T → 0 establishes convergence conditions (1) and (2) as mentioned above (Watkins, Dayan, 1992). Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 5

  6. Multi-agent learning Multi-agent reinforcement learning indep endent lea rner (IL) algorithm Independent Learning (IL) • A MARL algorithm is an performed by other agents. • Typical conditions for if the agents learn Q-values for Independent Learning: their individual actions. – An agent is unaware of the • Experiences for agent i take the existence of other agents. form � a i , r ( a i ) � where a i is the – It cannot identify other agent’s action performed by i and r ( a i ) is actions, or has no reason to a reward for action a i . believe that other agents are • Learning is based on acting strategically. Q new ( a ) = ( 1 − λ ) Q old ( a ) + λ · r ( a ) Of course, even if an agent can learn through joint actions, it may ILs perform their actions, obtain a still choose to ignore information reward and update their Q-values about the other agents’ behaviour. without regard to the actions Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 6

  7. Multi-agent learning Multi-agent reinforcement learning Joint Q-values are estimated Joint-Action Learning (JAL) • e.g., fictitious play: rewards for joint actions. f i ( a − i ) = Def Π j � = i φ j ( a − i ) exp e ted value of an individual For a 2 × 2 game an agent would a tion is the sum of joint have to maintain Q ( T , L ) , where φ j ( a − i ) is i ’s empirical Q ( T , R ) , Q ( B , L ) , and Q ( B , R ) . distribution of j ’s actions on a − i . omplementa ry joint a tion p ro�le • Row can only influence T , B but • The not opponent’s actions L , R . Let a i be an action of player i . A Q-values, weighed by the estimated probability of the is a set of joint actions a − i such associated complementary joint that a = a i ∪ a − i is a complete action profiles: joint action profile. ∑ EV ( a i ) = Q ( a i ∪ a − i ) f i ( a − i ) • Opponent’s actions can be a − i ∈ A − i estimated through forecast by, Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 7

  8. Multi-agent learning Multi-agent reinforcement learning Comparing Independent and Joint-Action Learners Case 1: the coordination game agents through fictitious play, and plays a softmax best response. L R A JAL computes singular Q-values � � T 10 0 by means of explicit belief B 0 10 distributions on joint Q-values. Thus, • A JAL is able to distinguish Q-values of different joint actions ∑ EV ( a i ) = Q ( a i ∪ a − i ) f i ( a − i ) a = a i ∪ a − i . a − i ∈ A − i • However, its ability to use this is more or less the same as the information is circumscribed by the Q-values learned by ILs. limited freedom of its own actions • Thus even though a JAL may be a i ∈ A i . fairly sure of the relative Q-values • A JAL maintains beliefs f ( a i ) about of its joint actions, it seems it the strategy being played by other cannot really benefit from this. Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 8

  9. Figure 1: Multi-agent learning Multi-agent reinforcement learning Convergence of coordi- nation for ILs and JALs (aver- aged over 100 trials). Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 9

  10. Multi-agent learning Multi-agent reinforcement learning Comparing Independent and Joint-Action Learners Case 1: the coordination game agents (fictitious play) and plays a softmax best response. L R A JAL computes single Q-values � � T 10 0 by means of explicit belief B 0 10 distributions on joint Q-values. Thus, • A JAL is able to distinguish ∑ EV ( a i ) = Q ( a i ∪ a − i ) f i ( a − i ) Q-values of different joint actions a − i ∈ A − i a = a i ∪ a − i . is more or less the same as the • However, its ability to use this Q-values learned by ILs. information is circumscribed by the limited freedom of its own actions • Thus even though a JAL may be a i ∈ A i . fairly sure of the relative Q-values of its joint actions, it seems it • A JAL maintains beliefs f ( a i ) about cannot really benefit from this. the strategy being played by other Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 10

  11. Multi-agent learning Multi-agent reinforcement learning Case 2: Penalty game B on average very L M R unattractive, and will   T 10 0 k converge to C . C 0 2 0     3. Therefore, Col will find T and B k 0 10 B slightly less attractive, and will converge to C as well. Suppose penalty k = − 100. The following stories are entirely JAL 1. Initially, Column explores. symmetrical for Row and Column. 2. Therefore Row gives low EV to T and B . Plays C the most. IL 1. Initially, Column explores. 3. Convergence to ( C , M ) . 2. Therefore, Row wil find T and Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 11

  12. Figure 2: Multi-agent learning Multi-agent reinforcement learning Likelihood of conver- gence to opt. equilib- rium as a func- tion of penalty k (100 trials). Last modified on April 3 rd , 2014 at 13:17 Gerard Vreeswijk. Slide 12

Recommend


More recommend