Multi-agent learning Multi-agent reinforcement learning Multi-agent learning Multi-agent reinforcement learning Gerard Vreeswijk , Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 1
Multi-agent learning Multi-agent reinforcement learning Research questions 1. Are there differences between (a) Independent Learners ( IL ) Agents that attempt to learn i. The values of single actions (single-action RL). (b) Joint Action Learners ( JAL ) Agents that attempt to learn both i. The values of joint actions (multi-action RL). ii. The behaviour employed by other agents (Fictitious Play). 2. Are RL algorithms guaranteed to converge in multi-agent settings? If so, do they converge to equilibria? Are these equilibria optimal? 3. How are rates of convergence and limit points influenced by the system structure and action selection strategies? Claus et al. address some of these questions in a limited setting, namely, A repeated cooperative two-player multiple-action game in strategic form. Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 2
Multi-agent learning Multi-agent reinforcement learning Cited work Claus and Boutilier (1998). “The Dynamics of Reinforcement Learning in Cooperative Multia- gent Systems” in: Proc. of the Fifteenth National Conf. on Artificial Intelligence , pp. 746-752. The paper on which this presentation is mostly based on. Watkins and Dayan (1992). “Q-learning”. Machine Learning , Vol. 8 , pp. 279-292. Mainly the result that Q-learning converges to the optimum action-values with probability one as long as all actions are repeatedly sampled in all states and the action-values are represented discretely. Fudenberg, D. and D. Kreps (1993): “Learning Mixed Equilibria,” Games and Economic Behavior , Vol. 5 , pp. 320-367. Mainly Proposition 6.1 and its proof pp. 342-344. Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 3
Multi-agent learning Multi-agent reinforcement learning Q-learning • The general version of Q-learning • Single-state reinforcement is multi-state and amounts to learning rule: continuously updating the Q new ( a ) = ( 1 − λ ) Q old ( a ) + λ · r various Q ( s , a ) with • Two sufficient conditions for r ( s , a , s ′ ) + γ · max Q ( s ′ , a ) (1) a convergence in Q-learning (Watkins, Dayan, 1992): • In the present setting, there is only one state (namely, the stage 1. Parameter λ decreases through time such that ∑ t λ is game G ) so that (1) reduces to divergent and ∑ t λ 2 is r ( s , a , s ) convergent. which may be abbreviated to r ( a ) 2. All actions are sampled infinitely often. or even r . Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 4
Multi-agent learning Multi-agent reinforcement learning Exploitive vs. non-exploitive exploration Convergence on Q-learning does not depend on the exploration strategy used. (It is just that all actions must be sampled infinitely often.) Non-exploitive exploration This is like what happens in the ǫ -part of ǫ -greedy learning. Exploitive exploration Even during exploration, there is a probabilistic bias to exploring optimal actions. Example . Boltzmann exploration (a.k.a. soft max, mixed logit, or quantal response function): e Q ( a ) / T ∑ a ′ e Q ( a ′ ) / T with T > 0. Letting T → 0 establishes convergence conditions (1) and (2) as mentioned above (Watkins, Dayan, 1992). Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 5
Multi-agent learning Multi-agent reinforcement learning Independent Learning (IL) • A MARL algorithm is an performed by other agents. independent learner (IL) algorithm • Typical conditions for if the agents learn Q-values for Independent Learning: their individual actions. – An agent is unaware of the • Experiences for agent i take the existence of other agents. form � a i , r ( a i ) � where a i is the – It cannot identify other agent’s action performed by i and r ( a i ) is actions, or has no reason to a reward for action a i . believe that other agents are • Learning is based on acting strategically. Q new ( a ) = ( 1 − λ ) Q old ( a ) + λ · r ( a ) Of course, even if an agent can learn through joint actions, it may ILs perform their actions, obtain a still choose to ignore information reward and update their Q-values about the other agents’ behaviour. without regard to the actions Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 6
Multi-agent learning Multi-agent reinforcement learning Joint-Action Learning (JAL) • Joint Q-values are estimated e.g., fictitious play: rewards for joint actions. f i ( a − i ) = Def Π j � = i φ j ( a − i ) For a 2 × 2 game an agent would have to maintain Q ( T , L ) , where φ j ( a − i ) is i ’s empirical Q ( T , R ) , Q ( B , L ) , and Q ( B , R ) . distribution of j ’s actions on a − i . • Row can only influence T , B but • The expected value of an individual not opponent’s actions L , R . action is the sum of joint Let a i be an action of player i . A Q-values, weighed by the complementary joint action profile estimated probability of the is a set of joint actions a − i such associated complementary joint that a = a i ∪ a − i is a complete action profiles: joint action profile. ∑ EV ( a i ) = Q ( a i ∪ a − i ) f i ( a − i ) • Opponent’s actions can be a − i ∈ A − i estimated through forecast by, Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 7
Multi-agent learning Multi-agent reinforcement learning Comparing Independent and Joint-Action Learners Case 1: the coordination game agents through fictitious play, and plays a softmax best response. L R A JAL computes singular Q-values � � T 10 0 by means of explicit belief B 0 10 distributions on joint Q-values. Thus, • A JAL is able to distinguish Q-values of different joint actions ∑ EV ( a i ) = Q ( a i ∪ a − i ) f i ( a − i ) a = a i ∪ a − i . a − i ∈ A − i • However, its ability to use this is more or less the same as the information is circumscribed by the Q-values learned by ILs. limited freedom of its own actions • Thus even though a JAL may be a i ∈ A i . fairly sure of the relative Q-values • A JAL maintains beliefs f ( a i ) about of its joint actions, it seems it the strategy being played by other cannot really benefit from this. Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 8
Multi-agent learning Multi-agent reinforcement learning Figure 1: Convergence of coordi- nation for ILs and JALs (aver- aged over 100 trials). Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 9
Multi-agent learning Multi-agent reinforcement learning Comparing Independent and Joint-Action Learners Case 1: the coordination game agents (fictitious play) and plays a softmax best response. L R A JAL computes single Q-values � � T 10 0 by means of explicit belief B 0 10 distributions on joint Q-values. Thus, • A JAL is able to distinguish ∑ EV ( a i ) = Q ( a i ∪ a − i ) f i ( a − i ) Q-values of different joint actions a − i ∈ A − i a = a i ∪ a − i . is more or less the same as the • However, its ability to use this Q-values learned by ILs. information is circumscribed by the limited freedom of its own actions • Thus even though a JAL may be a i ∈ A i . fairly sure of the relative Q-values of its joint actions, it seems it • A JAL maintains beliefs f ( a i ) about cannot really benefit from this. the strategy being played by other Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 10
Multi-agent learning Multi-agent reinforcement learning Case 2: Penalty game B on average very L M R unattractive, and will T 10 0 k converge to C . C 0 2 0 3. Therefore, Col will find T and B k 0 10 B slightly less attractive, and will converge to C as well. Suppose penalty k = − 100. The following stories are entirely JAL 1. Initially, Column explores. symmetrical for Row and Column. 2. Therefore Row gives low EV to T and B . Plays C the most. IL 1. Initially, Column explores. 3. Convergence to ( C , M ) . 2. Therefore, Row wil find T and Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 11
Multi-agent learning Multi-agent reinforcement learning Figure 2: Likelihood of conver- gence to opt. equilib- rium as a func- tion of penalty k (100 trials). Gerard Vreeswijk. Slides last processed on Thursday 25 th March, 2010 at 17:32h. Slide 12
Recommend
More recommend