The Size of Message Set Needed for the Optimal Communication Policy Tatsuya Kasai, Hayato Kobayashi, and Ayumi Shinohara Graduate School of Information Sciences, Tohoku University, Japan The 7 th European Workshop on Multi-Agent Systems (EUMAS 2009) Ayia Napa, Cyprus Dec17-18, 2009
Background Multi-agent coordination with communication. Main objective : To find the optimal action policy δ A and communication policy δ M We are interested in an approach based on autonomous learning. Definition of policies for agent i in our proposed methods Signal Learning with Messages (SLM) Signal Learning (SL) [Kasai+ AAMAS09] [Kasai+ 08] receive → A i Action A : Ω i × M i δ i A set of A set of Policy observations actions A set of received messages Communi- receive → M i M : Ω i → M i M : Ω i × M i δ i δ i cation Policy A set of messages to send other agents (SL and SLM are based on Multi-Agent Reinforcement Learning framework)
Motivation Actual learning results of SL and SLM [Kasai+ AAMAS09] The performance of cooperation when the size of M i is increases. The form of policies Bad receive → A i Performance of cooperation A : Ω i × M i δ i M : Ω i δ i → M i improved receive → A i A : Ω i × M i δ i receive → M i M : Ω i × M i δ i Good Size of M i We have an interest about how much size of M i for constructing the optimal policy ?
Scheme of talk We show minimum required sizes | M i | for achieving the optimal policy for Signal Learning on Jointly Fully Observable Dec-POMDP-Com Signal Learning with Messages on Deterministic Dec-POMDP-Com Dec-POMDP-Com Jointly Fully Deterministic Observable
Outline Background Scheme of talk Review : Dec-POMDP-Com [Goldman+ 04] Constrained model Jointly Fully Observable Dec-POMDP-Com [Goldman+ 04] Deterministic Dec-POMDP-Com (we define) Theoretical analysis Conclusion
Dec-POMDP-Com [Goldman+ 04] Page of Dec-POMDP-Com 1/3 (Decentralized Partially Observable Markov Decision Process with Communication) A decentralized multi-agent system, where agents can communicate with each other and only observe the restricted information. Example of model Two agents get a treasure cooperatively. field The treasure is locked. Both agents must reach the treasure at the same time to open the lock.
Dec-POMDP-Com [Goldman+ 04] Page of Dec-POMDP-Com 2/3 (Decentralized Partially Observable Markov Decision Process with Communication) A decentralized multi-agent system, where agents can communicate with each other and only observe the restricted information. Formulation Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > O 1 Example of model 1step for agent i on Dec-POMDP-Com 1. Receive an observation o i from the environment. field 2. Send a message m i to the other agents. a 1 = Move right 3. Perform an action a i in the environment. a 2 = Move up Restricted Sights m 2 Repeat until both agent arrive at the treasure. O 2 m 1 Two agents get a treasure cooperatively. The treasure is locked. Both agents must reach the treasure at the same time to open the lock.
Dec-POMDP-Com [Goldman+ 04] Page of Dec-POMDP-Com 3/3 (Decentralized Partially Observable Markov Decision Process with Communication) A decentralized multi-agent system, where agents can communicate with each other and only observe the restricted information. Formulation Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > O 1 Example of model A set of agents’ indices field a 1 e.g., I = {1, 2} a 2 = 1 = 2 m 2 O 2 m 1 Two agents get a treasure cooperatively. The treasure is locked. Both agents must reach the treasure at the same time to open the lock.
Dec-POMDP-Com [Goldman+ 04] Page of Dec-POMDP-Com 3/3 (Decentralized Partially Observable Markov Decision Process with Communication) A decentralized multi-agent system, where agents can communicate with each other and only observe the restricted information. Formulation Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > O 1 Example of model A set of global states field a 1 e.g., s = (position of agent 1, position of agent 2, a 2 position of treasure ) , s ∈ S m 2 O 2 m 1 Two agents get a treasure cooperatively. The treasure is locked. Both agents must reach the treasure at the same time to open the lock.
Dec-POMDP-Com [Goldman+ 04] Page of Dec-POMDP-Com 3/3 (Decentralized Partially Observable Markov Decision Process with Communication) A decentralized multi-agent system, where agents can communicate with each other and only observe the restricted information. Formulation Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > O 1 Example of model Ω : a set of joint observations field a 1 Ω = Ω 1 × Ω 2 , where Ω i is a set of observations for agent i a 2 A : a set of joint actions m 2 A = A 1 × A 2 O 2 m 1 Two agents get a treasure cooperatively. The treasure is locked. Both agents must reach the treasure at the same time to open the lock.
Dec-POMDP-Com [Goldman+ 04] Page of Dec-POMDP-Com 3/3 (Decentralized Partially Observable Markov Decision Process with Communication) A decentralized multi-agent system, where agents can communicate with each other and only observe the restricted information. Formulation Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > O 1 Example of model M : a set of joint messages field a 1 M = M 1 × M 2 a 2 C : M → R is a cost function C ( m ) represent the total cost of transmitting m 2 the messages sent by all agents. O 2 m 1 Two agents get a treasure cooperatively. The treasure is locked. Both agents must reach the treasure at the same time to open the lock.
Dec-POMDP-Com [Goldman+ 04] Page of Dec-POMDP-Com 3/3 (Decentralized Partially Observable Markov Decision Process with Communication) A decentralized multi-agent system, where agents can communicate with each other and only observe the restricted information. Formulation Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > O 1 Example of model P : a transition probability function field a 1 O : an observation probability function a 2 m 2 O 2 m 1 Two agents get a treasure cooperatively. The treasure is locked. Both agents must reach the treasure at the same time to open the lock.
Dec-POMDP-Com [Goldman+ 04] Page of Dec-POMDP-Com 3/3 (Decentralized Partially Observable Markov Decision Process with Communication) A decentralized multi-agent system, where agents can communicate with each other and only observe the restricted information. Formulation Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > O 1 Example of model R : a reward function field a 1 e.g., the treasure obtained by agents a 2 T : a time horizon m 2 O 2 m 1 Two agents get a treasure cooperatively. The treasure is locked. Both agents must reach the treasure at the same time to open the lock.
Outline Background Scheme of talk Review : Dec-POMDP-Com [Goldman+ 04] Constrained model Jointly Fully Observable Dec-POMDP-Com [Goldman+ 04] Deterministic Dec-POMDP-Com (we define) Theoretical analysis Conclusion
Jointly Fully Observable Dec-POMDP-Com [Goldman+ 04] The Dec-POMDP- Com such that the combination of the agents’ observations leads to the global state. Jointly fully Observable Dec-POMDP-Com field o 1 + o 2 = global state (That is Jointly fully observable) O 1 O 2
Deterministic Dec-POMDP-Com The model where P and O on the definition are constrained. Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T >
Deterministic Dec-POMDP-Com The model where P and O on the definition are constrained. Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > Restriction 1 : Deterministic transitions For any state s ∈ S and any joint action P is a transition probability function a ∈ A , there exists a state s’ ∈ S such that P ( s , a , s ’) = 1. P(s, a ,s 1 ’)= 0.1 s 1 ’ ・ s’ ・ s P(s, a ,s i ’)= 0.2 ・ s i ’ s P ( s, a , s’ )=1 ・ P(s, a ,s n ’)= 0.4 ・ ・ s n ’ The next global state is decided uniquely. n = | S |
Deterministic Dec-POMDP-Com The model where P and O on the definition are constrained. Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > Restriction 2 : Deterministic observable For any state s, s’ ∈ S and any joint O is a observation probability function action a ∈ A , there exists a joint observation o ∈ Ω such that O ( s , a , s’ , o ) = 1 . s’ s’ ・・・ ・・・ o 1 o i o n o O ( s, a ,s’, o 1 ) O ( s, a ,s’, o i ) O ( s, a ,s’, o n ) O ( s, a ,s’, o ) = 1 = 0.1 = 0.2 = 0.3 n = | Ω | The current observation is decided uniquely.
Deterministic Dec-POMDP-Com The model where P and O on the definition are constrained. Dec-POMDP-Com := < I, S, Ω , A , M , C, P, O, R, T > The next global state is decided uniquely. Restriction 1 : Deterministic transitions The current observation is decided uniquely. Restriction 2 : Deterministic observable When Dec-POMDP-Com has Restriction 1 and 2, it is called Deterministic Dec-POMDP-Com
Outline Background Scheme of talk Review : Dec-POMDP-Com [Goldman+ 04] Constrained model Jointly Fully Observable Dec-POMDP-Com [Goldman+ 04] Deterministic Dec-POMDP-Com (we define) Theoretical analysis Conclusion
Main results Corollary 1 : Minimum required sizes | M i | for Signal Learning on Jointly Fully Observable Dec- POMDP-Com Theorem 2 : Minimum required sizes | M i | for Signal Learning with Messages on Deterministic Dec- POMDP-Com Dec-POMDP-Com Jointly Fully Deterministic Observable
Recommend
More recommend