D++: Structural Credit Assignment in Tightly Coupled Multiagent Domains Aida Rahmatualabi , Jen Jen Chung, Kagan Tumer Autonomous Agents and Distributed Intelligence Lab OSU Robotjcs
Problem Definition team team performance performance DEMUR 2016 Aida Rahmattalabi | Oregon State University 2
Loosely Coupled vs Tightly Coupled Agents Loose coupling : • Task consists of many single-robot tasks • Each robot uses/requires litule knowledge of the other robots to accomplish the task Tight coupling : • Multjple robots are required to achieve the task • Mutual dependence of the robots on each other's performance • The objectjve functjon is inherently non-smooth DEMUR 2016 Aida Rahmattalabi | Oregon State University 3
Learning is Challenging in Tightly Coupled Tasks: DEMUR 2016 Aida Rahmattalabi | Oregon State University 4
Learning is Challenging in Tightly Coupled Tasks: The probability of SUFFICIENT agents, DEMUR 2016 Aida Rahmattalabi | Oregon State University 5
Learning is Challenging in Tightly Coupled Tasks: The probability of SUFFICIENT agents, picking the RIGHT ACTION DEMUR 2016 Aida Rahmattalabi | Oregon State University 6
Learning is Challenging in Tightly Coupled Tasks: The probability of SUFFICIENT agents, picking the RIGHT ACTION , at the RIGHT TIME DEMUR 2016 Aida Rahmattalabi | Oregon State University 7
Learning is Challenging in Tightly Coupled Tasks: The probability of SUFFICIENT agents, picking the RIGHT ACTION , at the RIGHT TIME is LOW DEMUR 2016 Aida Rahmattalabi | Oregon State University 8
Learning is Challenging in Tightly Coupled Tasks: The probability of SUFFICIENT agents, picking the RIGHT ACTION , at the RIGHT TIME is LOW How can we devise agent-specifjc evaluatjon functjons to reward the stepping stone actjons? DEMUR 2016 Aida Rahmattalabi | Oregon State University 9
Difference Evaluation Function (Agogino and Tumer, 2004) – Individual agents’ contributjon to the global team performance – Removes an agent replaces a “ counterfactual ” agent Global system performance Global system performance excluding the efgects of agent i “The world with me” “The world without me” DEMUR 2016 Aida Rahmattalabi | Oregon State University 11
D++: An Extension to Difference Reward (D) – The reward functjon evaluates the performance of a “super agent” – It introduces “ counterfactual ” agents Global system performance Global system performance Where “multjple copies of me” are present – Provides agents with stronger feedback signal – Rewards the stepping stones that lead to achieving the system objectjve DEMUR 2016 Aida Rahmattalabi | Oregon State University 12
Example: DEMUR 2016 Aida Rahmattalabi | Oregon State University 13
D++: An Extension to Difference Reward(D) • How many “counterfactual” agents should be added? DEMUR 2016 Aida Rahmattalabi | Oregon State University 14
D++: An Extension to Difference Reward(D) • How many “counterfactual” agents should be added? Search difgerent number of counterfactual agents untjl a non zero reward is reached DEMUR 2016 Aida Rahmattalabi | Oregon State University 15
D++: An Extension to Difference Reward(D) • How many “counterfactual” agents should be added? Search difgerent number of counterfactual agents untjl a non zero reward is reached • What if suffjcient number of agents are already available? Is D++ enough? DEMUR 2016 Aida Rahmattalabi | Oregon State University 16
D++: An Extension to Difference Reward(D) • How many “counterfactual” agents should be added? Search difgerent number of counterfactual agents untjl a non zero reward is reached • What if suffjcient number of agents are already available? Is D++ enough? Calculatjng both D and D++ and choosing the highest one DEMUR 2016 Aida Rahmattalabi | Oregon State University 17
Cooperative CoEvolutionary Algorithm (CCEA) • Train NN policy weights via cooperatjve coevolutjonary algorithm (CCEA) Initjalize M populatjons of k NNs Initjalize M populatjons of k NNs Initjalize M populatjons of k NNs Initjalize M populatjons of k NNs Retain k best performing Mutate each to create M Retain k best performing Mutate each to create M NNs of each populatjon populatjons of 2 k NNs NNs of each populatjon populatjons of 2 k NNs Credit Assignment Randomly select one NN from each Assess team performance and Randomly select one NN from each Assess team performance and populatjon to create team T i assign fjtness to team members populatjon to create team T i assign fjtness to team members DEMUR 2016 Aida Rahmattalabi | Oregon State University 18
Domain: Multi-robot Exploration • Neural-network controllers – NN state vector [ s 1 , s ] 2 V j 1 å å s , s 1, q , i = 2, q , i = ( ) d L j , L i d L i ' , L i ( ) j Î I q i ' Î N q [ dx , dy ] – Control actjons • Team observatjon reward: 1 N i , k V i N i , j 2 å å å G = 1 2 ( d i , j + d i , k ) i j k DEMUR 2016 Aida Rahmattalabi | Oregon State University 19
Experiments: Required Number of robots Number of POIs Type observatjons 12 10 Homogeneous 3 12 10 Homogeneous 6 9 15 Heterogeneous [1,1,1] 9 15 Heterogeneous [3,1,1] DEMUR 2016 Aida Rahmattalabi | Oregon State University 20
Homogeneous Agents: Number of observations = 3 DEMUR 2016 Aida Rahmattalabi | Oregon State University 21
Homogeneous Agents: Number of observations = 3 DEMUR 2016 Aida Rahmattalabi | Oregon State University 22
Homogeneous Agents: Learned Policies of D++ learners DEMUR 2016 Aida Rahmattalabi | Oregon State University 23
Homogeneous Agents: Learned Policies of D++ learners DEMUR 2016 Aida Rahmattalabi | Oregon State University 24
Homogeneous Agents: Learned Policies of D++ learners DEMUR 2016 Aida Rahmattalabi | Oregon State University 25
Homogeneous Agents: Learned Policies of D++ learners DEMUR 2016 Aida Rahmattalabi | Oregon State University 26
Homogeneous Agents: Number of observations = 6 DEMUR 2016 Aida Rahmattalabi | Oregon State University 27
Homogeneous Agents: Number of observations = 6 DEMUR 2016 Aida Rahmattalabi | Oregon State University 28
Heterogeneous Agents: Number of observations = [1, 1, 1] 50 40 30 G(z) 20 10 G D D++ 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Calls to G DEMUR 2016 Aida Rahmattalabi | Oregon State University 29
Heterogeneous Agents: Learned Policies of D++ learners 40 35 30 25 Y 20 15 10 5 0 -5 0 5 10 15 20 25 30 35 40 X DEMUR 2016 Aida Rahmattalabi | Oregon State University 30
Heterogeneous Agents: Number of observations = [3, 1, 1] 14 12 10 8 G(z) 6 4 G D 2 D++ 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Calls to G DEMUR 2016 Aida Rahmattalabi | Oregon State University 31
Conclusion • D++ is a new rewarding structure for tjghtly coupled multjagent domains • D++ outperforms both G and D – Rewarding the stepping stone actjons required in the long term success • Robot heterogeneity/tjghter coupling challenges G and D learners – D++ learners can learn high-reward policies DEMUR 2016 Aida Rahmattalabi | Oregon State University 32
D++: Structural Credit Assignment in Tightly Coupled Multiagent Domains Aida Rahmatualabi , Jen Jen Chung, Kagan Tumer Autonomous Agents and Distributed Intelligence Lab OSU Robotjcs
Recommend
More recommend