Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu, Jiaming Song, Stefano Ermon Department of Computer Science, Stanford University Contact: lantaoyu@cs.stanford.edu
<latexit sha1_base64="AKws76aOIQ/vlCa0QNvQLXdwHRA=">ACL3icbVBNSyNBFOzxe6OrUY+5NIYFhSXMxIMeVgiI4lHBGCEzDm86PUmT7pmh+81iGHLx93jx4g/xEsRFvPov7CQe/NiChuq9+iuijIpDLruozMzOze/sLj0o7S8nN1rby+cWHSXDPeZKlM9WUEhkuR8CYKlPwy0xUJHkr6h+O/dZfro1Ik3McZDxQ0E1ELBiglcLysa/gOvQzQS3BXhQVR8PpVfIY27JVjgTe8Oqd+F5SCK6R624T4m0KIO74W3R4GYbnq1twJ6HfivZNqo96qPN+M7k/D8oPfSVmueIJMgjFtz80wKECjYJIPS35ueAasD13etjQBxU1QTPIO6S+rdGicansSpBP140YBypiBiuzkOJT56o3F/3ntHOP9oBJliNP2PShOJcUzouj3aE5gzlwBJgWti/UtYDQxtxSVbgvc18ndyUa95u7X6mVdt/CFTLJEK2SLbxCN7pEFOyClpEkZuyQN5Iv+cO2fkPDsv09EZ531nk3yC8/oGl/mtEw=</latexit> Motivation • By definition, the performance of RL agents heavily relies on the quality of reward functions. hP T i t =1 γ t r ( s t , a t ) max π E π Computer Games Multi-Agent System Dialogue • In many real-world scenarios, especially in multi-agent settings, hand-tuning informative reward functions can be very challenging. • Solution: learning from expert demonstrations!
<latexit sha1_base64="mSOGy5SbNzZRqB+RiQ+OplYS+yA=">AB7XicbVBNSwMxEJ2tX7V+VT16CRahgpTdVrDHghePFewHtEvJptk2NpsSVYoS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZF8ScaeO6305uY3Nreye/W9jbPzg8Kh6ftLVMFKEtIrlU3QBrypmgLcMp91YURwFnHaCye3c7zxRpZkUD2YaUz/CI8FCRrCxUluV9RW+HBRLbsVdAK0TLyMlyNAcFL/6Q0mSiApDONa657mx8VOsDCOczgr9RNMYkwke0Z6lAkdU+ni2hm6sMoQhVLZEgYt1N8TKY60nkaB7YywGetVby7+5/USE9b9lIk4MVSQ5aIw4chINH8dDZmixPCpJZgoZm9FZIwVJsYGVLAheKsvr5N2teLVKtX761KjnsWRhzM4hzJ4cAMNuIMmtIDAIzDK7w50nlx3p2PZWvOyWZO4Q+czx+X0o5z</latexit> <latexit sha1_base64="oLJVDvbL3g/9lhUVNX+FRAp9xk0=">ACJXicbVDLSsNAFJ34rPVdelmsAjVRUmqoIsKBSm4rGAf0MQwmU7boZNJmJmIJeZn3PgrblxYRHDlrzhJu9DWAwNnzrmXe+/xQkalMs0vY2l5ZXVtPbeR39za3tkt7O23ZBAJTJo4YIHoeEgSRjlpKqoY6YSCIN9jpO2NrlO/UCEpAG/U+OQOD4acNqnGCktuYWqHdL7U3gFbR89urH+QZtyaDdokpq6HlxPckMt50bRYMoOYl9CRPHLdQNMtmBrhIrBkpghkabmFi9wIc+YQrzJCUXcsMlRMjoShmJMnbkSQhwiM0IF1NOfKJdOLsygQea6UH+4HQjyuYqb87YuRLOfY9XZkuLue9VPzP60aqf+nElIeRIhxPB/UjBlUA08hgjwqCFRtrgrCgeleIh0grHSweR2CNX/yImlVytZuXJ7XqxVZ3HkwCE4AiVgQtQAzegAZoAg2fwCt7BxHgx3owP43NaumTMeg7AHxjfP2AHpJQ=</latexit> <latexit sha1_base64="lADpjtdH09HXKxCgQmHRrYn7+4=">AB73icbVDLSgNBEOz1GeMr6tHLYBDiJexGwRwDXjxGMA9IltA7mSRDZmfXmVkhrPkJLx4U8ervePNvnCR70MSChqKqm+6uIBZcG9f9dtbWNza3tnM7+d29/YPDwtFxU0eJoqxBIxGpdoCaCS5Zw3AjWDtWDMNAsFYwvpn5rUemNI/kvZnEzA9xKPmAUzRWandjXsInfdErFN2yOwdZJV5GipCh3it8dfsRTUImDRWodcdzY+OnqAyngk3z3USzGOkYh6xjqcSQaT+d3zsl51bpk0GkbElD5urviRDrSdhYDtDNCO97M3E/7xOYgZVP+UyTgyTdLFokAhiIjJ7nvS5YtSIiSVIFbe3EjpChdTYiPI2BG/5VXSrJS9y3Ll7qpYq2Zx5OAUzqAEHlxDW6hDg2gIOAZXuHNeXBenHfnY9G65mQzJ/AHzucPiA2Pmg=</latexit> <latexit sha1_base64="dJnK71NR3o5vXieC5lAfNvyDvE=">AB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFeyxILHCvYD26Vk02wbmk2WJCuUtf/CiwdFvPpvPlvTNs9aOuDgcd7M8zMC2LOtHdbye3tr6xuZXfLuzs7u0fFA+PWlomitAmkVyqToA15UzQpmG06sKI4CTtvB+Hrmtx+p0kyKezOJqR/hoWAhI9hY6aEXs/5NGT/p836x5FbcOdAq8TJSgyNfvGrN5AkiagwhGOtu54bGz/FyjDC6bTQSzSNMRnjIe1aKnBEtZ/OL56iM6sMUCiVLWHQXP09keJI60kU2M4Im5Fe9mbif143MWHNT5mIE0MFWSwKE46MRLP30YApSgyfWIKJYvZWREZYWJsSAUbgrf8ipVSveRaV6d1mq17I48nACp1AGD6gDrfQgCYQEPAMr/DmaOfFeXc+Fq05J5s5hj9wPn8A0I+QUg=</latexit> Motivation • Imitation learning does not recover reward functions. • Behavior Cloning π ∗ = max π ∈ Π E π E [log π ( a | s )] • Generative Adversarial Imitation Learning [Ho & Ermon, 2016] IRL RL π ( a | s ) π E ( a | s ) r ( s, a )
<latexit sha1_base64="lADpjtdH09HXKxCgQmHRrYn7+4=">AB73icbVDLSgNBEOz1GeMr6tHLYBDiJexGwRwDXjxGMA9IltA7mSRDZmfXmVkhrPkJLx4U8ervePNvnCR70MSChqKqm+6uIBZcG9f9dtbWNza3tnM7+d29/YPDwtFxU0eJoqxBIxGpdoCaCS5Zw3AjWDtWDMNAsFYwvpn5rUemNI/kvZnEzA9xKPmAUzRWandjXsInfdErFN2yOwdZJV5GipCh3it8dfsRTUImDRWodcdzY+OnqAyngk3z3USzGOkYh6xjqcSQaT+d3zsl51bpk0GkbElD5urviRDrSdhYDtDNCO97M3E/7xOYgZVP+UyTgyTdLFokAhiIjJ7nvS5YtSIiSVIFbe3EjpChdTYiPI2BG/5VXSrJS9y3Ll7qpYq2Zx5OAUzqAEHlxDW6hDg2gIOAZXuHNeXBenHfnY9G65mQzJ/AHzucPiA2Pmg=</latexit> <latexit sha1_base64="oLJVDvbL3g/9lhUVNX+FRAp9xk0=">ACJXicbVDLSsNAFJ34rPVdelmsAjVRUmqoIsKBSm4rGAf0MQwmU7boZNJmJmIJeZn3PgrblxYRHDlrzhJu9DWAwNnzrmXe+/xQkalMs0vY2l5ZXVtPbeR39za3tkt7O23ZBAJTJo4YIHoeEgSRjlpKqoY6YSCIN9jpO2NrlO/UCEpAG/U+OQOD4acNqnGCktuYWqHdL7U3gFbR89urH+QZtyaDdokpq6HlxPckMt50bRYMoOYl9CRPHLdQNMtmBrhIrBkpghkabmFi9wIc+YQrzJCUXcsMlRMjoShmJMnbkSQhwiM0IF1NOfKJdOLsygQea6UH+4HQjyuYqb87YuRLOfY9XZkuLue9VPzP60aqf+nElIeRIhxPB/UjBlUA08hgjwqCFRtrgrCgeleIh0grHSweR2CNX/yImlVytZuXJ7XqxVZ3HkwCE4AiVgQtQAzegAZoAg2fwCt7BxHgx3owP43NaumTMeg7AHxjfP2AHpJQ=</latexit> <latexit sha1_base64="dJnK71NR3o5vXieC5lAfNvyDvE=">AB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFeyxILHCvYD26Vk02wbmk2WJCuUtf/CiwdFvPpvPlvTNs9aOuDgcd7M8zMC2LOtHdbye3tr6xuZXfLuzs7u0fFA+PWlomitAmkVyqToA15UzQpmG06sKI4CTtvB+Hrmtx+p0kyKezOJqR/hoWAhI9hY6aEXs/5NGT/p836x5FbcOdAq8TJSgyNfvGrN5AkiagwhGOtu54bGz/FyjDC6bTQSzSNMRnjIe1aKnBEtZ/OL56iM6sMUCiVLWHQXP09keJI60kU2M4Im5Fe9mbif143MWHNT5mIE0MFWSwKE46MRLP30YApSgyfWIKJYvZWREZYWJsSAUbgrf8ipVSveRaV6d1mq17I48nACp1AGD6gDrfQgCYQEPAMr/DmaOfFeXc+Fq05J5s5hj9wPn8A0I+QUg=</latexit> <latexit sha1_base64="3q69riKgMJ7dCLeUS7TDplOABc=">AB7XicbVBNSwMxEJ2tX7V+VT16CRahgpTdVrDHghePFewHtEvJptk2NpsSVYoS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZF8ScaeO6305uY3Nreye/W9jbPzg8Kh6ftLVMFKEtIrlU3QBrypmgLcMp91YURwFnHaCye3c7zxRpZkUD2YaUz/CI8FCRrCxUjsu6yt8OSiW3Iq7AFonXkZKkKE5KH71h5IkERWGcKx1z3Nj46dYGUY4nRX6iaYxJhM8oj1LBY6o9tPFtTN0YZUhCqWyJQxaqL8nUhxpPY0C2xlhM9ar3lz8z+slJqz7KRNxYqgy0VhwpGRaP46GjJFieFTSzBRzN6KyBgrTIwNqGBD8FZfXiftasWrVar316VGPYsjD2dwDmXw4AYacAdNaAGBR3iGV3hzpPivDsfy9ack82cwh84nz+UwI5x</latexit> Motivation • Imitation learning does not recover reward functions. • Behavior Cloning π ∗ = max π ∈ Π E π E [log π ( a | s )] • Generative Adversarial Imitation Learning [Ho & Ermon, 2016] π ( a | s ) π E ( a | s ) Matching with GAN p ( s, a )
<latexit sha1_base64="yr7Og+EdZ1l80MCEkGrNsiJiA=">ACGHicbZDLSgMxFIYzXmu9V26CRahCtaZKtiNUHDjsoK9QC9DJs20sZnJkJwRy9DHcOruHGhiNvufBvTy0Jbfwj8fOcTs7vRYJrsO1va2l5ZXVtPbWR3tza3tnN7O1XtYwVZRUqhVR1j2gmeMgqwEGweqQYCTzBal7/ZlyvPTKluQzvYRCxVkC6Ifc5JWCQmzlX7VN8jXNYE+QSO+BUWi6kdRDfIansCuJmKTdsHNZO28PRFeNM7MZNFMZTczanYkjQMWAhVE64ZjR9BKiAJOBRum7FmEaF90mUNY0MSMN1KJocN8bEhHexLZV4IeEJ/TyQk0HoQeKYzINDT87Ux/K/WiMEvthIeRjGwkE4X+bHAIPE4JdzhygQhBsYQqrj5K6Y9ogFk2XahODMn7xoqoW8c5Ev3F1mS8VZHCl0iI5QDjnoCpXQLSqjCqLoGb2id/RhvVhv1qf1NW1dsmYzB+iPrNEP2UGfpg=</latexit> <latexit sha1_base64="iMdxnKiMTtlEX/9UuknRpqdtNU=">ACGHicbVC7TsMwFHXKq5RXgJHFokIqDCUpCBTEQtjEfQhNaFyXLe16jiR7SBVUT6DhV9hYQAh1m78DU4b8SgcydLxOfq3nu8kFGpLOvDyM3NLywu5ZcLK6tr6xvm5lZDBpHApI4DFoiWhyRhlJO6oqRVigI8j1Gmt7wMvWb90RIGvBbNQqJ6M+pz2KkdJSxzx0Qnp3cA4dH6kBRiy+SaCjgu9/LSl98Ytkv2MWrbI1AfxL7IwUQYZaxw73QBHPuEKMyRl27ZC5cZIKIoZSQpOJEmI8BD1SVtTjnwi3XhyWAL3tNKFvUDoxWcqD87YuRLOfI9XZnuKGe9VPzPa0eqd+bGlIeRIhxPB/UiBvXlaUqwSwXBio0QVhQvSvEAyQVjrLg7Bnj35L2lUyvZRuXJ9XKyeZHkwQ7YBSVg1NQBVegBuoAgwfwBF7Aq/FoPBtvxvu0NGdkPdvgF4zxJyEOn84=</latexit> Motivation • Why should we care reward learning? • Scientific inquiry: human and animal behavioral study, inferring intentions, etc. • Presupposition: reward function is considered to be the most succinct, robust and transferable description of the task. [Abbeel & Ng, 2014] r ∗ = (object pos − goal pos) 2 VS. π ∗ : S → P ( A ) • Re-optimizing policies in new environments, debugging and analyzing imitation learning algorithms, etc. • These properties are even more desirable in the multi-agent settings.
<latexit sha1_base64="hVwuR8Ted1t3O6ZK0IXne5ndZM=">AB73icbVDLSgNBEJz1GeMr6tHLYBA8hd0omGPAi8cI5oHJEmYnvcmQeawzs0JY8hNePCji1d/x5t84SfagiQUNRVU3V1Rwpmxv/tra1vbG5tF3aKu3v7B4elo+OWUam0KSK92JiAHOJDQtsxw6iQYiIg7taHwz89tPoA1T8t5OEgFGUoWM0qskzoP/Z4SMCT9Utmv+HPgVRLkpIxyNPqlr95A0VSAtJQTY7qBn9gwI9oymFa7KUGEkLHZAhdRyURYMJsfu8UnztlgGOlXUmL5+rviYwIYyYicp2C2JFZ9mbif143tXEtzJhMUguSLhbFKcdW4dnzeMA0UMsnjhCqmbsV0xHRhFoXUdGFECy/vEpa1UpwWaneXZXrtTyOAjpFZ+gCBega1dEtaqAmoijZ/SK3rxH78V79z4WrWtePnOC/sD7/AH1fo/i</latexit> <latexit sha1_base64="BmMsp83qz7lP2nj6Yhkur/PhWg0=">ADHnicbVJdixMxFM2MX2v92K4+hIsLrOopbMW9GVhQRZ8rNDuLjbtkEnTadhkJiR3ZMs4v8QX/4ovPigi+KT/xkw7Ld2uF0IO596Te26SWEthodP56/k3bt6fWfnbuPe/QcPd5t7j05tlhvGByTmTmPqeVSpHwAiQ/14ZTFUt+Fl+8rfJnH7mxIkv7MNd8pGiSiqlgFBwV7Xndfawjkime0IAzQ8w0SbTkGEi+RSGhAMN7Dhc8pOogKOwHPdxz5EFPA/LT3YML+gYDogRyQxGmPBLvdAGmNhcrRWmboODSoErCV5q3E4a2MU+JoperuochlkcFydlVBAtopOytiSzZMv0qvfRVZFLOQ9C4U35pquiX659BatB8Oq0l3jR6sPaUJqluYq5iZqtTruzCHwdhDVoTp6UfM3mWQsVzwFJqm1w7CjYVRQA4JXjZIbrm7ImfOhgShW3o2LxvCV+5pgJnmbGrRTwgt1UFRZO1exq6yGt9u5ivxfbpjD9M2oEKnOgads2WiaS+wev/oreCIMZyDnDlBmhPOK2YwaysD9qIa7hHB75Ovg9LAdvmofvu+2jrv1deygJ+gpClCIXqNj9A710Ax7P31fvu/fC/+N/8n/6vZanv1ZrH6Er4f/4BEzr8cQ=</latexit> Preliminaries • Single-Agent Inverse RL • Basic principle: find a reward function that explains the expert behaviors. (ill-defined) • Maximum Entropy Inverse RL (MaxEnt IRL) provides a general probabilistic framework to solve the ambiguity. • Maximum Entropy Inverse RL (MaxEnt IRL) provides a general probabilistic framework to solve the ambiguity. " T # T ! Y X η ( s 1 ) P ( s t +1 | s t , a t ) r ω ( s t , a t ) p ω ( τ ) ∝ exp t =1 t =1 " T # X r ω ( s t , a t ) max E π E [log p ω ( τ )] = E τ ∼ π E − log Z ω ω t =1 where is the partition function. Z ω
Recommend
More recommend