Linguistic sca fg olds for policy learning Jacob Andreas Berkeley → Microsoft Semantic Machines → MIT
Linguistic sca fg olds for policy learning (what can language do for RL?) Jacob Andreas Berkeley → Microsoft Semantic Machines → MIT
An NLPer’s view of RL ( , R )
An NLPer’s view of RL ( , R ) memorize 1 reward fn
An NLPer’s view of RL ( , R 1 ) ( , R ) ( , R 2 ) memorize k reward fns [e.g. Taylor & Stone 09]
An NLPer’s view of RL ( , R 1 ) ( , R ) Learn to accomplish ( , R 2 ) new goals! ( , R 1 ) (-2, 3) ( , R 1 ) (-2, -2) [e.g. Schaul et al. 15]
An NLPer’s view of RL ( , R 1 ) ( , R ) Learn to follow ( , R 2 ) instructions! ( , R 1 ) ( , R 1 ) (-2, 3) run northwest ( , R 1 ) ( , R 1 ) (-2, -2) go southwest
Instructions as observations ( , R 1 ) ( , R ) ( , R 2 ) ( , R 1 ) ( , R 1 ) (-2, 3) run northwest ( , R 1 ) ( , R 1 ) (-2, -2) go southwest
Instructions as observations ( , R 1 ) ( , R ) ( , R 2 ) ( , R 1 ) ( , R 1 ) (-2, 3) run northwest ( , R 1 ) ( , R 1 ) (-2, -2) go southwest
Beyond observations (1) Instructions are moves in a game, not observations of an environment. ( , R 1 ) ( , R 1 ) (-2, 3) run northwest ( , R 1 ) ( , R 1 ) (-2, -2) go southwest
Beyond goals (2) There’s more to language learning than instruction following! ( , R 1 ) ( , R 1 ) run northwest ??? ( , R 1 ) ( , R 1 ) not so fast go southwest
Language use as gameplay
Generation & understanding Turn right and walk through the kitchen. Go right into the living room and stop by the rug. [Anderson et al. 18]
A reference game [Frank & Goodman 12]
“glasses" [Frank & Goodman 12]
“glasses" [Frank & Goodman 12]
“glasses" [Frank & Goodman 12]
“glasses" [Frank & Goodman 12]
The rational speech acts model 1/2 1/2 L 0 ( . | glasses) 0 1 L 0 ( . | hat) [Frank & Goodman 12, Degen 13]
The rational speech acts model 1/2 1/2 L 0 ( . | glasses) 0 1 L 0 ( . | hat) 1 1/3 S 1 ( glasses | . ) ∝ L 0 ( . | glasses) 0 2/3 S 1 ( hat | . ) [Frank & Goodman 12, Degen 13]
The rational speech acts model L 1 ( . | glasses ) ∝ S 1 ( glasses | . ) 3/4 1/4 0 1 L 1 ( . | hat ) 1 1/3 S 1 ( glasses | . ) ∝ L 0 ( . | glasses) 0 2/3 S 1 ( hat | . ) [Frank & Goodman 12, Degen 13]
Pragmatics Q: Do you know what time it is?
Pragmatics Q: Do you know what time it is? A: Yes
Pragmatics Q: Do you know what time it is? A: Yes I find his cooking very interesting. [Grice 70]
RSA game tree speaker hat glasses
RSA game tree: as speaker speaker listener +1 hat hat -1 +1 glasses glasses -1
RSA game tree: as speaker speaker listener +1 hat hat -1 +1 glasses glasses -1
RSA game tree: as listener speaker listener ? ? glasses glasses ?
A recipe for pragmatic language understanding hat & glasses 1. Train a base speaker model guy with glasses hat man smiley glasses plain hat & glasses man glasses
A recipe for pragmatic language understanding 1. Train a base speaker model 2. Solve this POMDP: +1 hat hat -1 Ronghang Volkan Daniel Hu Cirik Fried +1 glasses glasses Speaker—follower models for vision- -1 and-language navigation. NeurIPS 18.
Application: instruction following (a) orange : trajectory human : Go through the door on baseline policy the right and continue straight. Stop in the next room in front of without pragmatic instruction : Reasoning the bed. inference top-down Go through the door on overview of the right and continue (b) green : trajectory trajectories straight. Stop in the next with pragmatic room in front of the bed. inference
Application: instruction generation seq2seq: Walk past the dining room table and chairs and wait there. reasoning : Walk past the dining room table and chairs and take a right into the living room. Stop once you are on the rug. human : Turn right and walk through the kitchen. Go right into the living room and stop by the rug.
Lesson Utterances are chosen to facilitate correct interpretation in context. (This makes the learning problem easier!)
Language as a sca fg old for learning
What else is an instruction follower good for? Language learning Reinforcement learning go east of the heart Learning with latent language. A, Klein & Levine. NAACL 18.
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit> Pretraining via language learning f ( · ; η , ) NORTH π go east of the heart [Branavan et al., 09]
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit> <latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit> (Standard) reinforcement learning L ( f ( · ; η , ) , · ) ??? R π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit> <latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit> Concept learning L ( f ( · ; η , ) , · ) NORTH,… R π find the horse
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit> <latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit> Concept learning L ( f ( · ; η , ) , · ) NORTH,… R π -0.52 find the horse
Recommend
More recommend