linguistic sca fg olds for policy learning
play

Linguistic sca fg olds for policy learning Jacob Andreas Berkeley - PowerPoint PPT Presentation

Linguistic sca fg olds for policy learning Jacob Andreas Berkeley Microsoft Semantic Machines MIT Linguistic sca fg olds for policy learning (what can language do for RL?) Jacob Andreas Berkeley Microsoft Semantic Machines MIT An


  1. Linguistic sca fg olds for policy learning Jacob Andreas Berkeley → Microsoft Semantic Machines → MIT

  2. Linguistic sca fg olds for policy learning (what can language do for RL?) Jacob Andreas Berkeley → Microsoft Semantic Machines → MIT

  3. An NLPer’s view of RL ( , R )

  4. An NLPer’s view of RL ( , R ) memorize 1 reward fn

  5. An NLPer’s view of RL ( , R 1 ) ( , R ) ( , R 2 ) memorize k reward fns [e.g. Taylor & Stone 09]

  6. An NLPer’s view of RL ( , R 1 ) ( , R ) Learn to accomplish ( , R 2 ) new goals! ( , R 1 ) (-2, 3) ( , R 1 ) (-2, -2) [e.g. Schaul et al. 15]

  7. An NLPer’s view of RL ( , R 1 ) ( , R ) Learn to follow 
 ( , R 2 ) instructions! ( , R 1 ) ( , R 1 ) (-2, 3) run northwest ( , R 1 ) ( , R 1 ) (-2, -2) go southwest

  8. Instructions as observations ( , R 1 ) ( , R ) ( , R 2 ) ( , R 1 ) ( , R 1 ) (-2, 3) run northwest ( , R 1 ) ( , R 1 ) (-2, -2) go southwest

  9. Instructions as observations ( , R 1 ) ( , R ) ( , R 2 ) ( , R 1 ) ( , R 1 ) (-2, 3) run northwest ( , R 1 ) ( , R 1 ) (-2, -2) go southwest

  10. Beyond observations (1) Instructions are moves in a game, not observations of an environment. ( , R 1 ) ( , R 1 ) (-2, 3) run northwest ( , R 1 ) ( , R 1 ) (-2, -2) go southwest

  11. Beyond goals (2) There’s more to language learning 
 than instruction following! ( , R 1 ) ( , R 1 ) run northwest ??? ( , R 1 ) ( , R 1 ) not so fast go southwest

  12. Language use as gameplay

  13. Generation & understanding Turn right and walk through the kitchen. Go right into the living room and stop by the rug. [Anderson et al. 18]

  14. A reference game [Frank & Goodman 12]

  15. “glasses" [Frank & Goodman 12]

  16. “glasses" [Frank & Goodman 12]

  17. “glasses" [Frank & Goodman 12]

  18. “glasses" [Frank & Goodman 12]

  19. The rational speech acts model 1/2 1/2 L 0 ( . | glasses) 0 1 L 0 ( . | hat) [Frank & Goodman 12, Degen 13]

  20. The rational speech acts model 1/2 1/2 L 0 ( . | glasses) 0 1 L 0 ( . | hat) 1 1/3 S 1 ( glasses | . ) ∝ L 0 ( . | glasses) 0 2/3 S 1 ( hat | . ) [Frank & Goodman 12, Degen 13]

  21. The rational speech acts model L 1 ( . | glasses ) ∝ S 1 ( glasses | . ) 3/4 1/4 0 1 L 1 ( . | hat ) 1 1/3 S 1 ( glasses | . ) ∝ L 0 ( . | glasses) 0 2/3 S 1 ( hat | . ) [Frank & Goodman 12, Degen 13]

  22. Pragmatics Q: Do you know what time it is?

  23. Pragmatics Q: Do you know what time it is? A: Yes

  24. Pragmatics Q: Do you know what time it is? A: Yes I find his cooking very interesting. [Grice 70]

  25. RSA game tree speaker hat glasses

  26. RSA game tree: as speaker speaker listener +1 hat hat -1 +1 glasses glasses -1

  27. RSA game tree: as speaker speaker listener +1 hat hat -1 +1 glasses glasses -1

  28. RSA game tree: as listener speaker listener ? ? glasses glasses ?

  29. A recipe for pragmatic language understanding hat & 
 glasses 1. Train a base speaker model guy with 
 glasses 
 hat man smiley glasses 
 plain hat & 
 glasses man glasses

  30. A recipe for pragmatic language understanding 1. Train a base speaker model 2. Solve this POMDP: +1 hat hat -1 Ronghang 
 Volkan 
 Daniel 
 Hu Cirik Fried +1 glasses glasses Speaker—follower models for vision- -1 and-language navigation. NeurIPS 18.

  31. Application: instruction following (a) orange : trajectory human : Go through the door on baseline policy the right and continue straight. Stop in the next room in front of without pragmatic instruction : Reasoning the bed. inference top-down Go through the door on overview of the right and continue (b) green : trajectory trajectories straight. Stop in the next with pragmatic room in front of the bed. inference

  32. Application: instruction generation seq2seq: Walk past the dining room table and chairs and wait there. reasoning : Walk past the dining room table and chairs and take a right into the living room. Stop once you are on the rug. human : Turn right and walk through the kitchen. Go right into the living room and stop by the rug.

  33. Lesson Utterances are chosen to facilitate 
 correct interpretation in context. (This makes the learning problem easier!)

  34. Language as a sca fg old 
 for learning

  35. What else is an instruction follower good for? Language learning Reinforcement learning go east of the heart Learning with latent language. 
 A, Klein & Levine. NAACL 18.

  36. <latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit> Pretraining via language learning f ( · ; η , ) NORTH π go east of the heart [Branavan et al., 09]

  37. <latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit> <latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit> (Standard) reinforcement learning L ( f ( · ; η , ) , · ) ??? R π

  38. <latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit> <latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit> Concept learning L ( f ( · ; η , ) , · ) NORTH,… R π find the horse

  39. <latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit> <latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit> Concept learning L ( f ( · ; η , ) , · ) NORTH,… R π -0.52 find the horse

Recommend


More recommend