task agnostic priors for reinforcement learning
play

Task-agnostic priors for reinforcement learning Karthik Narasimhan - PowerPoint PPT Presentation

Task-agnostic priors for reinforcement learning Karthik Narasimhan Princeton Collaborators: Yilun Du (MIT), Regina Barzilay (MIT), Tommi Jaakkola (MIT) State of RL ~1100 PF/s days ~800 PF/s days or 45000 years (source: OpenAI) Little to no


  1. Task-agnostic priors for reinforcement learning Karthik Narasimhan Princeton Collaborators: Yilun Du (MIT), Regina Barzilay (MIT), Tommi Jaakkola (MIT)

  2. State of RL ~1100 PF/s days ~800 PF/s days or 45000 years (source: OpenAI) Little to no transfer of knowledge Sample efficiency Generalizability

  3. Current approaches • Multi-task policy learning (Parisotto et al., 2015; Rusu et al., 2016; Andreas et al., 2017; Espeholt et al., 2018, …) • Meta-learning (Duan et al., 2016; Wang et al., 2016; Finn et al., 2017; Al-Shedivat et al., 2018; Nagabandi et al., 2018, …) • Bayesian RL (Ghavamzadeh et al., 2015; … ) • Successor representations (Dayan, 1993; Kulkarni et al., 2016, Barreto et al., 2017; …) • Policies are closely coupled with tasks & not well suited for transfer to new domains • Model-based RL?

  4. Bootstrapping model learning with task- agnostic priors • Model of the environment is more Prior transferrable than policy • Expensive to learn a model of the environment from scratch - use priors! Model Model • Task-agnostic priors for models … + Generalizable, easier to acquire Policy Policy - May be sub-optimal for specific task Task 2 Task 1

  5. ‘Universal’ priors Language Physics [NBJ18] [DN19]

  6. Task-agnostic dynamics priors for RL Key questions: • Can we learn physics in a task- agnostic fashion? • Does it help sample efficiency of RL? • Does it help transfer of learned representations to new tasks? (Yilun Du and Karthik Narasimhan, ICML 2019)

  7. Dynamics model for RL • Frame prediction : Oh et al.(2015), Finn et al.(2016), Weber et al. (2017), … • Lack generalization since model is learned for a specific task (i.e. action-conditioned) • Parameterized physics models : Cutler et al. (2014), Scholz et al.(2014), Zhu et al. (2018), … • Require manual specification • Our work : Learn prior from task-independent data, decouple model and policy

  8. Overall approach • Pre-train a frame predictor on physics videos • Initialize dynamics model and use it to learn policy that makes use of future state predictions • Simultaneously fine-tune dynamics model on target environment SpatialNet

  9. SpatialNet • Two key operations: • Isolation of dynamics of each entity • Accurate modeling of local spaces around each entity • Spatial memory : Use convolutions and residual connections to better capture local dynamics (instead of additive updates in the ConvLSTM model (Xingjian et al., 2015) • No action-conditioning ˆ T ( s 0 | s, a ) SpatialNet

  10. Experimental setup • PhysVideos, 625k frames of video containing moving objects of various shapes and sizes (rendered with a physics engine) PhysVideos • PhysWorld : Collection of 2D physics-centric games - navigation, object gathering and shooting tasks Pre-trained dynamics • Atari: Stochastic version with sticky actions model PhysWorld • RL agent: Use predicted future frames as input to policy Agent network Atari • Same pre-trained dynamics prior used in all RL experiments …

  11. Frame predictions PhysShooter Pixel prediction accuracy

  12. Predicting physical parameters PhysShooter Predicted frames are indicative of physical parameters of environment

  13. Policy learning: PhysWorld

  14. Policy learning: Atari Same pre-trained model as before (from PhysVideos)

  15. Transfer Learning 35.42 No transfer 42.27 Pre-trained dynamics predictor 40.40 Policy transfer (PhysForage) 53.66 Model transfer (PhysForage) 25 33.75 42.5 51.25 60 Reward Target env: PhysShooter Model transfer > policy transfer > no transfer

  16. Beyond physics • Not all environments are physical • How do we encode knowledge of the environment?

  17. Environment 1 Environment 2 0.4 u 1 u 2 u 3 s 2 s 3 s 1 ADD DOTA Image u 6 0.2 u 5 0.4 s 5 s 4 u 4 • Knowledge of environment: transitions and rewards • Need some anchor to re-use acquired information • Incorrect mapping will lead to negative transfer

  18. Environment 1 Environment 2 0.35 0.4 u 1 u 2 u 3 s 2 s 3 s 1 0.45 u 6 0.2 u 5 0.4 s 5 s 4 u 4 0.2 s1 is similar to u1 s2 is similar to u3 …

  19. Grounding language for transfer in RL Spiders are chasers and Scor%ions chase you can be dest5oyed by an and kill you on touch ex%losion • Text descriptions associated with objects/entities

  20. Language as a bridge Language • Language as task-invariant and accessible medium • Traditional approaches : direct transfer of policy (e.g. instruction following) • This work : transfer ‘model’ of the environment using text descriptions (Narasimhan, Barzilay, Jaakkola, JAIR 2018)

  21. Model-based reinforcement learning T ( s 0 | s, a ) R ( s, a ) Transition distribution and reward function s 0 1 State s Ac)on a +10 s 0 n

  22. Model-based reinforcement learning T ( s 0 | s, a, z ) Text-conditioned transition distribution State s s 0 1 Ac)on a Scor%ions chase you Text z s 0 and kill you on touch n

  23. Bootstrap learning through text Transfer Scor%ions chase you Spiders are chasers z 2 z 1 and kill you on touch and can be dest5oyed ˆ ˆ T 2 ( u 0 | u, a, z 2 ) T 1 ( s 0 | s, a, z 1 ) R 1 ( s, a, z 1 ) R 2 ( u, a, z 2 ) • Appropriate representation to incorporate language • Partial text descriptions

  24. Differentiable value iteration X T ( s 0 | s, a ) V ( s 0 ) V ( s ) = max Q ( s, a ) Q ( s, a ) = R ( s, a ) + γ a s 0 Convolu)onal neural network (CNN) Reward R max Observations 
 + 
 Descriptions T Q V φ ( s ) V Value k step recurrence (Value Iteration Network, Tamar et al., 2016)

  25. Experiments • 2-D game environments from the GVGAI framework 
 (each with different layouts, different entity sets, etc.) Environment stats: 
 • Text descriptions from Amazon Mechanical Turk Source and target game instances for transfer • Transfer setup : train on multiple source tasks, and use learned parameters to initialize for target tasks • Baselines : DQN (Mnih et al., 2015), text-DQN, Actor-Mimic (Parisotto et al., 2016) • Evaluation : Jumpstart, average and asymptotic reward

  26. Average reward 0.8 F&E-1 to Freeway 0.73 0.6 Reward 0.4 0.33 0.2 0.22 0.21 0.08 0 No transfer DQN Actor Mimic text-DQN text-VIN

  27. Transfer results

  28. Conclusions • Model-based RL is sample efficient but learning a model is expensive • Task-agnostic priors over models provide a solution for both sample efficiency and generalization • Two common priors applicable to a variety of tasks: classical mechanics and language Questions?

  29. Challenger : Joshua Zhanson <jzhanson@andrew.cmu.edu> ● After the success of deep learning, we are now seeing a push into middle- level intelligence, such as ● cross-domain reasoning, e.g., visual question-answering or language- grounding, ● using knowledge from a different tasks and domains to aid learning, e.g., learning skills from video or demonstration, or learning to learn in general. ● What do you see as the end goal of such mid-level intelligence, especially since the space of mid-level tasks is so much more complex and varied? ● What are the greatest obstacles on the path to mid-level intelligence?

Recommend


More recommend