Model Based Reinforcement Learning Katerina Fragkiadaki Model - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Model Based Reinforcement Learning Katerina Fragkiadaki

Model Anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition T(s’|s,a) and reward R(s,a). r s s 0 a

Model-learning We will be learning the model using experience tuples. A supervised learning problem. r Learning machine s (random forest, s 0 deep neural network, linear a (shallow predictor)

Learning Dynamics general parametric form (no Newtonian Physics prior from Physics knowledge) VS equations Neural networks: lots of unknown System identification: when we assume the dynamics parameters equations given and only have few unknown parameters Much easier to learn but suffers from under- Very flexible, very hard to get it to modeling, bad models generalize

Observation prediction Our model tries to predict the observations. Why? Because MANY different rewards can be computed once I have access to the future visual observation, e.g., make Mario jump, make Mario move to the right, to the left, lie down, make Mario jump on the well and then jump back down again etc.. If I was just predicting rewards, then I can only plan towards that specific goal, e.g., win the game, same in the model-free case. Unroll the model by feeding the prediction back as input! o Learning machine s o ′ � (random forest, r deep neural network, linear a (shallow predictor)

Prediction in a latent space Our model tries to predict a (potentially latent) embedding, from which rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. r = exp( −∥ h ′ � − h g ∥ ) h o Learning machine s (random forest, h ′ � r h g deep neural network, linear a (shallow predictor)

Prediction in a latent space Our model tries to predict a (potentially latent) embedding, from which rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. One such feature encoding we have seen is the one that keep from the observation ONLY whatever is controllable by the agent. s h ( s ) T( h ; θ ) a min θ , ϕ . ∥ T( h ( s ), a ; θ ) − h ( s ′ � ) ∥ + ∥ Inv( h ( s ), h ( s ′ � ); ψ ) − a ∥ s ′ � h ( s ′ � ) a s h ( s )

Prediction in a latent space Our model tries to predict a (potentially latent) embedding, from which rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. r = exp( −∥ h ′ � − h g ∥ ) h o Learning machine s (random forest, h ′ � r h g deep neural network, linear a (shallow predictor)

Prediction in a latent space Our model tries to predict a (potentially latent) embedding, from which rewards can be computed, e.g., by matching the embedding from my desired goal image to the prediction. Unroll the model by feeding the prediction r = exp( −∥ h ′ � − h g ∥ ) back as input! h o Learning machine s (random forest, h ′ � h g deep neural network, linear a (shallow predictor)

Avoid or minimize unrolling Unrolling quickly causes errors to accumulate. We can instead consider coarse models, where we input a long sequences of actions and predict the final embedding in one shot, without unrolling. r = exp( −∥ h ′ � − h g ∥ ) h o Learning machine s (random forest, h ′ � h g deep neural network, linear a (shallow predictor)

Why model learning • Online Planning at test time - Model predictive Control • Model-based RL: training policies using simulated experience • Efficient Exploration

Why model learning • Online Planning at test time - Model predictive Control • Model-based RL: training policies using simulated experience • Efficient Exploration Given a state I unroll my model forward and seek the action that results in the highest reward. How do I select this action? 1.I discretize my action space and perform tree- search 2.I use continuous gradient descent to optimize over actions

Bachpropagate to actions deterministic node: the value is a r 0 deterministic function of its input r 1 stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) ρ ( s , a ) deterministic computation node a 0 a 1 π θ ( s ) π θ ( s ) s 1 s 0 T( s , a ) T( s , a ) ... θ Reward and dynamics are known

Bachpropagate to actions Given a state I unroll my model forward and seek the action that results in the highest reward. How do I select this action? 1.I discretize my action space and perform tree- search 2.I use continuous gradient descent to optimize over actions a 0 a 1 r s 1 s 0 T( s , a , θ ) T( s , a , θ ) s T ... No policy learned, action selection directly by backpropagating through the dynamics, the continuous analog of online planning dynamics are frozen, we backpropagate to actions directly

Remember: Stochastic Value Gradients V0 z ∼ N (0 , 1) a = µ ( s ; θ ) + z σ ( s ; θ ) z DNN s a ( θ µ ) s Q(s,a) DNN ( θ Q )

Bachpropagate to the policy deterministic node: the value is a r 0 deterministic function of its input r 1 stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) ρ ( s , a ) deterministic computation node a 0 a 1 π θ ( s ) π θ ( s ) s 1 s 0 T( s , a ) T( s , a ) ... θ Reward and dynamics are known

Bachpropagate to the policy Q ( s , a ) Q ( s , a ) a 0 a 1 π θ ( s ) π θ ( s ) s 1 s 0 T( s , a , θ ) T( s , a , θ ) ... dynamics are frozen, backprogate to the policy directly by maximizing Q within a time horizon θ

Challenges • Errors accumulate during unrolling • Policy learnt on top of an inaccurate model is upperbounded by the accuracy of the model • Policies exploit model errors be being overly optimistic • With lots of experience, model-free methods would always do better Answers: • Use model to pre-train your polic, finetune while being model-free • Use model to explore fast, but always try actions not suggested by the model so you do not suffer its biases • Build a model on top of a latent space which is succinct and easily predictable • Abandon global models and train local linear models, which do not generalize but help you solve your problem fast, then distill the knowledge of the actions to a general neural network policy (next week)

Model Learning Three questions always in mind • What shall we be predicting? h o s h ′ � a • What is the architecture of the model, what structural biases should we add to get it to generalize? h o s h ′ � a h • What is the action representation? o s h ′ � a

How do we learn to play Billiards? • First, we tranfer all knowledge about how objects move, that we have accumulated so far. • Second, we watch other people play and practise ourselves, to finetune such model knowledge F 23

How do we learn to play Billiards? F 24

Learning Action-Conditioned Billiard Dynamics Predictive Visual Models of Physics for Playing Billiards , K.F. et al. ICLR 2016 29

Learning Action-Conditioned Billiard Dynamics CNN Force field Q: will our model be able to generalize across different number of balls present? 30

Learning Action-Conditioned Billiard Dynamics F F World-Centric Prediction Object-Centric Prediction Q: will our model be able to generalize across different number of balls present? 31

Object-centric Billiard Dynamics d x CNN ball displacement F The object-centric CNN is shared across all objects in the scene. We apply it one object at a time to predict the object’s future displacement. We then copy paste the ball at the predicted location, and feed back as input. 38

fi id=6571367.7967880 39

Playing Billiards How should I push the red ball so that it collides with the green on? Cme for searching in the force space 40

Learning Dynamics Two good ideas so far: 1) object graphs instead of images. Such encoding allows to generalize across different number of entities in the scene. 2) predict motion instead of appearance. Since appearance does not change, predicting motion suffices. Let’s predict only the dynamic properties and keep the static one fixed.

Model Based Reinforcement Learning Katerina Fragkiadaki Model - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Model Based Reinforcement Learning Katerina Fragkiadaki Model Anything the agent can use to predict how the environment will respond to its actions,

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Advanced Model-Based Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Model Based Reinforcement Learning Oriol Vinyals (DeepMind) @OriolVinyalsML May 2018 Stanford

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

CSCI 8260 Spring 2016 Computer and Networks Security INTRODUCTION 1 Research in Computer

Why secure the OS? Works directly on the hardware but can be adapted during runtime Operating

CSN11121 System Administration and Forensics Week 3 : Users, Permissions, Processes, and Pipes

User Accounts Even a single-user workstation (Desktop Computer) uses multiple accounts. Such a

Copy Constructor 1 Copy Constructor 1. Initialize one object from another of the same type

Linking library data: contributions and role of subject data Nuno Freire The European Library

Object Oriented Programming and Design in Java Session 13 Instructor: Bert Huang Announcements

Run Time Environments ALSU Textbook Chapter 7.17.3 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw