a case against
play

A Case Against s t-1 State s t State s t+1 m t-1 Generative - PowerPoint PPT Presentation

Action a t-1 Action a t Action State a t+! A Case Against s t-1 State s t State s t+1 m t-1 Generative Models for m t / n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t


  1. Action a t-1 Action a t Action State a t+! A Case Against s t-1 State s t State s t+1 … m t-1 Generative Models for m t / n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t n Reinforcement Learning? o i t c A Data x t+1 t n e m n o r v i n E l a n r e t n I State Repr. Shakir Mohamed Critic Option KB n o t i p O Generative models shakir@deepmind.com r e n n a l P for RL workshop DALI 2018 @shakir_za

  2. 2 Observation/ Environment Sensation Action Environment Brain Primary Motor Primary Sensory Cortex Cortex Premotor Sensory Cortex Association Cortex Prefrontal Posterior Assoc. Cortex Cortex Perception and Action

  3. 3 Observation/ Environment Sensation Action Environment Brain Primary Motor Primary Sensory Cortex Cortex Premotor Sensory Cortex Association Cortex External Environment Prefrontal Posterior Assoc. Observation/ Cortex Cortex Action Sensation Environment Agent Internal Environment Option KB Critic State Repr. State Option Embedding Planner Perception and Action

  4. 4 What makes something Don’t know how to usefully a generative model ? learn hierarchical models . Any hierarchical reasoning will be hard to do. Much emphasis on generative models of models of images . False sense of progress? Inference is hard. Are we attempting to solve a more Anything ever truly di ffi cult problem first, instead unsupervised? Always a context, of solving directly? and contextual models other than labels is hard. Generative RL

  5. 5 Environment is the generative process: An unknown likelihood; Action Prior Not known analytically; p(a) Only able to observe its outcomes. Prior over actions Environment Interaction only or Model p(R(s,a)) Long-term reward All the key inferential questions can now be asked in this simple framework. Generative Processes

  6. 6 Simplest question: What is the posterior distribution over actions? Maximising the probability of the return log p(R) . Variational inference in the hierarchical model: Action Prior p(a) Recover policy search methods: Uniform prior over distributions Environment Continuous policy parameters or Model p(R(s,a)) Can evaluate environment, but not di ff erentiate. Planning-as-Inference

  7. 7 Free Energy Policy gradient using score-function gradient Action Prior Appearance of the entropy penalty is natural and p(a) alternative priors easy to consider. Can easily incorporate prior knowledge of the action space. Use any of the tools of probabilistic inference available. Environment or Model Easily handle stochastic and deterministic policies. p(R(s,a)) Planning-as-Inference

  8. 8 With a more realistic expansion as Action Prior graphical model p(a) A c Derive Bellman’s equation as a t i o n E x t e r n a l E di ff erent writing of message passing. n v i r o n m e n t Application of the EM algorithm for O b s e r v a t policy search becomes possible. i o S n I n / e t n e s r a n t a i o l n E n v Environment i r o Easily consider other variational Environment n m e n t O p t i Agent o n or Model K B methods, like EP . O p C t i r o i t n i c S p(R(s,a)) t a Both model-free and model-based t e R e p r . methods emerge. S t a t e P l E a m n n b e e r d d i n g Planning-as-Inference

  9. 9 Inference is already hard. Quantification of uncertainty Do we gain additional benefit? helps drive natural exploration. But uncertainty often not used; easy to obtain in other ways. Can be computationally more demanding. Parameter inference is already di ffi cult in non- RL settings. Any hyperparameters can be learnt. Simpler and competitive methods exist. Generative RL

  10. 10 Model-based RL

  11. 11 Learn a model of the environment and Action Action Action a t-1 a t a t+! use that in all planning. Internal simulator - limit State State State … s t-1 s t s t+1 interactions with env for safety and planning. m t-1 m t m t+1 Long-term predictions allow for Data Data Data better planning. x t-1 x t x t+1 Data e ffi cient learning, especially when experience is expensive to obtain. Prior knowledge of the environment can be easily incorporated. Chiappa et al. (2017) Model-based RL

  12. 12 Exploration Physical and temporal consistency Model-based RL

  13. 13 Agent only as good as the model that is learnt. Arguments in favour rely For the most part limited to small on linear models, or in domains, limited complexity, Two sources of error - low-dimensions. limited consistency. model and any value, estimate. Computationally more Hard to specify model Even harder to learn models in expensive that model-free that best captures data. partially-observed scenarios. methods. Need highly-flexible models to account for di ff erent regimes, and di ffi cult to develop a general-purpose model learning Finding the best solution in an approach. environment requires continuous exploration, Learn models based on changing Need to also learn reward continuous data collection policies from which the data is model -very hard. and continuous model- obtained. updating. Generative RL

  14. 14 Trend is to use lots of computation, coupled with environments that can parallelised. OpenAI evolution strategies (2017) Model-error propagates: When model-learning succeeds To learn robust models and we often use model-free reduce uncertainty requires a lots methods to train initially and of data. Works against the data- provide good data. e ffi ciency argument. Data- e ffi cient Learning

  15. 15 Generative models to drive behaviour in the absence of rewards: Complex probabilistic quantities, such as information gain or mutual information. Mohamed and Rezende (2015) Intrinsic Motivation

  16. 16 Generative models to drive behaviour in the absence of rewards Computation of complex Add to the burden of probabilistic quantities involves data needed to learn the approximations that impact policy reward structure. learning, data-e ffi ciency, safety. Require learning of environment Simplistic applications of models themselves, with all the these approaches at present. di ffi culty entailed. Generative RL

  17. 17 Arguments rely on the di ffi culty of using generative models and learning complex probabilistic quantities given current tools. Stronger support for models-free methods since they side-step many of the challenges of model-learning. Serious challenges to learning reliable, rapidly adapting, data-e ffi cient, and general-purpose models for use in practice. Uncertainty used in limited ways, but adds a great deal of complexity. Integrated systems. Types of environments and problems that are being addressed matter. Valid Critique?

  18. 18 Action a t-1 Action a t Action State a t+! s t-1 State s t State Not possible to argue against s t+1 … m t-1 m t / probabilistic approaches to RL. n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t n o i t c A Data x t+1 t n e m n o r v i n E l a n r e t n I State Repr. Our challenge is to show that principles we Critic Option KB develop have a rich theory that apply in practice and can be deployed in flexible ways. n o t i p O r e n n a l P

  19. Action a t-1 Action a t Action State a t+! A Case Against s t-1 State s t State s t+1 … m t-1 Generative Models for m t / n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t n Reinforcement Learning? o i t c A Data x t+1 t n e m n o r v i n E l a n r e t n I State Repr. Shakir Mohamed Critic Option KB n o t i p O Generative models shakir@deepmind.com r e n n a l P for RL workshop DALI 2018 @shakir_za

Recommend


More recommend