discrete and continuous
play

Discrete and Continuous Reinforcement Learning (not part of exam - PowerPoint PPT Presentation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Brief Overview of Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1 ADVANCED MACHINE LEARNING Forms of Learning Supervised learning where the algorithm


  1. ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Brief Overview of Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1

  2. ADVANCED MACHINE LEARNING Forms of Learning • Supervised learning – where the algorithm learns a function or model that maps best a set of inputs to a set of desired outputs. • Reinforcement learning – where the algorithm learns a policy or model of the set of transitions across a discrete set of input- output states (Markovian world) in order to maximize a reward value (external reinforcement). • Unsupervised learning – where the algorithm learns a model that best represent a set of inputs without any feedback (no desired output, no external reinforcement) 2 2 2

  3. ADVANCED MACHINE LEARNING Example of RL Learning how to stand-up Morimoto and Doya, Robotics and Autonomous Systems , 2001 3 3 3

  4. ADVANCED MACHINE LEARNING Reinforcement learning: Sequential Decision Problem Problem: Search a mapping from state to action s a   f s a t t Task: Get rock samples Feedback: success or failure It is up to the robot to figure out the best solution ! What are the rewards ? Let’s try everything Exploration: Have to try and explore multiple solutions to find the best! 4 4 4

  5. ADVANCED MACHINE LEARNING Supervised / semi-supervised learning Problem: Search a mapping from state to action s a   f s a t t In supervised learning, at each time step provide pairs:         , , , , , ,... , s a s a s a s a 1 1 2 2 3 3 T T In semi-supervised learning, provide partial supervision           , , , ? , , ? , , ,... , s a s s s a s a 1 1 2 3 4 4 T T The set of state-action pairs provided for training are optim al (expert tea cher) 5 5 5

  6. ADVANCED MACHINE LEARNING Example of RL bootstrapped with supervised learning Learning how to swing up a pendulum Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 6 6 6

  7. ADVANCED MACHINE LEARNING Reinforcement learning & supervised learning The expert provides some examples of optimal state-action pai rs and of the associated reward. Expert provides some sequences of optimal state-action pairs ( roll-outs ) and the reward:           , , , , , , , , , , , ,..., , , s a r s a r s a r s a r s a r 1 1 1 2 2 2 3 3 3 4 4 4 T T T The a gent searc hes for the solu tion by generatin g new action-sta te pairs in a neightbourhood around the expert's demonstrations. These solutions are not necessarily optimal . The expert provides a reward for thes e roll-outs.           , , , , , , , , , , , ,..., , , s a r s a r s a r s a r s a r 1 1 1 2 2 2 3 3 3 4 4 4 T T T 7 7 7

  8. ADVANCED MACHINE LEARNING The Reward The reward shapes the learning. Choosing it well is crucial for success of the learning. Imagine that you want to train a robot to learn to walk. • What reward would you choose for training a robot to stand-up? • What is the dimension of the state-action space? • How long would it take to learn through trial and error? • How is the reward helping reduce this number? UC Berkeley, Darwin Robot 8 8 8

  9. ADVANCED MACHINE LEARNING The Reward One could choose a more complex reward (informative): Reward = penalty for deviation of center of mass from equilibrium point + reward for cyclic motion of left and right leg + reward for in-phase motion of upper and lower leg, etc. Reduce the search of the state-action space by looking for phase- relationships between the joints. Unconstrained search over joint angles Constrained search for torso motion and relative leg motion 9 9 9

  10. ADVANCED MACHINE LEARNING RL: Optimality Reinforcement learning when using discrete state-action spaces with finite horizon can be solved in an optimal manner. We will see next how this can be done. This is no longer true for generic continuous state-action spaces. However, the same principles can be extended to continuous worlds, albeit with loss of optimality principle. (note that you can also guarantee optimality in continuous state and action space but some assumptions have to be made, e.g. Gaussian noise and linear control policy) 10 10 10

  11. ADVANCED MACHINE LEARNING RL: Discrete State Fire pit Goal Reward Agent Set of possible states in the world (environment + agent) 225 states in the above example (not all shown above) 11 11 11

  12. ADVANCED MACHINE LEARNING RL: Discrete State Fire pit Goal    A policy , is used to choose an action s a from any state . a s RL le arns an optimal policy . Agent A set of possible actions of the agent: 12 12 12

  13. ADVANCED MACHINE LEARNING RL: Discrete Actions Fire pit    Illustration of a policy , . s a Stochastic Environment : Transitions across states are not    deterministic | , p s a s  1 t t t Rewards may also be stochastic. Agent Knowing requires a model of the world. p It can be learned while learning the policy. 13 13 13

  14. ADVANCED MACHINE LEARNING RL: the effect of the environment Deterministic Stochastic RL takes into account stochasticity of the environment 14 14 14

  15. ADVANCED MACHINE LEARNING RL: the effect of the environment RL assumes that the world is first-order Marko v      | , , , ,... , | , p s a s a s a s p s a s     1 1 1 0 0 1 t t t t t t t t In words: the probability of a transition to a new state (and new reward) depends only on the current state and action, not on the history of previous states and actions. If state and action sets are finite, it is a finite MDP. This assumptions reduces drastically computation. No need to propagate the probabilities. Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 15 15

  16. ADVANCED MACHINE LEARNING RL: the policy At each time step, the agent being in a state s t    chooses an action by drawing from , a s a t     If , equiprobable for all actions , s a a pick at random a    - Otherwise pick best action argmax , s a a Gree dy Policy Example of greedy policy ending up in a limit cycle when using a poor policy. The agent must be able to measure how well it is doing and use this to update its policy. A good policy maximizes the expected reward. 16 16 16

  17. ADVANCED MACHINE LEARNING RL: exercise I 17 17 17

  18. ADVANCED MACHINE LEARNING RL: Value function Value function Reward The state -value function gives for each state an estimate of the expected reward starting          from that state: ( ) V s E r s s    1 t k t    0 k It depends on the agent’s polic y. 19 19 19

  19. ADVANCED MACHINE LEARNING RL: Value function Value function Policy Greedy Policy 20 20 20

  20. ADVANCED MACHINE LEARNING RL: Value function Discount future rewards:             k  ( ) , 0 1. V s E r s s    1 t k t    0 k    shortsighted 0 1 farsight ed 21 21 21

  21. ADVANCED MACHINE LEARNING RL: Markov Decision Process (MDP) Fire pit MDP Goal Agent How to find the best possible policy ? Find optimal value function => gives optimal policy 22 22 22

  22. ADVANCED MACHINE LEARNING RL: How to find an optimal policy ? Find the value function Need policy to compute the expectation Exploit recursive property: Bellman equation 23 23 23

  23. ADVANCED MACHINE LEARNING RL: Bellman Equation The Bellman equation is a recursive equation describing MDPs: t  r t  1   r t  2   2 r t  3   3 r R t  4    r t  1   r t  2   r t  3   2 r t  4  r t  1   R t  1   So:    ( ) V s E R s s  t t         E r V s s s Bellman Equation    1 1 t t t Or, without the expectation operator (assuming a MDP):            a a ( ) ( , ) ( ) V s s a P R V s   s s s s  a s Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24 24 24

  24. ADVANCED MACHINE LEARNING RL: Bellman Policy Evaluation V  ( s )  ( a , ) s S a P  s s a r a R  S’ s s V  ( s ' )   So:    ( ) V s E R s s  t t         E r V s s s Bellman Equation    1 1 t t t Or, without the expectation operator (assuming a MDP):            a a ( ) ( , ) ( ) V s s a P R V s   s s s s  a s Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 25 25

Recommend


More recommend