skill discovery from unstructured demonstrations skill
play

Skill discovery from unstructured demonstrations Skill discovery - PowerPoint PPT Presentation

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations Pravesh Ranchod School of Computer Science University of the Witwatersrand pravesh.ranchod@wits.ac.za Initial objective Initial objective We


  1. Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations Pravesh Ranchod School of Computer Science University of the Witwatersrand pravesh.ranchod@wits.ac.za

  2. Initial objective Initial objective ● We want agents that can feasibly learn to do We want agents that can feasibly learn to do things autonomously things autonomously ● Minimize the burden on an expert Minimize the burden on an expert – Specify what, not how Specify what, not how

  3. Reinforcement Learning Reinforcement Learning ● Reinforcement Learning Reinforcement Learning – Learn behaviour from experience Learn behaviour from experience – MDP = (S, A, T, R) MDP = (S, A, T, R) S 1 S 2 S 3 T(s 1 ,a 1 ) T(s 2 ,a 2 ) Reward Reward Reward a 1 a 2 – Take actions that maximise long term reward Take actions that maximise long term reward – Expert burden is reduced to specifying reward Expert burden is reduced to specifying reward function function

  4. Reinforcement Learning Reinforcement Learning ● Reinforcement Learning Process Reinforcement Learning Process – We specify transition dynamics and reward function We specify transition dynamics and reward function and get a policy and get a policy System dynamics Reinforcement Learning Policy Reward function Algorithm

  5. Reinforcement Learning Reinforcement Learning ● SARSA / Q-Learning SARSA / Q-Learning – Observe state, take action, receive reward, observe Observe state, take action, receive reward, observe new state new state – Keep track of the Keep track of the value value of an action in a particular of an action in a particular state state – Estimate the value of a state as the immediate Estimate the value of a state as the immediate reward received plus the value of the new state reward received plus the value of the new state – Update estimates by moving the estimate in the Update estimates by moving the estimate in the direction of the observation direction of the observation

  6. Skills Skills ● Problem: Too many states and actions Problem: Too many states and actions – Actions could be too low level (eg. Robot walking) Actions could be too low level (eg. Robot walking) ● Potential Solution: Use the options framework to Potential Solution: Use the options framework to introduce high level actions introduce high level actions – Each Each option option is an RL task of its own is an RL task of its own – We can then invoke an entire option as an action We can then invoke an entire option as an action – Analogous to skills Analogous to skills – Requires the expert to specify MANY RL tasks, hence Requires the expert to specify MANY RL tasks, hence many reward functions many reward functions

  7. Updated objective Updated objective ● We want agents that can feasibly learn to do We want agents that can feasibly learn to do things autonomously things autonomously ● Minimize the burden on an expert when many Minimize the burden on an expert when many tasks are to be learned tasks are to be learned – Specify what, not how Specify what, not how – Demonstrate what, not how Demonstrate what, not how

  8. Inverse Reinforcement Learning Inverse Reinforcement Learning ● Reinforcement learning can produce action Reinforcement learning can produce action selections (policy) from a reward function selections (policy) from a reward function ● Inverse Reinforcement Learning produces a Inverse Reinforcement Learning produces a reward function by observing action selections reward function by observing action selections ● Iteratively proposes and evaluates reward Iteratively proposes and evaluates reward functions, attempting to match expert functions, attempting to match expert observations observations

  9. Inverse Reinforcement Learning Inverse Reinforcement Learning ● Inverse Reinforcement Learning Process Inverse Reinforcement Learning Process – We provide trajectories and dynamics and get a We provide trajectories and dynamics and get a reward function (which if optimized would match reward function (which if optimized would match expert behaviour) expert behaviour) System dynamics Inverse Reward Function Expert behaviour Reinforcement Learning Algorithm

  10. Inverse Reinforcement Learning Inverse Reinforcement Learning ● Well, how pointless was that? Well, how pointless was that? – Surprisingly pointful Surprisingly pointful – Captures the goal of the demonstrator rather than Captures the goal of the demonstrator rather than just the actions just the actions – Allows action selection in situations the expert did Allows action selection in situations the expert did not encounter not encounter – Allows robustness to changing environments and Allows robustness to changing environments and capabilities capabilities

  11. Learning from demonstration Learning from demonstration ● Must provide many demonstrations to learn Must provide many demonstrations to learn many reward functions for many small tasks many reward functions for many small tasks (options) (options) – The demonstrator could demonstrate small tasks The demonstrator could demonstrate small tasks repetitively (annoying and time consuming) repetitively (annoying and time consuming) – Annotations could be provided indicating when Annotations could be provided indicating when each task begins and ends (still annoying, and each task begins and ends (still annoying, and difficult) difficult)

  12. Objective Objective ● We want agents that can feasibly learn to do We want agents that can feasibly learn to do things autonomously things autonomously ● Minimize the burden on an expert when many Minimize the burden on an expert when many tasks are to be learned tasks are to be learned – Specify what, not how Specify what, not how – Demonstrate what, not how Demonstrate what, not how – Unstructured demonstrations Unstructured demonstrations

  13. NPBRS NPBRS ● We introduce a technique called Nonparamteric We introduce a technique called Nonparamteric Bayesian Reward Segmentation Bayesian Reward Segmentation – Takes Takes unstructured demonstrations unstructured demonstrations and and produces many reward functions along with the produces many reward functions along with the policies that optimise them policies that optimise them – Does this by segmenting trajectories into more Does this by segmenting trajectories into more likely pieces likely pieces All A B C A

  14. Segmentation Segmentation ● What information do we have to segmention? What information do we have to segmention? ● Reward based segmentation Reward based segmentation – Performs IRL on each segment Performs IRL on each segment – Evaluates the quality of the IRL Evaluates the quality of the IRL – Bad segmentation will lead to bad IRL Bad segmentation will lead to bad IRL One reward function - lousy A B C A Three reward functions – great A B C A

  15. Our model Our model ● Assume separate skill sets per trajectory, generated from a Beta Assume separate skill sets per trajectory, generated from a Beta process process – Allows for an infinitely sized skill set Allows for an infinitely sized skill set – Encourages shared skills across trajectories Encourages shared skills across trajectories – Allows skill dynamics to change depending on the skill set Allows skill dynamics to change depending on the skill set ● Within each skill set, model the skill transition dynamics as a Within each skill set, model the skill transition dynamics as a sticky Hidden Markov Model Hidden Markov Model sticky ● The skill sequence is drawn from the skill transition distribution The skill sequence is drawn from the skill transition distribution ● Within each skill, the observations are generated from a skill Within each skill, the observations are generated from a skill specific MDP, where every skill shares transition dynamics but has specific MDP, where every skill shares transition dynamics but has a specific reward function a specific reward function

  16. Our model Our model ● Perform inference on this model using a Markov Perform inference on this model using a Markov chain Monte Carlo sampler chain Monte Carlo sampler – Sample based on model likelihood – ie. the probability Sample based on model likelihood – ie. the probability of the data given the model of the data given the model – Observation log likelihood is the sum of the log Observation log likelihood is the sum of the log likelihood of each transition likelihood of each transition – The likelihood of each transition is the probability of The likelihood of each transition is the probability of action selection under the optimal policy for the reward action selection under the optimal policy for the reward function generated from IRL on all segments assigned function generated from IRL on all segments assigned to that skill to that skill A B C A

  17. Does it work? Does it work? In car domain In car domain – Skill A : Hit every other car Skill A : Hit every other car – Skill B : Stay in the left lane but switch to avoid Skill B : Stay in the left lane but switch to avoid collisions collisions – Skill C : Stay in the right lane but switch to avoid Skill C : Stay in the right lane but switch to avoid collisions collisions ● Data generated by randomly switching between Data generated by randomly switching between policies with probability 0.01 policies with probability 0.01

  18. Does it work? Does it work?

Recommend


More recommend