simultaneous acquisition of task and feedback models q
play

Simultaneous Acquisition of Task and Feedback Models q Manuel - PowerPoint PPT Presentation

Simultaneous Acquisition of Task and Feedback Models q Manuel Lopes, Thomas Cederborg and Pierre-Yves Oudeyer INRIA, Bordeaux Sud-Ouest manuel.lopes@inria.fr flowers.inria.fr/mlopes Outline Outline Interactive Learning Interactive


  1. Simultaneous Acquisition of Task and Feedback Models q Manuel Lopes, Thomas Cederborg and Pierre-Yves Oudeyer INRIA, Bordeaux Sud-Ouest manuel.lopes@inria.fr flowers.inria.fr/mlopes

  2. Outline Outline • Interactive Learning Interactive Learning – Ambiguous Protocols – Ambiguous Signals Ambiguous Signals – Active Learning

  3. Learning from Demonstration g Pros • Natural/intuitive (is it?) • Facilitates social acceptance Cons • Requires an expert with knowledge about the task and the learning system • Long and Costly Demonstrations • No Feedback on the Learning N F db k h L i Process (on most methods)

  4. What is the best strategy to learn/teach? What is the best strategy to learn/teach? Considering teaching how to play Considering teaching how to play tennis. Information provided: • Rules of the game • Rules of the game R(x) • Strategies or verbal instructions S i b l i i of how to behave V( ) V( ) V(x)>V(y) • Demonstrations (of a particular π (x)=a hit)

  5. How to improve learning from demonstration? • Combine: Combine: – demonstrations to initialize – self-experiment to correct modeling errors self experiment to correct modeling errors • Feedback corrections • Instructions • More data • …

  6. How to improve learning/teaching? How to improve learning/teaching? Learner Learner – Active Learning – Combine with Self- Combine with Self Experimentation Teacher – Better Strategies B S i – Extra Cues

  7. How are demonstrations provided? How are demonstrations provided? • Remote control (direct control) Remote control (direct control) – Exoskeleton, joystick, Wiimote,… • Unobtrusive – Acquired with vision, 3d-cameras from someone’s execution • Remote instruction (indirect control) ( ) – Verbal commands, gestures, …

  8. Behavior of Humans Behavior of Humans • People want to direct the agent’s attention to guide exploration i id l i • People have a positive bias in their rewarding behavior, suggesting both instrumental and motivational intents with their communication channel. • People adapt their teaching strategy as they develop a mental model of how the agent learns. model of how the agent learns. • People are not optimal, even when they try to be so Cakmak, Thomaz

  9. Interactive Learning Approaches Interactive Learning Approaches Active Learner • Decide what to ask ( Lopes Cohn Judah ) • Decide what to ask ( Lopes,Cohn,Judah ) • Ask when Uncertain/Risk ( Chernova , Roy , …) • Decide when to ask ( Cakmak ) • … Improved Teacher I d T h • Dogged Learning ( Grollman ) • User Preferences ( Mason ) User Preferences ( Mason ) • Extra Cues ( Thomaz, Knox, Judah ) • User Queries the Learner ( Cakmak ) • Tactile Guidance ( Billard ) • …

  10. Learning under a weakly specified protocol Learning under a weakly specified protocol • People do not follow protocols rigidly i idl • Some of the provided cues depart from their mathematical d t f th i th ti l meaning, e.g. extra utterances, gestures guidance motivation gestures, guidance, motivation • Can we exploit those extra cues? • If robots adapt to the user, will training be easier? g

  11. Different Feedback Structures Different Feedback Structures User can provide direct feedback: User can provide direct feedback: • Reward – Quantitative evaluation Q • Corrections – Yes/No classifications of behavior • Actions User can provide extra signals: U id i l • Reward of exploratory actions • Reward of getting closer to target R d f i l

  12. Unknown/Ambiguous Feedback Unknown/Ambiguous Feedback Unknown feedback signals: • Gestures • Prosody • Word synonyms • …

  13. Goal / Contribution Goal / Contribution Learn simultaneously: –Task reward function f –Interaction Protocol what information is the user providing what information is the user providing –Meaning of extra signals what is the meaning of novel signals, e.g. prosody, unknown words,… Simultaneous Acquisition of Task and Feedback Models , Manuel Lopes, Thomas Cederborg and Pierre-Yves Oudeyer, ICDL , 2011.

  14. Markov decision process Markov decision process Set of possible states of the world and actions: p X = {1, ..., |X|} A = {1, ..., |A|} • State evolves according to P[ X t + 1 = y | X t = x , A t = a ] = P a ( x , y ) • Reward r defines the task of the agent • A policy defines how to choose actions P[ A t = a | X t = x ] = π ( x , a ) • Determine the policy that maximizes the total (expected) reward: V ( x ) = E π [ ∑ t g t r t | X 0 = x ] • Optimal policy can be computed using DP: V * ( x ) = r ( x ) + g max a E a [ V * ( y )] Q * ( x a ) = r ( x ) + γ E [ V * ( y )] Q ( x, a ) = r ( x ) + γ E a [ V ( y )]

  15. Inverse Reinforcement Learning The goal of the ˆ T r r task is unknown RL RL IRL IRL π ˆ T π * From world model and reward From samples of the policy and Find optimal policy world model Estimate reward Ng et al, ICML00; Abbeel et al ICML04; Neu et al, UAI07; Ramachandran et al IJCAI 07; Lopes et al IROS07

  16. Probabilistic View of IRL Probabilistic View of IRL • Prior distribution P[ r ] • Suppose now that agent is given a demonstration: • Likelyhood of demo, D {( x 1 , a 1 ), ..., ( x n , a n )} D = {( x 1 , a 1 ), ..., ( x n , a n )} L (D) L (D) = ∏ i π r ( x i , a i ) ∏ ( ) • The teacher is not perfect (sometimes makes mistakes) • Posterior over rewards: * η ( , ) Q x a e P[ r | D] ∝ P[ r ] P[D | r ] π ’( x , a ) = * ∑ b η ( , ) Q x b e • MC-based methods to • Likelihood of observed demo: L (D) = ∏ i π ’( x i , a i ) sample P[ r | D] Ramachandran

  17. Bayesian inverse reinforcement learning Bayesian inverse reinforcement learning

  18. Gradient-based IRL Gradient based IRL • Idea: Compute the maximum likelihood estimate for r given the • Idea: Compute the maximum-likelihood estimate for r given the demonstration D • We use a gradient ascent algorithm: W di l i h r t + 1 = r t + ∇ r L (D) • Upon convergence, the obtained reward maximizes the likelihood of the demonstration Policy Loss ( Neu et al. ), Maximum likelihood ( Lopes et al. )

  19. The Selection Criterion The Selection Criterion • Distribution P[ r | D ] induces a distribution on Π Distribution P[ r | D ] induces a distribution on Π • Use MC to approximate P[ r | D ] • For each ( x , a ) , P[ r | D ] induces a distribution on π ( x , a ) : µ xa ( p ) = P[ π ( x , a ) = p | D ] • Compute per state average entropy: H ( x ) = 1 / | A | ∑ a H (µ xa ) Compute entropy H (µ xa ) a 1 a 2 a 3 a 4 ... a N

  20. Active IRL Require: Initial demonstration D 1. . Estimate P[ π | D] using MC st ate [ π | ] us g C maybe only around the ML estimate for all x ∈ X for all x ∈ X 2. 2 Compute H ( x ) endfor df Query action for x * = argmax x H ( x ) 3. 4. Add new sample to D Active Learning for Reward Estimation in Inverse Reinforcement Learning , Manuel Lopes, Francisco Melo and Luis Montesano. ECML/PKDD , 2009.

  21. Results General grid world ( M × M grid), >200 states • • Four actions available (N, S, E, W) • Parameterized reward (goal state)

  22. Active IRL, sample trajectories Active IRL, sample trajectories Require: Initial demonstration D q 1 1. Estimate P[ π | D] using MC for all x ∈ X 2. 0 .9 3. Compute H ( x ) 0 .8 4. endfor 0 .7 5 5. Solve MDP with R=H(x) S l MDP ith R H( ) rn d a cttra j1 0 .6 6. Query trajectory following optimal policy p y 0 .5 7. Add new trajectory to D 0 .4 0 1 1 0 1 0

  23. Unknown/Ambiguous Feedback Unknown/Ambiguous Feedback Unknown feedback protocol Unknown feedback protocol The information provided by the demonstration has not a predefined semantics has not a predefined semantics Meanings of the user signals • Binary Reward y • Action

  24. Feedback Profiles Feedback Profiles Demonstration Binary Reward Ambiguous Ambiguous

  25. Combination of Profiles Combination of Profiles

  26. Acquisition of Task and Feedback Model Acquisition of Task and Feedback Model

  27. Unknown/Ambiguous Feedback Unknown/Ambiguous Feedback Unknown feedback signals: • Gestures • Prosody • Word synonyms • …

  28. Feedback meaning of user signals Feedback meaning of user signals User might use different words to provide feedback • Ok, correct, good, nice, … • Wrong, error, no no, … • Up Go Forward Up, Go, Forward An intuitive interface should allow the interaction to be as free as possible possible Even if the user does not follow a strict vocabulary, can the robot still make use of such extra signals? make use of such extra signals? Learn the meaning of new vocabulary g y

  29. TR TO TT RT OT Init Action Next Feedback F1 (_/A) F2 (A/_) State State OT TO OT TO TT Grasp1 RT _ + - - + RT Grasp2 RT RelOnObj ++ -+ -- +- RT RelOnObj OT _ +++ -+- --+ +-+ TT Grasp2 TR AgarraVer Assuming (F1,OT) AgarraVer means Grasp1

  30. Scenario Scenario Actions: Up, Down, Left, Right, Pick, Release T? Task consist in finding: what object to pick and where to take it Robot tries an action, including none User provides feedback T? T? 8 known symbols, 8 unknown ones Robot must learn the task goal how Robot must learn the task goal, how the user provides feedback and some unknown signs

Recommend


More recommend