for knowledge transfer in
play

for Knowledge Transfer in Reinforcement Learning Benjamin Rosman - PowerPoint PPT Presentation

Structured Representations for Knowledge Transfer in Reinforcement Learning Benjamin Rosman Mobile Intelligent Autonomous Systems Council for Scientific and Industrial Research & School of Computer Science and Applied Maths University of


  1. Structured Representations for Knowledge Transfer in Reinforcement Learning Benjamin Rosman Mobile Intelligent Autonomous Systems Council for Scientific and Industrial Research & School of Computer Science and Applied Maths University of the Witwatersrand South Africa

  2. Robots solving complex tasks Large high- dimensional action and state spaces Many different task instances

  3. Behaviour learning β€’ Reinforcement learning (RL) Action a Reward r State s

  4. Markov decision process (MDP) β€’ 𝑁 = 𝑇, 𝐡, π‘ˆ, 𝑆 Learn optimal policy: 𝑏 0 𝜌 βˆ— ∢ 𝑇 β†’ 𝐡 -1 𝑑 2 1.0 0.3 0.1 0.7 0.5 𝑏 0 𝑏 1 1 0.1 0.9 -0.3 𝑑 0 1.0 𝑏 1 𝑏 0 0.5 𝑏 1 𝑑 1 1.0

  5. Looking into the future β€’ Can’t just rely on immediate rewards β€’ Define value functions : 𝜌 β€’ π‘Š 𝜌 𝑑 = 𝐹 𝜌 𝑆 𝑒 𝑑 𝑒 = 𝑑} 𝑑 𝜌 β€’ 𝑅 𝜌 𝑑, 𝑏 = 𝐹 𝜌 𝑆 𝑒 𝑑 𝑒 = 𝑑, 𝑏 𝑒 = 𝑏} 𝑏 𝑑 β€’ V* (Q*) is a proxy for Ο€ * 5

  6. Value functions example β€’ Random policy: β€’ Optimal: 6

  7. RL algorithms β€’ So: solve a large system of nonlinear value function equations (Bellman equations) β€’ Optimal control problem β€’ But: transitions P & rewards R aren’t known! β€’ RL learning is trial-and-error learning to find an optimal policy from experience β€’ Exploration vs exploitation 7

  8. Exploring 99 100 98 97 96 8

  9. Learned value function 9

  10. An algorithm: Q-learning β€’ Initialise 𝑅(𝑑, 𝑏) arbitrarily β€’ Repeat (for each episode): β€’ Initialise 𝑑 β€’ Repeat (for each step of episode): Choose 𝑏 from 𝑑 ( πœ— -greedy policy from 𝑅 ) 1. arg max 𝑅(𝑑, 𝑏) π‘₯. π‘ž. 1 βˆ’ πœ— exploit β€’ 𝑏 ← ࡝ 𝑏 explore π‘ π‘π‘œπ‘’π‘π‘› π‘₯. π‘ž. πœ— Take action 𝑏 , observe 𝑠, 𝑑′ 2. Update estimate of 𝑅 3. 𝑏′ 𝑅 𝑑 β€² , 𝑏 β€² βˆ’ 𝑅(𝑑, 𝑏) β€’ 𝑅 𝑑, 𝑏 ← 𝑅 𝑑, 𝑏 + 𝛽 𝑠 + 𝛿 max learn β€’ 𝑑 ← 𝑑′ β€’ Until 𝑑 is terminal estimated immediate future reward reward 10

  11. Solving tasks

  12. Generalising solutions? ? β€’ How does this help us solve other problems?

  13. Hierarchical RL β€’ Sub-behaviours: options 𝑝 = ⟨𝐽 𝑝 , 𝜌 𝑝 , 𝛾 𝑝 ⟩ β€’ Policy + initiation and termination conditions 𝜌 𝑝 : 𝑇 β†’ 𝐡 𝛾 𝑝 : 𝑇 β†’ [0,1] 𝐽 𝑝 βŠ† 𝑇 β€’ Abstract away low level actions β€’ Does not affect the state space

  14. Abstracting states β€’ Aim : learn an abstract representation of the environment β€’ Use with task-level planners β€’ Based on agent behaviours (skills / options) β€’ General : don’t need to be relearned for every new task Steven James (in collaboration with George Konidaris) S. James, B. Rosman, G. Konidaris. Learning to Plan with Portable Symbols. ICML/IJCAI/AAMAS 2018 Workshop on Planning and Learning, July 2018. S. James, B. Rosman, G. Konidaris. Learning Portable Abstract Representations for High-Level Planning. Under review.

  15. Requirements: planning with skills β€’ Learn the preconditions β€’ Classification problem: β€’ 𝑄 can execute skill? current_state) β€œSYMBOLS” β€’ Learn the effects β€’ Density estimation: β€’ 𝑄 next_state current_state, skill) β€’ Possible if options are subgoal i.e. 𝑄 next_state current_state, skill)

  16. Subgoal options β€’ 𝑄 next_state current_state, skill) β€’ Partition skills to ensure property holds β€’ e.g. β€œwalk to nearest door”

  17. Generating symbols from skills [Konidaris, 2018] β€’ Results in abstract MDP/propositional PPDDL β€’ But 𝑄(𝑑 ∈ 𝐽 𝑝 ) and 𝑄 𝑑 β€² 𝑝) are distributions/symbols over state space particular to current task β€’ e.g. grounded in a specific set of xy-coordinates

  18. Towards portability β€’ Need a representation that facilitates transfer β€’ Assume agent has sensors which provide it with (lossy) observations β€’ Augment the state space with action-centric observations β€’ Agent space β€’ e.g. robot navigating a building β€’ State space: xy-coordinates β€’ Agent space: video camera

  19. Portable symbols β€’ Learning symbols in agent space β€’ Portable! β€’ But: non-Markov and insufficient for planning β€’ Add the subgoal partition labels to rules β€’ General abstract symbols + grounding β†’ portable rules +

  20. Grounding symbols β€’ Learn abstract symbols β€’ Learning linking functions: β€’ Mapping partition numbers from options to their effects β€’ This gives us a factored MDP or a PPDDL representation β€’ Provably sufficient for planning

  21. Learning grounded symbols USING AGENT-SPACE DATA USING STATE-SPACE DATA

  22. The treasure game

  23. Agent and problem space β€’ State space: 𝑦𝑧 -position of agent, key and treasure, angle of levers and state of lock β€’ Agent space: 9 adjacent cells about the agent

  24. Skills β€’ Options: β€’ GoLeft, GoRight β€’ JumpLeft, JumpRight β€’ DownRight, DownLeft β€’ Interact β€’ ClimbLadder, DescendLadder

  25. Learning portable rules β€’ Cluster to create subgoal agent-space options β€’ Use SVM and KDE to estimate preconditions and effects β€’ Learned rules can be transferred between tasks Interact1 Rule DescendLadder Rule

  26. Grounding rules β€’ Partition options in state space to get partition numbers β€’ Learn grounded rule instances: linking 1 2 + 3

  27. Partitioned rules Precondition: Negative effect: Positive effect: Interact 1 : Interact 3 :

  28. Experiments β€’ Require fewer samples in subsequent tasks

  29. Portable rules β€’ Learn abstract rules and their groundings β€’ Transfer between domain instances β€’ Just by learning linking functions β€’ But what if there is additional structure? β€’ In particular, there are many rule instances (objects of interest)? Ofir Marom Ofir Marom and Benjamin Rosman. Zero-Shot Transfer with Deictic Object-Oriented Representation in Reinforcement Learning. NIPS, 2018.

  30. Example: Sokoban

  31. Sokoban (legal move)

  32. Sokoban (legal move)

  33. Sokoban (illegal move)

  34. Sokoban (goal)

  35. Representations 𝑑 = (π‘π‘•π‘“π‘œπ‘’ 𝑦 = 3, π‘π‘•π‘“π‘œπ‘’ 𝑧 = 4, 𝑐𝑝𝑦1 𝑦 = 4, 𝑐𝑝𝑦1 𝑧 = 4, 𝑐𝑝𝑦2 𝑦 = 3, 𝑐𝑝𝑦2 𝑧 = 2) β€’ Poor scalability β€’ 100s of boxes? β€’ Transferability? β€’ Effects of actions depend on interactions further away, complicating a mapping to agent space

  36. Object-oriented representations β€’ Consider objects explicitly β€’ Object classes have attributes β€’ Relationships based on formal logic:

  37. Propositional OO-MDPs [Duik, 2010] β€’ Describe transition rules using schemas β€’ Propositional Object-Oriented MDPs β€’ Provably efficient to learn (KWIK bounds) 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 π‘„π‘“π‘ π‘‘π‘π‘œ, π‘‹π‘π‘šπ‘š β‡’ π‘„π‘“π‘ π‘‘π‘π‘œ. 𝑦 ← π‘„π‘“π‘ π‘‘π‘π‘œ. 𝑦 + 0

  38. Benefits β€’ Propositional OO-MDPs β€’ Compact representation β€’ Efficient learning of rules

  39. Limitations β€’ Propositional OO-MDPs are efficient, but restrictive 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝑋𝑓𝑑𝑒 𝐢𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 𝐢𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝐢𝑝𝑦. 𝑦 ← ?

  40. Limitations β€’ Propositional OO-MDPs are efficient, but restrictive β€’ Restriction that preconditions are propositional β€’ Can’t refer to the same box 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝑋𝑓𝑑𝑒 𝐢𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 𝐢𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝐢𝑝𝑦. 𝑦 ← ?

  41. Limitations β€’ Propositional OO-MDPs are efficient, but restrictive β€’ Restriction that preconditions are propositional β€’ Can’t refer to the same box 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝑋𝑓𝑑𝑒 𝐢𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 𝐢𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝐢𝑝𝑦. 𝑦 ← ? Ground instances! But then relearn dynamics for box1, box2, etc.

  42. Deictic OO-MDPs β€’ Deictic predicates instead of propositions β€’ Grounded only with respect to a central deictic object ( β€œ me ” or β€œ this ” ) β€’ Relates to other non-grounded objects β€’ Transition dynamics of 𝐢𝑝𝑦. 𝑦 depends on grounded 𝑐𝑝𝑦 object 𝐹𝑏𝑑𝑒 ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝑋𝑓𝑑𝑒 𝑐𝑝𝑦, π‘„π‘“π‘ π‘‘π‘π‘œ ∧ π‘ˆπ‘π‘£π‘‘β„Ž 𝐹𝑏𝑑𝑒 𝑐𝑝𝑦, π‘‹π‘π‘šπ‘š β‡’ 𝑐𝑝𝑦. 𝑦 ← 𝑐𝑝𝑦. 𝑦 + 0 β€’ Also provably efficient

  43. Learning the dynamics β€’ Learning from experience: β€’ For each action, how do attributes change? β€’ KWIK framework β€’ Propositional OO-MDPs: DOORMAX algorithm β€’ Transition dynamics for each attribute and action must be representable as a binary tree β€’ Effects at the leaf nodes β€’ Each possible effect can occur at most at one leaf, except for a failure condition (globally nothing changes)

  44. Learning the dynamics π‘ž 1 𝜚 π‘ž 2 π‘ž 3 𝜚 𝑠 𝑠 1 2

Recommend


More recommend