Structured Representations for Knowledge Transfer in Reinforcement Learning Benjamin Rosman Mobile Intelligent Autonomous Systems Council for Scientific and Industrial Research & School of Computer Science and Applied Maths University of the Witwatersrand South Africa
Robots solving complex tasks Large high- dimensional action and state spaces Many different task instances
Behaviour learning β’ Reinforcement learning (RL) Action a Reward r State s
Markov decision process (MDP) β’ π = π, π΅, π, π Learn optimal policy: π 0 π β βΆ π β π΅ -1 π‘ 2 1.0 0.3 0.1 0.7 0.5 π 0 π 1 1 0.1 0.9 -0.3 π‘ 0 1.0 π 1 π 0 0.5 π 1 π‘ 1 1.0
Looking into the future β’ Canβt just rely on immediate rewards β’ Define value functions : π β’ π π π‘ = πΉ π π π’ π‘ π’ = π‘} π‘ π β’ π π π‘, π = πΉ π π π’ π‘ π’ = π‘, π π’ = π} π π‘ β’ V* (Q*) is a proxy for Ο * 5
Value functions example β’ Random policy: β’ Optimal: 6
RL algorithms β’ So: solve a large system of nonlinear value function equations (Bellman equations) β’ Optimal control problem β’ But: transitions P & rewards R arenβt known! β’ RL learning is trial-and-error learning to find an optimal policy from experience β’ Exploration vs exploitation 7
Exploring 99 100 98 97 96 8
Learned value function 9
An algorithm: Q-learning β’ Initialise π (π‘, π) arbitrarily β’ Repeat (for each episode): β’ Initialise π‘ β’ Repeat (for each step of episode): Choose π from π‘ ( π -greedy policy from π ) 1. arg max π (π‘, π) π₯. π. 1 β π exploit β’ π β ΰ΅ π explore π πππππ π₯. π. π Take action π , observe π , π‘β² 2. Update estimate of π 3. πβ² π π‘ β² , π β² β π (π‘, π) β’ π π‘, π β π π‘, π + π½ π + πΏ max learn β’ π‘ β π‘β² β’ Until π‘ is terminal estimated immediate future reward reward 10
Solving tasks
Generalising solutions? ? β’ How does this help us solve other problems?
Hierarchical RL β’ Sub-behaviours: options π = β¨π½ π , π π , πΎ π β© β’ Policy + initiation and termination conditions π π : π β π΅ πΎ π : π β [0,1] π½ π β π β’ Abstract away low level actions β’ Does not affect the state space
Abstracting states β’ Aim : learn an abstract representation of the environment β’ Use with task-level planners β’ Based on agent behaviours (skills / options) β’ General : donβt need to be relearned for every new task Steven James (in collaboration with George Konidaris) S. James, B. Rosman, G. Konidaris. Learning to Plan with Portable Symbols. ICML/IJCAI/AAMAS 2018 Workshop on Planning and Learning, July 2018. S. James, B. Rosman, G. Konidaris. Learning Portable Abstract Representations for High-Level Planning. Under review.
Requirements: planning with skills β’ Learn the preconditions β’ Classification problem: β’ π can execute skill? current_state) βSYMBOLSβ β’ Learn the effects β’ Density estimation: β’ π next_state current_state, skill) β’ Possible if options are subgoal i.e. π next_state current_state, skill)
Subgoal options β’ π next_state current_state, skill) β’ Partition skills to ensure property holds β’ e.g. βwalk to nearest doorβ
Generating symbols from skills [Konidaris, 2018] β’ Results in abstract MDP/propositional PPDDL β’ But π(π‘ β π½ π ) and π π‘ β² π) are distributions/symbols over state space particular to current task β’ e.g. grounded in a specific set of xy-coordinates
Towards portability β’ Need a representation that facilitates transfer β’ Assume agent has sensors which provide it with (lossy) observations β’ Augment the state space with action-centric observations β’ Agent space β’ e.g. robot navigating a building β’ State space: xy-coordinates β’ Agent space: video camera
Portable symbols β’ Learning symbols in agent space β’ Portable! β’ But: non-Markov and insufficient for planning β’ Add the subgoal partition labels to rules β’ General abstract symbols + grounding β portable rules +
Grounding symbols β’ Learn abstract symbols β’ Learning linking functions: β’ Mapping partition numbers from options to their effects β’ This gives us a factored MDP or a PPDDL representation β’ Provably sufficient for planning
Learning grounded symbols USING AGENT-SPACE DATA USING STATE-SPACE DATA
The treasure game
Agent and problem space β’ State space: π¦π§ -position of agent, key and treasure, angle of levers and state of lock β’ Agent space: 9 adjacent cells about the agent
Skills β’ Options: β’ GoLeft, GoRight β’ JumpLeft, JumpRight β’ DownRight, DownLeft β’ Interact β’ ClimbLadder, DescendLadder
Learning portable rules β’ Cluster to create subgoal agent-space options β’ Use SVM and KDE to estimate preconditions and effects β’ Learned rules can be transferred between tasks Interact1 Rule DescendLadder Rule
Grounding rules β’ Partition options in state space to get partition numbers β’ Learn grounded rule instances: linking 1 2 + 3
Partitioned rules Precondition: Negative effect: Positive effect: Interact 1 : Interact 3 :
Experiments β’ Require fewer samples in subsequent tasks
Portable rules β’ Learn abstract rules and their groundings β’ Transfer between domain instances β’ Just by learning linking functions β’ But what if there is additional structure? β’ In particular, there are many rule instances (objects of interest)? Ofir Marom Ofir Marom and Benjamin Rosman. Zero-Shot Transfer with Deictic Object-Oriented Representation in Reinforcement Learning. NIPS, 2018.
Example: Sokoban
Sokoban (legal move)
Sokoban (legal move)
Sokoban (illegal move)
Sokoban (goal)
Representations π‘ = (πππππ’ π¦ = 3, πππππ’ π§ = 4, πππ¦1 π¦ = 4, πππ¦1 π§ = 4, πππ¦2 π¦ = 3, πππ¦2 π§ = 2) β’ Poor scalability β’ 100s of boxes? β’ Transferability? β’ Effects of actions depend on interactions further away, complicating a mapping to agent space
Object-oriented representations β’ Consider objects explicitly β’ Object classes have attributes β’ Relationships based on formal logic:
Propositional OO-MDPs [Duik, 2010] β’ Describe transition rules using schemas β’ Propositional Object-Oriented MDPs β’ Provably efficient to learn (KWIK bounds) πΉππ‘π’ β§ πππ£πβ πΉππ‘π’ πππ π‘ππ, ππππ β πππ π‘ππ. π¦ β πππ π‘ππ. π¦ + 0
Benefits β’ Propositional OO-MDPs β’ Compact representation β’ Efficient learning of rules
Limitations β’ Propositional OO-MDPs are efficient, but restrictive πΉππ‘π’ β§ πππ£πβ πππ‘π’ πΆππ¦, πππ π‘ππ β§ πππ£πβ πΉππ‘π’ πΆππ¦, ππππ β πΆππ¦. π¦ β ?
Limitations β’ Propositional OO-MDPs are efficient, but restrictive β’ Restriction that preconditions are propositional β’ Canβt refer to the same box πΉππ‘π’ β§ πππ£πβ πππ‘π’ πΆππ¦, πππ π‘ππ β§ πππ£πβ πΉππ‘π’ πΆππ¦, ππππ β πΆππ¦. π¦ β ?
Limitations β’ Propositional OO-MDPs are efficient, but restrictive β’ Restriction that preconditions are propositional β’ Canβt refer to the same box πΉππ‘π’ β§ πππ£πβ πππ‘π’ πΆππ¦, πππ π‘ππ β§ πππ£πβ πΉππ‘π’ πΆππ¦, ππππ β πΆππ¦. π¦ β ? Ground instances! But then relearn dynamics for box1, box2, etc.
Deictic OO-MDPs β’ Deictic predicates instead of propositions β’ Grounded only with respect to a central deictic object ( β me β or β this β ) β’ Relates to other non-grounded objects β’ Transition dynamics of πΆππ¦. π¦ depends on grounded πππ¦ object πΉππ‘π’ β§ πππ£πβ πππ‘π’ πππ¦, πππ π‘ππ β§ πππ£πβ πΉππ‘π’ πππ¦, ππππ β πππ¦. π¦ β πππ¦. π¦ + 0 β’ Also provably efficient
Learning the dynamics β’ Learning from experience: β’ For each action, how do attributes change? β’ KWIK framework β’ Propositional OO-MDPs: DOORMAX algorithm β’ Transition dynamics for each attribute and action must be representable as a binary tree β’ Effects at the leaf nodes β’ Each possible effect can occur at most at one leaf, except for a failure condition (globally nothing changes)
Learning the dynamics π 1 π π 2 π 3 π π π 1 2
Recommend
More recommend