PAC-MDP Learning with Knowledge-based Admissible Models Marek Grze - PowerPoint PPT Presentation

PAC-MDP Learning with Knowledge-based Admissible Models Marek Grze´ s and Daniel Kudenko Department of Computer Science United Kingdom AAMAS 2010

Reinforcement Learning ◮ The loop of interaction: ◮ Agent can see the current state of the environment ◮ Agent chooses an action ◮ State of the environment changes, agent receives reward or punishment ◮ The goal of learning : quickly learn the policy that maximises the long-term expected reward

Exploration-Exploitation Trade-off ◮ We have found a reward of 100. Is it the best reward which can be achieved? ◮ Exploitation : should I stick to the best reward which was found? But, there may still be a high reward undiscovered. ◮ Exploration: should I try more new actions to find a region with a higher reward? But, a lot of negative reward may be collected while exploring unknown actions.

PAC-MDP Learning ◮ While learning the policy, also learn the model of the environment ◮ Assume that all unknown actions lead to a state with a highest possible reward ◮ This approach has been proven to be PAC, i.e., the number of suboptimal decisions is bounded polynomially by relevant parameters

Problem Formulation ◮ PAC-MDP learning vs. heuristic search ◮ Default R-max ‘is like’ best-first search (i.e., A*) with a trivial heuristic h(s)=0 ◮ Heuristic search is efficient when used with good informative heuristics ◮ It is useful and desirable to transfer this idea to reinforcement learning

Problem Formulation ctd ◮ Existing literature shows how admissible heuristics can improve PAC-MDP learning via reward shaping (Asmuth, Littman & Zinkov 2008) ◮ In this work, we are looking for alternative ways of incorporating knowledge (heuristics) into reinforcement learning algorithms ◮ Different knowledge (global admissible heuristics may not be available) ◮ Different ways of using knowledge (more efficient than reward shaping) ◮ We want to guarantee that the algorithm remains PAC-MDP

Determinisation in Symbolic Planning ◮ Action representation: Probabilistic Planning Domain Description Language (PPDDL) ( a p 1 e 1 ... p n e n ) ◮ Determinisation (probabilities known but ignored), e.g., FF-Replan, P-Graphplan ◮ In reinforcement learning probabilities are not known anyway

All-outcomes (AO) Determinisation ◮ Available knowledge: all outcomes e i of each action, a . ( a p 1 e 1 ... p n e n ) ◮ Create a new MDP ˆ M in which there is a deterministic action a d for each possible effect, e i , of a given action a . ◮ The value function of a new MDP, ˆ M , is admissible, i.e., ˆ V ( s ) ≥ V ∗ ( s )

Free Space Assumption (FSA) ◮ Available knowledge: intended (which is either most probable or completely blocked) outcome e i of each action, a . If the intended outcome is blocked, then all remaining outcomes, e i , of a given action are most probable outcomes of different actions. ( a p 1 e 1 ... p n e n ) ◮ Create a new MDP ˆ M in which each action, a , is replaced by its intended outcome. ◮ The value function of a new MDP, ˆ M , is admissible, i.e., ˆ V ( s ) ≥ V ∗ ( s )

PAC-MDP Learning with Admissible Models ◮ Rmax ◮ If (s,a) not known (i.e., n ( s , a ) < m ): use Rmax ◮ if (s,a) known (i.e., n ( s , a ) ≥ m ): use estimated model

PAC-MDP Learning with Admissible Models ◮ Rmax ◮ If (s,a) not known (i.e., n ( s , a ) < m ): use Rmax ◮ if (s,a) known (i.e., n ( s , a ) ≥ m ): use estimated model ◮ Our approach ◮ If (s,a) not known (i.e., n ( s , a ) < m ): use the knowledge-based admissible model ◮ if (s,a) known (i.e., n ( s , a ) ≥ m ): use estimated model

Results 0 -5 Average cumulative reward / 10 3 -10 -15 -20 Rmax-AO RS(Manhattan)-AO RS(Manhattan) RS(Line) Rmax -25 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of episodes / 10 2 Figure: Results on a 25 × 25 maze domain. AO knowledge.

Results 0 -5 Average cumulative reward / 10 3 -10 -15 -20 Rmax-FSA RS(Manhattan)-FSA RS(Manhattan) RS(Line) Rmax -25 0 0.5 1 1.5 2 2.5 3 3.5 4 Number of episodes / 10 2 Figure: Results on a 25 × 25 maze domain. FSA knowledge.

Comparing with the Bayesian Exploration Bonus Algorithm ◮ Bayesian Exploration Bonus (BEB) approximates Bayesian exploration (Kolter & Ng 2009). ◮ (+) It can use action knowledge (AO and FSA) via informative priors. ◮ (-) It is not PAC-MDP. ◮ Our approach shows how to use this knowledge with PAC-MDP algorithms. ◮ Comparing BEB using informative priors with our approach using knowledge-based models (see our paper).

Conclusion ◮ The use of knowledge in RL is important. ◮ It was shown how to use partial knowledge about actions with PAC-MDP algorithms in a theoretically correct way. ◮ Global admissible heuristics required by reward shaping may not be available (e.g., PPDDL domains). ◮ Knowledge-based admissible models turned out to be more efficient than reward shaping with equivalent knowledge: in our case knowledge is used when actions are still ‘unknown’, whereas reward shaping helps only with known actions. ◮ BEB can use AO and FSA knowledge via informative priors. It was shown how to use this knowledge in the PAC-MDP framework (BEB is not PAC-MDP). May 9, 2010

PAC-MDP Learning with Knowledge-based Admissible Models Marek Grze - PowerPoint PPT Presentation

PAC-MDP Learning with Knowledge-based Admissible Models Marek Grze s and Daniel Kudenko Department of Computer Science United Kingdom AAMAS 2010 Reinforcement Learning The loop of interaction: Agent can see the current state of the

Admissible Rules of (Fragments of) R-Mingle Admissible Rules of (Fragments of) R-Mingle Laura

Guiding Financial Controls and Practices for PACs and PAC Treasurers PAC Treasurers Workshop

Decidability of the Admissible Rules in Intuitionistic Propositional Logic Jeroen P. Goudsmit

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are

NAPSLO PAC Contributions How contributing to the NAPSLO PAC will benefit you, your company and the

WELCOME June 2011 PAC Presentation Opening Remarks Introductions June 2011 PAC

AAOS Orthopaedic PAC The Orthopaedic PAC is the only national political action committee

LArIAT Fermilab PAC Meeting November 11, 2016 Jen Raaf PAC Charge Fermilab PAC Meeting, J.

Assumptions about admissible models and the semantics Igor Yanovich Universit at T ubingen

Logic Programming and MDPs for Planning Alborz Geramifard Winter 2009 Index Introduction Logic

Whats Next? 29 1 The MDP Journey Next Spring Leadership MDP Badge Fall Symposium

Talk overview Introduction and historical background Multiple delivery publishing (MDP)

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

The PAC Learning Framework Guoqing Zheng January 20, 2015 Guoqing Zheng The PAC Learning

HERITAGE SQUARE CONSIDERATIONS Public Process Project Advisory Committee Meetings: PAC Meeting

An Empirical Study of Textual Key-Fingerprint Representations Sergej Dechand , Dominik

A Cache-conscious Profitability A Cache-conscious Profitability Model for Empirical Tuning of

Scale-Invariance Ideas Scale-Invariance: . . . Which Dependencies . . . Explain the Empirical

Three Researchers, Five Conjectures: An Empirical Analysis of TOM-Skype Censorship and

Anytime Best First search: Empirical evaluation Natalia Flerova Radu Marinescu

An Empirical Study of Cycle Toggling Based Laplacian Solvers Kevin Deweese 1 John Gilbert 1 Gary

An Empirical Study of How Developers Use - Results - Discussion Autocompletion Sheldon Chi

AN EMPIRICAL STUDY OF PRACTITIONERS PERSPECTIVES ON GREEN SOFTWARE ENGINEERING Empirical