online exploration in least squares policy iteration
play

Online Exploration in Least-Squares Policy Iteration Lihong Li, - PowerPoint PPT Presentation

Online Exploration in Least-Squares Policy Iteration Lihong Li, Michael L. Littman, and Christopher R. Mansley Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) 5/14/2009 AAMAS - Budapest 1 Contributions Reinforcement Learning


  1. Online Exploration in Least-Squares Policy Iteration Lihong Li, Michael L. Littman, and Christopher R. Mansley Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) 5/14/2009 AAMAS - Budapest 1

  2. Contributions Reinforcement Learning Challenge I Challenge II Exploration/Exploitation Tradeoff Value-Function Approximation Rmax [Brafman & Tenneholtz 02] LSPI [Lagoudakis & Parr 03] (provably efficient, finite) (continuous, offline) LSPI-Rmax 5/14/2009 AAMAS - Budapest 2

  3. Outline • Introduction – LSPI – Rmax • LSPI-Rmax • Experiments • Conclusions 5/14/2009 AAMAS - Budapest 3

  4. Basic Terminology • Markov decision process – States: S – Actions: A – Reward function: -1 ≤ R(s,a) ≤ 1 – Transition probabilities: T(s’|s,a) – Discount factor: 0 < γ < 1 • Optimal value function: • Optimal policy: • Approximate 5/14/2009 AAMAS - Budapest 4

  5. Linear Function Approximation • Features: – A.k.a. “basis functions”, and predefined • Weights: – Measures contributions of φ i to approximating Q* • 5/14/2009 AAMAS - Budapest 5

  6. LSPI [Lagoudakis & Parr 03] π  π ’ Improve π : Initialize Evaluate π : compute w π ’(s) = argmax a w· φ (s,a) π 5/14/2009 AAMAS - Budapest 6

  7. LSPI [Lagoudakis & Parr 03] But, LSPI does not specify how to collect samples D : a fundamental challenge in online reinforcement learning π  π ’ Improve π : Initialize Evaluate π : An agent only collects samples in states it visits… compute w π ’(s) = argmax a w· φ (s,a) π 5/14/2009 AAMAS - Budapest 7

  8. Exploration/Exploitation Tradeoff 0 0 0 0 0 0 0 1000 1 2 3 98 99 100 0.001 total rewards efficient optimal policy exploration inefficient exploration time 5/14/2009 AAMAS - Budapest 8

  9. Rmax [Brafman & Tenenholtz 02] • Rmax is for finite-state, finite-action MDPs • Learns T and R by counting/averaging • In s t , takes optimal action in “Optimism in the face of uncertainty”  Either: explore “unknown” region  Or: exploit “known” region Thm: Rmax is provably efficient Known Unknown S x A state-actions state-actions 5/14/2009 AAMAS - Budapest 9

  10. LSPI-Rmax • Similar to LSPI • But distinguishes known/unknown ( s,a ) : Samples in D Known state-actions Unknown state-actions (Like Rmax) Treat their Q-value as Q max S x A Modifications of LSTDQ 5/14/2009 AAMAS - Budapest 10

  11. LSTDQ-Rmax 5/14/2009 AAMAS - Budapest 11

  12. LSPI-Rmax for Online RL • D = empty set • Initialize w • for t = 1, 2, 3, … – Take greedy action: a t = argmax a w· φ (s t ,a) – D = D U {( s t ,a t ,r t ,s t+1 )} – Run LSPI using LSTDQ-Rmax 5/14/2009 AAMAS - Budapest 12

  13. Experiments • Problems – MountainCar – Bicycle – Continuous Combination Lock – ExpressWorld (a variant of PuddleWorld) Four actions Stochastic transitions Reward: -1 reward per step -0.5 reward per step in “expresslane” penalty for stepping into puddles Random start states 5/14/2009 AAMAS - Budapest 13

  14. Various Exploration Rules with LSPI Converges to better policies 5/14/2009 AAMAS - Budapest 14

  15. A Closer Look States visited in the first 3 episodes: Efficient Inefficient exploration exploration Help discovery of goal and expresslane 5/14/2009 AAMAS - Budapest 15

  16. More Experiments 5/14/2009 AAMAS - Budapest 16

  17. Effect of Rmax Threshold 5/14/2009 AAMAS - Budapest 17

  18. Conclusions • We proposed LSPI-Rmax – LSPI + Rmax – encourages active exploration – with linear function approximation • Future directions – Similar idea applied to Gaussian process RL – Comparison to model-based RL 5/14/2009 AAMAS - Budapest 18

  19. 5/14/2009 AAMAS - Budapest 19

  20. Where are features from? • Hand-crafted features – expert knowledge required – expensive and error prone • Generic features – RBF, CMAC, polynomial, etc. – may not always work well • Automatic feature selection using – Bellman error [Parr et al. 07] – spectral graph analysis [Mahadevan & Maggioni 07] – TD approximation [Li & Williams & Balakrishnan 09] – L 1 Regularization for LSPI [Kolter & Ng 09] 5/14/2009 AAMAS - Budapest 20

  21. LSPI-Rmax vs. MBRL • Model-based RL (e.g., Rmax) – Learns an MDP model – Computes policy with the approximate model – Can use function approx. in model learning • Rmax w/ many compact representations [Li 09] • LSPI-Rmax is model-free RL – Avoids expensive “planning” step – Has weaker theoretical guarantees 5/14/2009 AAMAS - Budapest 21

Recommend


More recommend