Sample mple-Opt Optimal imal Pa Para rametric metric Q-Le Learning arning Usi Using ng Li Line nearly arly Ad Additive ditive Fea eatur tures es Lin in F. Yan ang, , Meng ngdi di Wan ang
A Basic RL Model: Markov Decision Process • States: ; Actions: • Reward: • State transition: • Policy: random Effective Horizon: • Optimal policy & value: • -optimal policy :
Curse of Dimensionality • Optimal sample complexity: |S| = 3 361 |S| ≥ 256 256×240 Too many states for How to optimally reduce dimensions? most cases … Exploiting structures!
Parametric Q-Learning On Feature-Based MDP • Transition is decomposable 𝑄 ∈ ℝ 𝑇×𝐵 ×𝑇 Φ Ψ Known Unknown
Parametric Q-Learning On Feature-Based MDP • Transition is decomposable
Parametric Q-Learning On Feature-Based MDP 0.2 0.11 0.3 0.5 0.01
A Simple Regression Based Algorithm • Generative Model: we are able to samples from any ( s,a ) Represent Q-function with parameter 𝑥 ∈ ℝ 𝐿 : 𝑅 𝑥 ≔ 𝑠 𝑡, 𝑏 + 𝛿𝜚 𝑡, 𝑏 ⊤ 𝑥 𝑊 𝑥 𝑡 ≔ max 𝑏∈𝐵 𝑅 𝑥 (𝑡, 𝑏) 𝜌 𝑥 𝑡 ≔ argmax 𝑏∈𝐵 𝑅 𝑥 (𝑡, 𝑏) • Learn 𝑥 with modified Q-learning Sample complexity ( 𝐿 : feature dimension): 𝐿 ෨ 𝑃 𝜗 2 1 − 𝛿 7
Sample Optimality? 𝑄 ⋅ |𝑡 1 , 𝑏 1 • Anchor condition: 𝑄 ⋅ |𝑡 2 , 𝑏 2 𝑄 ⋅ |𝑡, 𝑏 𝑄 ⋅ |𝑡 6 , 𝑏 6 𝑄 ⋅ |𝑡 3 , 𝑏 3 Sample complexity: 𝑄 ⋅ |𝑡 4 , 𝑏 4 𝑄 ⋅ |𝑡 5 , 𝑏 5 𝐿 ෩ Θ 𝜗 2 1 − 𝛿 3 ArXiv: 1902.04779. Poster: 117
Recommend
More recommend