learning without exploration
play

Learning without Exploration Scott Fujimoto , David Meger, Doina - PowerPoint PPT Presentation

Off-Policy Deep Reinforcement Learning without Exploration Scott Fujimoto , David Meger, Doina Precup Mila, McGill University Surprise! Agent orange and agent blue are trained with 1. The same off-policy algorithm (DDPG). 2. The same dataset.


  1. Off-Policy Deep Reinforcement Learning without Exploration Scott Fujimoto , David Meger, Doina Precup Mila, McGill University

  2. Surprise! Agent orange and agent blue are trained with… 1. The same off-policy algorithm (DDPG). 2. The same dataset.

  3. The Difference? 1. Agent orange: Interacted with the environment. • Standard RL loop. • Collect data, store data in buffer, train, repeat. 2. Agent blue: Never interacted with the environment. • Trained with data collected by agent orange concurrently.

  4. 1. Trained with the same off-policy algorithm. 2. Trained with the same dataset. 3. One interacts with the environment. One doesn’t .

  5. Off-policy deep RL fails when truly off-policy .

  6. Value Predictions

  7. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′

  8. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ GIVEN GENERATED

  9. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ 𝑡, 𝑏, 𝑠, 𝑡 ′ ~𝐸𝑏𝑢𝑏𝑡𝑓𝑢 1. 𝑏 ′ ~𝜌(𝑡 ′ ) 2.

  10. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ 𝑡 ′ , 𝑏 ′ ∉ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 → 𝑅 𝑡 ′ , 𝑏 ′ = 𝐜𝐛𝐞 → 𝑅 𝑡, 𝑏 = 𝐜𝐛𝐞

  11. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ 𝑡 ′ , 𝑏 ′ ∉ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 → 𝑅 𝑡 ′ , 𝑏 ′ = 𝐜𝐛𝐞 → 𝑅 𝑡, 𝑏 = 𝐜𝐛𝐞

  12. Extrapolation Error 𝑅 𝑡, 𝑏 ← 𝑠 + 𝛿𝑅 𝑡′, 𝑏′ 𝑡 ′ , 𝑏 ′ ∉ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 → 𝑅 𝑡 ′ , 𝑏 ′ = 𝐜𝐛𝐞 → 𝑅 𝑡, 𝑏 = 𝐜𝐛𝐞

  13. Extrapolation Error Attempting to evaluate 𝜌 without (sufficient) access to the (𝑡, 𝑏) pairs 𝜌 visits.

  14. Batch-Constrained Reinforcement Learning Only choose 𝜌 such that we have access to the (𝑡, 𝑏) pairs 𝜌 visits.

  15. Batch-Constrained Reinforcement Learning 1. a~𝜌 𝑡 such that 𝑡, 𝑏 ∈ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 . 2. a~𝜌 𝑡 such that 𝑡 ′ , 𝜌 𝑡 ′ ∈ 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 . 3. a~𝜌 𝑡 such that 𝑅(𝑡, 𝑏) is maxed.

  16. Batch-Constrained Deep Q-Learning (BCQ) First imitate dataset via generative model: 𝐻(𝑏|𝑡) ≈ 𝑄 𝐸𝑏𝑢𝑏𝑡𝑓𝑢 (𝑏|𝑡) . 𝜌 𝑡 = argmax 𝑏 𝑗 𝑅 (𝑡, 𝑏 𝑗 ) , where 𝑏 𝑗 ~𝐻 (I.e. select the best action that is likely under the dataset) (+ some additional deep RL magic )

  17. ∎ BCQ ∎ DDPG

  18. ∎ BCQ ∎ DDPG

  19. Come say Hi @ Pacific Ballroom #38 (6:30 Tonight) https://github.com/sfujim/BCQ (Artist’s rendition of poster session)

Recommend


More recommend