Offline Reinforcement Learning CS 285 Instructor: Aviral Kumar UC Berkeley
What have we covered so far? • Exploration: - Strategies to discover high-reward states, diverse skills, etc. - How hard is exploration? ◆ Super Large ✓ |S||A| (1 − γ ) 3 log |S||A| #Samples ≥ Ω δ How many states to visit in the “best” case to learn an optimal Q-function • Even if we are ready to collect so many samples, it may be dangerous in practice: imagine a random policy on an autonomous car or a robot! Azar, Munos, Kappen. On the Sample Complexity of RL with a Generative Model. ICML 2012 and many others…
Can we apply standard RL in the real-world? • RL is fundamentally an “active” learning paradigm: the agent needs to collect its own dataset to learn meaningful policies • This can be unsafe or expensive in real world problems! Generalization? ? Iterated data collection can cause poor generalization! Gottesman, Johansson, Komorowski, Faisal, Sontag, Doshi-Velez. Guidelines for RL in Healtcare. Nature Medicine, 2019. Kumar , Gupta, Levine. DisCor: Corrective Feedback in RL via Distribution Correction, NeurIPS 2020.
O ffl ine (Batch) Reinforcement Learning Learn from a previously collected static dataset • Large static datasets of meaningful Why is offline RL behaviours already exist promising? • Large datasets at the core of successes in Vision and NLP Lange, Gabel, Reidmiller. Batch Reinforcement Learning. 2012 . Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.
Applications of O ffl ine RL Kalashnikov et al. QT-Opt: Scalable Deep RL for Vision-Based Robotic Manipulation. CoRL 2018. Jaques et al. Way O ff -Policy Batch Reinforcement Learning for Dialog. EMNLP 2020. Guez et al. Adaptive Treatment of Epilepsy via Batch-Mode Reinforcement Learning. AAAI 2008. Kendall et al. Learning to Drive in a Day. ICRA 2019. Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.
How good can o ffl ine RL perform? Supervised Learning Dog Can do as good as the dataset! Cat? O ffl ine Reinforcement Learning Can do better than the dataset! Stitching Can show that Q-learning recovers optimal policy from random data. Fu, Kumar, Nachum, Tucker, Levine. D4RL: Datasets for Deep Data-Driven RL. arXiv 2020.
Formalism and Notation • Dataset construction: Reward known - Several trajectories: 0 t τ i = { s t i , a t i , r t i } H D = { τ 1 , · · · , τ N } , i , s t =1 • Approximate “distribution” of states in the dataset: D ( s ) • Approximate distribution of actions at a given state in the dataset: D ( a | s ) • Will use notation for the behavior policy, π β ( a | s ) = D ( a | s ) • Standard RL notation from before: Q π ( s, a ) , V π ( s ) , d π ( s ) , etc.
Part 1: Classic Offline RL Algorithms and Challenges With Offline RL Part 2: Deep RL Algorithms to Address These Challenges Part 3: Related Problems, Evaluation Protocols, Applications
Part 1: Classic Algorithms and Challenges With Offline RL
A Generic O ff -Policy RL Algorithm DQN and Actor-critic algorithms both follow a similar skeleton, but with di ff erent design choices. 1. Collect data using the current policy 2. Store this data in a replay bu ff er 3. Use replay bu ff er to make updates on the policy and the Q-function 4. Continue from step 1.
Can such o ff -policy RL algorithms be used? O ff -Policy RL Algorithms can be applied, in principle “Off-Policy” buffer from past policies “Off-Policy” buffer from some unknown policies We will discuss some classical algorithms based on this idea next Lagoudakis, Parr. Least Squares Policy Iteration. JMLR 2003. Ernest el al. Tree-Based Batch Mode Reinforcement Learning. JMLR 2005 Gordon G. J. Stable Function Approximation in Dynamic Programming. ICML 1995, and many more…
Classic Batch Q-Learning Algorithms 1. Compute target values using the current Q-function 2. Train Q-function by minimizing TD error with respect to target values from Step 1. Q ( s, a ) = w T φ ( s, a ) Linear Q-functions Least Squares w T φ ( s, a ) ≈ R + γ max w T φ ( s 0 , a 0 ) Temporal Di ff erence Q-Learning (LSTD- a 0 Q) Can be solved in many ways: (1) find fixed point of the above equation (2) minimise the gap between the two sides of the equation Lagoudakis, Parr. Least Squares Policy Iteration. JMLR 2003. Ernest el al. Tree-Based Batch Mode Reinforcement Learning. JMLR 2005 Riedmiller. Neural Fitted Q-Iteration. ECML 2005. Gordon G. J. Stable Function Approximation in Dynamic Programming. ICML 1995 Antos, Szepesvari, Munos. Fitted Q-Iteration in Continuous Action-Space MDPS. NeurIPS 2007.
Classic Batch RL Algorithms based on IS High-confidence bounds on the return estimate Doubly-robust Variance reduction techniques Precup. Eligibility Traces for O ff -Policy Policy Evaluation. CSD Faculty Publication Series, 2000. Precup, Sutton, Dasgupta. O ff -Policy TD Learning with Function Approximation. ICML 2001. Peshkin and Shelton. Learning from Scarce Experience. 2002. Thomas, Theocharous, Ghavamzadeh. High Confidence O ff -Policy Evaluation. AAAI 2015. Thomas, Theocharous, Ghavamzadeh. High Confidence O ff -Policy Improvement. ICML 2015. Thomas, Brunskill. Magical Policy Search: Data E ffi cient RL with Guarantees of Global Optimality. EWRL 2016. Jiang and Li. Doubly-Robust O ff -Policy Value Estimation for Reinforcement Learning. ICML 2016.
Modern O ffl ine RL: A Simple Experiment Collect expert data and run actor-critic algorithms on this data Performance doesn’t improve with more data Learning diverges “Policy unlearning” Not a classical overfitting issue! how well it does how well it thinks it does Kumar, Fu, Tucker, Levine. Stabilizing O ff -Policy RL via Bootstrapping Error Reduction , NeurIPS 2019. Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.
So, why do RL algorithms fail, even though imitation learning would work in this setting (e.g., in Lecture 2)?
Let’s see how the Q-function is updated a 0 Q ( s 0 , a 0 ) Q ( s, a ) ← r ( s, a ) + γ max h a 0 Q ( s 0 , a 0 ))) 2 i ( Q ( s, a ) − ( r ( s, a ) + γ max E s,a,s 0 ⇠ D Q-values at other actions Q-values on the data Which actions does the Q- function train on? s, a ∼ D Where does the action a’ for the target value come from? a 0 Q ( s 0 , a 0 ) max Q-learning queries values at unseen action targets, which are never trained during training
Why are erroneous backups a big deal? • This phenomenon also happens in online RL settings, where the Q-function is erroneously optimistic • But Boltzmann or epsilon-greedy exploration on this overoptimistic Q- function (generally) leads to “error correction” π explore ( a | s ) ∝ exp( Q ( s, a )) Error correction is not necessarily guaranteed with online data collection when using deep neural nets, but mostly works fine in practice (trick: use replay bu ff ers, perform distribution correction, etc) • But the primary ability of error correction, i.e., exploration, is impossible in o ffl ine RL, due to no access to an environment…. Kumar, Fu, Tucker, Levine. Stabilizing O ff -Policy RL via Bootstrapping Error Reduction , NeurIPS 2019. Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020. Kumar, Gupta, Levine. DisCor: Corrective-Feedback in RL via Distribution Correction. NeurIPS 2020. Kumar , Gupta. Does On-Policy Data Collection Fix Errors in O ff -Policy Reinforcement Learning?, BAIR blog.
Distributional Shift in O ffl ine RL • Distribution shift between the behavior policy (the policy that collected the data) and the policy during learning 6 = π β ( a | s ) a 0 Q ( s 0 , a 0 ) Q ( s, a ) ← r ( s, a ) + γ max Q ( s, a ) ← r ( s, a ) + γ E a 0 ⇠ π ( a 0 | s 0 ) Q ( s 0 , a 0 ) = π β ( a | s ) ( Q ( s, a ) − B ¯ Q ( s, a )) 2 ⇤ ⇥ Training: E s,a ∼ d πβ ( s,a ) Offline Q-Learning algorithms can overestimate the value of unseen actions and can thus be falsely optimistic Kumar, Fu, Tucker, Levine. Stabilizing O ff -Policy RL via Bootstrapping Error Reduction , NeurIPS 2019. Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.
Error Compounds in RL (Additional Slide) Typical cartoon showing “error compounding” in RL Error compounding over the horizon magnifies a small error into a big one. Recent work has also showed counterexamples that indicate we can’t do better. Janner, Fu, Zhang, Levine. When to Trust Your Model: Model-Based Policy Optimization. NeurIPS 2019. Ross, Gordon, Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. AISTATS 2011 Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.
Part 2: Deep RL Algorithms to Address Distribution Shift
Addressing Distribution Shift via Pessimism Q ( s, a ) ← r ( s, a ) + γ E a 0 ⇠ π φ ( a | s ) [ Q ( s 0 , a 0 )] “Policy Constraint” π φ := arg max E a ∼ π φ ( a | s ) [ Q ( s, a )] s.t. D ( π φ ( a | s ) , π β ( a | s )) ≤ ε φ Out-of-distribution action values are no longer used for the backup Hence, all values used during training are also trained, E a 0 ⇠ π φ ( a | s ) [ Q ( s 0 , a 0 )] leading to better learning Levine, Kumar, Tucker, Fu. O ffl ine RL Tutorial and Perspectives on Open Problems. arXiv 2020.
Recommend
More recommend