Rishabh Agarwal , Dale Schuurmans, Mohammad Norouzi How I Learned To Stop Worrying And Love Offmine RL An Optimistic Perspective on Offline Reinforcement Learning
What makes Deep Learning Successful? Expressive function approximators An Optimistic Perspective on Offmine Reinforcement Learning P 2
What makes Deep Learning Successful? Expressive function approximators Powergul learning algorithms An Optimistic Perspective on Offmine Reinforcement Learning P 3
What makes Deep Learning Successful? Large and Diverse Expressive function Datasets approximators Powergul learning algorithms An Optimistic Perspective on Offmine Reinforcement Learning P 4
How to make Deep RL similarly successful? Expressive function approximators Good learning algorithms e.g., actor-critic, approx DP An Optimistic Perspective on Offmine Reinforcement Learning P 5
How to make Deep RL similarly successful? Large and Diverse Expressive function Datasets approximators Good learning algorithms e.g., actor-critic, approx DP An Optimistic Perspective on Offmine Reinforcement Learning P 6
How to make Deep RL similarly successful? Expressive function Interactive Environments approximators Good learning algorithms e.g., actor-critic, approx DP Active Data Collection An Optimistic Perspective on Offmine Reinforcement Learning P 7
RL for Real-World: RL with Large Datasets RoboNet Robotics [1] Dasari, Ebert, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database. An Optimistic Perspective on Offmine Reinforcement Learning P 8
RL for Real-World: RL with Large Datasets RoboNet Robotics Recommender Systems [1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database. An Optimistic Perspective on Offmine Reinforcement Learning P 9
RL for Real-World: RL with Large Datasets RoboNet Self-Driving Cars Robotics Recommender Systems [1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database. An Optimistic Perspective on Offmine Reinforcement Learning P 10
RL for Real-World: RL with Large Datasets RoboNet Self-Driving Cars Robotics Recommender Systems [1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database. An Optimistic Perspective on Offmine Reinforcement Learning P 11
Offmine RL: A Data-Driven RL Paradigm Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/ An Optimistic Perspective on Offmine Reinforcement Learning P 12
Offmine RL: A Data-Driven RL Paradigm Offmine RL can help: Pretrain agents on existing ● logged data . Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/ An Optimistic Perspective on Offmine Reinforcement Learning P 13
Offmine RL: A Data-Driven RL Paradigm Offmine RL can help: ● Pretrain agents on existing logged data . Evaluate RL algorithms on ● the basis of exploitation alone on common datasets . Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/ An Optimistic Perspective on Offmine Reinforcement Learning P 14
Offmine RL: A Data-Driven RL Paradigm Offmine RL can help: Pretrain the agents on existing ● logged data. ● Evaluate RL algorithms on the basis of exploitation alone on common datasets. Deliver real world impact. ● Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/ An Optimistic Perspective on Offmine Reinforcement Learning P 15
But .. Offmine RL is Hard! NO new corrective feedback! An Optimistic Perspective on Offmine Reinforcement Learning P 16
But .. Offmine RL is Hard! Requires Countergactual Generalization An Optimistic Perspective on Offmine Reinforcement Learning P 17
But .. Offmine RL is Hard! Fully Off-Policy Function Bootstrapping Approximation ( Learning guess from a guess ) An Optimistic Perspective on Offmine Reinforcement Learning P 18
Standard RL fails in Offmine setuing .. An Optimistic Perspective on Offmine Reinforcement Learning P 19
Standard RL fails in Offmine setuing .. An Optimistic Perspective on Offmine Reinforcement Learning P 20
Standard RL fails in Offmine setuing .. An Optimistic Perspective on Offmine Reinforcement Learning P 21
Can standard ofg-policy RL succeed in the offmine Setuing? An Optimistic Perspective on Offmine Reinforcement Learning P 22
Offmine RL on Atari 2600 200 million frames (standard protocol) Train 5 DQN (Nature) agents on each Atari game using sticky actions (stochasticity) An Optimistic Perspective on Offmine Reinforcement Learning P 23
Offmine RL on Atari 2600 Save all of the tuples of (observation, action, next observation, reward) encountered to DQN-replay dataset(s) An Optimistic Perspective on Offmine Reinforcement Learning P 24
Offmine RL on Atari 2600 Train ofg-policy agents using DQN-replay dataset(s) without any furuher environment interaction An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine DQN work? An Optimistic Perspective on Offmine Reinforcement Learning
Let's try recent ofg-policy algorithms! Actions R e t Distributional RL uses Z(s, a), a u r n s distribution over returns, instead of the Q -function. Z (1/K) Z (2/K) Z (K/K) Shared Neural Network QR-DQN An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine QR-DQN work? An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine DQN work? An Optimistic Perspective on Offmine Reinforcement Learning
Offmine DQN (Nature) vs Offmine C51 Average online scores of C51 and DQN (Nature) agents trained offmine on DQN replay dataset for the same number of gradient steps as online DQN. The horizontal line shows the pergormance of fully-trained DQN. An Optimistic Perspective on Offmine Reinforcement Learning
Developing Robust Offmine RL algorithms ➢ Emphasis on Generalization Given a fjxed dataset, generalize to unseen states during evaluation. ○ An Optimistic Perspective on Offmine Reinforcement Learning
Developing Robust Offmine RL algorithms Emphasis on Generalization ➢ Given a fjxed dataset, generalize to unseen states during evaluation. ○ ➢ Ensemble of Q -estimates : Ensembling, Dropout widely used for improving generalization. ○ An Optimistic Perspective on Offmine Reinforcement Learning
Ensemble-DQN Actions Actions Actions Returns .. Train multiple (linear) Q 1 Q 2 Q K Q -heads with difgerent Shared Neural random initialization. Network Ensemble-DQN An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine Ensemble-DQN work? An Optimistic Perspective on Offmine Reinforcement Learning
Does Offmine DQN work? An Optimistic Perspective on Offmine Reinforcement Learning
Developing Robust Offmine RL algorithms Emphasis on Generalization ➢ Given a fjxed dataset, generalize to unseen states during evaluation. ○ ➢ Q -learning as constraint satisfaction : ○ An Optimistic Perspective on Offmine Reinforcement Learning
Random Ensemble Mixture (REM) Actions R e t u r n s Minimize TD error on ∑ i ⍺ i Q i random (per minibatch) 𝛽 2 𝛽 K convex combination of Q 1 Q 2 Q K multiple Q -estimates. Shared Neural Network REM An Optimistic Perspective on Offmine Reinforcement Learning
REM vs QR-DQN Actions R Returns e t u r n s ∑ i ⍺ i Q i Z (1/K) Z (2/K) Z (K/K) 𝛽 2 𝛽 K Q 1 Q 2 Q K Shared Neural Network Shared Neural Network QR-DQN REM An Optimistic Perspective on Offmine Reinforcement Learning
Offmine Stochastic Atari Results Scores averaged over 5 runs of offline agents trained using DQN replay data across 60 Atari games for 5X gradient steps. Offline REM surpasses gains from online C51 and offline QR-DQN. An Optimistic Perspective on Offmine Reinforcement Learning
Offmine REM vs. Baselines An Optimistic Perspective on Offmine Reinforcement Learning
Reviewers asked: Does Online REM work? Average normalized scores of online agents trained for 200 million game frames. Multi-network REM with 4 Q-functions performs comparably to QR-DQN. An Optimistic Perspective on Offmine Reinforcement Learning
Key Factor in Success: Offmine Dataset Size Randomly subsample N% of frames from 200 million frames for offmine training. Divergence with 1% of data for prolonged training! An Optimistic Perspective on Offmine Reinforcement Learning
Key Factor in Success: Offmine Dataset Composition Subsample fjrst 10% of total frames (20 million) for offmine training -- much lower quality data. An Optimistic Perspective on Offmine Reinforcement Learning
Recommend
More recommend