How I Learned To Stop Worrying And Love Offmine RL An - PowerPoint PPT Presentation

Rishabh Agarwal , Dale Schuurmans, Mohammad Norouzi How I Learned To Stop Worrying And Love Offmine RL An Optimistic Perspective on Offline Reinforcement Learning

What makes Deep Learning Successful? Expressive function approximators An Optimistic Perspective on Offmine Reinforcement Learning P 2

What makes Deep Learning Successful? Expressive function approximators Powergul learning algorithms An Optimistic Perspective on Offmine Reinforcement Learning P 3

What makes Deep Learning Successful? Large and Diverse Expressive function Datasets approximators Powergul learning algorithms An Optimistic Perspective on Offmine Reinforcement Learning P 4

How to make Deep RL similarly successful? Expressive function approximators Good learning algorithms e.g., actor-critic, approx DP An Optimistic Perspective on Offmine Reinforcement Learning P 5

How to make Deep RL similarly successful? Large and Diverse Expressive function Datasets approximators Good learning algorithms e.g., actor-critic, approx DP An Optimistic Perspective on Offmine Reinforcement Learning P 6

How to make Deep RL similarly successful? Expressive function Interactive Environments approximators Good learning algorithms e.g., actor-critic, approx DP Active Data Collection An Optimistic Perspective on Offmine Reinforcement Learning P 7

RL for Real-World: RL with Large Datasets RoboNet Robotics [1] Dasari, Ebert, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database. An Optimistic Perspective on Offmine Reinforcement Learning P 8

RL for Real-World: RL with Large Datasets RoboNet Robotics Recommender Systems [1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database. An Optimistic Perspective on Offmine Reinforcement Learning P 9

RL for Real-World: RL with Large Datasets RoboNet Self-Driving Cars Robotics Recommender Systems [1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database. An Optimistic Perspective on Offmine Reinforcement Learning P 10

RL for Real-World: RL with Large Datasets RoboNet Self-Driving Cars Robotics Recommender Systems [1] Dasari, Eberu, Tian, Nair, Bucher, Schmeckpeper, .. Finn. RoboNet: Large-Scale Multi-Robot Learning. [2] Yu, Xian, Chen, Liu, Liao, Madhavan, Darrell. BDD100K: A Large-scale Diverse Driving Video Database. An Optimistic Perspective on Offmine Reinforcement Learning P 11

Offmine RL: A Data-Driven RL Paradigm Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/ An Optimistic Perspective on Offmine Reinforcement Learning P 12

Offmine RL: A Data-Driven RL Paradigm Offmine RL can help: Pretrain agents on existing ● logged data . Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/ An Optimistic Perspective on Offmine Reinforcement Learning P 13

Offmine RL: A Data-Driven RL Paradigm Offmine RL can help: ● Pretrain agents on existing logged data . Evaluate RL algorithms on ● the basis of exploitation alone on common datasets . Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/ An Optimistic Perspective on Offmine Reinforcement Learning P 14

Offmine RL: A Data-Driven RL Paradigm Offmine RL can help: Pretrain the agents on existing ● logged data. ● Evaluate RL algorithms on the basis of exploitation alone on common datasets. Deliver real world impact. ● Image Source: Data-Driven Deep Reinforcement Learning, BAIR Blog. htups://bair.berkeley.edu/blog/2019/12/05/bear/ An Optimistic Perspective on Offmine Reinforcement Learning P 15

But .. Offmine RL is Hard! NO new corrective feedback! An Optimistic Perspective on Offmine Reinforcement Learning P 16

But .. Offmine RL is Hard! Requires Countergactual Generalization An Optimistic Perspective on Offmine Reinforcement Learning P 17

But .. Offmine RL is Hard! Fully Off-Policy Function Bootstrapping Approximation ( Learning guess from a guess ) An Optimistic Perspective on Offmine Reinforcement Learning P 18

Standard RL fails in Offmine setuing .. An Optimistic Perspective on Offmine Reinforcement Learning P 19

Can standard ofg-policy RL succeed in the offmine Setuing? An Optimistic Perspective on Offmine Reinforcement Learning P 22

Offmine RL on Atari 2600 200 million frames (standard protocol) Train 5 DQN (Nature) agents on each Atari game using sticky actions (stochasticity) An Optimistic Perspective on Offmine Reinforcement Learning P 23

Offmine RL on Atari 2600 Save all of the tuples of (observation, action, next observation, reward) encountered to DQN-replay dataset(s) An Optimistic Perspective on Offmine Reinforcement Learning P 24

Offmine RL on Atari 2600 Train ofg-policy agents using DQN-replay dataset(s) without any furuher environment interaction An Optimistic Perspective on Offmine Reinforcement Learning

Does Offmine DQN work? An Optimistic Perspective on Offmine Reinforcement Learning

Let's try recent ofg-policy algorithms! Actions R e t Distributional RL uses Z(s, a), a u r n s distribution over returns, instead of the Q -function. Z (1/K) Z (2/K) Z (K/K) Shared Neural Network QR-DQN An Optimistic Perspective on Offmine Reinforcement Learning

Does Offmine QR-DQN work? An Optimistic Perspective on Offmine Reinforcement Learning

Offmine DQN (Nature) vs Offmine C51 Average online scores of C51 and DQN (Nature) agents trained offmine on DQN replay dataset for the same number of gradient steps as online DQN. The horizontal line shows the pergormance of fully-trained DQN. An Optimistic Perspective on Offmine Reinforcement Learning

Developing Robust Offmine RL algorithms ➢ Emphasis on Generalization Given a fjxed dataset, generalize to unseen states during evaluation. ○ An Optimistic Perspective on Offmine Reinforcement Learning

Developing Robust Offmine RL algorithms Emphasis on Generalization ➢ Given a fjxed dataset, generalize to unseen states during evaluation. ○ ➢ Ensemble of Q -estimates : Ensembling, Dropout widely used for improving generalization. ○ An Optimistic Perspective on Offmine Reinforcement Learning

Ensemble-DQN Actions Actions Actions Returns .. Train multiple (linear) Q 1 Q 2 Q K Q -heads with difgerent Shared Neural random initialization. Network Ensemble-DQN An Optimistic Perspective on Offmine Reinforcement Learning

Does Offmine Ensemble-DQN work? An Optimistic Perspective on Offmine Reinforcement Learning

Developing Robust Offmine RL algorithms Emphasis on Generalization ➢ Given a fjxed dataset, generalize to unseen states during evaluation. ○ ➢ Q -learning as constraint satisfaction : ○ An Optimistic Perspective on Offmine Reinforcement Learning

Random Ensemble Mixture (REM) Actions R e t u r n s Minimize TD error on ∑ i ⍺ i Q i random (per minibatch) 𝛽 2 𝛽 K convex combination of Q 1 Q 2 Q K multiple Q -estimates. Shared Neural Network REM An Optimistic Perspective on Offmine Reinforcement Learning

REM vs QR-DQN Actions R Returns e t u r n s ∑ i ⍺ i Q i Z (1/K) Z (2/K) Z (K/K) 𝛽 2 𝛽 K Q 1 Q 2 Q K Shared Neural Network Shared Neural Network QR-DQN REM An Optimistic Perspective on Offmine Reinforcement Learning

Offmine Stochastic Atari Results Scores averaged over 5 runs of offline agents trained using DQN replay data across 60 Atari games for 5X gradient steps. Offline REM surpasses gains from online C51 and offline QR-DQN. An Optimistic Perspective on Offmine Reinforcement Learning

Offmine REM vs. Baselines An Optimistic Perspective on Offmine Reinforcement Learning

Reviewers asked: Does Online REM work? Average normalized scores of online agents trained for 200 million game frames. Multi-network REM with 4 Q-functions performs comparably to QR-DQN. An Optimistic Perspective on Offmine Reinforcement Learning

Key Factor in Success: Offmine Dataset Size Randomly subsample N% of frames from 200 million frames for offmine training. Divergence with 1% of data for prolonged training! An Optimistic Perspective on Offmine Reinforcement Learning

Key Factor in Success: Offmine Dataset Composition Subsample fjrst 10% of total frames (20 million) for offmine training -- much lower quality data. An Optimistic Perspective on Offmine Reinforcement Learning

How I Learned To Stop Worrying And Love Offmine RL An - PowerPoint PPT Presentation

Rishabh Agarwal , Dale Schuurmans, Mohammad Norouzi How I Learned To Stop Worrying And Love Offmine RL An Optimistic Perspective on Offline Reinforcement Learning What makes Deep Learning Successful? Expressive function approximators An

1/37 Lesson: How I Learned to Stop Worrying and Love the Bot 2/37 Lesson: How I Learned to Stop

PCI: A Four-Letter Word of E-Commerce or: How I Learned to Stop Worrying and Love the Standard

Enterprise Redemption or: How I Learned To Stop Worrying And Love the JVM Enterprise Redemption

How I learned to stop worrying and love interference Pierre de Vries ISART 2010 ISART 2010 1

How I Learned to Rohit Zambre,* Stop Aparna Chandramowlishwaran,* Worrying Pavan Balaji

Talking to OGC Web Services in JSON Or How I Learned to Stop Worrying and Love XML Processing in

Primary Chemoradiation for Oropharynx Cancer or how I learned to stop worrying and love

How F# Learned to Stop Worrying and Love the Data Tomas Petricek @tomaspetricek Conspirator behind

Testing (or how I learned to stop worrying and love specification) David Nutter Biomathematics

Innovative Marketing Or: How I Learned To Stop Worrying and Love Social Media MarketingStrategist

Neutron L3 Agent HA Or: How I Learned to Stop Worrying and Love the API Kevin Bringard //

Lecture 1: Course Intro + Propositional Logic Or: How I Learned to Stop Worrying and Love the

The Security Impact of IPv6 How I Learned to Stop Worrying and Love IPv6 Johannes B. Ullrich,

How I Learned to Stop Worrying about Exascale and Love MPI (Yes, MPI is indeed da bomb!) Pavan

AI-Augmented Algorithms How I Learned to Stop Worrying and Love Choice Lars Kotthofg

Lecture 1: Course Intro + Propositional Logic Or: How I Learned to Stop Worrying and Love the

CSCI 2133 RAPID PROGRAMMING TECHNIQUES FOR INNOVATION Lab 07: Prep Lab: HTML5 Instructor: Gang

NOAA Pipeline/BUFR/CBUFR, schedule, clear flag and cloud- cleared filter November 2001 AIRS

Update on offline resources at CERN and some news on: database for logging online processing

INTRO TO DIGITAL COURSE DESIGN With the CrossCountry Toolkit With Helen and Jose Diacono QUICK

Are PWAs ready to Are PWAs ready to take over the world? take over the world? Implementing main

Scripting in Virtual Worlds with Remote Data Behram Mistree Virtual World Scripting Scripts

Sampling Algorithms for Data Sampling Algorithms for Data Collection in Online Networks

sPHENIX computing sPHENIX timeline PD 2/3 1 st sPHENIX workfest, 2011 in Boulder Computing