Off-Policy Evaluation via Off- Policy Classification Alex Irpan, - PowerPoint PPT Presentation

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye

Overview • Motivation • Contributions • Background • Method • Results • Limitations

Motivation • Typically, performance of deep RL algorithms is evaluated via on- policy interactions • But comparing models in a real-world environment is costly • Examines off-policy policy evaluation (OPE) for value-based methods

Motivation (cont.) • Existing OPE metrics either rely on a model of the environment or importance sampling (IS) • OPE is most useful in off-policy RL setting, where we expect to use real-world data as “validation set” • Hard to use with IS • For high-dimensional observations, models of the environment can be difficult to fit

Contributions • Framed OPE as a positive-unlabeled (PU) classification problem and developed two scores: OPC and SoftOPC • Relies on neither IS nor model learning • Correlate well with performance (on both simulated and real-world tasks) • Can be used with complex data to evaluate expected performance of off-policy RL methods • Proposed metrics outperform a variety of baseline methods including simulation-to-reality transfer scenario

General Background (MDP) • Focus on finite-horizon Markov decision processes (MDP): • Assume a binary reward MDP, which satisfies: • 𝛿 = 1 • Reward is 𝑠 𝑢 = 0 at all intermediate steps • Final reward 𝑠 𝑈 = 0,1 • Learn Q-functions 𝑅(𝐭,𝐛) to evaluate policies 𝜌 𝐭 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝐛 𝑅(𝐭,𝐛)

General Background (Positive-Unlabeled Learning) • Positive-unlabeled (PU) learning learns binary classification from partially labeled data • Sufficient to learn a binary classifier if the positive class prior 𝑞(𝑧 = 1) is known • Loss over negatives can be indirectly estimated from 𝑞(𝑧 = 1)

General Background (Positive-Unlabeled Learning) • Want to evaluate 𝑚 𝑕 𝑦 , 𝑧 over negative examples (𝑦, 𝑧 = 0) 𝑞 𝑦 = 𝑞 𝑦 𝑧 = 1 𝑞 𝑧 = 1 + 𝑞 𝑦 𝑧 = 0 𝑞(𝑧 = 0) • Using 𝔽 𝑌 𝑔(𝑦) = ׬ 𝑦 𝑞 𝑦 𝑔 𝑦 𝑒𝑦 : 𝔽 𝑌 𝑔(𝑦) = 𝑞 𝑧 = 1 𝔽 𝑌|𝑍=1 𝑔(𝑦) + 𝑞 𝑧 = 0 𝔽 𝑌|𝑍=0 𝑔(𝑦) • Letting 𝑔 𝑦 = 𝑚(𝑕 𝑦 , 0) :

General Background (Definitions) • In a binary reward MDP, (𝐭 𝑢 ,𝐛 𝑢 ) is feasible if an optimal 𝜌 ∗ has non- zero probability of achieving success after taking 𝐛 𝑢 in 𝐭 𝑢 • (𝐭 𝑢 ,𝐛 𝑢 ) is catastrophic if even an optimal 𝜌 ∗ has zero probability of succeeding after 𝐛 𝑢 is taken • Therefore, return of a trajectory 𝜐 is 1 only if all (𝐭 𝑢 ,𝐛 𝑢 ) in 𝜐 are feasible

OPE Method (Theorem) • Theorem: 𝑆 𝜌 ≥ 1 − 𝑈(𝜗 + 𝑑) 1 𝑈 • 𝜗 = 𝑈 σ 𝑗=1 𝜗 𝑢 being average error over all 𝐭 𝑢 ,𝐛 𝑢 , with 𝜗 𝑢 = 𝔽 𝜍 𝑢,𝜌 ෍ 𝜌 𝐛 𝐭 𝑢 + 𝐛∈𝒝_(𝐭 𝑢 ) • 𝒝_(𝐭) : set of catastrophic actions at state 𝐭 + : state distribution at time 𝑢 , given that 𝜌 was followed, and all its • 𝜍 𝑢,𝜌 previous actions were feasible, and 𝐭 𝑢 is feasible • 𝑑 𝐭 𝑢 , 𝐛 𝑢 : probability that stochastic dynamics bring a feasible (𝐭 𝑢 ,𝐛 𝑢 ) to a catastrophic 𝐭 𝑢+1 , with 𝑑 = max 𝐭,𝐛 𝑑(𝐭, 𝐛)

OPE Method (Missing negative labels) • Estimate 𝜗 , probability that 𝜌 takes a catastrophic action – i.e., (𝐭,𝜌 𝐭 ) is a false positive 𝜗 = 𝑞 𝑧 = 0 𝔽 𝑌|𝑍=0 𝑚 𝑕 𝑦 , 0 • Recall 𝑞 𝑧 = 0 𝔽 𝑌|𝑍=0 𝑚 𝑕 𝑦 ,0 = 𝔽 𝑌,𝑍 𝑚 𝑕 𝑦 , 0 − 𝑞(𝑧 = 1)𝔽 𝑌|𝑍=1 𝑚 𝑕 𝑦 , 0 • We obtain 𝜗 = 𝔽 𝐭,𝐛 𝑚 𝑅 𝐭,𝐛 , 0 − 𝑞(𝑧 = 1)𝔽 𝐭,𝐛 ,𝑧=1 𝑚(𝑅 𝐭,𝐛 , 0)

OPE Method (Off-policy classification) • Off-policy classification (OPC) score : negative loss when 𝑚 is 0-1 loss • SoftOPC : negative loss when 𝑚 is a soft loss function 𝑚 𝑅 𝐭, 𝐛 , 𝑍 = 1 − 2𝑍 𝑅 𝐭, 𝐛

OPE Method (Evaluating OPE metrics) • Standard method: report MSE to the true episode return • Our metrics do not estimate episode return directly • Instead, train many Q-functions with different learning algorithms • Evaluate true return of the equivalent argmax policy for each Q-function • Compare correlation of the metric to true return • Coefficient of determination of line of best fit 𝑆 2 , and Spearman rank correlation 𝜊

Baseline Metrics • Temporal-difference (TD) error • Standard Q-learning training loss • Discounted sum of advantages σ 𝑢 𝛿 𝑢 𝐵 𝜌 • Relates 𝑊 𝜌 𝑐 𝐭 − 𝑊 𝜌 (𝐭) to the sum of advantages over data from 𝜌 𝑐 • Monte Carlo corrected (MCC) error • Arrange discounted sum of advantages into a squared error

Experimental Results (Simple Environments) • Performance against stochastic dynamics

Experimental Results (Vision-Based Robotic Grasping) • Performance on simulated and real versions of a vision- based grasping task

Discussion of results • OPC and SoftOPC consistently outperformed baselines • SoftOPC more reliably ranks policies than baselines for real- world performance • SoftOPC performs slightly better than OPC

Limitations • Key limitation: restricted task domain • Assumes an agent either succeeds or fails • Difficult to model with complicated tasks with a long time-horizon • Could not compare to many OPE baselines that use IS and model learning techniques • High correlation with real-world robotic grasping task, but comparable with sum of discounted advantages in simulation

Contributions (Recap) • Difficult and expensive to evaluate performance based on real-world environments • Many off-policy RL methods are based on value-based methods and do not require any knowledge of the policy that generated the real-world training data • These methods are hard to use with IS and model selection • Treated evaluation as a classification problem and proposed OPC and SoftOPC from negative losses to be used with off-policy Q-learning algorithms • Can predict relative performance of different policies in generalization scenarios • Proposed OPE metrics outperform a variety of baseline methods including simulation-to-reality transfer scenario

Take Home Questions • What conditions must be met for the MDP to perform OPE via OPC? • What is a natural choice for the decision function? • How are the classification scores determined? Which losses are used? • Which two correlations are used to evaluate the metrics?

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, - PowerPoint PPT Presentation

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye Overview Motivation Contributions

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Off-policy methods with approximation Recall off-policy learning involves two policies One

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

misc: environments, usethis, package structure Environments Environments and bindings via

Moving Up & Moving Off Off Campus Living 2019 Opening Video Part 1: Deciding to go off

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1

Weakly Supervised Classification Weakly Supervised Classification and Robust Learning and Robust

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade 2018 c

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun

HTTPS Traffic Classification Wazen M. Shbair, Thibault Cholez, J er ome Fran cois,

An unsupervised classification process for large datasets based on web reasoning Rafael PEIXOTO,

Multi-dimensional Packet Classification Yadi Ma, Suman Banerjee University of Wisconsin-Madison

Ri Risk bo bounds unds for r some me classificati tion n and nd re regre ression models

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, - PowerPoint PPT Presentation

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine Topic: Imitation - Inverse RL Presenter:Ning (Angela) Ye Overview Motivation Contributions

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Off-policy methods with approximation Recall off-policy learning involves two policies One

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Evaluation of Memory and CPU usage via Cgroups of ATLAS workloads via Cgroups of ATLAS workloads

misc: environments, usethis, package structure Environments Environments and bindings via

Moving Up &amp; Moving Off Off Campus Living 2019 Opening Video Part 1: Deciding to go off

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation &amp; Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson &amp; Pyla UX Evaluation

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1

Weakly Supervised Classification Weakly Supervised Classification and Robust Learning and Robust

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade 2018 c

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun

HTTPS Traffic Classification Wazen M. Shbair, Thibault Cholez, J er ome Fran cois,

An unsupervised classification process for large datasets based on web reasoning Rafael PEIXOTO,

Multi-dimensional Packet Classification Yadi Ma, Suman Banerjee University of Wisconsin-Madison

Ri Risk bo bounds unds for r some me classificati tion n and nd re regre ression models

Moving Up & Moving Off Off Campus Living 2019 Opening Video Part 1: Deciding to go off

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation