GradientDICE: Rethinking Generalized Offline Estimation of - PowerPoint PPT Presentation

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values Shangtong Zhang 1 , Bo Liu 2 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University

Preview • O ff -policy evaluation with density ratio learning • Use the Perron-Frobenius theorem to reduce the constraints from 3 to 2, reducing the positiveness constraint, making the problem convex in both tabular and linear setting • A special weighted norm L 2 • Improvements over DualDICE and GenDICE in tabular, linear and neural network settings

Off-policy evaluation is to estimate the performance of a policy with off-policy data • The target policy π • A data set { s i , a i , r i , s ′ i } i =1,…, N • s i , a i ∼ d μ ( s , a ), r i = r ( s i , a i ), s ′ i ∼ p ( ⋅ | s i , a i ) • The performance metric ρ γ ( π ) ≐ ∑ s , a d γ ( s , a ) r ( s , a ) d γ ( s , a ) ≐ (1 − γ ) ∑ ∞ t =0 γ t Pr( S t = s , A t = a ∣ π , p ) • ( γ < 1) • d γ ( s , a ) ≐ lim t →∞ Pr( S t = s , A t = a ∣ π , p ) ( γ = 1)

Density ratio learning is promising for off-policy evaluation (Liu et al, 2018) d γ ( s , a ) • Learn with function approximation τ * ( s , a ) ≐ d μ ( s , a ) ρ γ ( π ) = ∑ s , a d μ ( s , a ) τ * ( s , a ) r ( s , a ) ≈ 1 N ∑ N • i =1 τ * ( s i , a i ) r i

Density ratio satisfies a Bellman- like equation (Zhang et al, 2020) • D τ * = (1 − γ ) μ 0 + γ P ⊤ π D τ * D ∈ ℝ N sa × N sa , D ≐ diag ( d μ ) • • τ * ∈ ℝ N sa • μ 0 ∈ ℝ N sa , μ 0 ( s , a ) ≐ μ 0 ( s ) π ( a | s ) • P π ∈ ℝ N sa × N sa , P π (( s , a ), ( s ′ , a ′ )) ≐ p ( s ′ | s , a ) π ( a ′ | s ′ )

is easy as it implies a γ < 1 unique solution • D τ = (1 − γ ) μ 0 + γ P ⊤ π D τ • ( I − γ P ⊤ π ) − 1 exists

Previous work requires three constraints for γ = 1 D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. GenDICE (Zhang et al, 2020) considers 1 & 3 explicitly L ( τ ) ≐ divergence ( D τ , P ⊤ π D τ ) + (1 − 1 ⊤ D τ ) 2 τ 2 , e τ and implements 2 with positive function approximation (e.g. ), projected SGD, or stochastic mirror descent Mousavi et al. (2020) implements 3 with self-normalization over all state-action pairs

Previous work requires three constraints for γ = 1 D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. The objective becomes non-convex with positive function approximation or self-normalization, even in tabular or linear setting. Projected SGD is computationally infeasible. Stochastic mirror descent significantly reduces the capacity of the (linear) function class.

We actually need only two constraints! D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. Perron-Frobenius theorem: the solution space of 1 is one-dimensional Either 2 or 3 is enough to guarantee a unique solution

  GradientDICE considers a special norm for the loss L 2 • GenDICE:   L ( τ ) ≐ divergence ((1 − γ ) μ 0 + γ P ⊤ π D τ , D τ ) + (1 − 1 ⊤ D τ ) 2 subject to Dy ≻ 0 • L ( τ ) ≐ || (1 − γ ) μ 0 + γ P ⊤ π D τ − D τ || D − 1 + (1 − 1 ⊤ D τ ) 2 • GradientTD loss: || … || D

GradientDICE considers a special norm for the loss L 2 • L ( τ ) ≐ || (1 − γ ) μ 0 + γ P ⊤ π D τ − D τ || D − 1 + (1 − 1 ⊤ D τ ) 2 f ∈ℝ Nsa , η ∈ℝ L ( τ , η , f ) ≐ (1 − γ ) 𝔽 μ 0 [ f ( s , a )] min max τ ∈ℝ Nsa + γ 𝔽 p [ τ ( s , a ) f ( s ′ , a ′ )] −𝔽 d μ [ τ ( s , a ) f ( s , a )] − 1 2 𝔽 d μ [ f ( s , a ) 2 ] • η 2 + λ ( 𝔽 d μ [ ητ ( s , a ) − η ] − 2 ) • Convergence in both tabular and linear setting with γ ∈ [0,1]

GradientDICE outperforms baselines in Boyan’s Chain (Tabular) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g.,   {4 − 6 ,4 − 5 , …,4 − 1 } learning rates from • Tuned to minimize final prediction error

GradientDICE outperforms baselines in Boyan’s Chain (Linear) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g.,   {4 − 6 ,4 − 5 , …,4 − 1 } learning rates from • Tuned to minimize final prediction error

  GradientDICE outperforms baselines in Reacher-v2 (Network) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g.,   Learning rates from {0.01,0.005,0.001} Penalty from {0.1,1} • Tuned to minimize final prediction error

Thanks • Code and Dockerfile are available at   https://github.com/ShangtongZhang/DeepRL

GradientDICE: Rethinking Generalized Offline Estimation of - PowerPoint PPT Presentation

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values Shangtong Zhang 1 , Bo Liu 2 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University Preview O ff -policy evaluation with density ratio learning Use the

RETHINKING THE TOOLS OF ENGAGEMENT FLIPPING THE OUTCOMES RETHINKING THE TOOLS OF ENGAGEMENT /

M6 Offline Analysis Katarina Pajchel University of Oslo April 18, 2008 Katarina Pajchel

Generalized MPLS Signaling draft-ietf-mpls-generalized-signaling-05.txt

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Opyum: offline package management with Yum -- Debarshi Ray What is it? An offline package

Offline Inbox Interceptor - Ultimate Presentation Offline Inbox Interceptor - Ultimate

CAF Benchmarking CAF Benchmarking Marco MEONI CERN - Offline Week C N O e Wee Alice Offline

Offline Data Processing: Tasks and Infrastructure Support T. Yang, UCSB 293S Table of Content

5.1 Online versus Offline SVMs We start with a review of the Offline Support Vector Machine.

Taking it all Offline with SQL Anywhere Eric Farrar, Product Manager Sybase iAnywhere March 5,

Offline Data Processing: Tasks and Infrastructure Support T. Yang, UCSB 290N Table of Content

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Rethinking of the Rethinking of the debian/watch debian/watch With thought experiments about

Rethinking Club Rethinking Club Promotion JUDO BC AGM | JUNE 2014 JENNIFER HOOD | JUMP

Rethinking Assets - CIH South West Conference 2014 Presented by: Robert Stronge Managing

SAASOA Rethinking the Ports Regulatos Core Methodology Rethinking the Ports Regulators Core

Discrete-Event Systems and Generalized Semi-Markov Processes Reading: Section 1.4 in Shedler or

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2019 Machine Learning Subfield

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

A Language for Probabilistically Oblivious Computation David Darais , Ian Sweet, Chang Liu,

CS 301 Lecture 09 Context-free grammars Stephen Checkoway February 14, 2018 1 / 22

Attraction and Avoidance Detection from Movements Zhenhui Jessie Li (with Bolin Ding, Fei Wu,

When Iterative Optimization Meets the Polyhedral Model: One-Dimensional Date Louis-Nol Pouchet

HIDING IN THE FAMILIAR: STEGANOGRAPHY AND VULNERABILITIES IN POPULAR ARCHIVES FORMATS Agenda