DualDICE Behavior-Agnostic Estimation of Discounted Stationary - PowerPoint PPT Presentation

DualDICE Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections Ofir Nachum ,* Yinlam Chow,* Bo Dai, Lihong Li Google Research *Equal contribution

Reinforcement Learning

Reinforcement Learning ● A policy acts on an environment.

Reinforcement Learning ● A policy acts on an environment. Initial state distribution β s 0

Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0

Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0 R(-| s 0 , a 0 ) r 0

Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0 T(-| s 0 , a 0 ) s 1 R(-| s 0 , a 0 ) r 0

Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) Initial state distribution β a 0 a 1 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) r 0 r 1

Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) 𝛒 (-| s 2 ) Initial state distribution β a 0 a 1 a 2 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) s 2 T(-| s 2 , a 2 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) R(-| s 2 , a 2 ) r 0 r 1 r 2

Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) 𝛒 (-| s 2 ) Initial state distribution β a 0 a 1 a 2 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) s 2 T(-| s 2 , a 2 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) R(-| s 2 , a 2 ) r 0 r 1 r 2 ● Question: What is the value (average reward) of the policy?

Off-policy Policy Estimation

Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy,

Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy, ● Only have access to finite experience dataset s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ . . . where transitions are from some unknown distribution

Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy, ● Only have access to finite experience dataset s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ . . . where transitions are from some unknown distribution ● Don’t even know the behavior policy!

Reduction of OPE to Density Ratio Estimation

Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ●

Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have,

Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average,

Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios)

Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios) ● Difficult because we don’t have access to environment and we don’t have explicit knowledge of d D (s,a), only samples.

The DualDICE Objective ● Define zero-reward Bellman operator as

The DualDICE Objective ● Define zero-reward Bellman operator as minimize squared Bellman error

The DualDICE Objective ● Define zero-reward Bellman operator as minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ●

The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ● ● Extension 1: Can remove appearance of Bellman operator from both objective and solution by application of Fenchel conjugate!

The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ● ● Extension 1: Can remove appearance of Bellman operator from both objective and solution by application of Fenchel conjugate! ● Extension 2: Can generalize this result to any convex function (not just square)!

DualDICE Results ● DualDICE accuracy during training compared to existing methods.

East Exhibition Hall B+C DualDICE Results Poster #205 ● DualDICE accuracy during training compared to existing methods.

DualDICE Behavior-Agnostic Estimation of Discounted Stationary - PowerPoint PPT Presentation

DualDICE Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections Ofir Nachum ,* Yinlam Chow,* Bo Dai, Lihong Li Google Research *Equal contribution Reinforcement Learning Reinforcement Learning A policy acts on an

EulerMahonian Statistics Via Polyhedral Geometry [ n ] q ! n ! Matthias Beck San Francisco

Scientia Education Investment Fund (SEIF) grants A/Prof Marina Harvey Sonal Bhalla Katja

Passages worth the dig Franois Rabelais (1494-1553) Franois Rabelais (1494-1553)

From geometry to invertibility preservers Joint work with Peter Semrl Dept. of Mathematics,

Bare Bones of the Data Certain dialects of American English allow a Condition B-violating pronoun

Decision Networks CS 188: Artificial Intelligence Decision Networks and Value of Information

Prepared Text of Remarks by David Poisson 1 SCG Legal 2016 Annual Meeting Boston, Massachusetts

343H: Honors AI Lecture 18: Decision Networks and VOI 3/27/2014 Kristen Grauman UT Austin

Architecture of the Triposo travel guide Douwe Osinga & Jon Tirsen Douwe Osinga Jon Tirsen

Summary of the Dijet Topology Group Parallel Session Robert M. Harris Fermilab JTERM III

Space of Search Strategies CSE 473: Artificial Intelligence Blind Search DFS, BFS, IDS

OSGi Best and Worst Practices Martin Lippert Context Client apps using: Swing,

Write a program that asks the user if they want to calculate the area of a square or a

Design and Analysis of Algorithms 18CS42 Module 1: Introduction to Algorithms Module 1:

QUIC Quick UDP Internet Connections Multiplexed Stream Transport over UDP IETF-88 TSV Area

Securing the Web Platform Securing the Web Platform Collin Jackson Stanford University The Web

Reducing input latency on the web bit.ly/reduce-input-latency W3C Games Workshop - June 2019

Escaping The Sandbox Blackhat Abu Dhabi Stephen A. Ridley Senior Researcher Matasano Security

Loophole: Timing Attacks on Shared Event Loops in Chrome Pepe Vila & Boris Kpf IMDEA

Using Clang for fun and profit Examples from the Chromium project Hans Wennborg hwennborg (at)

Chrome OS Hardening http://outflux.net/slides/2012/bsides-pdx/chromeos.pdf Security B-Sides PDX

QUIC CPU Pergormance Can HTTP/3 be as efficient as HTTP/2 and HTTP 1.1? SIGCOMM EPIQ 2020,

Hacking WebKit & Its JavaScript Engines Jarred Nicholls Work @ Sencha WebKit Committer

Dongseok Jang Zachary Tatlock Sorin Lerner UC San Diego

DualDICE Behavior-Agnostic Estimation of Discounted Stationary - PowerPoint PPT Presentation

DualDICE Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections Ofir Nachum ,* Yinlam Chow,* Bo Dai, Lihong Li Google Research *Equal contribution Reinforcement Learning Reinforcement Learning A policy acts on an

EulerMahonian Statistics Via Polyhedral Geometry [ n ] q ! n ! Matthias Beck San Francisco

Scientia Education Investment Fund (SEIF) grants A/Prof Marina Harvey Sonal Bhalla Katja

Passages worth the dig Franois Rabelais (1494-1553) Franois Rabelais (1494-1553)

From geometry to invertibility preservers Joint work with Peter Semrl Dept. of Mathematics,

Bare Bones of the Data Certain dialects of American English allow a Condition B-violating pronoun

Decision Networks CS 188: Artificial Intelligence Decision Networks and Value of Information

Prepared Text of Remarks by David Poisson 1 SCG Legal 2016 Annual Meeting Boston, Massachusetts

343H: Honors AI Lecture 18: Decision Networks and VOI 3/27/2014 Kristen Grauman UT Austin

Architecture of the Triposo travel guide Douwe Osinga &amp; Jon Tirsen Douwe Osinga Jon Tirsen

Summary of the Dijet Topology Group Parallel Session Robert M. Harris Fermilab JTERM III

Space of Search Strategies CSE 473: Artificial Intelligence Blind Search DFS, BFS, IDS

OSGi Best and Worst Practices Martin Lippert Context Client apps using: Swing,

Write a program that asks the user if they want to calculate the area of a square or a

Design and Analysis of Algorithms 18CS42 Module 1: Introduction to Algorithms Module 1:

QUIC Quick UDP Internet Connections Multiplexed Stream Transport over UDP IETF-88 TSV Area

Securing the Web Platform Securing the Web Platform Collin Jackson Stanford University The Web

Reducing input latency on the web bit.ly/reduce-input-latency W3C Games Workshop - June 2019

Escaping The Sandbox Blackhat Abu Dhabi Stephen A. Ridley Senior Researcher Matasano Security

Loophole: Timing Attacks on Shared Event Loops in Chrome Pepe Vila &amp; Boris Kpf IMDEA

Using Clang for fun and profit Examples from the Chromium project Hans Wennborg hwennborg (at)

Chrome OS Hardening http://outflux.net/slides/2012/bsides-pdx/chromeos.pdf Security B-Sides PDX

QUIC CPU Pergormance Can HTTP/3 be as efficient as HTTP/2 and HTTP 1.1? SIGCOMM EPIQ 2020,

Hacking WebKit &amp; Its JavaScript Engines Jarred Nicholls Work @ Sencha WebKit Committer

Dongseok Jang Zachary Tatlock Sorin Lerner UC San Diego

Architecture of the Triposo travel guide Douwe Osinga & Jon Tirsen Douwe Osinga Jon Tirsen

Loophole: Timing Attacks on Shared Event Loops in Chrome Pepe Vila & Boris Kpf IMDEA

Hacking WebKit & Its JavaScript Engines Jarred Nicholls Work @ Sencha WebKit Committer