DualDICE Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections Ofir Nachum ,* Yinlam Chow,* Bo Dai, Lihong Li Google Research *Equal contribution
Reinforcement Learning
Reinforcement Learning ● A policy acts on an environment.
Reinforcement Learning ● A policy acts on an environment. Initial state distribution β s 0
Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0
Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0 R(-| s 0 , a 0 ) r 0
Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0 T(-| s 0 , a 0 ) s 1 R(-| s 0 , a 0 ) r 0
Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) Initial state distribution β a 0 a 1 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) r 0 r 1
Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) 𝛒 (-| s 2 ) Initial state distribution β a 0 a 1 a 2 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) s 2 T(-| s 2 , a 2 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) R(-| s 2 , a 2 ) r 0 r 1 r 2
Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) 𝛒 (-| s 2 ) Initial state distribution β a 0 a 1 a 2 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) s 2 T(-| s 2 , a 2 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) R(-| s 2 , a 2 ) r 0 r 1 r 2 ● Question: What is the value (average reward) of the policy?
Off-policy Policy Estimation
Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy,
Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy, ● Only have access to finite experience dataset s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ . . . where transitions are from some unknown distribution
Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy, ● Only have access to finite experience dataset s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ . . . where transitions are from some unknown distribution ● Don’t even know the behavior policy!
Reduction of OPE to Density Ratio Estimation
Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ●
Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have,
Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average,
Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios)
Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios)
Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios) ● Difficult because we don’t have access to environment and we don’t have explicit knowledge of d D (s,a), only samples.
The DualDICE Objective ● Define zero-reward Bellman operator as
The DualDICE Objective ● Define zero-reward Bellman operator as
The DualDICE Objective ● Define zero-reward Bellman operator as minimize squared Bellman error
The DualDICE Objective ● Define zero-reward Bellman operator as minimize squared Bellman error s 2 s 3 s 1 s 4 s 0
The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0
The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0
The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0
The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ●
The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ● ● Extension 1: Can remove appearance of Bellman operator from both objective and solution by application of Fenchel conjugate!
The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ● ● Extension 1: Can remove appearance of Bellman operator from both objective and solution by application of Fenchel conjugate! ● Extension 2: Can generalize this result to any convex function (not just square)!
DualDICE Results ● DualDICE accuracy during training compared to existing methods.
East Exhibition Hall B+C DualDICE Results Poster #205 ● DualDICE accuracy during training compared to existing methods.
Recommend
More recommend