dualdice
play

DualDICE Behavior-Agnostic Estimation of Discounted Stationary - PowerPoint PPT Presentation

DualDICE Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections Ofir Nachum ,* Yinlam Chow,* Bo Dai, Lihong Li Google Research *Equal contribution Reinforcement Learning Reinforcement Learning A policy acts on an


  1. DualDICE Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections Ofir Nachum ,* Yinlam Chow,* Bo Dai, Lihong Li Google Research *Equal contribution

  2. Reinforcement Learning

  3. Reinforcement Learning ● A policy acts on an environment.

  4. Reinforcement Learning ● A policy acts on an environment. Initial state distribution β s 0

  5. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0

  6. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0 R(-| s 0 , a 0 ) r 0

  7. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) Initial state distribution β a 0 s 0 T(-| s 0 , a 0 ) s 1 R(-| s 0 , a 0 ) r 0

  8. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) Initial state distribution β a 0 a 1 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) r 0 r 1

  9. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) 𝛒 (-| s 2 ) Initial state distribution β a 0 a 1 a 2 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) s 2 T(-| s 2 , a 2 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) R(-| s 2 , a 2 ) r 0 r 1 r 2

  10. Reinforcement Learning ● A policy acts on an environment. 𝛒 (-| s 0 ) 𝛒 (-| s 1 ) 𝛒 (-| s 2 ) Initial state distribution β a 0 a 1 a 2 s 0 T(-| s 0 , a 0 ) s 1 T(-| s 1 , a 1 ) s 2 T(-| s 2 , a 2 ) R(-| s 0 , a 0 ) R(-| s 1 , a 1 ) R(-| s 2 , a 2 ) r 0 r 1 r 2 ● Question: What is the value (average reward) of the policy?

  11. Off-policy Policy Estimation

  12. Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy,

  13. Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy, ● Only have access to finite experience dataset s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ . . . where transitions are from some unknown distribution

  14. Off-policy Policy Estimation ● Want to estimate average discounted per-step reward of policy, ● Only have access to finite experience dataset s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ s, a, r, s’ . . . where transitions are from some unknown distribution ● Don’t even know the behavior policy!

  15. Reduction of OPE to Density Ratio Estimation

  16. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ●

  17. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have,

  18. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average,

  19. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios)

  20. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios)

  21. Reduction of OPE to Density Ratio Estimation Can write where d π is discounted on-policy distribution ● ● Using importance weighting trick, we have, ● Given finite dataset, this corresponds to weighted average, ● Problem reduces to estimating weights (density ratios) ● Difficult because we don’t have access to environment and we don’t have explicit knowledge of d D (s,a), only samples.

  22. The DualDICE Objective ● Define zero-reward Bellman operator as

  23. The DualDICE Objective ● Define zero-reward Bellman operator as

  24. The DualDICE Objective ● Define zero-reward Bellman operator as minimize squared Bellman error

  25. The DualDICE Objective ● Define zero-reward Bellman operator as minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

  26. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

  27. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

  28. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0

  29. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ●

  30. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ● ● Extension 1: Can remove appearance of Bellman operator from both objective and solution by application of Fenchel conjugate!

  31. The DualDICE Objective ● Define zero-reward Bellman operator as maximize initial “nu-values” minimize squared Bellman error s 2 s 3 s 1 s 4 s 0 Nice: Objective is based on expectations from d D , β, and π, which we have access to. ● ● Extension 1: Can remove appearance of Bellman operator from both objective and solution by application of Fenchel conjugate! ● Extension 2: Can generalize this result to any convex function (not just square)!

  32. DualDICE Results ● DualDICE accuracy during training compared to existing methods.

  33. East Exhibition Hall B+C DualDICE Results Poster #205 ● DualDICE accuracy during training compared to existing methods.

Recommend


More recommend