gradientdice rethinking generalized offline estimation of
play

GradientDICE: Rethinking Generalized Offline Estimation of - PowerPoint PPT Presentation

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values Shangtong Zhang 1 , Bo Liu 2 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University Preview O ff -policy evaluation with density ratio learning Use the


  1. GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values Shangtong Zhang 1 , Bo Liu 2 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University

  2. Preview • O ff -policy evaluation with density ratio learning • Use the Perron-Frobenius theorem to reduce the constraints from 3 to 2, reducing the positiveness constraint, making the problem convex in both tabular and linear setting • A special weighted norm L 2 • Improvements over DualDICE and GenDICE in tabular, linear and neural network settings

  3. Off-policy evaluation is to estimate the performance of a policy with off-policy data • The target policy π • A data set { s i , a i , r i , s ′ i } i =1,…, N • s i , a i ∼ d μ ( s , a ), r i = r ( s i , a i ), s ′ i ∼ p ( ⋅ | s i , a i ) • The performance metric ρ γ ( π ) ≐ ∑ s , a d γ ( s , a ) r ( s , a ) d γ ( s , a ) ≐ (1 − γ ) ∑ ∞ t =0 γ t Pr( S t = s , A t = a ∣ π , p ) • ( γ < 1) • d γ ( s , a ) ≐ lim t →∞ Pr( S t = s , A t = a ∣ π , p ) ( γ = 1)

  4. Density ratio learning is promising for off-policy evaluation (Liu et al, 2018) d γ ( s , a ) • Learn with function approximation τ * ( s , a ) ≐ d μ ( s , a ) ρ γ ( π ) = ∑ s , a d μ ( s , a ) τ * ( s , a ) r ( s , a ) ≈ 1 N ∑ N • i =1 τ * ( s i , a i ) r i

  5. Density ratio satisfies a Bellman- like equation (Zhang et al, 2020) • D τ * = (1 − γ ) μ 0 + γ P ⊤ π D τ * D ∈ ℝ N sa × N sa , D ≐ diag ( d μ ) • • τ * ∈ ℝ N sa • μ 0 ∈ ℝ N sa , μ 0 ( s , a ) ≐ μ 0 ( s ) π ( a | s ) • P π ∈ ℝ N sa × N sa , P π (( s , a ), ( s ′ , a ′ )) ≐ p ( s ′ | s , a ) π ( a ′ | s ′ )

  6. is easy as it implies a γ < 1 unique solution • D τ = (1 − γ ) μ 0 + γ P ⊤ π D τ • ( I − γ P ⊤ π ) − 1 exists

  7. Previous work requires three constraints for γ = 1 D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. GenDICE (Zhang et al, 2020) considers 1 & 3 explicitly L ( τ ) ≐ divergence ( D τ , P ⊤ π D τ ) + (1 − 1 ⊤ D τ ) 2 τ 2 , e τ and implements 2 with positive function approximation (e.g. ), projected SGD, or stochastic mirror descent Mousavi et al. (2020) implements 3 with self-normalization over all state-action pairs

  8. Previous work requires three constraints for γ = 1 D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. The objective becomes non-convex with positive function approximation or self-normalization, even in tabular or linear setting. Projected SGD is computationally infeasible. Stochastic mirror descent significantly reduces the capacity of the (linear) function class.

  9. We actually need only two constraints! D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. Perron-Frobenius theorem: the solution space of 1 is one-dimensional Either 2 or 3 is enough to guarantee a unique solution

  10. 
 GradientDICE considers a special norm for the loss L 2 • GenDICE: 
 L ( τ ) ≐ divergence ((1 − γ ) μ 0 + γ P ⊤ π D τ , D τ ) + (1 − 1 ⊤ D τ ) 2 subject to Dy ≻ 0 • L ( τ ) ≐ || (1 − γ ) μ 0 + γ P ⊤ π D τ − D τ || D − 1 + (1 − 1 ⊤ D τ ) 2 • GradientTD loss: || … || D

  11. GradientDICE considers a special norm for the loss L 2 • L ( τ ) ≐ || (1 − γ ) μ 0 + γ P ⊤ π D τ − D τ || D − 1 + (1 − 1 ⊤ D τ ) 2 f ∈ℝ Nsa , η ∈ℝ L ( τ , η , f ) ≐ (1 − γ ) 𝔽 μ 0 [ f ( s , a )] min max τ ∈ℝ Nsa + γ 𝔽 p [ τ ( s , a ) f ( s ′ , a ′ )] −𝔽 d μ [ τ ( s , a ) f ( s , a )] − 1 2 𝔽 d μ [ f ( s , a ) 2 ] • η 2 + λ ( 𝔽 d μ [ ητ ( s , a ) − η ] − 2 ) • Convergence in both tabular and linear setting with γ ∈ [0,1]

  12. GradientDICE outperforms baselines in Boyan’s Chain (Tabular) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g., 
 {4 − 6 ,4 − 5 , …,4 − 1 } learning rates from • Tuned to minimize final prediction error

  13. GradientDICE outperforms baselines in Boyan’s Chain (Linear) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g., 
 {4 − 6 ,4 − 5 , …,4 − 1 } learning rates from • Tuned to minimize final prediction error

  14. 
 GradientDICE outperforms baselines in Reacher-v2 (Network) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g., 
 Learning rates from {0.01,0.005,0.001} Penalty from {0.1,1} • Tuned to minimize final prediction error

  15. Thanks • Code and Dockerfile are available at 
 https://github.com/ShangtongZhang/DeepRL

Recommend


More recommend