GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values Shangtong Zhang 1 , Bo Liu 2 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University
Preview • O ff -policy evaluation with density ratio learning • Use the Perron-Frobenius theorem to reduce the constraints from 3 to 2, reducing the positiveness constraint, making the problem convex in both tabular and linear setting • A special weighted norm L 2 • Improvements over DualDICE and GenDICE in tabular, linear and neural network settings
Off-policy evaluation is to estimate the performance of a policy with off-policy data • The target policy π • A data set { s i , a i , r i , s ′ i } i =1,…, N • s i , a i ∼ d μ ( s , a ), r i = r ( s i , a i ), s ′ i ∼ p ( ⋅ | s i , a i ) • The performance metric ρ γ ( π ) ≐ ∑ s , a d γ ( s , a ) r ( s , a ) d γ ( s , a ) ≐ (1 − γ ) ∑ ∞ t =0 γ t Pr( S t = s , A t = a ∣ π , p ) • ( γ < 1) • d γ ( s , a ) ≐ lim t →∞ Pr( S t = s , A t = a ∣ π , p ) ( γ = 1)
Density ratio learning is promising for off-policy evaluation (Liu et al, 2018) d γ ( s , a ) • Learn with function approximation τ * ( s , a ) ≐ d μ ( s , a ) ρ γ ( π ) = ∑ s , a d μ ( s , a ) τ * ( s , a ) r ( s , a ) ≈ 1 N ∑ N • i =1 τ * ( s i , a i ) r i
Density ratio satisfies a Bellman- like equation (Zhang et al, 2020) • D τ * = (1 − γ ) μ 0 + γ P ⊤ π D τ * D ∈ ℝ N sa × N sa , D ≐ diag ( d μ ) • • τ * ∈ ℝ N sa • μ 0 ∈ ℝ N sa , μ 0 ( s , a ) ≐ μ 0 ( s ) π ( a | s ) • P π ∈ ℝ N sa × N sa , P π (( s , a ), ( s ′ , a ′ )) ≐ p ( s ′ | s , a ) π ( a ′ | s ′ )
is easy as it implies a γ < 1 unique solution • D τ = (1 − γ ) μ 0 + γ P ⊤ π D τ • ( I − γ P ⊤ π ) − 1 exists
Previous work requires three constraints for γ = 1 D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. GenDICE (Zhang et al, 2020) considers 1 & 3 explicitly L ( τ ) ≐ divergence ( D τ , P ⊤ π D τ ) + (1 − 1 ⊤ D τ ) 2 τ 2 , e τ and implements 2 with positive function approximation (e.g. ), projected SGD, or stochastic mirror descent Mousavi et al. (2020) implements 3 with self-normalization over all state-action pairs
Previous work requires three constraints for γ = 1 D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. The objective becomes non-convex with positive function approximation or self-normalization, even in tabular or linear setting. Projected SGD is computationally infeasible. Stochastic mirror descent significantly reduces the capacity of the (linear) function class.
We actually need only two constraints! D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. Perron-Frobenius theorem: the solution space of 1 is one-dimensional Either 2 or 3 is enough to guarantee a unique solution
GradientDICE considers a special norm for the loss L 2 • GenDICE: L ( τ ) ≐ divergence ((1 − γ ) μ 0 + γ P ⊤ π D τ , D τ ) + (1 − 1 ⊤ D τ ) 2 subject to Dy ≻ 0 • L ( τ ) ≐ || (1 − γ ) μ 0 + γ P ⊤ π D τ − D τ || D − 1 + (1 − 1 ⊤ D τ ) 2 • GradientTD loss: || … || D
GradientDICE considers a special norm for the loss L 2 • L ( τ ) ≐ || (1 − γ ) μ 0 + γ P ⊤ π D τ − D τ || D − 1 + (1 − 1 ⊤ D τ ) 2 f ∈ℝ Nsa , η ∈ℝ L ( τ , η , f ) ≐ (1 − γ ) 𝔽 μ 0 [ f ( s , a )] min max τ ∈ℝ Nsa + γ 𝔽 p [ τ ( s , a ) f ( s ′ , a ′ )] −𝔽 d μ [ τ ( s , a ) f ( s , a )] − 1 2 𝔽 d μ [ f ( s , a ) 2 ] • η 2 + λ ( 𝔽 d μ [ ητ ( s , a ) − η ] − 2 ) • Convergence in both tabular and linear setting with γ ∈ [0,1]
GradientDICE outperforms baselines in Boyan’s Chain (Tabular) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g., {4 − 6 ,4 − 5 , …,4 − 1 } learning rates from • Tuned to minimize final prediction error
GradientDICE outperforms baselines in Boyan’s Chain (Linear) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g., {4 − 6 ,4 − 5 , …,4 − 1 } learning rates from • Tuned to minimize final prediction error
GradientDICE outperforms baselines in Reacher-v2 (Network) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g., Learning rates from {0.01,0.005,0.001} Penalty from {0.1,1} • Tuned to minimize final prediction error
Thanks • Code and Dockerfile are available at https://github.com/ShangtongZhang/DeepRL
Recommend
More recommend