Provably Convergent Two- Timescale Off-Policy Actor-Critic with Function Approximation Shangtong Zhang 1 , Bo Liu 2 , Hengshuai Yao 3 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University 3 Huawei
Preview • O ff -policy control under the excursion objective ∑ s d μ ( s ) v π ( s ) • The first provably convergent two-timescale o ff -policy actor-critic algorithm with function approximation • New perspective for Emphatic TD (Sutton et al, 2016) • Convergence of Regularized GTD-style algorithms under a changing target policy
The excursion objective is commonly used for off-policy control • J ( π ) = ∑ s d μ ( s ) i ( s ) v π ( s ) : stationary distribution of the behaviour policy d μ : value function of the target policy v π , the interest function (Sutton et al, 2016) i : 𝒯 → [0, ∞ )
Off-policy policy gradient theorem gives the exact the gradient (Imani et al, 2018) • ∇ J ( π ) = ∑ s ¯ m ( s ) ∑ a q π ( s , a ) ∇ π ( a | s ) m ≐ ( I − γ P ⊤ π ) − 1 Di ∈ ℝ N s ¯ D = diag ( d μ )
Rewriting the gradients gives a taxonomy of previous algorithms • ∇ θ J ( π ) = 𝔽 s ∼ d μ , a ∼ μ ( ⋅ | s ) [ m π ( s ) ρ π ( s , a ) q π ( s , a ) ∇ θ log π ( a | s )] m π ≐ D − 1 ( I − γ P ⊤ π ) − 1 Di (emphasis) 1. Ignoring (Degris et al, 2012) m π ( s ) 2. Use followon trace to approximate (Imani et al, 2018) m π ( s ) 3. Learn with function approximation (Ours) m π ( s )
Ignoring emphasis is theoretically justified only in tabular setting • Gradient Estimator (Degris et al, 2012): ρ π ( S t , A t ) q π ( S t , A t ) ∇ θ log π ( A t | S t ) • O ff -Policy Actor Critic (O ff -PAC) Extensions: O ff -policy DPG, DDPG, ACER, O ff -policy EPG, TD3, IMPALA • O ff -PAC is biased even with linear function approximation (Degris et al, 2012, Imani et al, 2018, Maei et al, 2018, Liu et al, 2019)
Followon trace is unbiased only in a limiting sense • Gradient Estimator (Imani et al, 2018): M t ρ π ( S t , A t ) q π ( S t , A t ) ∇ θ log π ( A t | S t ) (followon trace) M t ≐ i ( S t ) + γρ t − 1 M t − 1 Assuming is FIXED, t →∞ 𝔽 μ [ M t | S t = s ] = m π ( s ) π lim • is a scalar, but is a vector! M t m π
̂ ̂ Emphasis is the fixed point of a Bellman-like operator • 𝕌 y ≐ i + γ D − 1 P ⊤ π Dy • is a contraction mapping w.r.t. some weighted 𝕌 maximum norm (for any ) γ < 1 • The emphasis is its fixed point m π
̂ We propose to learn emphasis based on ̂ 𝕌 • A semi-gradient update based on 𝕌 • Gradient Temporal Di ff erence Learning (GTD) L ( ν ) ≐ || Π𝕌 v − v || 2 MSPBE: ( v = X ν ) D • Gradient Emphasis Learning (GEM) L ( w ) ≐ || Π ̂ 𝕌 m − m || 2 ( m = Xw ) D • ∇ θ J ( π ) = 𝔽 s ∼ d μ , a ∼ μ ( ⋅ | s ) [ m π ( s ) ρ π ( s , a ) q π ( s , a ) ∇ θ log π ( a | s )]
Regularized GTD-style algorithms converge under a changing policy • TD converges under a changing policy (Konda’s thesis) But those arguments can NOT be used to show the convergence of GTD • Regularization has to be used for GTD-style algorithms L ( m ) ≐ || Π ̂ 𝕌 Xw − Xw || 2 D + || w || 2 GEM: L ( v ) ≐ || Π𝕌 X ν − X ν || 2 D + || ν || 2 GTD: • Regularization in GTD: • Optimization perspective under a fixed : π Mahadevan et al. (2014), Liu et al., (2015), Macua et al., (2015), Yu (2017), Du et al. (2017) • Stochastic approximation perspective under a changing π
The Convergence Off-Policy Actor-Critic (COF-PAC) algorithm • ∇ θ J ( π ) = 𝔽 s ∼ d μ , a ∼ μ ( ⋅ | s ) [ m π ( s ) ρ π ( s , a ) q π ( s , a ) ∇ θ log π ( a | s )] L ( v ) ≐ || Π𝕌 X ν − X ν || D + || ν || 2 L ( w ) ≐ || Π ̂ 𝕌 Xw − Xw || D + || w || 2 • Two-timescale instead of bi-level optimization like SBEED • COF-PAC visits a neighbourhood of a stationary point of infinitely many times J ( π )
GEM approximates emphasis better than followon trace in Baird’s counterexample Averaged over 30 runs, mean + std
GEM-ETD doss better policy evaluation than ETD in Baird’s counterexample • ETD: ν t +1 ← ν t + α M t ρ t ( R t +1 + γ x ⊤ t +1 ν t − x ⊤ t ν t ) x ⊤ t • GEM-ETD: ν t +1 ← ν t + α 2 ( w ⊤ t x t ) ρ t ( R t +1 + γ x ⊤ t +1 ν t − x ⊤ t ν t ) x ⊤ t Averaged over 30 runs, mean + std
COF-PAC does better control than ACE in Reacher Averaged over 30 runs, mean + std
Thanks • Code and Dockerfile are available at https://github.com/ShangtongZhang/DeepRL
Recommend
More recommend