Chapter 11 Off-policy methods with approximation
Recall off-policy learning involves two policies • One policy π whose value function we are learning • the target policy • Another policy 𝜈 that is used to select actions • the behavior policy
Off-policy is much harder with Function Approximation • Even linear FA • Even for prediction (two fixed policies π and 𝜈 ) • Even for Dynamic Programming • The deadly triad: FA, TD, off-policy • Any two are OK, but not all three • With all three, we may get instability (elements of 𝜾 may increase to ± ∞ )
There are really 2 off-policy problems One we know how to solve, one we are not sure One about the future, one about the present • The easy problem is that of off-policy targets (future) • We have been correcting for that since Chapters 5 and 6 • Using importance sampling in the target • The hard problem is that of the distribution of states to update (present); we are no longer updating according to the on-policy distribution
Baird’s counterexample illustrates the instability 2 θ 1 + θ 8 2 θ 2 + θ 8 2 θ 3 + θ 8 2 θ 4 + θ 8 2 θ 5 + θ 8 2 θ 6 + θ 8 Components of the parameter vector θ 8 at the end of the episode under semi-gradient ctor off-policy TD(0) de 1% (similar for DP) 99% θ 1 – θ 6 1% θ 7 +2 θ 8 θ 7 π (solid |· ) = 1 Episodes µ (dashed |· ) = 6 / 7 µ (solid |· ) = 1 / 7
What causes the instability? • It has nothing to do with learning or sampling • Even dynamic programming suffers from divergence with FA • It has nothing to do with exploration, greedification, or control • Even prediction alone can diverge • It has nothing to do with local minima or complex non-linear approximators • Even simple linear approximators can produce instability
The deadly triad • The risk of divergence arises whenever we combine three things: 1. Function approximation • significantly generalizing from large numbers of examples 2. Bootstrapping • learning value estimates from other value estimates, Any 2 Ok as in dynamic programming and temporal-difference learning 3. Off-policy learning (Why is dynamic programming off-policy?) • learning about a policy from data not due to that policy, as in Q-learning, where we learn about the greedy policy from data with a necessarily more exploratory policy
TD(0) can diverge: A simple example r=1 θ 2 θ r + γθ ⇥ φ � − θ ⇥ φ δ = 0 + 2 θ − θ = θ = TD update: ∆ θ αδφ = Diverges! αθ = TD fixpoint: θ ∗ = 0
Geometric intuition v θ . as a giant vector ∈ R | S | = ˆ v ( · , θ ) → " # ( B π v )( s ) . X X p ( s 0 | s, a ) v ( s 0 ) = π ( s, a ) r ( s, a ) + γ a 2 A s 0 2 S v π ( Π B π v θ , Value Error The space of all VE( value functions ) E B ( r o r r θ e n a m l l e B Π v π = v θ ⇤ ⌘ min k VE k VE ( E B P � PBE = 0 = Π B π v θ , v θ min k BE k The subspace of all value functions representable as v θ ✓ 1 2 ✓ 2 1 an
Can we do without bootstrapping? • Bootstrapping is critical to the computational efficiency of DP • Bootstrapping is critical to the data efficiency of TD methods • On the other hand, bootstrapping introduces bias, which harms the asymptotic performance of approximate methods • The degree of bootstrapping can be finely controlled via the λ parameter, from λ =0 (full bootstrapping) to λ =1 (no bootstrapping)
4 examples of the effect of bootstrapping suggest that λ =1 (no bootstrapping) is a very poor choice Red points are the cases In all cases of no bootstrapping lower is better Pure No We need bootstrapping! bootstrapping bootstrapping
Desiderata: We want a TD algorithm that • Bootstraps (genuine TD) • Works with linear function approximation (stable, reliably convergent) • Is simple, like linear TD — O(n) • Learns fast, like linear TD • Can learn off-policy • Learns from online causal trajectories (no repeat sampling from the same state)
4 easy steps to stochastic gradient descent 1. Pick an objective function , J ( θ ) a parameterized function to be minimized 2. Use calculus to analytically compute the gradient � θ J ( θ ) 3. Find a “sample gradient” that you can sample on θ ⇥ θ � α ⇤ θ J t ( θ ) every time step and whose expected value equals the gradient 4. Take small steps in proportional to the sample gradient: θ θ ⇥ θ � α ⇤ θ J t ( θ )
Conventional TD is not the gradient of anything ∆ θ = αδφ TD(0) algorithm: δ = r + γθ ⇥ φ � − θ ⇥ φ ∂ J Assume there is a J such that: = δφ i ∂θ i Then look at the second derivative: } ∂ 2 J = ∂ ( δφ i ) C = ( γφ � o j − φ j ) φ i n ∂ 2 J ∂ 2 J t ∂θ j ∂θ i ∂θ j r a d i � = c t ∂θ j ∂θ i ∂θ i ∂θ j i o ∂ 2 J = ∂ ( δφ j ) n ! = ( γφ � i − φ i ) φ j ∂θ i ∂θ j ∂θ i Real 2 nd derivatives must be symmetric Etienne Barnard 1993
A-split example (Dayan 1992) Clearly, the true values are V ( A ) = 0 . 5 A V ( B ) = 1 50% 50% But if you minimize the naive B objective fn, , 100% J ( θ ) = E [ δ 2 ] then you get the solution 1 0 V ( A ) = 1 / 3 V ( B ) = 2 / 3 Even in the tabular case (no FA)
Indistinguishable pairs of MDPs � � � � � These two have different Value Errors, 0 but the same Return Errors 0 ✓ 1 2 0 ✓ 1 ✓ 1 2 2 (both errors have the same minima) � 2 � J RE ( θ ) 2 = J VE ( θ ) 2 + E h� i v π ( S t ) − G t � A t : ∞ ∼ π � These two have different Bellman Errors, - 1 1 - 1 1 - 1 B � A B but the same Projected Bellman Errors A B 0 (the errors have different minima) 0 0
Not all objectives can be estimated from data Not all minima can be found by learning Data Data distribution distribution d P µ ( ξ ) = d P µ ( ξ ) = MDP 1 MDP 2 MDP 1 MDP 2 TDE PBE RE BE 1 BE 2 VE 1 VE 2 ✓ ⇤ ✓ ⇤ 3 4 ✓ ⇤ ✓ ⇤ 1 2 ✓ ⇤ No learning algorithm can find the minimum of the Bellman Error � � � � � �
The Gradient-TD Family of Algorithms • True gradient-descent algorithms in the Projected Bellman Error • GTD( λ ) and GQ( λ ), for learning V and Q • Solve two open problems: • convergent linear-complexity off-policy TD learning • convergent non-linear TD • Extended to control variate, proximal forms by Mahadevan et al.
First relate the geometry to the iid statistics TV θ E B S M R T Π Π TV θ V θ RMSPBE MSPBE ( θ ) Φ , D matrix of the feature vectors for all states ⇥ V θ � Π TV θ ⇥ 2 = D Π = Φ ( Φ ⇧ D Φ ) � 1 Φ ⇧ D ⇥ Π ( V θ � TV θ ) ⇥ 2 = Φ T D ( TV θ − V θ ) = E [ δφ ] D ( Π ( V θ � TV θ )) ⇤ D ( Π ( V θ � TV θ )) = Φ T D Φ = E [ φφ T ] ( V θ � TV θ ) ⇤ Π ⇤ D Π ( V θ � TV θ ) = ( V θ � TV θ ) ⇤ D ⇤ Φ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( V θ � TV θ ) = ( Φ ⇤ D ( TV θ � V θ )) ⇤ ( Φ ⇤ D Φ ) � 1 Φ ⇤ D ( TV θ � V θ ) = φφ ⇤ ⇥ � 1 E [ δφ ] . E [ δφ ] ⇤ E � =
Derivation of the TDC algorithm r → s � s − ∆ θ = � 1 � 1 2 α r θ k V θ � Π TV θ k 2 2 α r θ J ( θ ) = D φ � φ � 1 φφ > ⇤ � 1 E [ δφ ] ⇣ ⌘ ⇥ = 2 α r θ E [ δφ ] E φφ > ⇤ � 1 E [ δφ ] ⇥ = � α ( r θ E [ δφ ]) E φφ > ⇤ � 1 E [ δφ ] r + γφ 0> θ � φ > θ ⇥ � � ⇤ ⇥ = � α E r θ [ φ ] E φ ( γφ 0 � φ ) > i > φφ > ⇤ � 1 E [ δφ ] h ⇥ = � α E E φφ > ⇤ � 1 E [ δφ ] φ 0 φ > ⇤ φφ > ⇤� � ⇥ ⇥ ⇥ = � α � E γ E E φφ > ⇤ � 1 E [ δφ ] φ 0 φ > ⇤ ⇥ ⇥ = α E [ δφ ] � αγ E E φ 0 φ > ⇤ ⇥ ⇡ α E [ δφ ] � αγ E w This is the trick! αδφ � αγφ 0 φ > w (sampling) ⇡ is a second w � ⇥ n set of weights
TD with gradient correction (TDC) algorithm aka GTD(0) • on each transition r → s � s − φ � φ • update two parameters TD(0) with gradient correction θ ← θ + αδφ − αγφ � � φ ⇥ w ⇥ w ← w + β ( δ − φ � w ) φ estimate of the • where, as usual TD error ( ) for δ the current state φ δ = r + γθ ⇥ φ � − θ ⇥ φ
Convergence theorems • All algorithms converge w.p.1 to the TD fix-point: E [ δφ ] − → 0 • GTD, GTD-2 converges at one time scale α = β − → 0 • TD-C converges in a two-time-scale sense α α , β − → 0 → 0 β −
Off-policy result: Baird’s counter-example &! "! "! % & "! '()(*+,+)-. ! /01 ! 67 ! "! $ '()*+, & ! & ! "! 123 # "! ! "! ! "! ! "!!! #!!! $!!! %!!! &!!! 23++45 123 ! " " 234 ! ! "! #! $! %! &!! &"! &#! &$! &%! "!! )-../0 Gradient algorithms converge. TD diverges.
Computer Go experiment • Learn a linear value � E [ ∆ θ T D ] � 0.8 function (probability of winning) for 9x9 Go from self play 0.6 GTD2 RNEU TDC • One million features, 0.4 each corresponding to a TD GTD2 template on a part of 0.2 GTD the Go board TDC 0 .000001 .000003 .00001 .00003 .0001 .0003 .001 • An established ! experimental testbed
Recommend
More recommend