Separating value functions across time-scales Joshua Romoff* 1,2 , Peter Henderson* 3 , Ahmed Touati 2,4 , Emma Brunskill 3 , Joelle Pineau 1,2 , Yann Ollivier 2 1 MILA-McGill University, 2 Facebook AI Research, 3 Stanford University, 4 MILA-Université de Montréal *Equal Contribution
Separating value functions across time- scales RL Background • Monte-Carlo Return / Target: ∞ ∑ 𝛿 𝑗 𝑠 𝑢 + 𝑗 𝐻 𝑢 ∶ = • � 𝑗 =0 • Value function: 𝑊 ( 𝑡 ) ∶ = 𝔽 [ 𝐻 𝑢 | 𝑡 𝑢 = 𝑡 , 𝜌 ] • �
Separating value functions across time- scales RL Background - bootstrapping • Multi-step returns: 𝑙 − 1 𝛿 𝑗 𝑠 𝑢 + 𝑗 + 𝛿 𝑙 𝑊 ( 𝑡 𝑢 + 𝑙 ) ∑ 𝐻 𝑙 𝑢 ∶ = • � 𝑗 =0 • – returns : 𝜇 ∞ 𝑙 ∑ 𝐻 𝜇 𝜇 𝑙 − 1 𝐻 𝑢 ∶ = (1 − 𝜇 ) 𝑢 • � 𝑙 =1
Separating value functions across time- scales 𝛿 Learning - Problems with large 𝑈𝐸 • Part of the problem formulation: ∞ ∑ 𝛿 𝑗 𝑠 𝑢 + 𝑗 𝐻 𝑢 ∶ = • � 𝑗 =0 • When training is difficult 𝛿 → 1 𝑊 𝛿
� Separating value functions across time- scales - Our Solution 𝑈𝐸 ( ∆ ) • Define a sequence of s: ∆ ∶ = ( 𝛿 0 , 𝛿 1 𝛿 � , …, 𝛿 𝑎 ) • with 𝛿 𝑗 ≤ 𝛿 𝑗 +1 ∀ 𝑗
� Separating value functions across time- scales - Our Solution 𝑈𝐸 ( ∆ ) • Define a sequence of s: ∆ ∶ = ( 𝛿 0 , 𝛿 1 𝛿 � , …, 𝛿 𝑎 ) • with 𝛿 𝑗 ≤ 𝛿 𝑗 +1 ∀ 𝑗 • Learn and 𝑋 0 ∶ = 𝑊 𝛿 0 𝑋 𝑗 ∶ = 𝑊 𝛿 𝑗 − 𝑊 𝛿 𝑗 − 1
Separating value functions across time- scales - Our Solution 𝑈𝐸 ( ∆ ) • Define a sequence of s: 𝛿 ∆ ∶ = ( 𝛿 0 , 𝛿 1 � , …, � 𝛿 𝑎 ) • with 𝛿 𝑗 ≤ 𝛿 𝑗 +1 ∀ 𝑗 • Learn and 𝑋 0 ∶ = 𝑊 𝛿 0 𝑋 𝑗 ∶ = 𝑊 𝛿 𝑗 − 𝑊 𝛿 𝑗 − 1 𝑎 ∑ • Recompose: 𝑋 𝑗 = 𝑊 𝛿 𝑎 𝑗 =0
Separating value functions across time- scales - Bellman Equation 𝑈𝐸 ( ∆ ) • We can use Bellman Equations: 𝑋 0 ∶ 𝑠 𝑢 + 𝛿 0 𝑋 0 ( 𝑇 𝑢 +1 ) • 𝑋 𝑗 >0 ∶ ( 𝛿 𝑗 − 𝛿 𝑗 − 1 ) 𝑊 𝛿 𝑗 − 1 ( 𝑇 𝑢 +1 ) + 𝛿 𝑗 𝑋 𝑗 ( 𝑇 𝑢 +1 ) • • We extend it to multi-step and 𝑈𝐸 𝑈𝐸 ( 𝜇 )
Separating value functions across time- scales - Bellman Equation 𝑈𝐸 ( ∆ )
Separating value functions across time- scales - Equivalence results 𝑈𝐸 ( ∆ ) • Equivalence to standard 𝑈𝐸 ( 𝜇 ) We did it! Wait…
Separating value functions across time- scales - Equivalence results 𝑈𝐸 ( ∆ ) • Equivalence to standard 𝑈𝐸 ( 𝜇 ) We did it! Wait…
Separating value functions across time- scales - Equivalence conditions 𝑈𝐸 ( ∆ ) • Linear function approximation • Same learning rates for each W • Same K-step / for each W 𝛿𝜇
Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) • We don’t have to be equivalent! • Change the learning rates • � -step / return 𝜇 𝑙
Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) • What will this get us? • Let’s turn to a slightly different setting to get more insight.
Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) • “Phasic” updates for standard TD
Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) Get an error bound using large deviation analysis with bias and variance components! Dependent on k-steps and discount (also size of samples). Small Note: Kearns & Singh have a slightly different variance term constant, the proof was excluded from the 2000 paper, so we instead used Hoeffding inequality to reach this constant (see our supplemental).
Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) If we do the same for our method, get a bias-variance tradeoff.
Separating value functions across time- scales – Little tuning required 𝑈𝐸 ( ∆ ) 1. Adaptive optimizer handle the learning rate 𝜇 𝑗 = min( 𝛿 𝑎 𝜇 𝑎 1 2. Set or , 1) 𝑙 𝑗 = 1 − 𝛿 𝑗 𝛿 𝑗 3. Set to be double the horizon of 𝛿 𝑗 𝛿 𝑗 − 1
Separating value functions across time- scales – for actor critic algorithms 𝑈𝐸 ( ∆ ) 1. Train the Ws as described 2. Use the sum of Ws instead of V in policy update We apply it to PPO and test it on Atari
Separating value functions across time- scales – for actor critic algorithms 𝑈𝐸 ( ∆ )
Separating value functions across time- scales – for actor critic algorithms 𝑈𝐸 ( ∆ )
Separating value functions across time- scales – for actor critic algorithms 𝑈𝐸 ( ∆ )
Separating value functions across time- scales – for actor critic algorithms 𝑈𝐸 ( ∆ )
Separating value functions across time- scales - Atari Experiments 𝑈𝐸 ( ∆ )
Separating value functions across time- scales - Atari Experiments 𝑈𝐸 ( ∆ )
Separating value functions across time- scales - Atari Experiments 𝑈𝐸 ( ∆ )
Separating value functions across time- scales - What does it learn? (Atari) 𝑈𝐸 ( ∆ )
Separating value functions across time- scales - What does it learn? (Atari) 𝑈𝐸 ( ∆ )
Separating value functions across time- scales - Benefits 𝑈𝐸 ( ∆ ) More knobs to tune bias-variance trade-off! :) More insight into value of policy at different time-scales! Bellman update for learning separated value functions, allows for some theoretical insights. Natural splitting for distributed computing.
Separating value functions across time- scales - Downsides 𝑈𝐸 ( ∆ ) More knobs to tune bias-variance trade-off! :( Somewhat more compute intensive.
Separating value functions across time- scales Meets Reward Estimation 𝑈𝐸 ( ∆ ) Previously demonstrated simple property that by using a learned estimation of the reward, we can reduce variance in learning especially in noisy environments. Joshua Romoff*, Peter Henderson*, Alexandre Piche, Vincent Francois-Lavet, and Joelle Pineau. "Reward Estimation for Variance Reduction in Deep Reinforcement Learning." In Conference on Robot Learning, pp. 674-699. 2018.
Separating value functions across time- scales Meets Reward Estimation 𝑈𝐸 ( ∆ ) Here, we are using many estimators, looking at a similar bias-variance trade- off. An interesting future investigation would look into whether separating value functions across many estimators has similar natural benefits in the cases of noisy rewards as in our reward estimation work. Joshua Romoff*, Peter Henderson*, Alexandre Piche, Vincent Francois-Lavet, and Joelle Pineau. "Reward Estimation for Variance Reduction in Deep Reinforcement Learning." In Conference on Robot Learning , pp. 674-699. 2018.
Separating value functions across time- scales Other Extensions • Adding more Ws to move discount factor to 1 • Q-learning extension • Use the natural time-scale split for distributed computing updates
Thanks! More Questions?
Recommend
More recommend