discount factor as a regularizer in rl
play

Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - PowerPoint PPT Presentation

ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK RL problems objectives The expected -discounted return (value function) Evaluation discount


  1. ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK

  2. RL problems objectives β€’ The expected 𝛿 𝑓 -discounted return (value function) Evaluation discount factor β€’ Policy Evaluation β€’ Policy Optimization How can we improve perfomance in the limited data regime?

  3. Discount regularization β€’ Discount regularization : β€œ guidance discount factor ” (Jiang ’ 15 ) Algorithm hyperparameter β€’ Theoretical analysis: β€’ Petrik and Scherrer ’ 09 – Approx. DP β€’ Jiang ’ 15 – model based Better performance for limited data β€’ Regularization effect: β€’ ↑ Bias π‘Š 𝛿 βˆ’ π‘Š 𝛿 𝑓 β€’ ↓ Variance ΰ·  π‘Š βˆ’ π‘Š 𝛿 β€’ Our work: β€’ In TD learning, discount regularization == explicit added regularizer β€’ When is discount regularization effective?

  4. Temporal Difference (TD) Learning β€’ Policy evaluation with value-function model β€’ Batch TD(0) Discount factor hyperparameter Aim to minimize

  5. Equivalent Form β€’ Equivalent update steps Discount regularization (using 𝜹 < 𝜹 𝒇 ) ⇕ ⇕ Using 𝜹 𝒇 + regularization term Regularization term gradient Similar Equivalence β€’ (expected) SARSA β€’ LSTD Activation regularization

  6. The Equivalent Regularizer β€’ Activation regularization 𝑀 2 regularization β€’ Tabular case: Discount regularization is sensitive to the empirical distribution

  7. Tabular Experiments 4x4 GridWorld β€’ Policy evaluation, 𝜌 𝑏 𝑑 uniform. 𝜌 ( 𝛿 𝑓 = 0.99 ) β€’ Goal : find ΰ·  π‘Š that estimates π‘Š 𝛿 𝑓 β€’ Loss measures: 2 = Οƒ π‘‘βˆˆπ‘‡ ΰ·  𝜌 2 ΰ·  β€’ 𝑴 πŸ‘ loss: 𝜌 In each MDP Instance: π‘Š βˆ’ π‘Š π‘Š βˆ’ π‘Š 𝛿 𝑓 𝛿 𝑓 β€’ 2 Draw 𝔽𝑆 𝑑 β€’ Draw 𝑄(. |𝑑, 𝑏) β€’ Ranking Loss: βˆ’Kandal`s_Tau ΰ·  𝜌 π‘Š, π‘Š 𝛿 𝑓 ( ~ number of order switches between state ranks) β€’ Average over 1000 MDP instances β€’ Data: trajectories of 50 time-steps

  8. Discount Regularization 𝑀 2 Regularization TD(0) Results 𝑀 2 loss Ranking Loss (𝛿 𝑓 = 0.99)

  9. Effect of the Empirical Distribution β€’ Equivalent regularizer: β€’ Tuples (𝑑, 𝑑 β€² , 𝑠) generation: 𝑑~𝑕(𝑑) , 𝑑 β€² ~𝑄 𝜌 𝑑 β€² 𝑑 , 𝑠~𝑆 𝜌 (𝑑) β€’ For each MDP - draw distribution 𝑕(𝑑) at 𝑒 π‘ˆπ‘Š from uniform 𝑴 πŸ‘ regularization Discount regularization Non-uniform Non-uniform Uniform Uniform (𝛿 𝑓 = 0.99)

  10. Effect of the Mixing Time β€’ Lower mixing time (slow mixing) β†’ Higher estimation variance β†’ more regularization is needed 𝑴 πŸ‘ regularization Discount regularization Slow mixing Slow mixing Fast mixing Fast mixing (𝛿 𝑓 = 0.99) (LSTD, 2 trajectories)

  11. Policy Optimization 𝜌 βˆ’ π‘Š 𝜌 βˆ— Goal : min π‘Š 𝛿 𝑓 𝛿 𝑓 𝜌 1 Policy-iteration: β€’ For episodes: β€’ Get data β€’ ΰ·  𝑅 ← Policy evaluation (e.g, SARSA) β€’ Improvement step (e.g., 𝜁 -epsilon-greedy) Activation regularization term:

  12. Deep RL Experiments β€’ Actor-critic algorithms: DDPG (Lillicrap β€˜ 15), TD3 (Fujimoto β€˜ 18) β€’ Mujoco continuous control (Todorov β€˜ 12) β€’ Goal: undiscounted sum of rewards (𝛿 𝑓 = 1) β€’ Limited number of time-steps (2e5 or less) β€’ Tested cases: β€’ Discount regularization (and no 𝑀 2 ) β€’ 𝑀 2 regularization (and 𝛿 = 0.999 )

  13. Discount Regularization L2 Regularization Discount Regularization L2 Regularization 2e5 steps 2e5 steps Fewer steps Fewer steps 𝛿 = 0.99 2.5e2 steps HalfCheetah-v2 𝛿 = 0.99 𝛿 = 0.8 1e5 steps Ant-v2 𝛿 = 0.99 5 5e4 steps Hopper-v2

  14. Conclusions β€’ Discount regularization in TD is equivalent to adding a regularizer term β€’ Regularization effectiveness is closely related to the data distribution and mixing rate. β€’ Generalization in deep RL is strongly affected by regularization β€’ Future work – theory needed Thanks for listening

Recommend


More recommend