ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK
RL problems objectives β’ The expected πΏ π -discounted return (value function) Evaluation discount factor β’ Policy Evaluation β’ Policy Optimization How can we improve perfomance in the limited data regime?
Discount regularization β’ Discount regularization : β guidance discount factor β (Jiang β 15 ) Algorithm hyperparameter β’ Theoretical analysis: β’ Petrik and Scherrer β 09 β Approx. DP β’ Jiang β 15 β model based Better performance for limited data β’ Regularization effect: β’ β Bias π πΏ β π πΏ π β’ β Variance ΰ· π β π πΏ β’ Our work: β’ In TD learning, discount regularization == explicit added regularizer β’ When is discount regularization effective?
Temporal Difference (TD) Learning β’ Policy evaluation with value-function model β’ Batch TD(0) Discount factor hyperparameter Aim to minimize
Equivalent Form β’ Equivalent update steps Discount regularization (using πΉ < πΉ π ) β β Using πΉ π + regularization term Regularization term gradient Similar Equivalence β’ (expected) SARSA β’ LSTD Activation regularization
The Equivalent Regularizer β’ Activation regularization π 2 regularization β’ Tabular case: Discount regularization is sensitive to the empirical distribution
Tabular Experiments 4x4 GridWorld β’ Policy evaluation, π π π‘ uniform. π ( πΏ π = 0.99 ) β’ Goal : find ΰ· π that estimates π πΏ π β’ Loss measures: 2 = Ο π‘βπ ΰ· π 2 ΰ· β’ π΄ π loss: π In each MDP Instance: π β π π β π πΏ π πΏ π β’ 2 Draw π½π π‘ β’ Draw π(. |π‘, π) β’ Ranking Loss: βKandal`s_Tau ΰ· π π, π πΏ π ( ~ number of order switches between state ranks) β’ Average over 1000 MDP instances β’ Data: trajectories of 50 time-steps
Discount Regularization π 2 Regularization TD(0) Results π 2 loss Ranking Loss (πΏ π = 0.99)
Effect of the Empirical Distribution β’ Equivalent regularizer: β’ Tuples (π‘, π‘ β² , π ) generation: π‘~π(π‘) , π‘ β² ~π π π‘ β² π‘ , π ~π π (π‘) β’ For each MDP - draw distribution π(π‘) at π ππ from uniform π΄ π regularization Discount regularization Non-uniform Non-uniform Uniform Uniform (πΏ π = 0.99)
Effect of the Mixing Time β’ Lower mixing time (slow mixing) β Higher estimation variance β more regularization is needed π΄ π regularization Discount regularization Slow mixing Slow mixing Fast mixing Fast mixing (πΏ π = 0.99) (LSTD, 2 trajectories)
Policy Optimization π β π π β Goal : min π πΏ π πΏ π π 1 Policy-iteration: β’ For episodes: β’ Get data β’ ΰ· π β Policy evaluation (e.g, SARSA) β’ Improvement step (e.g., π -epsilon-greedy) Activation regularization term:
Deep RL Experiments β’ Actor-critic algorithms: DDPG (Lillicrap β 15), TD3 (Fujimoto β 18) β’ Mujoco continuous control (Todorov β 12) β’ Goal: undiscounted sum of rewards (πΏ π = 1) β’ Limited number of time-steps (2e5 or less) β’ Tested cases: β’ Discount regularization (and no π 2 ) β’ π 2 regularization (and πΏ = 0.999 )
Discount Regularization L2 Regularization Discount Regularization L2 Regularization 2e5 steps 2e5 steps Fewer steps Fewer steps πΏ = 0.99 2.5e2 steps HalfCheetah-v2 πΏ = 0.99 πΏ = 0.8 1e5 steps Ant-v2 πΏ = 0.99 5 5e4 steps Hopper-v2
Conclusions β’ Discount regularization in TD is equivalent to adding a regularizer term β’ Regularization effectiveness is closely related to the data distribution and mixing rate. β’ Generalization in deep RL is strongly affected by regularization β’ Future work β theory needed Thanks for listening
Recommend
More recommend