Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - PowerPoint PPT Presentation

ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK

RL problems objectives • The expected 𝛿 𝑓 -discounted return (value function) Evaluation discount factor • Policy Evaluation • Policy Optimization How can we improve perfomance in the limited data regime?

Discount regularization • Discount regularization : “ guidance discount factor ” (Jiang ’ 15 ) Algorithm hyperparameter • Theoretical analysis: • Petrik and Scherrer ’ 09 – Approx. DP • Jiang ’ 15 – model based Better performance for limited data • Regularization effect: • ↑ Bias 𝑊 𝛿 − 𝑊 𝛿 𝑓 • ↓ Variance ෠ 𝑊 − 𝑊 𝛿 • Our work: • In TD learning, discount regularization == explicit added regularizer • When is discount regularization effective?

Temporal Difference (TD) Learning • Policy evaluation with value-function model • Batch TD(0) Discount factor hyperparameter Aim to minimize

Equivalent Form • Equivalent update steps Discount regularization (using 𝜹 < 𝜹 𝒇 ) ⇕ ⇕ Using 𝜹 𝒇 + regularization term Regularization term gradient Similar Equivalence • (expected) SARSA • LSTD Activation regularization

The Equivalent Regularizer • Activation regularization 𝑀 2 regularization • Tabular case: Discount regularization is sensitive to the empirical distribution

Tabular Experiments 4x4 GridWorld • Policy evaluation, 𝜌 𝑏 𝑡 uniform. 𝜌 ( 𝛿 𝑓 = 0.99 ) • Goal : find ෠ 𝑊 that estimates 𝑊 𝛿 𝑓 • Loss measures: 2 = σ 𝑡∈𝑇 ෠ 𝜌 2 ෠ • 𝑴 𝟑 loss: 𝜌 In each MDP Instance: 𝑊 − 𝑊 𝑊 − 𝑊 𝛿 𝑓 𝛿 𝑓 • 2 Draw 𝔽𝑆 𝑡 • Draw 𝑄(. |𝑡, 𝑏) • Ranking Loss: −Kandal`s_Tau ෠ 𝜌 𝑊, 𝑊 𝛿 𝑓 ( ~ number of order switches between state ranks) • Average over 1000 MDP instances • Data: trajectories of 50 time-steps

Discount Regularization 𝑀 2 Regularization TD(0) Results 𝑀 2 loss Ranking Loss (𝛿 𝑓 = 0.99)

Effect of the Empirical Distribution • Equivalent regularizer: • Tuples (𝑡, 𝑡 ′ , 𝑠) generation: 𝑡~𝑕(𝑡) , 𝑡 ′ ~𝑄 𝜌 𝑡 ′ 𝑡 , 𝑠~𝑆 𝜌 (𝑡) • For each MDP - draw distribution 𝑕(𝑡) at 𝑒 𝑈𝑊 from uniform 𝑴 𝟑 regularization Discount regularization Non-uniform Non-uniform Uniform Uniform (𝛿 𝑓 = 0.99)

Effect of the Mixing Time • Lower mixing time (slow mixing) → Higher estimation variance → more regularization is needed 𝑴 𝟑 regularization Discount regularization Slow mixing Slow mixing Fast mixing Fast mixing (𝛿 𝑓 = 0.99) (LSTD, 2 trajectories)

Policy Optimization 𝜌 − 𝑊 𝜌 ∗ Goal : min 𝑊 𝛿 𝑓 𝛿 𝑓 𝜌 1 Policy-iteration: • For episodes: • Get data • ෠ 𝑅 ← Policy evaluation (e.g, SARSA) • Improvement step (e.g., 𝜁 -epsilon-greedy) Activation regularization term:

Deep RL Experiments • Actor-critic algorithms: DDPG (Lillicrap ‘ 15), TD3 (Fujimoto ‘ 18) • Mujoco continuous control (Todorov ‘ 12) • Goal: undiscounted sum of rewards (𝛿 𝑓 = 1) • Limited number of time-steps (2e5 or less) • Tested cases: • Discount regularization (and no 𝑀 2 ) • 𝑀 2 regularization (and 𝛿 = 0.999 )

Discount Regularization L2 Regularization Discount Regularization L2 Regularization 2e5 steps 2e5 steps Fewer steps Fewer steps 𝛿 = 0.99 2.5e2 steps HalfCheetah-v2 𝛿 = 0.99 𝛿 = 0.8 1e5 steps Ant-v2 𝛿 = 0.99 5 5e4 steps Hopper-v2

Conclusions • Discount regularization in TD is equivalent to adding a regularizer term • Regularization effectiveness is closely related to the data distribution and mixing rate. • Generalization in deep RL is strongly affected by regularization • Future work – theory needed Thanks for listening

Discount Factor as a Regularizer in RL Ron Amit , Ron Meir - PowerPoint PPT Presentation

ICML 2020 Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR) Microsoft Research, Cambridge UK RL problems objectives The expected -discounted return (value function) Evaluation discount

Lecture 2: Stochastic Discount Factor Simon Gilchrist Boston Univerity and NBER EC 745 Fall,

A macrofounded linear stochastic discount factor An application to foreign exchange reserves asset

C;5SH DISCOUNT PROGRAM In a world of sameness, we're giving you o tions. Cash Discount Program

2020 Annual GSM Report on the Discount objective Discount objective Monthly Monitoring 31

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep

Northern Maine Transmission Discount Decision The Request for a Transmission Rate Discount

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu

Ori riginal Iss Issue Dis Discount (O (OID): What t It It Means and Ho How It It Wor

Feature Grouping as a Stochastic Regularizer for High Dimensional Structured Data Sergl

Certainty Factor certainty factor CF (is the certainty factor in the hypothesis H due to

1 Introduction to the DLM Introduction to the DLM Intervention - 1. Known effect Incorporate

Rating Factor 1 Review Rating Factor 1 Capacity of the Applicant 1 Rating Factor Review 2

Real Application of Factor Investing in SA Ann Sebastian STANLIB FACTOR INVESTING WHAT IS IT?

Medicare Drug Discount Cards: A Work in Progress Prepared for the Kaiser Family Foundation by

The Discount Rate Quandary Richard Jones FIA January 2018 Legislation on Scheme Funding

Establishing Public Sector Investment Discount Rate CEA March 2009 Making Difference

Discount Rates in Small Scale Fisheries Discount Rates in Small Scale Fisheries L OUISE T EH I

Two factor authentication Open, trustworthy and enterprise ready Two factor authentication

Factor VIII and factor IX development plans at the Paediatric Committee Overview Presented by:

Existence and uniqueness of optimal cyclic discount Tatyana process with discount Shutkina

On multiple discount rates C. Chambers F. Echenique Georgetown Caltech Columbia Sept. 15 2017

DUO 2-Factor Authentication at UNC Charlotte WHATS DUO? Two-Factor Authentication

For Tuesday Read Gaddis, chapter 5, sections 6-9 Program 4 Any questions? Write a

(IHBG) Competitive NOFA Training Rating Factor 3: Soundness of Approach 1 Rating Factor 3