Lipschitz Continuity in Model-based Reinforcement Learning Kavosh Asadi*, Dipendra Misra*, Michael L. Littman * denotes equal contribution � 1
Model-based RL value/policy planning acting model experience model learning model learning: T ( s 0 | s, a ) ≈ b T ( s 0 | s, a ) R ( s, a ) ≈ b R ( s, a ) planning: s 1 b s 1 b s 0 s 1 b b T s 2 b s 2 s 2 b s 3 b s 3 s 3 � 2
Compounding Error [Talvitie 2014, Venkatraman et al. 2015] ‣ happens when models are imperfect, which is almost always true ‣ estimation error or partial observability ‣ agnostic setting credit to Matt Cooper for the video github.com/dyelax truth model � 3
Main Takeaway Lipschitz continuity plays a key role in compounding errors and more generally in the theory of model-based RL Given two metric spaces and , a function f : M 1 7! M 2 ( M 1 , d 1 ) ( M 2 , d 2 ) is Lipschitz if the Lipschitz constant defined below is finite: � � f ( s 1 ) , f ( s 2 ) d 2 K d 1 ,d 2 ( f ) := sup d 1 ( s 1 , s 2 ) s 1 ∈ M 1 ,s 2 ∈ M 1 f ( s 1 ) f ( s ) s 1 � 4
Wasserstein Metric [Villani, 2008] in stochastic domains, we need to quantify di ff erence between two distributions µ 2 µ 1 µ 2 µ 1 Z Z W ( µ 1 , µ 2 ) := inf j ( s 1 , s 2 ) d ( s 1 , s 2 ) ds 2 ds 1 j ∈ Λ � 5
Three Theorems ‣ multi-step prediction error ‣ value function estimation error ‣ Lipschitz continuity of value function � 6
Multi-step Prediction Error assume a accurate model: ∆ � b � T ( · | s, a ) , T ( · | s, a ) ≤ ∆ ∀ s ∀ a W K ( b given a accurate model with a Lipschitz constant and a true T ) ∆ model with Lipschitz constant and a state distribution : µ ( s ) K ( T ) n − 1 X � b � T n ( · | µ ) , T n ( · | µ ) ( k ) i δ ( n ) := W ≤ ∆ i =0 � � K ( T ) , K ( b δ :error :prediction horizon k : min T ) n � 7
Value Function Estimation Error value/policy planning acting model experience model learning :Lipschitz constant of reward K ( R ) how inaccurate can the value function be? γ K ( R ) ∆ � � � ≤ � V T ( s ) − V b T ( s ) ∀ s (1 − γ )(1 − γ k ) � � K ( T ) , K ( b k : min T ) � 8
Lipschitz Continuity of Value Function ‣ Generalized VI [Littman and Szepesvári, 96]: a Lipschitz operator ‣ repeat until convergence: Z T ( s 0 | s, a ) f � � Q ( s 0 , · ) ds 0 Q ( s, a ) ← R ( s, a )+ γ ‣ value function is Lipschitz in every iteration (including the fixed point) K ( R ) K ( Q ) ≤ 1 − γ K ( T ) ‣ one implication: value-aware model learning [Farahmand et al, 2017] is equivalent to Wasserstein (will appear in PGMRL workshop later in the conference) � 9
Controlling Lipschitz Constant with Neural Nets for each layer, ensure the weights are in a desired norm ball: Lipschitz constant of entire net is bounded by multiplication of Lipschitz constant of layers � 10
Is Controlling the Lipschitz Constant of Transition Models Useful? ‣ Cartpole (left) and Pendulum (right) ‣ learn a model o ffl ine using random -250 samples -500 -750 -1000 ‣ perform policy gradient using the -1250 model -1500 ‣ test the policy in the environment average return per episode ‣ improved reward (higher is better) by an intermediate Lipschitz value more experiments (including on stochastic domains) in the paper � 11
Contributions: ‣ key role of Lipschitz constant in model-based RL: ‣ compounding error ‣ value function estimation error ‣ Lipschitz continuity of value function ‣ learning stochastic models using EM (skipped, details in the paper) ‣ quantifying Lipschitz constant of neural nets (skipped, details in the paper) ‣ model regularization by controlling the Lipschitz constant ‣ usefulness of Wasserstein for model-based RL (skipped, details in the paper) Questions? � 12
★ ★ ★ ★ References: Littman and Szepesvári, "A Generalized Reinforcement-Learning model: Convergence and Applications", 1996 Villani, "Optimal Transport, Old and New", 2014 Talvitie, "Model Regularization for Stable Sample Rollouts", 2014 Venkatraman, Hebert, and Bagnell, "Improving Multi-Step Prediction of Learned Time Series Models", 2015 � 13
Recommend
More recommend