control regularization for reduced variance reinforcement
play

Control Regularization for Reduced Variance Reinforcement Learning - PowerPoint PPT Presentation

Control Regularization for Reduced Variance Reinforcement Learning Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri Yisong Yue, Joel W. Burdick Reinforcement Learning Reinforcement learning (RL) studies how to use data from


  1. Control Regularization for Reduced Variance Reinforcement Learning Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri Yisong Yue, Joel W. Burdick

  2. Reinforcement Learning Reinforcement learning (RL) studies how to use data from interactions with the environment to learn an optimal policy: 𝜌 πœ„ 𝑏 𝑑 : 𝑇 Γ— 𝐡 β†’ 0,1 Policy: ∞ 𝛿 𝑒 𝑠 𝑑 𝑒 , 𝑏 𝑒 Reward max 𝐾(πœ„) = max 𝔽 𝜐~𝜌 πœ„ ෍ πœ„ πœ„ Optimization: 𝑒 𝜐: 𝑑 𝑒 , 𝑏 𝑒 , … , 𝑑 𝑒+𝑂 , 𝑏 𝑒+𝑂 Policy gradient-based optimization with no prior information: Figure from Sergey Levine Williams, 1992; Sutton et al. 1999 Baxter and Bartlett, 2000 Greensmith et al. 2004

  3. Variance in Reinforcement Learning RL methods suffer from high variance in learning (Islam et al. 2017; Henderson et al. 2018) Inverted pendulum 10 random seeds Allows us to optimize policy with no prior information (only sampled trajectories from interactions) Figure from Alex Irpan Greensmith et al. 2004, Zhao et al. 2012 Zhao et al. 2015; Thodoroff et al. 2018

  4. Variance in Reinforcement Learning RL methods suffer from high variance in learning (Islam et al. 2017; Henderson et al. 2018) Inverted pendulum 10 random seeds Allows us to optimize policy with no prior information (only sampled trajectories from interactions) Figure from Alex Irpan However, is this necessary or even desirable? Cartpole 𝑑 𝑒+1 β‰ˆ 𝑔 𝑑 𝑒 + 𝑕 𝑑 𝑒 𝑏 𝑒 Nominal controller is stable but based on: 𝒃 = 𝒗 𝒒𝒔𝒋𝒑𝒔 (𝒕) β€’ Error prone model 𝑀𝑅𝑆 Controller β€’ Linearized dynamics Figure from Kris Hauser Greensmith et al. 2004, Zhao et al. 2012 Zhao et al. 2015; Thodoroff et al. 2018

  5. Regularization with a Control Prior πœ‡ is a regularization parameter weighting Combine control prior, 𝑣 π‘žπ‘ π‘—π‘π‘  (𝑑) , the prior vs. the with learned controller, 𝑣 πœ„ 𝑙 𝑑 , learned controller sampled from 𝜌 πœ„ 𝑙 𝑏 𝑑 𝜌 πœ„ 𝑙 learned in same manner with samples drawn from new distribution (e.g. ) which can be equivalently expressed as the constrained Under the assumption of Gaussian exploration noise (i.e. 𝜌 πœ„ 𝑏 𝑑 has Gaussian distribution): optimization problem, Johannink et al. 2018; Silver et al. 2019

  6. Interpretation of the Prior Theorem 1. Using the mixed policy above, variance from 1 each policy gradient step is reduced by factor 1+πœ‡ 2 . However, this may introduce bias into the policy where represents the total variation distance between two policies.

  7. Interpretation of the Prior Theorem 1. Using the mixed policy above, variance from 1 each policy gradient step is reduced by factor 1+πœ‡ 2 . However, this may introduce bias into the policy Strong regularization: The control prior heavily constrains exploration. Stabilize to the red trajectory, but miss green one. Weak regularization: Greater room for exploration, but where represents the total variation distance may not stabilize around red between two policies. trajectory.

  8. Stability Properties from the Prior Regularization allows us to β€œcapture” stability properties from a robust control prior Theorem 2. Assume a stabilizing β„‹ ∞ control prior within the set π’Ÿ for the dynamical system (14). Then asymptotic stability and forward invariance of the set 𝒯 𝑑𝑒 βŠ† π’Ÿ is guaranteed under the regularized policy for all 𝑑 ∈ π’Ÿ . Cartpole With a robust control prior, the regularized controller always remains near the equilibrium point, even during learning

  9. Results Data gathered from chain of cars following each other. Goal is to optimize fuel- efficiency of the middle car. Goal is to minimize laptime of simulated racecar Control Regularization helps by providing: See Poster for similar results on CartPole domain β€’ Reduced variance β€’ Higher rewards β€’ Faster learning Code at: https://github.com/rcheng805/CORE-RL β€’ Potential safety guarantees Poster Number: 42 However, high regularization also leads to potential bias

Recommend


More recommend