Maximum Entropy-Regularized Multi-Goal Reinforcement Learning Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig Maximilian University of Munich | June 2019 | ICML 2019
Introduction In Multi-Goal Reinforcement Learning, an agent learns to achieve multiple goals with a goal-conditioned policy. During learning, the agent first collects the trajectories into a replay buffer and later these trajectories are selected randomly for replay. OpenAI Gym Robotic Simulations
Motivation - We observed that the achieved goals in the replay buffer are often biased towards the behavior policies. - From a Bayesian perspective (Murphy, 2012), when there is no prior knowledge of the target goal distribution, the agent should learn uniformly from diverse achieved goals. - We want to encourage the agent to achieve a diverse set of goals while maximizing the expected return.
Contributions - First, we propose a novel multi-goal RL objective based on weighted entropy, which is essentially a reward-weighted entropy objective. - Secondly, we derive a safe surrogate objective, that is, a lower bound of the original objective, to achieve stable optimization. - Thirdly, we developed a Maximum Entropy-based Prioritization (MEP) framework to optimize the derived surrogate objective. - We evaluate the proposed method in the OpenAI Gym robotic simulations.
<latexit sha1_base64="OdjnM+N0ldtMSmejwmrhjkZqpM=">AB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkV7LHgxWML9gPaUDbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DcCu4lCGgUCO8Hkbu53nlBpHsHM03Qj+hI8pAzaqzUTAalsltxFyDrxMtJGXI0BqWv/jBmaYTSMEG17nluYvyMKsOZwFmxn2pMKJvQEfYslTRC7WeLQ2fk0ipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmrDmZ1wmqUHJlovCVBATk/nXZMgVMiOmlCmuL2VsDFVlBmbTdG4K2+vE7a1Yp3Xak2b8r1Wh5HAc7hAq7Ag1uowz0oAUMEJ7hFd6cR+fFeXc+lq0bTj5zBn/gfP4A2GWM7g=</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="w0HyHQT+vKR/Yh4DcsftmNhoY0=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI4kXjxCIo8ENmR26IWR2dnNzKyGEL7AiweN8eonefNvHGAPClbSaWqO91dQSK4Nq7eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNwI7CQKaRQIbAfj27nfkSleSzvzSRBP6JDyUPOqLFS46lfLldwGyTryMlCBDvV/86g1ilkYoDRNU67nJsafUmU4Ezgr9FKNCWVjOsSupZJGqP3p4tAZubDKgISxsiUNWai/J6Y0noSBbYzomakV725+J/XTU1Y9adcJqlByZaLwlQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5LtWoWRx7O4BwuwYMbqMEd1KEJDBCe4RXenAfnxXl3PpatOSebOYU/cD5/AOMBjPU=</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> A Novel Multi-Goal RL Objective Based on Weighted Entropy Guiacsu [1971] proposed weighted entropy, which is an extension of Shannon entropy. The definition of weighted entropy is given by K X H w p = − w k p k log p k k =1 where is the weight of the event and is the probability of the event. p w " # T 1 X η H ( θ ) = H w p ( T g ) = E p r ( S t , G e ) | θ log p ( τ g ) t =1 This objective encourages the agent to maximize the expected return as well as to achieve more diverse goals. τ g = ( g s 0 , ..., g s * We use to denote all the achieved goals in the trajectory , i.e., . T ) τ g τ
<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> A Safe Surrogate Objective The surrogate is a lower bound of the objective function, i.e., , η L ( θ ) η L ( θ ) < η H ( θ ) where " T # 1 X η H ( θ ) = H w p ( T g ) = E p r ( S t , G e ) | θ log p ( τ g ) t =1 " T # X η L ( θ ) = Z · E q r ( S t , G e ) | θ t =1 q ( τ g ) = 1 Z p ( τ g ) (1 − p ( τ g )) is the normalization factor for . q ( τ g ) Z is the weighted entropy (Guiacsu, 1971; Kelbert et al., 2017), where the p ( T g ) H w weight is the accumulated reward , in our case. Σ T t =1 r ( S t , G e )
Maximum Entropy-based Prioritization (MEP) MEP Algorithm: We update the density model to construct a higher entropy distribution of achieved goals and update the agent with the more diversified training distribution.
Mean success rate and training time
Entropy of achieved goals versus and training epoch No MEP: 5.13 ± 0.33 No MEP: 5.73 ± 0.33 No MEP: 5.78 ± 0.21 With MEP: 5.59 ± 0.34 With MEP: 5.81 ± 0.30 With MEP: 5.81 ± 0.18
Summary and Take-home Message - Our approach improves performance by nine percentage points and sample- efficiency by a factor of two while keeping computational time under control. - Training the agent with many different kinds of goals, i.e., a higher entropy goal distribution, helps the agent to learn. - The code is available on GitHub: https://github.com/ruizhaogit/mep - Poster: 06:30 -- 09:00 PM @ Pacific Ballroom #32 Thank you!
Recommend
More recommend