Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig - PowerPoint PPT Presentation

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning Rui Zhao*, Xudong Sun, Volker Tresp   Siemens AG & Ludwig Maximilian University of Munich | June 2019 | ICML 2019

Introduction In Multi-Goal Reinforcement Learning, an agent learns to achieve multiple goals with a goal-conditioned policy. During learning, the agent first collects the trajectories into a replay buffer and later these trajectories are selected randomly for replay. OpenAI Gym Robotic Simulations

Motivation - We observed that the achieved goals in the replay buffer are often biased towards the behavior policies. - From a Bayesian perspective (Murphy, 2012), when there is no prior knowledge of the target goal distribution, the agent should learn uniformly from diverse achieved goals. - We want to encourage the agent to achieve a diverse set of goals while maximizing the expected return.

Contributions - First, we propose a novel multi-goal RL objective based on weighted entropy, which is essentially a reward-weighted entropy objective. - Secondly, we derive a safe surrogate objective, that is, a lower bound of the original objective, to achieve stable optimization. - Thirdly, we developed a Maximum Entropy-based Prioritization (MEP) framework to optimize the derived surrogate objective. - We evaluate the proposed method in the OpenAI Gym robotic simulations.

<latexit sha1_base64="OdjnM+N0ldtMSmejwmrhjkZqpM=">AB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkV7LHgxWML9gPaUDbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DcCu4lCGgUCO8Hkbu53nlBpHsHM03Qj+hI8pAzaqzUTAalsltxFyDrxMtJGXI0BqWv/jBmaYTSMEG17nluYvyMKsOZwFmxn2pMKJvQEfYslTRC7WeLQ2fk0ipDEsbKljRkof6eyGik9TQKbGdEzVivenPxP6+XmrDmZ1wmqUHJlovCVBATk/nXZMgVMiOmlCmuL2VsDFVlBmbTdG4K2+vE7a1Yp3Xak2b8r1Wh5HAc7hAq7Ag1uowz0oAUMEJ7hFd6cR+fFeXc+lq0bTj5zBn/gfP4A2GWM7g=</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="w0HyHQT+vKR/Yh4DcsftmNhoY0=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI4kXjxCIo8ENmR26IWR2dnNzKyGEL7AiweN8eonefNvHGAPClbSaWqO91dQSK4Nq7eQ2Nre2d/K7hb39g8Oj4vFJS8epYthksYhVJ6AaBZfYNwI7CQKaRQIbAfj27nfkSleSzvzSRBP6JDyUPOqLFS46lfLldwGyTryMlCBDvV/86g1ilkYoDRNU67nJsafUmU4Ezgr9FKNCWVjOsSupZJGqP3p4tAZubDKgISxsiUNWai/J6Y0noSBbYzomakV725+J/XTU1Y9adcJqlByZaLwlQE5P512TAFTIjJpZQpri9lbARVZQZm03BhuCtvrxOWpWyd1WuNK5LtWoWRx7O4BwuwYMbqMEd1KEJDBCe4RXenAfnxXl3PpatOSebOYU/cD5/AOMBjPU=</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> A Novel Multi-Goal RL Objective Based on Weighted Entropy   Guiacsu [1971] proposed weighted entropy, which is an extension of Shannon entropy. The definition of weighted entropy is given by K X H w p = − w k p k log p k k =1 where is the weight of the event and is the probability of the event. p w " # T 1 X η H ( θ ) = H w p ( T g ) = E p r ( S t , G e ) | θ log p ( τ g ) t =1 This objective encourages the agent to maximize the expected return as well as to achieve more diverse goals. τ g = ( g s 0 , ..., g s * We use to denote all the achieved goals in the trajectory , i.e., . T ) τ g τ

<latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> <latexit sha1_base64="(nul)">(nul)</latexit> A Safe Surrogate Objective The surrogate is a lower bound of the objective function, i.e., , η L ( θ ) η L ( θ ) < η H ( θ ) where " T # 1 X η H ( θ ) = H w p ( T g ) = E p r ( S t , G e ) | θ log p ( τ g ) t =1 " T # X η L ( θ ) = Z · E q r ( S t , G e ) | θ t =1 q ( τ g ) = 1 Z p ( τ g ) (1 − p ( τ g )) is the normalization factor for . q ( τ g ) Z is the weighted entropy (Guiacsu, 1971; Kelbert et al., 2017), where the p ( T g ) H w weight is the accumulated reward , in our case. Σ T t =1 r ( S t , G e )

Maximum Entropy-based Prioritization (MEP) MEP Algorithm: We update the density model to construct a higher entropy distribution of achieved goals and update the agent with the more diversified training distribution.

Mean success rate and training time

Entropy of achieved goals versus and training epoch No MEP: 5.13 ± 0.33 No MEP: 5.73 ± 0.33 No MEP: 5.78 ± 0.21 With MEP: 5.59 ± 0.34 With MEP: 5.81 ± 0.30 With MEP: 5.81 ± 0.18

Summary and Take-home Message - Our approach improves performance by nine percentage points and sample- efficiency by a factor of two while keeping computational time under control. - Training the agent with many different kinds of goals, i.e., a higher entropy goal distribution, helps the agent to learn. - The code is available on GitHub: https://github.com/ruizhaogit/mep - Poster: 06:30 -- 09:00 PM @ Pacific Ballroom #32 Thank you!

Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig - PowerPoint PPT Presentation

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig Maximilian University of Munich | June 2019 | ICML 2019 Introduction In Multi-Goal Reinforcement Learning, an agent

Dirichlet Processes and Nonparametric Bayesian Modelling Volker Tresp 1 Motivation Infinite

Adversarial Search Volker Sorge Intro to AI: Problem of Games Lecture 4 Volker Sorge MiniMax

Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: Basics Lecture 12 Volker

Planning Volker Sorge Intro to AI: Logical Knowledge Representation Lecture 6 Volker Sorge

Xudong W ang Assistant Professor Department of Materials Science and Engineering University of

Deep Image Description Rui-Wei Zhao rw.du.zhao@gmail.com 1 Outline Generating descriptions

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

1 Code from Sun Bev Crair 2 beverly.crair@eng.sun.com Sun Microsystems, Inc. Overview

Instance Search Task Wenhui Jiang (jiang1st@bupt.edu.cn) Zhicheng Zhao, Qi Chen, Jinlong Zhao,

Knowledge Representation Other Examples Volker Sorge Intro to AI: Knowledge Representation

Happy Birthday, Volker! David S. Watkins Department of Mathematics Washington State University

Bayesian Networks Volker Sorge Intro to AI: Specifying Probability Distributions Lecture 8

Uninformed Search Depth First Search Iterative Deepening Volker Sorge Uniform Cost Search

Overview Volker Sorge http://www.cs.bham.ac.uk/~vxs/teaching/ai Intro to AI: Staff Lecture 1

QjackCtl Considered Harmful QjackCtl Considered Harmful rncbc a.k.a. a.k.a. Rui Nuno Capela Rui

Joint Energy and Communication Scheduling for Wireless Powered Networks Rui Zhang ECE Department,

High Needs Block Planning Ahead Prudent forecasting and early action High Needs has not

Actual Divergence of perturbative QCD series at Low Energy, I [Divergent Series, Summation,]

Head First CVE Ken Lee @echain + Who is Ken? * Former Product Developer * Chief Security Officer

Week 2 -Wednesday What did we talk about last time? Binary representation C literals

Embodied Carbon in MEP design Studies Louise Hamot Global Head of Lifecycle Research

Art & Music Elective Programmes (AEP/MEP) For Express Students in non-AEP/MEP Schools All

On singular two-parameter eigenvalue problems Andrej Muhi c Institute of Mathematics, Physics

CMTC Providing Incentives and Support for A&D Manufacturers John Anderson - Director

Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig - PowerPoint PPT Presentation

Maximum Entropy-Regularized Multi-Goal Reinforcement Learning Rui Zhao*, Xudong Sun, Volker Tresp Siemens AG & Ludwig Maximilian University of Munich | June 2019 | ICML 2019 Introduction In Multi-Goal Reinforcement Learning, an agent

Dirichlet Processes and Nonparametric Bayesian Modelling Volker Tresp 1 Motivation Infinite

Adversarial Search Volker Sorge Intro to AI: Problem of Games Lecture 4 Volker Sorge MiniMax

Machine Learning: Reinforcement Learning Volker Sorge Intro to AI: Basics Lecture 12 Volker

Planning Volker Sorge Intro to AI: Logical Knowledge Representation Lecture 6 Volker Sorge

Xudong W ang Assistant Professor Department of Materials Science and Engineering University of

Deep Image Description Rui-Wei Zhao rw.du.zhao@gmail.com 1 Outline Generating descriptions

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

1 Code from Sun Bev Crair 2 beverly.crair@eng.sun.com Sun Microsystems, Inc. Overview

Instance Search Task Wenhui Jiang (jiang1st@bupt.edu.cn) Zhicheng Zhao, Qi Chen, Jinlong Zhao,

Knowledge Representation Other Examples Volker Sorge Intro to AI: Knowledge Representation

Happy Birthday, Volker! David S. Watkins Department of Mathematics Washington State University

Bayesian Networks Volker Sorge Intro to AI: Specifying Probability Distributions Lecture 8

Uninformed Search Depth First Search Iterative Deepening Volker Sorge Uniform Cost Search

Overview Volker Sorge http://www.cs.bham.ac.uk/~vxs/teaching/ai Intro to AI: Staff Lecture 1

QjackCtl Considered Harmful QjackCtl Considered Harmful rncbc a.k.a. a.k.a. Rui Nuno Capela Rui

Joint Energy and Communication Scheduling for Wireless Powered Networks Rui Zhang ECE Department,

High Needs Block Planning Ahead Prudent forecasting and early action High Needs has not

Actual Divergence of perturbative QCD series at Low Energy, I [Divergent Series, Summation,]

Head First CVE Ken Lee @echain + Who is Ken? * Former Product Developer * Chief Security Officer

Week 2 -Wednesday What did we talk about last time? Binary representation C literals

Embodied Carbon in MEP design Studies Louise Hamot Global Head of Lifecycle Research

Art &amp; Music Elective Programmes (AEP/MEP) For Express Students in non-AEP/MEP Schools All

On singular two-parameter eigenvalue problems Andrej Muhi c Institute of Mathematics, Physics

CMTC Providing Incentives and Support for A&amp;D Manufacturers John Anderson - Director

Art & Music Elective Programmes (AEP/MEP) For Express Students in non-AEP/MEP Schools All

CMTC Providing Incentives and Support for A&D Manufacturers John Anderson - Director