Bidirectional Model-based Policy Optimization Hang Lai , Jian Shen, - PowerPoint PPT Presentation

Bidirectional Model-based Policy Optimization Hang Lai , Jian Shen, Weinan Zhang, Yong Yu Shanghai Jiao Tong University

Content 1. Background & Motivation 2. Method 3. Theoretical Analysis 4. Experiment Result

Background Model-based reinforcement learning (MBRL): • Build a model • To help decision making Challenge: compounding error real trajectory

Motivation Human beings in real world: • Predict future consequences forward • Imagine traces leading to a goal backward Existing methods: • Learn a forward model to plan ahead. This paper: • Additionally learn a backward model to reduce the reliance on accuracy in forward model.

Motivation

Content 1. Background & Motivation 2. Method 3. Theoretical Analysis 4. Experiment Result

Method bidirectional models + MBPO[1] method Bidirectional Model-based Policy Optimization (BMPO) Other components: • State sampling strategy • Incorporating model predictive control

Preliminary: MBPO • Interaction with environment with current policy. • Train forward model ensembles using real data. • Generate branched short rollouts with current policy. • Improve the policy with real & generated data.

Model Learning • Use an ensemble of probabilistic networks for both the forward model and the backward model . • The corresponding loss functions are: and : mean and covariance : number of real transitions

Backward Policy Backward policy : take actions given the next state. Used to generate backward policy. • By maximum likelihood estimation: • By conditional GAN:

State Sampling Strategy MBPO: randomly select states from environment data buffer to begin model rollouts. BMPO(ours): sample high value states according to the probability calculated by: Estimated value Probability of s to selected

Environment Interaction MBPO: directly use the current policy BMPO: use MPC • Generate candidate action sequences from current policy. • Simulate the corresponding trajectories in the model. • Select the first action of the sequence that yields the highest return.

Overall algorithm MBPO: BMPO (ours):

Content 1. Motivation 2. Method 3. Theoretical Analysis 4. Experiment Result

Theoretical Analysis For bidirectional model with backward rollout length and forward length : Expected return in environment Policy shift variation Model error Expected return in branched rollout For the forward only model with rollout length :

Content 1. Motivation 2. Method 3. Theoretical Analysis 4. Experiment Result

Settings • Environments Pendulum Hopper Walker Ant • Baselines MBPO, SLBO[2], PETS[3], SAC[4]

Comparison Result

Model Error Model validation loss(single step error)

Compounding Model Error Assume a real trajectory of length is .

Backward Policy Choice Figure 1): Comparison of two heuristic design choices for the backward policy loss: MLE loss and GAN loss.

Ablation study Figure 2): Ablation study of three crucial components: forward model, backward model, and MPC.

Hyperparameter study: Figure 3): the sensitivity of our algorithm to the hyper-parameter .

Hyperparameter study: Figure 4): Average return with different backward rollout lengths and fixed forward length .

Reference [1] Janner, Michael, et al. "When to trust your model: Model-based policy optimization." Advances in Neural Information Processing Systems . 2019. [2] Luo, Yuping, et al. "Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees." arXiv preprint arXiv:1807.03858 (2018). [3] Chua, Kurtland, et al. "Deep reinforcement learning in a handful of trials using probabilistic dynamics models." Advances in Neural Information Processing Systems . 2018. [4] Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." arXiv preprint arXiv:1801.01290 (2018).

Thanks for your interest! Please feel free to contact me at laihang@apex.sjtu.edu.cn

Bidirectional Model-based Policy Optimization Hang Lai , Jian Shen, - PowerPoint PPT Presentation

Bidirectional Model-based Policy Optimization Hang Lai , Jian Shen, Weinan Zhang, Yong Yu Shanghai Jiao Tong University Content 1. Background & Motivation 2. Method 3. Theoretical Analysis 4. Experiment Result Background Model-based

Bidirectional Transformations a PL perspective BIRS meeting on BX, 2013 Bidirectional

Effects of Approximate Filtering on the Appearance of Bidirectional Texture Functions Adrian

Security Notions for Bidirectional Channels Giorgia Azzurra Marson Bertram Poettering FSE 2017

Achievable Rate Region of the Bidirectional Achievable Rate Region of the Bidirectional

Image Retargeting Shai Avidan Tel Aviv University Bidirectional Similarity (Simakov et al.

1 minute Path tracing Bidirectional path tracing Progressive photon mapping 1 minute

A Combinatorial Language for Put-based Bidirectional Programming Hugo Pacheco National Institute

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

National Diabetes Prevention Program (National DPP) and Utah Tobacco Quitline Bidirectional

Bidirectional Flow Measurement, IPFIX, and Security Analysis Elisa Boschi, Hitachi Europe SAS

Bidirectional Path Tracing II CS295, Spring 2017 Shuang Zhao Computer Science Department

Overview of New BiDirectional Amplifier Requirements for Hospitals Florida Hospital

Realistic Image Synthesis Bidirectional Path Tracing & Reciprocity Philipp Slusallek Karol

Inductive Program Synthesis for Bidirectional Transformations Tobias G odderz, Helmut Grohne,

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representation Eliyahu

Framewise Phoneme Classification with Bidirectional LSTM Networks Alex Graves and Jurgen

May 2017 Jeff Tongs Director Technical and Quality Accounting Treats or Threats? Are you

DISCOURSE EXPECTATIONS IN A NON-NATIVE LANGUAGE Theres Grter 1 , Hannah Rohde 2 & Amy J.

Institute for 2009 I I Patterns of Interaction Collaborative & Perception of Intent

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? What does a theory consist of?

E s t i m a t i o n Es st ti im ma at ti io on n E o f t h e of f t th

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with

Hypothesis testing and statistical decision theory Lirong Xia March 25, 2016 Schedule

Non-Relativistic Ion Beam Diagnostics Chris Richard Budker Seminar 12-2-18 Outline Goals

Bidirectional Model-based Policy Optimization Hang Lai , Jian Shen, - PowerPoint PPT Presentation

Bidirectional Model-based Policy Optimization Hang Lai , Jian Shen, Weinan Zhang, Yong Yu Shanghai Jiao Tong University Content 1. Background & Motivation 2. Method 3. Theoretical Analysis 4. Experiment Result Background Model-based

Bidirectional Transformations a PL perspective BIRS meeting on BX, 2013 Bidirectional

Effects of Approximate Filtering on the Appearance of Bidirectional Texture Functions Adrian

Security Notions for Bidirectional Channels Giorgia Azzurra Marson Bertram Poettering FSE 2017

Achievable Rate Region of the Bidirectional Achievable Rate Region of the Bidirectional

Image Retargeting Shai Avidan Tel Aviv University Bidirectional Similarity (Simakov et al.

1 minute Path tracing Bidirectional path tracing Progressive photon mapping 1 minute

A Combinatorial Language for Put-based Bidirectional Programming Hugo Pacheco National Institute

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

National Diabetes Prevention Program (National DPP) and Utah Tobacco Quitline Bidirectional

Bidirectional Flow Measurement, IPFIX, and Security Analysis Elisa Boschi, Hitachi Europe SAS

Bidirectional Path Tracing II CS295, Spring 2017 Shuang Zhao Computer Science Department

Overview of New BiDirectional Amplifier Requirements for Hospitals Florida Hospital

Realistic Image Synthesis Bidirectional Path Tracing &amp; Reciprocity Philipp Slusallek Karol

Inductive Program Synthesis for Bidirectional Transformations Tobias G odderz, Helmut Grohne,

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representation Eliyahu

Framewise Phoneme Classification with Bidirectional LSTM Networks Alex Graves and Jurgen

May 2017 Jeff Tongs Director Technical and Quality Accounting Treats or Threats? Are you

DISCOURSE EXPECTATIONS IN A NON-NATIVE LANGUAGE Theres Grter 1 , Hannah Rohde 2 &amp; Amy J.

Institute for 2009 I I Patterns of Interaction Collaborative &amp; Perception of Intent

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? What does a theory consist of?

E s t i m a t i o n Es st ti im ma at ti io on n E o f t h e of f t th

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with

Hypothesis testing and statistical decision theory Lirong Xia March 25, 2016 Schedule

Non-Relativistic Ion Beam Diagnostics Chris Richard Budker Seminar 12-2-18 Outline Goals

Realistic Image Synthesis Bidirectional Path Tracing & Reciprocity Philipp Slusallek Karol

DISCOURSE EXPECTATIONS IN A NON-NATIVE LANGUAGE Theres Grter 1 , Hannah Rohde 2 & Amy J.

Institute for 2009 I I Patterns of Interaction Collaborative & Perception of Intent