Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a - PowerPoint PPT Presentation

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330 Student Presentation

Table of Contents Motivation & problem ● Method overview ● Experiments ● Takeaways ● Discussion (strengths & weaknesses/limitations) ●

Motivation We would like to use reinforcement learning algorithms to solve tasks using only low-level ● observations, such as learning robotic control using only unstructured raw image data The standard approach relies on sensors to obtain information that would be helpful for learning ● Learning from only image data is hard because the RL algorithm must learn both a useful ● representation of the data and the task itself This is called the representation learning problem ●

Approach The approach of the paper is a two-fold approach: 1. Learn a predictive stochastic latent variable model for given high-dimensional data (i.e., images) 2. Perform learning in the latent space of the latent variable model

The Stochastic Latent Variable model We would like our latent variable model to represent a partially-observable Markov Decision ● Process (POMDP) The authors choose a graphical model for the latent variable model ● Previous work have used mixed deterministic-stochastic models, but SLAC’s model is purely ● stochastic The graphical model will be trained using amortized variational inference ●

Graphical model representation of POMDP Since we can only observe part of the true state, we need past information to infer the next latent ● state We can derive an evidence lower bound (ELBO) for POMDP: ●

Learning in the Latent Space The SLAC algorithm can be viewed as an extension of the Soft Actor-Critic algorithm (SAC) ● Learning is done in the maximum entropy setting , where we seek to maximize the entropy along ● with the expected reward: The entropy term encourages exploration ●

Soft Actor-Critic (SAC) As an actor-critic method, SAC learns both value function approximators (the critic ) and a policy ● (the actor ) SAC is trained using alternating policy evaluation and policy improvement ● Training is done in the latent space (i.e., in the state space z ) ●

Soft Actor Critic (SAC), con’t SAC learns two Q-networks, a V-network, and a policy network ● Two Q-networks are used to mitigate overestimation bias ● A V-network is used to stabilize training ● Taking gradients through the expectations is done using the reparametrization trick ●

Putting it all Together Finally, both the latent variable model and agent are trained together ● The full SLAC model has two layers of latent variables ●

Image-based Continuous Control Tasks Four tasks from DeepMind Control Suite ● Cheetah run Walker walk Ball-in-cup catch Finger spin Four tasks from OpenAI Gym ● Cheetah Walker Hopper Ant

Comparison with other models SAC ● ○ Off-policy actor-critic algorithm, learning directly from images or true states D4PG ● ○ Off-policy actor-critic algorithm, learning directly from images PlaNet ● ○ Model-based RL method for learning directly from images ○ Mixed deterministic/stochastic sequential latent variable model ○ No explicit policy learning yet used model predictive control (MPC) DVRL ● ○ On-policy model-free RL algorithm ○ Mixed deterministic/stochastic latent-variable POMDP model

Results on DeepMind Control Suite (4 tasks) Sample efficiency of SLAC is comparable or better than both model-based and ● model-free Outperforms DVRL ● Efficient off-policy RL algorithm take advantage of the learned representation ○

Results on OpenAI Gym (4 tasks) - Tasks are more challenging than DeepMind Control Suite tasks - Rewards not shaped, not bounded between 0 and 1 - More complex dynamics - Episode terminate on failure - PlaNet unable to solve last three tasks but obtains sub-optimal policy on cheetah

Robotic Manipulation Tasks - 9-DoF 3-fingered DClaw robot Push a door Close a drawer Reach out and pick up an object *Note: SLAC algorithm achieves above actions

Robotic Manipulation Tasks (continued) - 9-DoF 3-fingered DClaw robot - Goal : rotate a valve from various starting positions to various desired goal locations - Three different settings: 1. Fixed goal position 2. Random goal from 3 options : 3. Random goal :

Results Goal : Turning a valve to a desired location Takeaways: - For fixed goal setting, all performances are similar - For three random goal setting, SLAC and SAC from raw images performs well - For random goal setting , SLAC performs better than SAC from raw images / comparable to SAC from states

Latent Variable Models Six different models: ● Non-sequential VAE ○ PlaNet (Mixed deterministic/stochastic Model) ○ Simple Filtering (without factoring model) ○ Fully deterministic ○ Mixed deterministic/stochastic Model ○ Fully stochastic ○ Under fixed RL framework of SLAC ● Takeaway: - Fully stochastic model outperforms others

SLAC paper summary Propose a SLAC RL algorithm for learning from high-dimensional image inputs ● Combined off-policy model-free RL with representation learning via a sequential stochastic state space model ● SLAC’s fully stochastic model outperforms other latent variable models ● Achieved improved sample efficiency and final task performance ● Four DeepMind Control Suite tasks and four OpenAI Gym tasks ○ Simulation on robotic manipulation tasks (9-DoF 3-fingered DClaw robot on four tasks) ○

Limitations For fairness , performance evaluations for other models seems necessary ● not just SLAC RL framework, compare on different latent variable models ○ States benefits of using two layers of latent variables ● Insufficient explanation on why it brings good balance ○ Reward function choice for simulated robotics tasks are not well explained ● Insufficient explanation on weak performances of SAC from true states on three random ● goal setting (refer to previous slide) Performance on other image-based continuous control tasks ●

Appendix A (reward functions)

Appendix B (SLAC algorithm)

Log-likelihood of the observations can be bounded

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a - PowerPoint PPT Presentation

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330 Student Presentation Table of Contents Motivation & problem Method overview Experiments Takeaways Discussion (strengths

Soft Actor-Critic Zikun Chen, Minghan Li Jan. 28, 2020 Soft Actor-Critic: Ofg-Policy Maximum

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

AN AN AN ACTOR AN ACTOR ACTOR ACTOR- - - -CENTERED POLICY PROCESS CENTERED POLICY PROCESS

Living Actor Living Actor Living Actor - Use Cases Living Actor - Use Cases Use Cases

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi

Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn Rietz University of Hamburg

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Why actor analysis? Actor and network analysis Bert Enserink Network map of linked Network map

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Actor-Attention-Critic for Multi-Agent Reinforcement Learning Shariq Iqbal and Fei Sha Outline

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Zsolt Kira Georgia

Data structures wa y x 1 D ASE System System E C r* O r state D Critic Critic E

ADNet: A Deep Network for Detecting Adverts M. Hossari, S. Dev, M. Nicholson, K. McCabe, A.

Boosting Recommender Systems with Deep Learning Joo Gomes RecSys 2017 Como, Italy 200

Strategic Foresight, Deep Uncertainty, and Leadership: A Workshop Report Darryl Farber * , Mathew

Dumping of radioactive wastes at Cardiff Deep Tim Deere-Jones (Marine Radioactivity Research

Deep Util ilities Michigan Utility Coordination Conference January 20 th , 2016 Project Design

Are standards optional in healthcare in New Zealand? Suzanne Rolls Professional Nursing Adviser

Project Lead Community Anticoagulation Patient and Public Engagement Event 13 th November 2015

YOU CAN DO ANYTHING YOU SET YOUR MIND TO MIDGE COZZENS, RUTGERS UNIVERSITY UNIVERSITY OF NEBRASKA

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a - PowerPoint PPT Presentation

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330 Student Presentation Table of Contents Motivation & problem Method overview Experiments Takeaways Discussion (strengths

Soft Actor-Critic Zikun Chen, Minghan Li Jan. 28, 2020 Soft Actor-Critic: Ofg-Policy Maximum

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

AN AN AN ACTOR AN ACTOR ACTOR ACTOR- - - -CENTERED POLICY PROCESS CENTERED POLICY PROCESS

Living Actor Living Actor Living Actor - Use Cases Living Actor - Use Cases Use Cases

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi

Soft Actor-Critic: Deep Reinforcement Learning for Robotics Finn Rietz University of Hamburg

Movie &amp; Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Why actor analysis? Actor and network analysis Bert Enserink Network map of linked Network map

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Actor-Attention-Critic for Multi-Agent Reinforcement Learning Shariq Iqbal and Fei Sha Outline

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Zsolt Kira Georgia

Data structures wa y x 1 D ASE System System E C r* O r state D Critic Critic E

ADNet: A Deep Network for Detecting Adverts M. Hossari, S. Dev, M. Nicholson, K. McCabe, A.

Boosting Recommender Systems with Deep Learning Joo Gomes RecSys 2017 Como, Italy 200

Strategic Foresight, Deep Uncertainty, and Leadership: A Workshop Report Darryl Farber * , Mathew

Dumping of radioactive wastes at Cardiff Deep Tim Deere-Jones (Marine Radioactivity Research

Deep Util ilities Michigan Utility Coordination Conference January 20 th , 2016 Project Design

Are standards optional in healthcare in New Zealand? Suzanne Rolls Professional Nursing Adviser

Project Lead Community Anticoagulation Patient and Public Engagement Event 13 th November 2015

YOU CAN DO ANYTHING YOU SET YOUR MIND TO MIDGE COZZENS, RUTGERS UNIVERSITY UNIVERSITY OF NEBRASKA

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor