This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020
No Quiz Today
Project 3 due today 3
Next Thursday: No class Happy Thanksgiving 4
Project 4 is available Starts 10/29 Thursday v https://github.com/yingxue- zhang/DS595CS525-RL- Projects/tree/master/Project4 v Important Dates: v Project Proposal: Thursday 11/12/2020 v Progressive report: Thursday 11/26/2020 v Final Project: § Tuesday 12/8/2020 team project report is due § Thursday 12/10/2020 Virtual Poster Session 5
This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL
Inverse Reinforcement Reinforcement Learning Learning Tabular representation of reward Linear reward function learning Model-based control Imitation learning Model-free control Apprenticeship learning Single Agent (MC, SARSA, Q-Learning) Inverse reinforcement learning MaxEnt IRL Function representation of reward MaxCausalEnt IRL 1. Linear value function approx MaxRelEnt IRL (MC, SARSA, Q-Learning) 2. Value function approximation (Deep Q-Learning, Double DQN, Non-linear reward function learning prioritized DQN, Dueling DQN) Generative adversarial 3. Policy function approximation imitation learning (GAIL) (Policy gradient , PPO, TRPO) 4. Actor-Critic methods (A2C, Adversarial inverse reinforcement A3C, Pathwise Derivative PG) learning (AIRL) Advanced topics in RL ( Sparse Rewards ) Review of Deep Learning Review of Generative Adversarial nets As bases for non-linear function As bases for non-linear IRL approximation (used in 2-4). Multiple Agents Multi-Agent Inverse Reinforcement Multi-Agent Reinforcement Learning Learning Multi-agent Actor-Critic MA-GAIL etc. Applications MA-AIRL AMA-GAIL
This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL § Project #4 progress update
Model-free RL Algorithms v Value-based (Learned Value Function) v Policy-based (Learned Policy Function) § Actor-Critic (Learned both Value and Policy Functions)
Model-free RL Algorithms v Value-based (Learned Value Function) v Deep Q-Learning (DQN) v Double DQN v Dueling DQN v Prioritized DQN v Policy-based (Learned Policy Function) § Basic Policy Gradient Algorithm § REINFORCE § Vanilla, PPO, TRPO, PPO2 § Actor-Critic (Learned both Value and Policy Functions) § A2C § A3C § Pathwise Derivative Policy Gradient
Basic Policy Gradient Algorithm Update Model Data Collection only used once Unbiased estimator
Epsilon Greedy Boltzmann Exploration
(A2C algorithm) Value function Approximation Policy Gradient
Asynchronous Advantage Actor-Critic (A3C) From A2C to A3C
Pathwise Derivative Policy Gradient David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, “Deterministic Policy Gradient Algorithms”, ICML, 2014 Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra, “ CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING ”, ICLR, 2016
Replaced ε-greedy policy with πnetwork.
Model-free RL Algorithms v Value-based (Learned Value Function) v Deep Q-Learning (DQN) v Double DQN v Dueling DQN v Prioritized DQN v Policy-based (Learned Policy Function) § Basic Policy Gradient Algorithm § REINFORCE § Vanilla, PPO, TRPO, PPO2 § Actor-Critic (Learned both Value and Policy Functions) § A2C § A3C § Pathwise Derivative Policy Gradient
This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL
Multi-step DQN Balance between MC and TD Experience Buffer
Noisy Net for DQN https://arxiv.org/abs/1706.01905 https://arxiv.org/abs/1706.10295 v Noise on Action (Epsilon Greedy) v Noise on Parameters Inject noise into the parameters of Q- function at the beginning of each episode Add noise The noise would NOT change in an episode.
Noisy Net Random exploration Systematic exploration
Demo https://blog.openai.com/better- exploration-with-parameter-noise/ Which one is action noise vs parameter noise ?
Distributional Q-function -10 10 -10 10 Different distributions can have the same values.
Distributional Q-function s s A network with 15 outputs A network with 3 outputs (each action has 5 bins)
Demo https://youtu.be/yFBwyPuO2Vg
Rainbow https://arxiv.org/abs/1710.02298
P r o s a Continuous Actions n d C o n s ? ? ? Solution 1 See which action can obtain the largest Q value Solution 2 Using gradient ascent to solve the optimization problem.
Continuous Actions Solution 3 Design a network to make the optimization easy. vector s matrix scalar
https://www.youtube.com/watch?v=ZhsEKTo7V04 Continuous Actions Solution 4 Don’t use Q-learning Policy-based Value-based Learning a Critic Learning an Actor + Critic Actor
This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL § Project #4 progress update
Sparse Reward Reward Shaping
Reward Shaping
Reward Shaping https://openreview.net/forum?id=Hk3 VizDoom mPK5gg¬eId=Hk3mPK5gg
Reward Shaping Get reward, when closer Need domain knowledge https://openreview.net/pdf?id=Hk3mPK5gg
https://arxiv.org/abs/1705.05363 Curiosity updated updated … Env Actor Env Actor Env … Reward ICM Reward ICM ICM = intrinsic curiosity module
Intrinsic Curiosity Module Encourage exploration diff Network 1 Some states is hard to predict, but not important. Trivial events
Intrinsic Curiosity Module Encourage exploration diff Network 1 Some states is hard to predict, but not important. Trivial events
Intrinsic Curiosity Module diff Network Network 1 2 Feature Feature Ext Ext
Sparse Reward Curriculum Learning
Curriculum Learning v Starting from simple training examples, and then becoming harder and harder. VizDoom
Sparse Reward Hierarchical Reinforcement Learning
https://arxiv.org/abs/1805.08180
Inverse Reinforcement Reinforcement Learning Learning Tabular representation of reward Linear reward function learning Model-based control Imitation learning Model-free control Apprenticeship learning Single Agent (MC, SARSA, Q-Learning) Inverse reinforcement learning MaxEnt IRL Function representation of reward MaxCausalEnt IRL 1. Linear value function approx MaxRelEnt IRL (MC, SARSA, Q-Learning) 2. Value function approximation (Deep Q-Learning, Double DQN, Non-linear reward function learning prioritized DQN, Dueling DQN) Generative adversarial 3. Policy function approximation imitation learning (GAIL) (Policy gradient , PPO, TRPO) 4. Actor-Critic methods (A2C, Adversarial inverse reinforcement A3C, Pathwise Derivative PG) learning (AIRL) Advanced topics in RL ( Sparse Rewards ) Review of Deep Learning Review of Generative Adversarial nets As bases for non-linear function As bases for non-linear IRL approximation (used in 2-4). Multiple Agents Multi-Agent Inverse Reinforcement Multi-Agent Reinforcement Learning Learning Multi-agent Actor-Critic MA-GAIL etc. Applications MA-AIRL AMA-GAIL
Questions?
Recommend
More recommend