ds595 cs525 reinforcement learning
play

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 No Quiz Today Project 3 due today 3 Next Thursday: No class Happy Thanksgiving 4 Project 4


  1. This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020

  2. No Quiz Today

  3. Project 3 due today 3

  4. Next Thursday: No class Happy Thanksgiving 4

  5. Project 4 is available Starts 10/29 Thursday v https://github.com/yingxue- zhang/DS595CS525-RL- Projects/tree/master/Project4 v Important Dates: v Project Proposal: Thursday 11/12/2020 v Progressive report: Thursday 11/26/2020 v Final Project: § Tuesday 12/8/2020 team project report is due § Thursday 12/10/2020 Virtual Poster Session 5

  6. This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL

  7. Inverse Reinforcement Reinforcement Learning Learning Tabular representation of reward Linear reward function learning Model-based control Imitation learning Model-free control Apprenticeship learning Single Agent (MC, SARSA, Q-Learning) Inverse reinforcement learning MaxEnt IRL Function representation of reward MaxCausalEnt IRL 1. Linear value function approx MaxRelEnt IRL (MC, SARSA, Q-Learning) 2. Value function approximation (Deep Q-Learning, Double DQN, Non-linear reward function learning prioritized DQN, Dueling DQN) Generative adversarial 3. Policy function approximation imitation learning (GAIL) (Policy gradient , PPO, TRPO) 4. Actor-Critic methods (A2C, Adversarial inverse reinforcement A3C, Pathwise Derivative PG) learning (AIRL) Advanced topics in RL ( Sparse Rewards ) Review of Deep Learning Review of Generative Adversarial nets As bases for non-linear function As bases for non-linear IRL approximation (used in 2-4). Multiple Agents Multi-Agent Inverse Reinforcement Multi-Agent Reinforcement Learning Learning Multi-agent Actor-Critic MA-GAIL etc. Applications MA-AIRL AMA-GAIL

  8. This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL § Project #4 progress update

  9. Model-free RL Algorithms v Value-based (Learned Value Function) v Policy-based (Learned Policy Function) § Actor-Critic (Learned both Value and Policy Functions)

  10. Model-free RL Algorithms v Value-based (Learned Value Function) v Deep Q-Learning (DQN) v Double DQN v Dueling DQN v Prioritized DQN v Policy-based (Learned Policy Function) § Basic Policy Gradient Algorithm § REINFORCE § Vanilla, PPO, TRPO, PPO2 § Actor-Critic (Learned both Value and Policy Functions) § A2C § A3C § Pathwise Derivative Policy Gradient

  11. Basic Policy Gradient Algorithm Update Model Data Collection only used once Unbiased estimator

  12. Epsilon Greedy Boltzmann Exploration

  13. (A2C algorithm) Value function Approximation Policy Gradient

  14. Asynchronous Advantage Actor-Critic (A3C) From A2C to A3C

  15. Pathwise Derivative Policy Gradient David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, “Deterministic Policy Gradient Algorithms”, ICML, 2014 Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra, “ CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING ”, ICLR, 2016

  16. Replaced ε-greedy policy with πnetwork.

  17. Model-free RL Algorithms v Value-based (Learned Value Function) v Deep Q-Learning (DQN) v Double DQN v Dueling DQN v Prioritized DQN v Policy-based (Learned Policy Function) § Basic Policy Gradient Algorithm § REINFORCE § Vanilla, PPO, TRPO, PPO2 § Actor-Critic (Learned both Value and Policy Functions) § A2C § A3C § Pathwise Derivative Policy Gradient

  18. This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL

  19. Multi-step DQN Balance between MC and TD Experience Buffer

  20. Noisy Net for DQN https://arxiv.org/abs/1706.01905 https://arxiv.org/abs/1706.10295 v Noise on Action (Epsilon Greedy) v Noise on Parameters Inject noise into the parameters of Q- function at the beginning of each episode Add noise The noise would NOT change in an episode.

  21. Noisy Net Random exploration Systematic exploration

  22. Demo https://blog.openai.com/better- exploration-with-parameter-noise/ Which one is action noise vs parameter noise ?

  23. Distributional Q-function -10 10 -10 10 Different distributions can have the same values.

  24. Distributional Q-function s s A network with 15 outputs A network with 3 outputs (each action has 5 bins)

  25. Demo https://youtu.be/yFBwyPuO2Vg

  26. Rainbow https://arxiv.org/abs/1710.02298

  27. P r o s a Continuous Actions n d C o n s ? ? ? Solution 1 See which action can obtain the largest Q value Solution 2 Using gradient ascent to solve the optimization problem.

  28. Continuous Actions Solution 3 Design a network to make the optimization easy. vector s matrix scalar

  29. https://www.youtube.com/watch?v=ZhsEKTo7V04 Continuous Actions Solution 4 Don’t use Q-learning Policy-based Value-based Learning a Critic Learning an Actor + Critic Actor

  30. This Lecture v Actor-Critic methods § A2C § A3C § Pathwise Derivative Policy Gradient § Advanced RL Techniques § Advanced techniques for DQN • Multi-step DQN, Noisy net DQN • Distributional DQN • DQN for continuous action space § Sparse Reward • Reward shaping, Curiosity module • Curriculum learning, Hierarchical RL § Project #4 progress update

  31. Sparse Reward Reward Shaping

  32. Reward Shaping

  33. Reward Shaping https://openreview.net/forum?id=Hk3 VizDoom mPK5gg&noteId=Hk3mPK5gg

  34. Reward Shaping Get reward, when closer Need domain knowledge https://openreview.net/pdf?id=Hk3mPK5gg

  35. https://arxiv.org/abs/1705.05363 Curiosity updated updated … Env Actor Env Actor Env … Reward ICM Reward ICM ICM = intrinsic curiosity module

  36. Intrinsic Curiosity Module Encourage exploration diff Network 1 Some states is hard to predict, but not important. Trivial events

  37. Intrinsic Curiosity Module Encourage exploration diff Network 1 Some states is hard to predict, but not important. Trivial events

  38. Intrinsic Curiosity Module diff Network Network 1 2 Feature Feature Ext Ext

  39. Sparse Reward Curriculum Learning

  40. Curriculum Learning v Starting from simple training examples, and then becoming harder and harder. VizDoom

  41. Sparse Reward Hierarchical Reinforcement Learning

  42. https://arxiv.org/abs/1805.08180

  43. Inverse Reinforcement Reinforcement Learning Learning Tabular representation of reward Linear reward function learning Model-based control Imitation learning Model-free control Apprenticeship learning Single Agent (MC, SARSA, Q-Learning) Inverse reinforcement learning MaxEnt IRL Function representation of reward MaxCausalEnt IRL 1. Linear value function approx MaxRelEnt IRL (MC, SARSA, Q-Learning) 2. Value function approximation (Deep Q-Learning, Double DQN, Non-linear reward function learning prioritized DQN, Dueling DQN) Generative adversarial 3. Policy function approximation imitation learning (GAIL) (Policy gradient , PPO, TRPO) 4. Actor-Critic methods (A2C, Adversarial inverse reinforcement A3C, Pathwise Derivative PG) learning (AIRL) Advanced topics in RL ( Sparse Rewards ) Review of Deep Learning Review of Generative Adversarial nets As bases for non-linear function As bases for non-linear IRL approximation (used in 2-4). Multiple Agents Multi-Agent Inverse Reinforcement Multi-Agent Reinforcement Learning Learning Multi-agent Actor-Critic MA-GAIL etc. Applications MA-AIRL AMA-GAIL

  44. Questions?

Recommend


More recommend