Using State Predictions for Value Regularization in Curiosity Driven Deep Reinforcement Learning Oliver Richter, , Manuel Fritsche Gino Brunner, Roger Wattenhofer ETH Zurich – Distributed Computing – www.disco.ethz.ch TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A AAAA
Base actions on predictions
Reinforcement learning Agent Environment
Reinforcement learning
How to choose the action?
Return value
Value function
Reinforcement learning Agent Environment
Sparse reward settings ? Agent Environment
Agent Environment
Reward the exploration of novel states
Reward the exploration of novel states
How to find novel states? make predictions A
How to find novel states? make predictions get surprised A F
Curiosity prediction reality
Asynchronous Advantage Actor-Critic architecture (A3C) Feature A3C Extractor Network A3C
Adding curiosity Feature A3C Extractor Network 1 Forward Feature Model Extractor 2
Learning good features Feature A3C Extractor Network 1 Forward Feature Model Extractor 2 Inverse Feature Model Extractor 2 Pathak et. al, ICML 2017, A3C + ICM
Good features for all Feature A3C Extractor Network Forward Model Inverse Feature Model Extractor A3C + Pred
Adding Value Prediction Feature A3C Extractor Network Forward A3C Model Network Inverse Feature Model Extractor A3C + Pred + VPC
Value Prediction Consistency
Value Prediction Consistency
Value Prediction Consistency
Let’s see how it works in practice
Rewards per episode
Rewards per episode
Rewards per episode
Rewards per episode
Thinking bigger
Rewards per episode
Rewards per episode
Rewards per episode
Rewards per episode
Doom environment
Doom Setup
Rewards per episode
Rewards per episode
Rewards per episode
Rewards per episode
Question & Answers ?
Recommend
More recommend