FeUdal Networks for Hierarchical Reinforcement Learning
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind The 34th International Conference on Machine Learning (ICML 2017)
Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, - - PowerPoint PPT Presentation
FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind The 34th International Conference on Machine Learning (ICML
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind The 34th International Conference on Machine Learning (ICML 2017)
Reward Hiding:
their commands, not through an external reward
Information Hiding
levels of the hierarchy
Agent Agent Agent Rewards Rewards Environment Rewards Actions
Dayan, Peter and Hinton, Geoffrey E. , “Feudal Reinforcement Learning”, NIPS, 1993.
Manager
worker
Worker
Worker Goals, Rewards Environment Actions Manager Rewards
Shared Dense Embedding
produce goal and action
○ 16 8x8 filters ○ 32 4x4 filters ○ 256 fully connected ○ ReLU
Manager: Goal embedding
summed over last 10 time steps (goals vary smoothly)
environment
Worker: Action Embedding
embedding
○ Rows: actions [a] ○ Columns : embedding dimension [k]
Goal embedding: Worker
using linear transformation - 𝜚
○ Can’t produce a 0 vector ○ Can’t ignore the manager’s input, so manager’s goal will influence final policy
Action: Worker
matrix (U) with goal embedding (w)
actions
Directional Goal
▪ Intrinsic reward
𝑒𝑑𝑝𝑡 α, β = α𝑈β α β
Actor-critic: Value function from internal critic:
Not reward-hiding!
direction Actor-Critic:
𝐸∇𝜄log 𝜌(𝑏𝑢|𝑦𝑢; 𝜄)
𝐸 = (𝑆𝑢 + 𝛽𝑆𝑢 𝐽 − 𝑊 𝑢 𝐸(𝑦𝑢; 𝜄))
𝐽 = 1
𝑗=1 𝑑
Dilated RNN [Chang et al. 2017]:
rewards
with sub-policies at that time
triples it on Zaxxon and gets more than 20x improvement on Asterix
manager is significantly worse