FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind The 34th International Conference on Machine Learning (ICML 2017)
• Brief review of FeUdal Networks • Structures • Detailed Features • More on FeUdal Networks for HRL • Training • Experiments results
Rewards Feudal RL (1993) Reward Hiding : Agent ● Managers reward sub-managers for satisfying their commands, not through an external reward Rewards ● Managers have absolute control Agent Information Hiding Rewards ● Observe world at different resolutions ● Managers don’t know what happens at a other Agent levels of the hierarchy Actions Environment Dayan, Peter and Hinton, Geoffrey E. , “ Feudal Reinforcement Learning ”, NIPS, 1993.
FeUdal Networks (2017) Rewards Manager Manager ● Sets directional goals for the worker ● Rewarded by environment Goals, Rewards ● Does not directly act in environment Worker Worker ● Higher temporal resolution Actions ● Reward for achieving manager’s goals Environment ● Produces primitive actions in environment
FeUdal Network:
FeUdal Network : Details Shared Dense Embedding ● Embedding of input state ● Used by both worker and manager to produce goal and action ● CNN ○ 16 8x8 filters ○ 32 4x4 filters ○ 256 fully connected ○ ReLU
FeUdal Network FeUdal Network : Details Manager: Goal embedding ● Lower Temporal Resolution, goals summed over last 10 time steps (goals vary smoothly) ● Uses dilated LSTM ● Goal is in low-dimensional space, not environment ● Trained using transition policy gradient
FeUdal Network FeUdal Network : Details Worker: Action Embedding ● Standard LSTM on shared embedding ● Embedding U matrix: ○ Rows: actions [a] ○ Columns : embedding dimension [k]
FeUdal Network : Details FeUdal Network Goal embedding: Worker ● Compress manager’s goal to dim k using linear transformation - 𝜚 ● Same dim as action embedding ● Linear transformation with no bias ○ Can’t produce a 0 vector ○ Can’t ignore the manager’s input, so manager’s goal will influence final policy
FeUdal Network FeUdal Network : Details Action: Worker ● Product of action embedding matrix (U) with goal embedding (w) ● Produces a distribution over actions ● Action = softmax(U*w)
FeUdal Network : Features Directional Goal
FeUdal Network : Features
FeUdal Network : Features ▪ Intrinsic reward = α 𝑈 β 𝑒 𝑑𝑝𝑡 α, β α β
Training Manager: Transition Policy Gradient Actor-critic: Value function from internal critic:
Training Worker: Weighted reward Actor-Critic: 𝐸 ∇ 𝜄 log 𝜌(𝑏 𝑢 |𝑦 𝑢 ; 𝜄) ∇𝜌 𝑢 = 𝐵 𝑢 Not reward-hiding! ● Use weighted sum of intrinsic reward, and environment reward 𝐸 = (𝑆 𝑢 + 𝛽𝑆 𝑢 𝐽 − 𝑊 𝐸 (𝑦 𝑢 ; 𝜄)) 𝐵 𝑢 𝑢 ● Intrinsic reward is based on whether the worker follows the correct direction 𝑑 𝐽 = 1 𝑆 𝑢 𝑑 𝑒 𝑑𝑝𝑡 (𝑡 𝑢 − 𝑡 𝑢−𝑗 , 𝑢−𝑗 ) 𝑗=1
More Details: Dilated LSTM ● Better able to preserve memories over long periods ● Output is summed over previous 10 steps ● Specific type of Dilated RNN Dilated RNN [Chang et al. 2017]:
Results: Atari ● Outperforms LSTM baseline whenever there are more delayed rewards
Results: Compared with Option-critic ● Option-critic architecture: the only other end-to-end trainable system with sub-policies at that time ● Similar score on Seaquest, doubles it on Ms. Pacman, more than triples it on Zaxxon and gets more than 20x improvement on Asterix
Sub-policies inspection : Water Maze ● Circular space with invisible goal, agent must find goal ● Next episode put in a random location, and agent must find goal again ● Left two are individual episodes, right visualizes the sub-policies ● Agent learns meaningful sub goals
Ablations : Temporal Resolution Ablations ● Removing dilations from the LSTM or using full temporal time scale for manager is significantly worse
Ablations : Intrinsic Reward Ablations ● Using only intrinsic reward at right ● Environment reward is not necessary for good performance
Summary ● Directional rather than absolute goals are useful ● Dilated LSTM is crucial for high performance ● Improves long-term credit assignment over baselines ● Manager’s goals are meaningful low-level behaviors from the worker
Thanks for listening!
Recommend
More recommend