reinforcement learning
play

Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, - PowerPoint PPT Presentation

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind The 34th International Conference on Machine Learning (ICML


  1. FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind The 34th International Conference on Machine Learning (ICML 2017)

  2. • Brief review of FeUdal Networks • Structures • Detailed Features • More on FeUdal Networks for HRL • Training • Experiments results

  3. Rewards Feudal RL (1993) Reward Hiding : Agent ● Managers reward sub-managers for satisfying their commands, not through an external reward Rewards ● Managers have absolute control Agent Information Hiding Rewards ● Observe world at different resolutions ● Managers don’t know what happens at a other Agent levels of the hierarchy Actions Environment Dayan, Peter and Hinton, Geoffrey E. , “ Feudal Reinforcement Learning ”, NIPS, 1993.

  4. FeUdal Networks (2017) Rewards Manager Manager ● Sets directional goals for the worker ● Rewarded by environment Goals, Rewards ● Does not directly act in environment Worker Worker ● Higher temporal resolution Actions ● Reward for achieving manager’s goals Environment ● Produces primitive actions in environment

  5. FeUdal Network:

  6. FeUdal Network : Details Shared Dense Embedding ● Embedding of input state ● Used by both worker and manager to produce goal and action ● CNN ○ 16 8x8 filters ○ 32 4x4 filters ○ 256 fully connected ○ ReLU

  7. FeUdal Network FeUdal Network : Details Manager: Goal embedding ● Lower Temporal Resolution, goals summed over last 10 time steps (goals vary smoothly) ● Uses dilated LSTM ● Goal is in low-dimensional space, not environment ● Trained using transition policy gradient

  8. FeUdal Network FeUdal Network : Details Worker: Action Embedding ● Standard LSTM on shared embedding ● Embedding U matrix: ○ Rows: actions [a] ○ Columns : embedding dimension [k]

  9. FeUdal Network : Details FeUdal Network Goal embedding: Worker ● Compress manager’s goal to dim k using linear transformation - 𝜚 ● Same dim as action embedding ● Linear transformation with no bias ○ Can’t produce a 0 vector ○ Can’t ignore the manager’s input, so manager’s goal will influence final policy

  10. FeUdal Network FeUdal Network : Details Action: Worker ● Product of action embedding matrix (U) with goal embedding (w) ● Produces a distribution over actions ● Action = softmax(U*w)

  11. FeUdal Network : Features Directional Goal

  12. FeUdal Network : Features

  13. FeUdal Network : Features ▪ Intrinsic reward = α 𝑈 β 𝑒 𝑑𝑝𝑡 α, β α β

  14. Training Manager: Transition Policy Gradient Actor-critic: Value function from internal critic:

  15. Training Worker: Weighted reward Actor-Critic: 𝐸 ∇ 𝜄 log 𝜌(𝑏 𝑢 |𝑦 𝑢 ; 𝜄) ∇𝜌 𝑢 = 𝐵 𝑢 Not reward-hiding! ● Use weighted sum of intrinsic reward, and environment reward 𝐸 = (𝑆 𝑢 + 𝛽𝑆 𝑢 𝐽 − 𝑊 𝐸 (𝑦 𝑢 ; 𝜄)) 𝐵 𝑢 𝑢 ● Intrinsic reward is based on whether the worker follows the correct direction 𝑑 𝐽 = 1 𝑆 𝑢 𝑑 ෍ 𝑒 𝑑𝑝𝑡 (𝑡 𝑢 − 𝑡 𝑢−𝑗 , 𝑕 𝑢−𝑗 ) 𝑗=1

  16. More Details: Dilated LSTM ● Better able to preserve memories over long periods ● Output is summed over previous 10 steps ● Specific type of Dilated RNN Dilated RNN [Chang et al. 2017]:

  17. Results: Atari ● Outperforms LSTM baseline whenever there are more delayed rewards

  18. Results: Compared with Option-critic ● Option-critic architecture: the only other end-to-end trainable system with sub-policies at that time ● Similar score on Seaquest, doubles it on Ms. Pacman, more than triples it on Zaxxon and gets more than 20x improvement on Asterix

  19. Sub-policies inspection : Water Maze ● Circular space with invisible goal, agent must find goal ● Next episode put in a random location, and agent must find goal again ● Left two are individual episodes, right visualizes the sub-policies ● Agent learns meaningful sub goals

  20. Ablations : Temporal Resolution Ablations ● Removing dilations from the LSTM or using full temporal time scale for manager is significantly worse

  21. Ablations : Intrinsic Reward Ablations ● Using only intrinsic reward at right ● Environment reward is not necessary for good performance

  22. Summary ● Directional rather than absolute goals are useful ● Dilated LSTM is crucial for high performance ● Improves long-term credit assignment over baselines ● Manager’s goals are meaningful low-level behaviors from the worker

  23. Thanks for listening!

Recommend


More recommend