Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, - - PowerPoint PPT Presentation

FeUdal Networks for Hierarchical Reinforcement Learning Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind The 34th International Conference on Machine Learning (ICML


slide-1
SLIDE 1

FeUdal Networks for Hierarchical Reinforcement Learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu: DeepMind The 34th International Conference on Machine Learning (ICML 2017)

slide-2
SLIDE 2
  • Brief review of FeUdal Networks
  • Structures
  • Detailed Features
  • More on FeUdal Networks for HRL
  • Training
  • Experiments results
slide-3
SLIDE 3

Feudal RL (1993)

Reward Hiding:

  • Managers reward sub-managers for satisfying

their commands, not through an external reward

  • Managers have absolute control

Information Hiding

  • Observe world at different resolutions
  • Managers don’t know what happens at a other

levels of the hierarchy

Agent Agent Agent Rewards Rewards Environment Rewards Actions

Dayan, Peter and Hinton, Geoffrey E. , “Feudal Reinforcement Learning”, NIPS, 1993.

slide-4
SLIDE 4

FeUdal Networks (2017)

Manager

  • Sets directional goals for the

worker

  • Rewarded by environment
  • Does not directly act in environment

Worker

  • Higher temporal resolution
  • Reward for achieving manager’s goals
  • Produces primitive actions in environment

Worker Goals, Rewards Environment Actions Manager Rewards

slide-5
SLIDE 5

FeUdal Network:

slide-6
SLIDE 6

FeUdal Network : Details

Shared Dense Embedding

  • Embedding of input state
  • Used by both worker and manager to

produce goal and action

  • CNN

○ 16 8x8 filters ○ 32 4x4 filters ○ 256 fully connected ○ ReLU

slide-7
SLIDE 7

FeUdal Network

Manager: Goal embedding

  • Lower Temporal Resolution, goals

summed over last 10 time steps (goals vary smoothly)

  • Uses dilated LSTM
  • Goal is in low-dimensional space, not

environment

  • Trained using transition policy gradient

FeUdal Network : Details

slide-8
SLIDE 8

FeUdal Network

Worker: Action Embedding

  • Standard LSTM on shared

embedding

  • Embedding U matrix:

○ Rows: actions [a] ○ Columns : embedding dimension [k]

FeUdal Network : Details

slide-9
SLIDE 9

FeUdal Network

Goal embedding: Worker

  • Compress manager’s goal to dim k

using linear transformation - 𝜚

  • Same dim as action embedding
  • Linear transformation with no bias

○ Can’t produce a 0 vector ○ Can’t ignore the manager’s input, so manager’s goal will influence final policy

FeUdal Network : Details

slide-10
SLIDE 10

FeUdal Network

Action: Worker

  • Product of action embedding

matrix (U) with goal embedding (w)

  • Produces a distribution over

actions

  • Action = softmax(U*w)

FeUdal Network : Details

slide-11
SLIDE 11

Directional Goal

FeUdal Network : Features

slide-12
SLIDE 12

FeUdal Network : Features

slide-13
SLIDE 13

▪ Intrinsic reward

FeUdal Network : Features

𝑒𝑑𝑝𝑡 α, β = α𝑈β α β

slide-14
SLIDE 14

Training Manager: Transition Policy Gradient

Actor-critic: Value function from internal critic:

slide-15
SLIDE 15

Training Worker: Weighted reward

Not reward-hiding!

  • Use weighted sum of intrinsic reward, and environment reward
  • Intrinsic reward is based on whether the worker follows the correct

direction Actor-Critic:

∇𝜌𝑢 = 𝐵𝑢

𝐸∇𝜄log 𝜌(𝑏𝑢|𝑦𝑢; 𝜄)

𝐵𝑢

𝐸 = (𝑆𝑢 + 𝛽𝑆𝑢 𝐽 − 𝑊 𝑢 𝐸(𝑦𝑢; 𝜄))

𝑆𝑢

𝐽 = 1

𝑑 ෍

𝑗=1 𝑑

𝑒𝑑𝑝𝑡(𝑡𝑢 − 𝑡𝑢−𝑗, 𝑕𝑢−𝑗)

slide-16
SLIDE 16

More Details: Dilated LSTM

  • Better able to preserve memories over long periods
  • Output is summed over previous 10 steps
  • Specific type of Dilated RNN

Dilated RNN [Chang et al. 2017]:

slide-17
SLIDE 17

Results: Atari

  • Outperforms LSTM baseline whenever there are more delayed

rewards

slide-18
SLIDE 18

Results: Compared with Option-critic

  • Option-critic architecture: the only other end-to-end trainable system

with sub-policies at that time

  • Similar score on Seaquest, doubles it on Ms. Pacman, more than

triples it on Zaxxon and gets more than 20x improvement on Asterix

slide-19
SLIDE 19

Sub-policies inspection: Water Maze

  • Circular space with invisible goal, agent must find goal
  • Next episode put in a random location, and agent must find goal again
  • Left two are individual episodes, right visualizes the sub-policies
  • Agent learns meaningful sub goals
slide-20
SLIDE 20

Ablations : Temporal Resolution Ablations

  • Removing dilations from the LSTM or using full temporal time scale for

manager is significantly worse

slide-21
SLIDE 21

Ablations : Intrinsic Reward Ablations

  • Using only intrinsic reward at right
  • Environment reward is not necessary for good performance
slide-22
SLIDE 22

Summary

  • Directional rather than absolute goals are useful
  • Dilated LSTM is crucial for high performance
  • Improves long-term credit assignment over baselines
  • Manager’s goals are meaningful low-level behaviors from the worker
slide-23
SLIDE 23

Thanks for listening!