CS 885 Reinforcement Learning Neural Map: Structured Memory for Deep Reinforcement Learning Emilio Parisotto and Ruslan Salakhutdinov, ICLR 2018 Presented by Andreas Stöckel July 6, 2018
motivation: navigating partially observable environments (i) 1
motivation: navigating partially observable environments (i) 1
motivation: navigating partially observable environments (i) 1
motivation: navigating partially observable environments (i) Screenshot of “The Elder Scrolls: Arena”, 1994 1
mandates memory units tend to forget quickly incorporating locality into DNC motivation: navigating partially observable environments (ii) Partially observable environment Reduce interference by interference (DNC) have problems with Difgerentiable Neural Computer External memories such as Long-short-term memory (LSTM) Observations Reinforcement learning setup through arbitrary environments Discrete a t agent reaches goal Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Reward: ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating
units tend to forget quickly incorporating locality into DNC motivation: navigating partially observable environments (ii) Reinforcement learning setup Reduce interference by interference (DNC) have problems with Difgerentiable Neural Computer External memories such as Long-short-term memory (LSTM) mandates memory Observations through arbitrary environments Discrete a t agent reaches goal Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Partially observable environment ▶ Reward: ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating
incorporating locality into DNC motivation: navigating partially observable environments (ii) through arbitrary environments Reduce interference by interference (DNC) have problems with Difgerentiable Neural Computer External memories such as mandates memory Reinforcement learning setup Observations Discrete a t agent reaches goal Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Partially observable environment ▶ Reward: ▶ Long-short-term memory (LSTM) units tend to forget quickly ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating
incorporating locality into DNC motivation: navigating partially observable environments (ii) through arbitrary environments Reduce interference by interference (DNC) have problems with Difgerentiable Neural Computer mandates memory Reinforcement learning setup Observations agent reaches goal Discrete a t Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Partially observable environment ▶ Reward: ▶ Long-short-term memory (LSTM) units tend to forget quickly ▶ External memories such as ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating
motivation: navigating partially observable environments (ii) through arbitrary environments interference (DNC) have problems with Difgerentiable Neural Computer mandates memory Reinforcement learning setup Observations Discrete a t agent reaches goal Sparse reward r , i.e. only when Partially observable s t 2 ▶ State: ▶ Partially observable environment ▶ Reward: ▶ Long-short-term memory (LSTM) units tend to forget quickly ▶ External memories such as ▶ Actions: ▶ Goal: Find policy π ( a | s ) navigating ⇒ Reduce interference by incorporating locality into DNC
overview I Background Computer Actor-Critic II Neural Map Network III Empirical Evaluation IV Summary & Conclusion ▶ Memory Systems ▶ 2D Goal-Search Environment ▶ Difgerentiable Neural ▶ 3D Doom Environment ▶ Asynchronous Advantage
Part I BACKGROUND
3 external memories for deep neural networks EXTERNAL NEURAL NETWORK MEMORIES No write operator Write operator Memory networks Di � erentiable Neural Computer Convolutional network over External memory matrix with past M states di � erentiable access operations
t c R t , t e T t v T differentiable neural computer (dnc): read and write t c W t c W 1 M t 1 M t erase vector e t , value v t Write operation: Given write context c W t M T r t t Given read context vector c R Read operation: 4 ▶ External memory matrix M ∈ R W × H ▶ Associative memory: ▶ Associate context with value c t → v t ▶ Given ˜ c t ≈ c t retrieve r t ≈ v t
differentiable neural computer (dnc): read and write Given read context vector c R t erase vector e t , value v t Given write context c W t t 4 ▶ External memory matrix M ∈ R W × H ▶ Read operation: ▶ Associative memory: r t = M T t c R ▶ Associate context with value ▶ Write operation: c t → v t t , ▶ Given ˜ c t ≈ c t retrieve r t ≈ v t M t + 1 = M t ◦ ( 1 − c W t ) + c W t e T t v T
asynchronous advantage actor-critic (a3c) as baseline (Actor-Critic) 5 ▶ REINFORCE policy gradient descent with value function V π ( s ) ( ) ∇ π log π ( a t | s t ) G t − V π ( s t ) ∞ ∑ G t = γ k R t + k k = 0 ▶ Here: π ( a | s ) is deep neural network
Part II NEURAL MAP NETWORK
neural map: overview (i) 6 W x , y x , y r t W c t SMAX C M t M t+1 READ WRITE f CNN f NN H s t INPUT o t π ( a | s ) SMAX OUTPUT OUT f NN
neural map: overview (ii) r t f OUT Policy o t Output Update Variables Write c t Context t Read Input state Time step Location 7 Neural map Operations t ∈ Z = read ( M t ) ( x t , y t ) ∈ R 2 = context ( M t , s t , r t ) w ( x t , y t ) s t , r t , c t , M ( x t , y t ) ( ) s t ∈ R n = write t + 1 M t , w ( x t , y t ) ( ) M t ∈ R C × H × W = update M t + 1 t + 1 r t , c t , w ( x t , y t ) [ ] = t + 1 ( ) π ( s t | a ) = SoftMax NN ( o t )
neural map: read 8 W x , y x , y r t c t W SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Summarize memory in single “read vector” r t = f READ CNN ( M t ) ∈ R C
exp a x y z w exp a z w a x y q t M x y M x y neural map: context t t t t x y c t x y x y t t C t 9 W x , y x , y r t W c t SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Decode context c t based on input state s t [ ] q t = W ∈ R C s t , r t
exp a x y z w exp a z w M x y neural map: context x y C t t x y x y c t t t t t t 9 W x , y x , y r t W c t SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Decode context c t based on input state s t [ ] q t = W ∈ R C s t , r t a ( x , y ) q t , M ( x , y ) ⟨ ⟩ = ∈ R
M x y neural map: context t C t t x y x y c t t t t t exp 9 W x , y x , y r t W c t SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Decode context c t based on input state s t a ( x , y ) ( ) α ( x , y ) [ ] q t = W ∈ R C = ) ∈ R s t , r t a ( z , w ) ( ∑ z , w exp a ( x , y ) q t , M ( x , y ) ⟨ ⟩ = ∈ R
neural map: context t t t t t t t exp 9 W x , y x , y r t W c t SMAX C M t+1 M t READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Decode context c t based on input state s t a ( x , y ) ( ) α ( x , y ) [ ] q t = W ∈ R C = ) ∈ R s t , r t a ( z , w ) ( ∑ z , w exp a ( x , y ) q t , M ( x , y ) x , y α ( x , y ) M ( x , y ) ∑ ⟨ ⟩ = ∈ R c t = ∈ R C
neural map: write & update NN t 10 W x , y x , y r t W c t SMAX C M t M t+1 READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Compute content of memory for t + 1 at location x , y w ( x t , y t ) if ( a , b ) = ( x t , y t ) w ( x t , y t ) s t , r t , c t , M ( x t , y t ) M ( a , b ) t + 1 ([ ]) = f WRITE = t + 1 t + 1 M ( a , b ) if ( a , b ) ̸ = ( x t , y t ) t + 1
neural map: output & policy t f OUT 11 W x , y x , y r t W c t SMAX C M t M t+1 READ WRITE f CNN f NN H s t o t INPUT π ( a | s ) SMAX OUTPUT OUT f NN ▶ Compute policy π ( a | s ) c t , r t , M ( x t , y t ) [ ] ( ) o t = π ( a | s ) = SoftMax NN ( o t )
LSTM s t r t c t 1 h t context M t h t neural map: extensions Use GRU equations to disable/weaken memory update LSTM controller for 3D environments W , H too small, confuses network (reads what it just wrote) Use LSTM to remember read/write operations h t 1 c t Egocentric navigation Real-world agents do not know their global x , y location Always write to centre of memory, translate memory on movement 12 ▶ Write operation: Gated Recurrent Units (GRUs)
neural map: extensions Use GRU equations to disable/weaken memory update Egocentric navigation Real-world agents do not know their global x , y location Always write to centre of memory, translate memory on movement 12 ▶ Write operation: Gated Recurrent Units (GRUs) ▶ LSTM controller for 3D environments ▶ W , H too small, confuses network (reads what it just wrote) ⇒ Use LSTM to remember read/write operations h t = LSTM ( s t , r t , c t − 1 , h t − 1 ) c t = context ( M t , h t )
Recommend
More recommend