CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10 University of Waterloo CS885 Spring 2018 Pascal Poupart 1
Outline • Recurrent neural networks – Long short term memory (LSTM) networks • Deep recurrent Q-networks University of Waterloo CS885 Spring 2018 Pascal Poupart 2
ParEal Observability • Hidden Markov model – Initial state distribution: Pr($ % ) – Transition probabilities: Pr($ '() |$ ' ) – Observation probabilities: Pr(+ ' |$ ' ) • Belief monitoring ∝ Pr + ' $ ' ∑ / 012 Pr $ ' $ '3) Pr($ '3) |+ )..'3) ) Pr $ ' + )..' s 0 s 1 s 2 s 4 s 3 o 1 o 2 o 3 o 4 o 5 University of Waterloo CS885 Spring 2018 Pascal Poupart 3
Recurrent Neural Network (RNN) • In RNNs, outputs can be fed back to the network as inputs, crea:ng a recurrent structure • HMMs can be simulated and generalized by RNNs • RNNs can be used for belief monitoring ! " : vector of observa:ons # " : belief state University of Waterloo CS885 Spring 2018 Pascal Poupart 4
Training • Recurrent neural networks are trained by backpropaga6on on the unrolled network – E.g. backpropaga6on through 6me • Weight sharing: – Combine gradients of shared weights into a single gradient • Challenges: – Gradient vanishing (and explosion) – Long range memory – Predic6on driF University of Waterloo CS885 Spring 2018 Pascal Poupart 5
Long Short Term Memory (LSTM) • Special gated structure to control memorization and forgetting in RNNs • Mitigate gradient vanishing • Facilitate long term memory University of Waterloo CS885 Spring 2018 Pascal Poupart 6
Unrolled long short term memory &'( $ &'( # &'( % output output output X X X gate gate gate ℎ " ℎ $ ℎ % ℎ # X X X X X X forget forget forget gate gate gate input input input )* # )* $ )* % gate gate gate University of Waterloo CS885 Spring 2018 Pascal Poupart 7
Deep Recurrent Q-Network • Hausknecht and Stone (2016) – Atari games • TransiBon model – LSTM network • ObservaBon model – ConvoluBonal network image image University of Waterloo CS885 Spring 2018 Pascal Poupart 8
Deep Recurrent Q-Network Ini@alize weights ! and " ! at random in [−1,1] Observe current state ( Loop Execute policy for en@re episode Add episode ( ) * , + * , ) , , + , , ) - , + - , … , ) / , + / ) to experience buffer Sample episode from buffer Ini@alize ℎ 1 For 2 = 1 @ll the end of the episode do 4566 4! = 7 9 ! :;; ! (= ) *..? ), = + ? − ̂ B − 4N ! OPP ! (= Q J..H ), = G H C max G HIJ Q " L :;; " ! (= ) *..?M* ), = + ?M* 8 4! = Update weights: ! ← ! − S 4566 4! Every T steps, update target: " ! ← ! University of Waterloo CS885 Spring 2018 Pascal Poupart 9
Deep Recurrent Q-Network Initialize weights ! and " ! at random in [−1,1] Observe current state ( Loop Execute policy for entire episode Add episode ( ) * , + * , ) , , + , , ) - , + - , … , ) / , + / ) to experience buffer Sample episode from buffer Initialize ℎ 1 For 2 = 1 till the end of the episode do 4566 4! = 7 9 ! :;; ! (ℎ =>* ? ) = ), ? + = − ̂ B − 4L ! MNN ! (O PQR ? S P ), ? G C max G H Q " J :;; " ! ℎ =>* ? ) = ? ) =K* , ? + =K* 8 4! ? ℎ = ← :;; " ! (ℎ =>* , ? ) = ) 4566 Update weights: ! ← ! − U 4! Every V steps, update target: " ! ← ! University of Waterloo CS885 Spring 2018 Pascal Poupart 10
Results Flickering games (missing observaBons) University of Waterloo CS885 Spring 2018 Pascal Poupart 11
Recommend
More recommend