NPFL122, Lecture 6 Rainbow Milan Straka November 19, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Function Approximation v q We will approximate value function and/or state-value function , choosing from a family of w ∈ R d functions parametrized by a weight vector . We denote the approximations as ^ ( s , w ), v ^ ( s , a , w ). q V E We utilize the Mean Squared Value Error objective, denoted : ∑ ] 2 def ( w ) = μ ( s ) v [ π ( s ) − ( s , w ) ^ , V E v s ∈ S μ ( s ) where the state distribution is usually on-policy distribution. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 2/37
Gradient and Semi-Gradient Methods w The functional approximation (i.e., the weight vector ) is usually optimized using gradient methods, for example as 1 t ] 2 t +1 ← w − α ∇ v ( S [ π ) − ( S ^ , w ) w v t t t 2 ← w + α v [ π ( S ) − ( S ^ , w t ] ) ∇ ( S ^ , w ). v v t t t t t ( S ) v π t As usual, the is estimated by a suitable sample. For example in Monte Carlo methods, G t we use episodic return , and in temporal difference methods, we employ bootstrapping and + γ ( S ^ , w ). R v t +1 t +1 use NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 3/37
Deep Q Network Off-policy Q-learning algorithm with a convolutional neural network function approximation of action-value function. Training can be extremely brittle (and can even diverge as shown earlier). Convol ut i on Convol ut i on Ful l y connect ed Ful l y connect ed No i nput Figure 1 of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 4/37
Deep Q Networks 210 × 160 Preprocessing: 128-color images are converted to grayscale and then resized to 84 × 84 . 4 th Frame skipping technique is used, i.e., only every frame (out of 60 per second) is considered, and the selected action is repeated on the other frames. 4 Input to the network are last frames (considering only the frames kept by frame skipping), 4 i.e., an image with channels. The network is fairly standard, performing 8 × 8 32 filters of size with stride 4 and ReLU, 4 × 4 64 filters of size with stride 2 and ReLU, 3 × 3 64 filters of size with stride 1 and ReLU, fully connected layer with 512 units and ReLU, output layer with 18 output units (one for each action) NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 5/37
Deep Q Networks Network is trained with RMSProp to minimize the following loss: def E [ 2 ] ˉ ′ θ ′ L = ( r + γ max Q ( s , a ; ) − Q ( s , a ; θ )) . ′ ( s , a , r , s )∼ data a ′ ε An -greedy behavior policy is utilized. Important improvements: ′ ( s , a , r , s ) experience replay: the generated episodes are stored in a buffer as quadruples, ˉ and for training a transition is sampled uniformly; θ separate target network : to prevent instabilities, a separate target network is used to estimate state-value function. The weights are not trained, but copied from the trained ˉ network once in a while; ′ θ ′ ( r + γ max Q ( s , a ; ) − Q ( s , a ; θ )) [−1, 1] a ′ reward clipping of to . NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 6/37
Deep Q Networks Hyperparameters Hyperparameter Value minibatch size 32 replay buffer size 1M target network update frequency 10k discount factor 0.99 training frames 50M RMSProp learning rate and momentum 0.00025, 0.95 ε ε ε initial , final and frame of final 1.0, 0.1, 1M replay start size 50k no-op max 30 NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 7/37
Rainbow There have been many suggested improvements to the DQN architecture. In the end of 2017, the Rainbow: Combining Improvements in Deep Reinforcement Learning paper combines 7 of them into a single architecture they call Rainbow . Figure 1 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 8/37
Rainbow DQN Extensions Double Q-learning Similarly to double Q-learning, instead of ˉ ′ θ ′ r + γ max Q ( s , a ; ) − Q ( s , a ; θ ), a ′ we minimize ˉ ′ ′ ′ r + γQ ( s , arg max Q ( s , a ; θ ); ) − Q ( s , a ; θ ). θ a ′ 1 . 5 error 1 . 0 0 . 5 0 . 0 2 4 8 16 32 64 128 256 512 1024 number of actions Figure 1: The orange bars show the bias in a single Q- learning update when the action values are Q ( s, a ) = V ∗ ( s ) + ǫ a and the errors { ǫ a } m a =1 are independent standard normal random variables. The second set of action values Q ′ , used for the blue bars, was generated identically and in- dependently. All bars are the average of 100 repetitions. Figure 1 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 9/37
Rainbow DQN Extensions Double Q-learning Average error True value and an estimate All estimates and max Bias as function of state 2 2 max a Q t ( s, a ) − max a Q ∗ ( s, a ) +0 . 61 max a Q t ( s, a ) 1 Q ∗ ( s, a ) 0 0 0 − 0 . 02 Double-Q estimate Q t ( s, a ) − 1 − 2 − 2 max a Q t ( s, a ) − max a Q ∗ ( s, a ) max a Q t ( s, a ) +0 . 47 Q ∗ ( s, a ) 2 2 1 Q t ( s, a ) 0 +0 . 02 Double-Q estimate 0 0 − 1 4 4 4 max a Q t ( s, a ) − Q t ( s, a ) max a Q t ( s, a ) max a Q ∗ ( s, a ) 2 2 2 +3 . 35 0 0 0 − 0 . 02 Q ∗ ( s, a ) Double-Q estimate − 6 − 4 − 2 0 2 4 6 − 6 − 4 − 2 0 2 4 6 − 6 − 4 − 2 0 2 4 6 state state state Figure 2 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 10/37
Rainbow DQN Extensions Double Q-learning Alien Space Invaders Time Pilot Zaxxon Value estimates 20 2 . 5 8 DQN estimate 8 2 . 0 6 15 1 . 5 4 6 Double DQN estimate 10 1 . 0 2 Double DQN true value 4 DQN true value 0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Training steps (in millions) Value estimates Wizard of Wor Asterix (log scale) 100 80 DQN 40 10 20 DQN 1 10 Double DQN 5 Double DQN 0 50 100 150 200 0 50 100 150 200 Wizard of Wor Asterix 4000 Double DQN 6000 Double DQN Score 3000 4000 2000 2000 1000 DQN DQN 0 0 0 50 100 150 200 0 50 100 150 200 Training steps (in millions) Training steps (in millions) Figure 3 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 11/37
Rainbow DQN Extensions Double Q-learning Table 1 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. Table 2 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 12/37
Rainbow DQN Extensions Prioritized Replay Instead of sampling the transitions uniformly from the replay buffer, we instead prefer those with a large TD error. Therefore, we sample transitions according to their probability ∣ ∣ ω ′ θ ˉ ′ ∝ r + max Q ( s , a ; ) − Q ( s , a ; θ ) , ∣ ∣ p γ t ∣ ∣ a ′ ω = 0 ω where controls the shape of the distribution (which is uniform for and corresponds to ω = 1 TD error for ). New transitions are inserted into the replay buffer with maximum probability to support exploration of all encountered transitions. NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 13/37
Rainbow DQN Extensions Prioritized Replay p t Because we now sample transitions according to instead of uniformly, on-policy distribution and sampling distribution differ. To compensate, we therefore utilize importance sampling with ratio 1/ N ) β ( p = . ρ t t The authors utilize in fact “ for stability reasons ” / max . ρ ρ t i i NPFL122, Lecture 6 Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow 14/37
Recommend
More recommend