NPFL122, Lecture 5 Function Approximation, Deep Q Network Milan Straka November 12, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
n -step Methods ∞ -step TD 1-step TD and Monte Carlo and TD(0) 2-step TD 3-step TD n-step TD Full return is ∞ ∑ = , G R k +1 t k = t · · · · · · one-step return is = + γV ( S ). · · · G R t : t +1 t +1 t +1 · · · n We can generalize both into -step returns: t + n −1 ( k +1 ) ∑ def k − t t : t + n = + γ V ( S n ). Figure 7.1 of "Reinforcement Learning: An Introduction, Second Edition". G γ R t + n k = t def G t : t + n = t + n ≥ T G t with if . NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 2/33
n -step Sarsa ∞ -step TD 1-step TD n and TD(0) 2-step TD 3-step TD n-step TD and Monte Carlo Defining the -step return to utilize action-value function as t + n −1 ( k +1 ) ∑ k − t def n t : t + n = + γ Q ( S , A ) G γ R t + n t + n · · · · · · k = t · · · def G t : t + n = t + n ≥ T G t with if , we get the following · · · straightforward update rule: Q ( S , A ) ← Q ( S , A ) + [ − Q ( S , A t ] ) . α G t : t + n Figure 7.1 of "Reinforcement Learning: An Introduction, t t t t t Second Edition". Action values increased Action values increased Path taken by one-step Sarsa by 10-step Sarsa G G G Figure 7.4 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 3/33
n Off-policy -step Sarsa Recall the relative probability of a trajectory under the target and behaviour policies, which we now generalize as min( t + n , T −1) π ( A ∣ S ) ∏ def k k t : t + n = . ρ b ( A ∣ S ) k k k = t n Then a simple off-policy -step TD can be computed as V ( S ) ← V ( S ) + t : t + n −1 [ − V ( S t ] ) . αρ G t : t + n t t n Similarly, -step Sarsa becomes Q ( S , A ) ← Q ( S , A ) + t +1 : t + n [ − Q ( S , A t ] ) . αρ G t : t + n t t t t t NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 4/33
n Off-policy -step Without Importance Sampling n S t , A t We now derive the -step reward, starting from one-step: ∑ R t +1 def R t : t +1 = + π ( a ∣ S ) Q ( S , a ). G t +1 t +1 t +1 S t +1 a A t +1 For two-step, we get: R t +2 ∑ def R t : t +2 = + π ( a ∣ S ) Q ( S , a ) + γπ ( A ∣ S ) G . S t +2 G γ t +1 t +1 t +1 t +1 t +1 t +1: t +2 a = A t +1 A t +2 R t +3 Therefore, we can generalize to: S t +3 ∑ def R t : t + n = + π ( a ∣ S ) Q ( S , a ) + γπ ( A ∣ S ) G . G γ t +1 t +1 t +1 t +1 t +1 t +1: t + n a = A t +1 the 3-step tree-backup update Example in Section 7.5 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 5/33
Function Approximation v q We will approximate value function and/or state-value function , choosing from a family of w ∈ R d functions parametrized by a weight vector . We denote the approximations as ^ ( s , w ), v ^ ( s , a , w ). q V E We utilize the Mean Squared Value Error objective, denoted : ∑ ] 2 def ( w ) = μ ( s ) v [ π ( s ) − ( s , w ) ^ , V E v s ∈ S μ ( s ) where the state distribution is usually on-policy distribution. NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 6/33
Gradient and Semi-Gradient Methods w The functional approximation (i.e., the weight vector ) is usually optimized using gradient methods, for example as 1 t ] 2 t +1 ← w − α ∇ v ( S [ π ) − ( S ^ , w ) w v t t t 2 ← w − α v [ π ( S ) − ( S ^ , w t ] ) ∇ ( S ^ , w ). v v t t t t t ( S ) v π t As usual, the is estimated by a suitable sample. For example in Monte Carlo methods, G t we use episodic return , and in temporal difference methods, we employ bootstrapping and + γ ( S ^ , w ). R v t +1 t +1 use NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 7/33
Monte Carlo Gradient Policy Evaluation Gradient Monte Carlo Algorithm for Estimating ˆ v ⇡ v π Input: the policy π to be evaluated v : S ⇥ R d ! R Input: a di ff erentiable function ˆ Algorithm parameter: step size α > 0 Initialize value-function weights w 2 R d arbitrarily (e.g., w = 0 ) Loop forever (for each episode): Generate an episode S 0 , A 0 , R 1 , S 1 , A 1 , . . . , R T , S T using π Loop for each step of episode, t = 0 , 1 , . . . , T 1: ⇥ ⇤ w w + α G t ˆ v ( S t , w ) r ˆ v ( S t , w ) Algorithm 9.3 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 8/33
Linear Methods A simple special case of function approximation are linear methods, where ∑ def x ( s ) w = ^ ( x ( s ), w ) = T x ( s ) . v w i i x ( s ) s w The is a representation of state , which is a vector of the same size as . It is sometimes called a feature vector . The SGD update rule then becomes ← − [ π ( S ) − ( x ( S ^ ), w t ] ) x ( S ). w w α v v t +1 t t t t NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 9/33
Feature Construction for Linear Methods Many methods developed in the past: state aggregation, polynomials Fourier basis tile coding radial basis functions But of course, nowadays we use deep neural networks which construct a suitable feature vector automatically as a latent variable (the last hidden layer). NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 10/33
State Aggregation Simple way of generating a feature vector is state aggregation , where several neighboring states are grouped together. For example, consider a 1000-state random walk, where transitions go uniformly randomly to any of 100 neighboring states on the left or on the right. Using state aggregation, we can partition the 1000 states into 10 groups of 100 states. Monte Carlo policy evaluation then computes the following: 1 0.0137 True value v π Approximate Value Distribution MC value ˆ 0 scale scale v State distribution 0.0017 µ - 1 0 1 1000 State Figure 9.1 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 11/33
Tile Coding Tiling 1 Tiling 2 Tiling 3 Tiling 4 Continuous Four active tiles/features 2D state overlap the point space and are used to Point in represent it state space to be represented Figure 9.9 of "Reinforcement Learning: An Introduction, Second Edition". α / t t If overlapping tiles are used, the learning rate is usually normalized as . NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 12/33
Tile Coding For example, on the 1000-state random walk example, the performance of tile coding surpasses state aggregation: .4 .3 p .2 VE averaged State aggregation over 30 runs (one tiling) .1 Tile coding (50 tilings) 0 0 5000 Episodes Figure 9.10 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 13/33
Asymmetrical Tile Coding In higher dimensions, the tiles should have asymmetrical offsets, with a sequence of (1, 3, 5, … , 2 d − 1) being a good choice. Possible generalizations for uniformly o ff set tilings Possible generalizations for asymmetrically o ff set tilings Figure 9.11 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 14/33
Temporal Difference Semi-Gradient Policy Evaluation ( S ) + γ ( S ^ , w ). v R v t +1 t +1 π t In TD methods, we again use bootstrapping to estimate as Semi-gradient TD(0) for estimating ˆ v ⇡ v π Input: the policy π to be evaluated v : S + ⇥ R d ! R such that ˆ Input: a di ff erentiable function ˆ v (terminal , · ) = 0 Algorithm parameter: step size α > 0 Initialize value-function weights w 2 R d arbitrarily (e.g., w = 0 ) Loop for each episode: Initialize S Loop for each step of episode: Choose A ⇠ π ( ·| S ) Take action A , observe R, S 0 ⇥ ⇤ v ( S 0 , w ) ˆ w w + α R + γ ˆ v ( S, w ) r ˆ v ( S, w ) S S 0 until S is terminal Algorithm 9.3 of "Reinforcement Learning: An Introduction, Second Edition". Note that such algorithm is called semi-gradient , because it does not backpropagate through ′ ^ ( S , w ) v . NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 15/33
Temporal Difference Semi-Gradient Policy Evaluation V E An important fact is that linear semi-gradient TD methods do not converge to . Instead, w TD they converge to a different TD fixed point . It can be proven that 1 ( w ) ≤ min V E ( w ). V E TD 1 − γ w γ However, when is close to one, the multiplication factor in the above bound is quite large. NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 16/33
Recommend
More recommend