deterministic policy gradient advanced rl algorithms
play

Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka - PowerPoint PPT Presentation

NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 10, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated


  1. NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 10, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. REINFORCE with Baseline The returns can be arbitrary – better-than-average and worse-than-average returns cannot be recognized from the absolute value of the return. b ( s ) Hopefully, we can generalize the policy gradient theorem using a baseline to ∑ ∑ ∇ J ( θ ) ∝ μ ( s ) ( s , a ) − b ( s ) ) ∇ π ( a ∣ s ; θ ). ( q θ θ π s ∈ S a ∈ A b ( s ) ( s ) v π A good choice for is , which can be shown to minimize variance of the estimator. E ( s ) = ( s , a ) v q a ∼ π π π Such baseline reminds centering of returns, given that . Then, better- than-average returns are positive and worse-than-average returns are negative. def q ( s , a ) = ( s , a ) − ( s ) a v π π π The resulting value is also called an advantage function . ( s ) v π Of course, the baseline can be only approximated. If neural networks are used to estimate π ( a ∣ s ; θ ) , then some part of the network is usually shared between the policy and value function estimation, which is trained using mean square error of the predicted and observed return. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 2/27

  3. Parallel Advantage Actor Critic An alternative to independent workers is to train in a synchronous and centralized way by having the workes to only generate episodes. Such approach was described in May 2017 by Celemente et al., who named their agent parallel advantage actor-critic (PAAC). Actions States DNN ... Worker 0 Worker n w learn Environments 0 ... ... ... ... ... ... ... ... n e States, Rewards Targets Master Figure 1 of the paper "Efficient Parallel Methods for Deep Reinforcement Learning" by Alfredo V. Clemente et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 3/27

  4. Continuous Action Space Until now, the actions were discreet. However, many environments naturally accept actions from a , b ∈ R [ a , b ] continuous space. We now consider actions which come from range for , or more generally from a Cartesian product of several such ranges: ∏ [ a , b ]. i i i 1.0 µ = σ = 2 0, 0.2, A simple way how to parametrize the action distribution µ = σ = 2 0, 1.0, 0.8 µ = σ = 2 is to choose them from the normal distribution. 0, 5.0, σ 2 μ µ = σ = 2 − 2, 0.5, ✓ ◆ − ( x − µ ) 2 Given mean and variance , probability density 2 N ( μ , σ ) 0.6 p 2 σ 2 function of is 0.4 ( x − μ ) 2 − 1 def p ( x ) = 2 σ 2 . 0.2 e 2 πσ 2 0.0 − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 x Figure from section 13.7 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 4/27

  5. Continuous Action Space in Gradient Methods Utilizing continuous action spaces in gradient-based methods is straightforward. Instead of the softmax distribution we suitably parametrize the action value, usually using the normal distribution. Considering only one real-valued action, we therefore have def P ( a ∼ N ( μ ( s ; θ ), σ ( s ; θ ) ) ) , 2 π ( a ∣ s ; θ ) = μ ( s ; θ ) σ ( s ; θ ) where and are function approximation of mean and standard deviation of the action distribution. The mean and standard deviation are usually computed from the shared representation, with the mean being computed as a regular regression (i.e., one output neuron without activation); def log(1 + e ) the standard variance (which must be positive) being computed again as a regression, exp softplus softplus( x ) = x followed most commonly by either or , where . NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 5/27

  6. Continuous Action Space in Gradient Methods μ ( s ; θ ) σ ( s ; θ ) During training, we compute and and then sample the action value (clipping it [ a , b ] to if required). To compute the loss, we utilize the probability density function of the normal distribution (and usually also add the entropy penalty). mu = tf.layers.dense(hidden_layer, 1)[:, 0] sd = tf.layers.dense(hidden_layer, 1)[:, 0] sd = tf.exp(log_sd) # or sd = tf.nn.softplus(sd) normal_dist = tf.distributions.Normal(mu, sd) # Loss computed as - log π (a|s) - entropy_regularization loss = - normal_dist.log_prob(self.actions) * self.returns \ - args.entropy_regularization * normal_dist.entropy() NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 6/27

  7. Deterministic Policy Gradient Theorem Combining continuous actions and Deep Q Networks is not straightforward. In order to do so, we need a different variant of the policy gradient theorem. Recall that in policy gradient theorem, ∑ ∑ π ∇ J ( θ ) ∝ μ ( s ) ( s , a )∇ π ( a ∣ s ; θ ). q θ θ s ∈ S a ∈ A Deterministic Policy Gradient Theorem a ∈ R π ( s ; θ ) Assume that the policy is deterministic and computes an action . Then under several assumptions about continuousness, the following holds: [ ∇ ] . ∣ E ∇ J ( θ ) ∝ π ( s ; θ )∇ ( s , a ) q ∣ s ∼ μ ( s ) θ θ a π a = π ( s ; θ ) The theorem was first proven in the paper Deterministic Policy Gradient Algorithms by David Silver et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 7/27

  8. Deterministic Policy Gradient Theorem – Proof The proof is very similar to the original (stochastic) policy gradient theorem. We assume that ′ ′ p ( s ∣ s , a ), ∇ p ( s ∣ s , a ), r ( s , a ), ∇ r ( s , a ), π ( s ; θ ), ∇ π ( s ; θ ) a a θ are continuous in all params. ∇ ( s ) = ∇ ( s , π ( s ; θ )) v q θ θ π π ( r ( s , π ( s ; θ ) ) + ∫ ( s ) d s ) ′ ′ ′ = ∇ p ( s ∣ s , π ( s ; θ ) ) v γ θ π s ′ θ ∫ ∣ ′ ′ ′ = ∇ π ( s ; θ )∇ r ( s , a ) + γ ∇ p ( s ∣ s , π ( s ; θ ) ) v ( s ) d s ∣ θ a π a = π ( s ; θ ) s ′ ( r ( s , a ) ∫ ( s ) d s ) ∣ ′ ′ ′ = ∇ π ( s ; θ )∇ + p ( s ∣ s , a ) ) v γ ∣ θ a π a = π ( s ; θ ) s ′ ∫ ′ ′ ′ + γ p ( s ∣ s , π ( s ; θ ) ) ∇ ( s ) d s v θ π s ′ ∫ ∣ ′ ′ ′ = ∇ π ( s ; θ )∇ ( s , a ) + p ( s ∣ s , π ( s ; θ ) ) ∇ ( s ) d s q γ v ∣ θ θ a π π a = π ( s ; θ ) s ′ ′ ∇ ( s ) v θ π Similarly to the gradient theorem, we finish the proof by continually expanding . NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 8/27

  9. Deep Deterministic Policy Gradients Note that the formulation of deterministic policy gradient theorem allows an off-policy algorithm, because the loss functions no longer depends on actions (similarly to how expected Sarsa is also an off-policy algorithm). π ( s ; θ ) q ( s , a ; θ ) q ( s , a ; θ ) We therefore train function approximation for both and , training using a deterministic variant of the Bellman equation: E q ( S , A ; θ ) = + γq ( S , π ( S ; θ )) ] [ R , S t +1 t +1 t +1 t t R t +1 t +1 π ( s ; θ ) and according to the deterministic policy gradient theorem. The algorithm was first described in the paper Continuous Control with Deep Reinforcement Learning by Timothy P. Lillicrap et al. (2015). The authors utilize a replay buffer, a target network (updated by exponential moving average τ = 0.001 with ), batch normalization for CNNs, and perform exploration by adding a normal- distributed noise to predicted actions. Training is performed by Adam with learning rates of 1e-4 and 1e-3 for the policy and critic network, respectively. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 9/27

  10. Deep Deterministic Policy Gradients Algorithm 1 DDPG algorithm Randomly initialize critic network Q ( s, a | θ Q ) and actor µ ( s | θ µ ) with weights θ Q and θ µ . Initialize target network Q ′ and µ ′ with weights θ Q ′ ← θ Q , θ µ ′ ← θ µ Initialize replay buffer R for episode = 1, M do Initialize a random process N for action exploration Receive initial observation state s 1 for t = 1, T do Select action a t = µ ( s t | θ µ ) + N t according to the current policy and exploration noise Execute action a t and observe reward r t and observe new state s t +1 Store transition ( s t , a t , r t , s t +1 ) in R Sample a random minibatch of N transitions ( s i , a i , r i , s i +1 ) from R Set y i = r i + γQ ′ ( s i +1 , µ ′ ( s i +1 | θ µ ′ ) | θ Q ′ ) ∑ Update critic by minimizing the loss: L = 1 i ( y i − Q ( s i , a i | θ Q )) 2 N Update the actor policy using the sampled policy gradient:  ∇ θ µ J ≈ 1 ∇ a Q ( s, a | θ Q ) | s = s i ,a = µ ( s i ) ∇ θ µ µ ( s | θ µ ) | s i N i Update the target networks: θ Q ′ ← τθ Q + (1 − τ ) θ Q ′ θ µ ′ ← τθ µ + (1 − τ ) θ µ ′ end for end for Algorithm 1 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 10/27

Recommend


More recommend