Policy Gradients CS60077: Reinforcement Learning Abir Das IIT Kharagpur Nov 09, 10, 2020
Agenda Introduction REINFORCE Bias/Variance Agenda § Get started with the policy gradient methods. § Get familiar with naive REINFORCE algorithm and its advantages and disadvantages. § Getting familair with different variance reduction techniques. § Actor-Critic methods. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 2 / 39
Agenda Introduction REINFORCE Bias/Variance Resources § Deep Reinforcement Learning by Sergey Levine [Link] § OpenAI Spinning Up [Link] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 3 / 39
Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [SB] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 4 / 39
Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [SB] Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 4 / 39
Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [Sergey Levine, UC Berkeley] § In the middle is the ‘policy network’ which can directly learn a parameterized policy π θ ( a | s ) (sometimes denoted as π ( a | s ; θ ) ) and provides the probability distribution over all actions given the state s and parameterized by θ . § To distinguish it from the parameter vector w in value function approximator ˆ v ( s ; w ) , the notation θ is used. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 5 / 39
Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [Sergey Levine, UC Berkeley] § Goal in RL Problem is to maximize the total reward “in expectation” over long run. § A trajectory τ is defined as, τ = ( s 1 , a 1 , s 2 , a 2 , s 3 , a 3 , · · · ) § The probability of a trajectory is given by the joint probability of the state-action pairs. T � p θ ( s 1 , a 1 , s 2 , a 2 , · · · , s T , a T , s T +1 ) = p ( s 1 ) p ( s t +1 | s t , a t ) π θ ( a t | s t ) (1) t =1 Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 6 / 39
Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting § Proof of the above relation, p ( s T +1 , s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) p ( s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) p ( s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) p ( a T | s T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) p ( s T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) π θ ( a T | s T ) p ( s T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) (2) § The boxed part of the equation is very simi- lar to the left hand side. So, using similar argument repetitively, we get, p ( s T +1 , s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) π θ ( a T | s T ) p ( s T | s T − 1 , a T − 1 ) π θ ( a T − 1 | s T − 1 ) p ( s T − 1 , s T − 2 , a T − 2 · · · , s 1 , a 1 ) T � = p ( s 1 ) p ( s t +1 | s t , a t ) π θ ( a t | s t ) (3) t =1 Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 7 / 39
Agenda Introduction REINFORCE Bias/Variance The Goal of Reinforcement Learning Figure credit: [Sergey Levine, UC Berkeley] § We will sometimes denote the probability as p θ ( τ ) , i.e. , T � p θ ( τ ) = p θ ( s 1 , a 1 , s 2 , a 2 , · · · , s T , a T , s T +1 ) = p ( s 1 ) p ( s t +1 | s t , a t ) π θ ( a t | s t ) t =1 § The goal can be written as, �� � θ ∗ = arg max E τ ∼ p θ ( τ ) r ( s t , a t ) θ t � �� � J ( θ ) § Note that, for the time being, we are not considering discount. We will come back to that. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 8 / 39
Agenda Introduction REINFORCE Bias/Variance The Goal of Reinforcement Learning § Goal for a finite horizon setting: T � θ ∗ = arg max E ( s t , a t ) ∼ p θ ( s t , a t ) [ r ( s t , a t )] θ t =1 § The same for the infinite horizon setting θ ∗ = arg max E ( s , a ) ∼ p θ ( s , a ) [ r ( s , a )] θ § We will consider only finite horizon case in this topic. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 9 / 39
Agenda Introduction REINFORCE Bias/Variance Evaluating the Objective § We will see how we can optimize this objective - the expected value of the total reward under the trajectory distribution induced by the policy θ . § But before that let us see how we can evaluate the objective in model free setting. �� � J ( θ ) = E τ ∼ p θ ( τ ) r ( s t , a t ) (4) t Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 10 / 39
Agenda Introduction REINFORCE Bias/Variance Evaluating the Objective § We will see how we can optimize this objective - the expected value of the total reward under the trajectory distribution induced by the policy θ . § But before that let us see how we can evaluate the objective in model free setting. �� � � � ≈ 1 J ( θ ) = E τ ∼ p θ ( τ ) r ( s t , a t ) r ( s i,t , a i,t ) (4) N t t i Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 10 / 39
Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39
Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39
Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient. r ( τ ) � �� � � θ ∗ = arg max E τ ∼ p θ ( τ ) r ( s t , a t ) θ t � �� � J ( θ ) � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39
Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient. r ( τ ) � �� � � θ ∗ = arg max E τ ∼ p θ ( τ ) r ( s t , a t ) θ t � �� � J ( θ ) � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ § How to compute this complicated looking gradient! Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39
Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient. r ( τ ) � �� � � θ ∗ = arg max E τ ∼ p θ ( τ ) r ( s t , a t ) θ t � �� � J ( θ ) � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ § How to compute this complicated looking gradient! The log-derivative trick is our rescue. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39
Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick ∇ θ log p θ ( τ ) = ∂ log p θ ( τ ) 1 ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) ∂p θ ( τ ) = ⇒ ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ log p θ ( τ ) (5) Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39
Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick ∇ θ log p θ ( τ ) = ∂ log p θ ( τ ) 1 ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) ∂p θ ( τ ) = ⇒ ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ log p θ ( τ ) (5) § So using eqn. (5) we get the gradient of the objective as, � � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ = p θ ( τ ) ∇ θ log p θ ( τ ) r ( τ ) dτ = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ )] (6) Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39
Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick ∇ θ log p θ ( τ ) = ∂ log p θ ( τ ) 1 ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) ∂p θ ( τ ) = ⇒ ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ log p θ ( τ ) (5) § So using eqn. (5) we get the gradient of the objective as, � � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ = p θ ( τ ) ∇ θ log p θ ( τ ) r ( τ ) dτ = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ )] (6) § Remember that � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39
Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick § Till now we have the following, θ ∗ = arg max E τ ∼ p θ ( τ ) J ( θ ); J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] θ ∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ )] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39
Recommend
More recommend