Policy Gradients CS60077: Reinforcement Learning Abir Das IIT - PowerPoint PPT Presentation

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT Kharagpur Nov 09, 10, 2020

Agenda Introduction REINFORCE Bias/Variance Agenda § Get started with the policy gradient methods. § Get familiar with naive REINFORCE algorithm and its advantages and disadvantages. § Getting familair with different variance reduction techniques. § Actor-Critic methods. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 2 / 39

Agenda Introduction REINFORCE Bias/Variance Resources § Deep Reinforcement Learning by Sergey Levine [Link] § OpenAI Spinning Up [Link] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 3 / 39

Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [SB] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 4 / 39

Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [SB] Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 4 / 39

Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [Sergey Levine, UC Berkeley] § In the middle is the ‘policy network’ which can directly learn a parameterized policy π θ ( a | s ) (sometimes denoted as π ( a | s ; θ ) ) and provides the probability distribution over all actions given the state s and parameterized by θ . § To distinguish it from the parameter vector w in value function approximator ˆ v ( s ; w ) , the notation θ is used. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 5 / 39

Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [Sergey Levine, UC Berkeley] § Goal in RL Problem is to maximize the total reward “in expectation” over long run. § A trajectory τ is defined as, τ = ( s 1 , a 1 , s 2 , a 2 , s 3 , a 3 , · · · ) § The probability of a trajectory is given by the joint probability of the state-action pairs. T � p θ ( s 1 , a 1 , s 2 , a 2 , · · · , s T , a T , s T +1 ) = p ( s 1 ) p ( s t +1 | s t , a t ) π θ ( a t | s t ) (1) t =1 Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 6 / 39

Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting § Proof of the above relation, p ( s T +1 , s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) p ( s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) p ( s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) p ( a T | s T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) p ( s T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) π θ ( a T | s T ) p ( s T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) (2) § The boxed part of the equation is very similar to the left hand side. So, using similar argument repetitively, we get, p ( s T +1 , s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) π θ ( a T | s T ) p ( s T | s T − 1 , a T − 1 ) π θ ( a T − 1 | s T − 1 ) p ( s T − 1 , s T − 2 , a T − 2 · · · , s 1 , a 1 ) T � = p ( s 1 ) p ( s t +1 | s t , a t ) π θ ( a t | s t ) (3) t =1 Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 7 / 39

Agenda Introduction REINFORCE Bias/Variance The Goal of Reinforcement Learning Figure credit: [Sergey Levine, UC Berkeley] § We will sometimes denote the probability as p θ ( τ ) , i.e. , T � p θ ( τ ) = p θ ( s 1 , a 1 , s 2 , a 2 , · · · , s T , a T , s T +1 ) = p ( s 1 ) p ( s t +1 | s t , a t ) π θ ( a t | s t ) t =1 § The goal can be written as, �� θ ∗ = arg max E τ ∼ p θ ( τ ) r ( s t , a t ) θ t � �� J ( θ ) § Note that, for the time being, we are not considering discount. We will come back to that. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 8 / 39

Agenda Introduction REINFORCE Bias/Variance The Goal of Reinforcement Learning § Goal for a finite horizon setting: T � θ ∗ = arg max E ( s t , a t ) ∼ p θ ( s t , a t ) [ r ( s t , a t )] θ t =1 § The same for the infinite horizon setting θ ∗ = arg max E ( s , a ) ∼ p θ ( s , a ) [ r ( s , a )] θ § We will consider only finite horizon case in this topic. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 9 / 39

Agenda Introduction REINFORCE Bias/Variance Evaluating the Objective § We will see how we can optimize this objective - the expected value of the total reward under the trajectory distribution induced by the policy θ . § But before that let us see how we can evaluate the objective in model free setting. �� J ( θ ) = E τ ∼ p θ ( τ ) r ( s t , a t ) (4) t Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 10 / 39

Agenda Introduction REINFORCE Bias/Variance Evaluating the Objective § We will see how we can optimize this objective - the expected value of the total reward under the trajectory distribution induced by the policy θ . § But before that let us see how we can evaluate the objective in model free setting. �� ≈ 1 J ( θ ) = E τ ∼ p θ ( τ ) r ( s t , a t ) r ( s i,t , a i,t ) (4) N t t i Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 10 / 39

Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.   r ( τ ) � ��   θ ∗ = arg max   E τ ∼ p θ ( τ ) r ( s t , a t )   θ t � �� J ( θ ) � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.   r ( τ ) � ��   θ ∗ = arg max   E τ ∼ p θ ( τ ) r ( s t , a t )   θ t � �� J ( θ ) � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ § How to compute this complicated looking gradient! Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.   r ( τ ) � ��   θ ∗ = arg max   E τ ∼ p θ ( τ ) r ( s t , a t )   θ t � �� J ( θ ) � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ § How to compute this complicated looking gradient! The log-derivative trick is our rescue. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick ∇ θ log p θ ( τ ) = ∂ log p θ ( τ ) 1 ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) ∂p θ ( τ ) = ⇒ ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ log p θ ( τ ) (5) Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39

Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick ∇ θ log p θ ( τ ) = ∂ log p θ ( τ ) 1 ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) ∂p θ ( τ ) = ⇒ ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ log p θ ( τ ) (5) § So using eqn. (5) we get the gradient of the objective as, � � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ = p θ ( τ ) ∇ θ log p θ ( τ ) r ( τ ) dτ = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ )] (6) Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39

Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick ∇ θ log p θ ( τ ) = ∂ log p θ ( τ ) 1 ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) ∂p θ ( τ ) = ⇒ ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ log p θ ( τ ) (5) § So using eqn. (5) we get the gradient of the objective as, � � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ = p θ ( τ ) ∇ θ log p θ ( τ ) r ( τ ) dτ = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ )] (6) § Remember that � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39

Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick § Till now we have the following, θ ∗ = arg max E τ ∼ p θ ( τ ) J ( θ ); J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] θ ∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ )] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT - PowerPoint PPT Presentation

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT Kharagpur Nov 09, 10, 2020 Agenda Introduction REINFORCE Bias/Variance Agenda Get started with the policy gradient methods. Get familiar with naive REINFORCE algorithm and

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1.

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Outline Last time Image gradients Seam carving gradients as energy Edges

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

Modeling Velocity Gradients in an OBC, First-Break Positioning Algorithm Noel Zinn Western

Acoustic Liquid- -Level Determination of Level Determination of Acoustic Liquid Gradients and

The Effects of Thermal Gradients in Automotive Battery Packs Balancing Strategy Dr Alastair

Implicit Reparameterization Gradients Michael Figurnov, Shakir Mohamed, Andriy Mnih Poster: Room

Compostional Gradients in Petroleum Reservoirs Curtis H. Whitson (U. Trondheim) Paul Belery (Fina

Histograms of Oriented Gradients for Human Detection N. Dalal and B. Triggs CVPR 2005 HOG Steps

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

Guided Evolutionary Strategies Augmenting random search with surrogate gradients Niru

PMAA July 4, 2014 Accurate computation of smallest singular values using the PRIMME eigensolver

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting Jingchang Liu

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun & Rich Zemels

r t st Pt Prss

Reconstruction of full rank algebraic branching programs Vineet Nair Joint work with: Neeraj

Motivating Example Compute expectation E ( g ( X ( t 0 ))) for stochastic process X ( t ) =

High-dimensional and infinite-dimensional hyperbolic crosses and their applications in

Hardware Implementations of Fixed-Point Atan2 Florent de Dinechin Matei I stoan Universit

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT - PowerPoint PPT Presentation

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT Kharagpur Nov 09, 10, 2020 Agenda Introduction REINFORCE Bias/Variance Agenda Get started with the policy gradient methods. Get familiar with naive REINFORCE algorithm and

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1.

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Outline Last time Image gradients Seam carving gradients as energy Edges

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

Modeling Velocity Gradients in an OBC, First-Break Positioning Algorithm Noel Zinn Western

Acoustic Liquid- -Level Determination of Level Determination of Acoustic Liquid Gradients and

The Effects of Thermal Gradients in Automotive Battery Packs Balancing Strategy Dr Alastair

Implicit Reparameterization Gradients Michael Figurnov, Shakir Mohamed, Andriy Mnih Poster: Room

Compostional Gradients in Petroleum Reservoirs Curtis H. Whitson (U. Trondheim) Paul Belery (Fina

Histograms of Oriented Gradients for Human Detection N. Dalal and B. Triggs CVPR 2005 HOG Steps

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

Guided Evolutionary Strategies Augmenting random search with surrogate gradients Niru

PMAA July 4, 2014 Accurate computation of smallest singular values using the PRIMME eigensolver

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting Jingchang Liu

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun &amp; Rich Zemels

r t st Pt Prss

Reconstruction of full rank algebraic branching programs Vineet Nair Joint work with: Neeraj

Motivating Example Compute expectation E ( g ( X ( t 0 ))) for stochastic process X ( t ) =

High-dimensional and infinite-dimensional hyperbolic crosses and their applications in

Hardware Implementations of Fixed-Point Atan2 Florent de Dinechin Matei I stoan Universit

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun & Rich Zemels