Standard and Natural Policy Gradients for Discounted Rewards Aaron - PowerPoint PPT Presentation

Standard and Natural Policy Gradients for Discounted Rewards Aaron Mishkin August 8, 2020 UBC MLRG 2018W1 1

Motivating Example: Humanoid Robot Control Consider learning a control model for a robotic arm that plays table tennis. https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/11/15/ping-pongv2.jpg?w968 2

Why Policy Gradients? Policy gradients have several advantages: • Policy gradients permit explicit policies with complex parameterizations. • Such policies are easily defined for continuous state and action spaces. • Policy gradient approaches are guaranteed to converge under standard assumptions while greedy methods (SARSA, Q-learning, etc) are not. 3

Roadmap Background and Notation The Policy Gradient Theorem Natural Policy Gradients 4

Background and Notation

Markov Decision Processes (MDPs) A discrete-time MDP is specified by the tuple { S , A , d 0 , f , r } : • States are s ∈ S ; actions are a ∈ A . • f is the transition distribution. It satisfies the Markov property: f ( s t , a t , s t +1 ) = p ( s t +1 | s 0 , a 0 ... s t , a t ) = p ( s t +1 | s t , a t ) • d 0 ( s 0 ) is the initial distribution over states. • r ( s t , a t , s t +1 ) is the reward function, which may be deterministic or stochastic. • Trajectories are sequences of state-action pairs: τ 0: t = { ( s 0 , a 0 ) , ..., ( s t , a t ) } We treat states s as fully observable. 5

Continuous State and Action Spaces We will consider MDPs with continuous state and action spaces. In the robot control example: • s ∈ S is a real vector describing the configuration of the robotic arm’s movement system and the state of environment. • a ∈ A real vector representing a motor command to the arm. • Given action a in state s , the probability of being in a region of state space S ′ ⊆ S is: � P ( s ′ ∈ S ′ | s , a ) = S ′ p ( s ′ | s , a ) d s ′ Future states s ′ are only known probabilistically because our control and physical models are approximations. 6

Policies Policies defines how an agent acts in the MDP: • A policy π : S × A → [0 , ∞ ) is the conditional density function: π ( a | s ) := probability of taking action a in state s • The policy is deterministic when π ( a | s ) is a Dirac-delta function. • Actions are chosen by sampling from the policy a ∼ π ( a | s ). • The quality of a policy is given by an objective function J ( π ). 7

Bellman Equations We consider discounted returns with factor γ ∈ [0 , 1]. The Bellman equations describe the quality of a policy recursively: � � � � Q π ( s , a ) := f ( s ′ | s , a ) r ( s , a , s ′ ) + π ( a ′ | s ′ ) γ Q π ( s ′ , a ′ ) d a ′ d s ′ S A � V π ( s ) := π ( a | s ) Q π ( s , a ) d a A � � f ( s ′ | s , a ) r ( s , a , s ′ ) + γ V π ( s ′ ) d s ′ d a � � = π ( a | s ) A S � � f ( s ′ | s , a ) r ( s , a , s ′ ) d s ′ d a = π ( a | s ) A S � � f ( s ′ | s , a ) γ V π ( s ′ ) d s ′ d a + π ( a | s ) A S 8

Actor-Critic Methods Three major flavors of reinforcement learning: 1. Critic-only methods: Learn an approximation of the state-action reward function: R ( s , a ) ≈ Q π ( s , a ). 2. Actor-only methods: Learn the policy π directly from observed rewards. A parametric policy π θ can be optimized by descending the policy gradient : ∇ θ J ( π θ ) = ∂ J ( π θ ) ∂π θ ∂π θ ∂θ 3. Actor-Critic methods: Learn an approximation of the reward R ( s , a ) jointly with the policy π ( a | s ). 9

Value of a Policy We can use the Bellman equations to write the overall quality of the policy: J ( π ) � d 0 ( s 0 ) V π ( s 0 ) d s 0 (1 − γ ) = S ∞ � � � � sa k ) γ k r (¯ = p ( s k = ¯ s ) π ( a k | ¯ s ) f ( s k +1 | ¯ s , a k , s k +1 ) d s t +1 d a d ¯ s S A S k =0 ∞ � � � � γ k p ( s k = ¯ = s ) π ( a k | ¯ s ) f ( s k +1 | ¯ sa k ) r (¯ s , a k , s k +1 ) d s t +1 d a d ¯ s S A S k =0 Define the ”discounted state” distribution: ∞ � d π γ k p ( s k = ¯ γ (¯ s ) = (1 − γ ) s ) k =0 10

Value of Policy: Discounted Return The final expression for the overall quality of the policy is the discounted return : � � � d π f ( s ′ | ¯ s , a , s ′ ) d s ′ d a d ¯ J ( π ) = γ (¯ s ) π ( a | ¯ s ) s , a ) r (¯ s S A S Assuming that the policy is parameterized by θ , how can we compute the policy gradient ∇ θ J ( π θ )? 11

The Policy Gradient Theorem

Policy Gradient Theorem: Statement Theorem 1 - Policy Gradient: [5] The gradient of the discounted return is: � � d π s ) Q π ( s , a ) d a d ¯ ∇ θ J ( π θ ) = γ (¯ s ) ∇ θ π θ ( a k | ¯ s S A Proof: The relationship between the discounted return and the state value function gives us our starting place: � d 0 ( s 0 ) V π ( s 0 ) d s 0 ∇ θ J ( π θ ) = (1 − γ ) ∇ θ S � d 0 ( s 0 ) ∇ θ V π ( s 0 ) d s 0 = (1 − γ ) S 12

Policy Gradient Theorem: Proof Consider the gradient of the state value function: � ∇ θ V π ( s ) = ∇ θ π θ ( a | s ) Q π ( s , a ) d a A � ∇ θ π θ ( a | s ) Q π ( s , a ) + π θ ( a | s ) ∇ θ Q π ( s , a ) d a = A � � � ∇ θ π θ ( a | s ) Q π ( s , a ) + π θ ( a | s ) ∇ θ f ( s ′ | s , a ) r ( s , a , s ′ ) + = A S � γ V π ( s ′ ) d s ′ d a � � ∇ θ π θ ( a | s ) Q π ( s , a ) + π θ ( a | s ) γ f ( s ′ | s , a ) ∇ θ V π ( s ′ ) d s ′ d a = A S This is recursive expression for the gradient that we can unroll! 13

Policy Gradient Theorem: Proof Continued Unrolling the expression from s 0 gives: � ∇ θ V π ( s 0 ) = ∇ θ π θ ( a 0 | s 0 ) Q π ( s 0 , a 0 ) d a 0 A � � γ f ( s 1 | s 0 , a 0 ) ∇ θ V π ( s 1 ) d s 1 d a 0 + π θ ( a 0 | s 0 ) A S ∞ � � � γ k p ( s k = ¯ s ) Q π (¯ = s | s 0 ) ∇ θ π θ ( a | ¯ s , a ) d a d ¯ s S A k =0 So the policy gradient is given by: ∞ ∇ θ J ( π θ ) � � � � γ k p ( s k = ¯ s ) Q π (¯ (1 − γ ) = d 0 ( s 0 ) s | s 0 ) ∇ θ π θ ( a | ¯ s , a ) d a d ¯ s S S A k =0 � � d π (¯ s ) Q π (¯ = s ) ∇ θ π θ ( a | ¯ s , a ) d a d ¯ s � S A 14

Policy Gradient Theorem: Introducing Critics • However, we generally don’t know the state-action reward function Q π ( s , a ). • The Actor-Critic framework suggests learning an approximation R w ( s , a ) with parameters w . • Given a fixed policy π θ , we want to minimize the expected least-squares error: � � s )1 s , a )] 2 d a d ¯ d π (¯ 2 [ Q π (¯ w = argmin w s ) π θ ( a | ¯ s , a ) − R w (¯ s S A • Can we show that the policy gradient theorem holds for reward function learned this way? 15

Policy Gradient Theorem: The Way Forward Let’s rewrite the policy gradient theorem to use our approximate reward function: � � d π (¯ ∇ θ π θ ( a | ¯ s ) [ R w (¯ s , a )] d a d ¯ ∇ θ J ( π θ ) = s ) s S A � � d π (¯ s , a ) − Q π (¯ s , a ) + Q π (¯ = s ) ∇ θ π θ ( a | ¯ s ) [ R w (¯ s , a )] d a d ¯ s S A � � d π (¯ s ) Q π (¯ = s ) ∇ θ π θ ( a | ¯ s , a ) d a d ¯ s − S A � � d π (¯ s ) [ Q π (¯ s ) ∇ θ π θ ( a | ¯ s , a ) − R w (¯ s , a )] d a d ¯ s S A Intuition: We can impose technical conditions on R w (¯ s , a ) to insure the second term is zero. 16

Policy Gradient Theorem: Restrictions on the Critic The sufficient conditions on R w are: • R w is compatible with the parameterization of the policy π θ in the sense: 1 ∇ w R w ( s , a ) = ∇ θ log π θ ( a | s ) = π θ ( a | s ) ∇ θ π θ ( a | s ) • w has converged to a local minimum: � � s )1 s , a )] 2 d a d ¯ d π (¯ 2 [ Q π (¯ ∇ w s ) π θ ( a | ¯ s , a ) − R w (¯ s = 0 S A � � d π (¯ s , a ) [ Q π (¯ s ) π θ ( a | ¯ s ) ∇ w R w (¯ s , a ) − R w (¯ s , a )] d a d ¯ s = 0 S A � � d π (¯ s ) [ Q π (¯ ∇ θ π θ ( a | ¯ s , a ) − R w (¯ s , a )] d a d ¯ s ) s = 0 S A 17

Policy Gradient Theorem: Function Approximation Version Theorem 2 - Policy Gradient with Function Approximation: [5] If R w ( s , a ) satisfies the conditions on the previous slide, the policy gradient using the learned reward function is: � � d π (¯ ∇ θ J ( π θ ) = s ) ∇ θ π θ ( a | ¯ s ) R w (¯ s , a ) d a d ¯ s . S A 18

Policy Gradient Theorem: Recap • We’ve shown that the gradient of the policy quality w.r.t the policy parameters has a simple form. • We’ve derived sufficient conditions for an actor-critic algorithm to use the policy gradient theorem. • We’ve obtained a necessary functional form for R w ( s , a ), since the compatibility condition requires R w ( s , a ) = ∇ θ log π θ ( a | s ) ⊤ w 19

Policy Gradient Theorem: Actually Computing the Gradient • We can estimate the policy gradient in practice using the score function estimator (aka REINFORCE): � � d π (¯ ∇ θ J ( π θ ) = s ) ∇ θ π θ ( a | ¯ s ) R w (¯ s , a ) d a d ¯ s S A � � d π (¯ = s ) π θ ( a | ¯ s ) ∇ θ log π θ ( a | ¯ s ) R w (¯ s , a ) d a d ¯ s S A � � d π (¯ s ) ∇ θ log π θ ( a | s ) ⊤ w d a d ¯ = s ) π θ ( a | ¯ s ) ∇ θ log π θ ( a | ¯ s S A • We can approximate the necessary integrals using multiple trajectories τ 0: t computed under the current policy π θ . 20

Standard and Natural Policy Gradients for Discounted Rewards Aaron - PowerPoint PPT Presentation

Standard and Natural Policy Gradients for Discounted Rewards Aaron Mishkin August 8, 2020 UBC MLRG 2018W1 1 Motivating Example: Humanoid Robot Control Consider learning a control model for a robotic arm that plays table tennis.

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1.

Discounted UCB Levente Kocsis and Csaba Szepesv ari MTA SZTAKI, Hungary Levente Kocsis and

DISCOUNTED CASH FLOW VALUATION DIFFERENT ASPECTS OF VALUATION OF EQUITY SHARES USING DCF METHOD

Outline Last time Image gradients Seam carving gradients as energy Edges

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Discounted Duration Calculus Work in Progress H. Ody Joint work with M. Frnzle and M. R.

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

Reading Strand and Ideas Standard Statement 9 Range of Reading and Standard Statement 10 Level

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

Histograms of Oriented Gradients for Human Detection N. Dalal and B. Triggs CVPR 2005 HOG Steps

Acoustic Liquid- -Level Determination of Level Determination of Acoustic Liquid Gradients and

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture

Stirling Alloa Kincardine Railway Brief History of the Route 1850 1968 Alloa

Bisimilarities Induced by Relations on Home Page Actions Title Page S.

The Epoch of Disk Settling: z ~ 1 to Today Susan Kassin (NPP Fellow, NASA Goddard), Ben Weiner

String indexing in the Word RAM model, part 3 Pawe Gawrychowski University of Wrocaw &

Out of GIZAEfficient Word Alignment Models for SMT Yanjun Ma National Centre for Language

Anisotropic Long Range Spin Systems Nicol` o Defenu Scuola Internazionale Superiore di Studi

Multilinear maps from lattices Constructions, attacks, and applications Yilei Chen (Visa

AITP Components Cezary Kaliszyk 03 April 2016 University of Innsbruck, Austria Talk Overview

Standard and Natural Policy Gradients for Discounted Rewards Aaron - PowerPoint PPT Presentation

Standard and Natural Policy Gradients for Discounted Rewards Aaron Mishkin August 8, 2020 UBC MLRG 2018W1 1 Motivating Example: Humanoid Robot Control Consider learning a control model for a robotic arm that plays table tennis.

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1.

Discounted UCB Levente Kocsis and Csaba Szepesv ari MTA SZTAKI, Hungary Levente Kocsis and

DISCOUNTED CASH FLOW VALUATION DIFFERENT ASPECTS OF VALUATION OF EQUITY SHARES USING DCF METHOD

Outline Last time Image gradients Seam carving gradients as energy Edges

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Discounted Duration Calculus Work in Progress H. Ody Joint work with M. Frnzle and M. R.

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

Reading Strand and Ideas Standard Statement 9 Range of Reading and Standard Statement 10 Level

CS 285 Instructor: Sergey Levine UC Berkeley Recap: policy gradients fit a model to estimate

Natural Policy Gradients, TRPO, PPO CMU 10703 Katerina Fragkiadaki Part of the slides adapted

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

Histograms of Oriented Gradients for Human Detection N. Dalal and B. Triggs CVPR 2005 HOG Steps

Acoustic Liquid- -Level Determination of Level Determination of Acoustic Liquid Gradients and

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture

Stirling Alloa Kincardine Railway Brief History of the Route 1850 1968 Alloa

Bisimilarities Induced by Relations on Home Page Actions Title Page S.

The Epoch of Disk Settling: z ~ 1 to Today Susan Kassin (NPP Fellow, NASA Goddard), Ben Weiner

String indexing in the Word RAM model, part 3 Pawe Gawrychowski University of Wrocaw &amp;

Out of GIZAEfficient Word Alignment Models for SMT Yanjun Ma National Centre for Language

Anisotropic Long Range Spin Systems Nicol` o Defenu Scuola Internazionale Superiore di Studi

Multilinear maps from lattices Constructions, attacks, and applications Yilei Chen (Visa

AITP Components Cezary Kaliszyk 03 April 2016 University of Innsbruck, Austria Talk Overview

String indexing in the Word RAM model, part 3 Pawe Gawrychowski University of Wrocaw &