Standard and Natural Policy Gradients for Discounted Rewards Aaron Mishkin August 8, 2020 UBC MLRG 2018W1 1
Motivating Example: Humanoid Robot Control Consider learning a control model for a robotic arm that plays table tennis. https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/11/15/ping-pongv2.jpg?w968 2
Why Policy Gradients? Policy gradients have several advantages: • Policy gradients permit explicit policies with complex parameterizations. • Such policies are easily defined for continuous state and action spaces. • Policy gradient approaches are guaranteed to converge under standard assumptions while greedy methods (SARSA, Q-learning, etc) are not. 3
Roadmap Background and Notation The Policy Gradient Theorem Natural Policy Gradients 4
Background and Notation
Markov Decision Processes (MDPs) A discrete-time MDP is specified by the tuple { S , A , d 0 , f , r } : • States are s ∈ S ; actions are a ∈ A . • f is the transition distribution. It satisfies the Markov property: f ( s t , a t , s t +1 ) = p ( s t +1 | s 0 , a 0 ... s t , a t ) = p ( s t +1 | s t , a t ) • d 0 ( s 0 ) is the initial distribution over states. • r ( s t , a t , s t +1 ) is the reward function, which may be deterministic or stochastic. • Trajectories are sequences of state-action pairs: τ 0: t = { ( s 0 , a 0 ) , ..., ( s t , a t ) } We treat states s as fully observable. 5
Continuous State and Action Spaces We will consider MDPs with continuous state and action spaces. In the robot control example: • s ∈ S is a real vector describing the configuration of the robotic arm’s movement system and the state of environment. • a ∈ A real vector representing a motor command to the arm. • Given action a in state s , the probability of being in a region of state space S ′ ⊆ S is: � P ( s ′ ∈ S ′ | s , a ) = S ′ p ( s ′ | s , a ) d s ′ Future states s ′ are only known probabilistically because our control and physical models are approximations. 6
Policies Policies defines how an agent acts in the MDP: • A policy π : S × A → [0 , ∞ ) is the conditional density function: π ( a | s ) := probability of taking action a in state s • The policy is deterministic when π ( a | s ) is a Dirac-delta function. • Actions are chosen by sampling from the policy a ∼ π ( a | s ). • The quality of a policy is given by an objective function J ( π ). 7
Bellman Equations We consider discounted returns with factor γ ∈ [0 , 1]. The Bellman equations describe the quality of a policy recursively: � � � � Q π ( s , a ) := f ( s ′ | s , a ) r ( s , a , s ′ ) + π ( a ′ | s ′ ) γ Q π ( s ′ , a ′ ) d a ′ d s ′ S A � V π ( s ) := π ( a | s ) Q π ( s , a ) d a A � � f ( s ′ | s , a ) r ( s , a , s ′ ) + γ V π ( s ′ ) d s ′ d a � � = π ( a | s ) A S � � f ( s ′ | s , a ) r ( s , a , s ′ ) d s ′ d a = π ( a | s ) A S � � f ( s ′ | s , a ) γ V π ( s ′ ) d s ′ d a + π ( a | s ) A S 8
Actor-Critic Methods Three major flavors of reinforcement learning: 1. Critic-only methods: Learn an approximation of the state-action reward function: R ( s , a ) ≈ Q π ( s , a ). 2. Actor-only methods: Learn the policy π directly from observed rewards. A parametric policy π θ can be optimized by descending the policy gradient : ∇ θ J ( π θ ) = ∂ J ( π θ ) ∂π θ ∂π θ ∂θ 3. Actor-Critic methods: Learn an approximation of the reward R ( s , a ) jointly with the policy π ( a | s ). 9
Value of a Policy We can use the Bellman equations to write the overall quality of the policy: J ( π ) � d 0 ( s 0 ) V π ( s 0 ) d s 0 (1 − γ ) = S ∞ � � � � sa k ) γ k r (¯ = p ( s k = ¯ s ) π ( a k | ¯ s ) f ( s k +1 | ¯ s , a k , s k +1 ) d s t +1 d a d ¯ s S A S k =0 ∞ � � � � γ k p ( s k = ¯ = s ) π ( a k | ¯ s ) f ( s k +1 | ¯ sa k ) r (¯ s , a k , s k +1 ) d s t +1 d a d ¯ s S A S k =0 Define the ”discounted state” distribution: ∞ � d π γ k p ( s k = ¯ γ (¯ s ) = (1 − γ ) s ) k =0 10
Value of Policy: Discounted Return The final expression for the overall quality of the policy is the discounted return : � � � d π f ( s ′ | ¯ s , a , s ′ ) d s ′ d a d ¯ J ( π ) = γ (¯ s ) π ( a | ¯ s ) s , a ) r (¯ s S A S Assuming that the policy is parameterized by θ , how can we compute the policy gradient ∇ θ J ( π θ )? 11
The Policy Gradient Theorem
Policy Gradient Theorem: Statement Theorem 1 - Policy Gradient: [5] The gradient of the discounted return is: � � d π s ) Q π ( s , a ) d a d ¯ ∇ θ J ( π θ ) = γ (¯ s ) ∇ θ π θ ( a k | ¯ s S A Proof: The relationship between the discounted return and the state value function gives us our starting place: � d 0 ( s 0 ) V π ( s 0 ) d s 0 ∇ θ J ( π θ ) = (1 − γ ) ∇ θ S � d 0 ( s 0 ) ∇ θ V π ( s 0 ) d s 0 = (1 − γ ) S 12
Policy Gradient Theorem: Proof Consider the gradient of the state value function: � ∇ θ V π ( s ) = ∇ θ π θ ( a | s ) Q π ( s , a ) d a A � ∇ θ π θ ( a | s ) Q π ( s , a ) + π θ ( a | s ) ∇ θ Q π ( s , a ) d a = A � � � ∇ θ π θ ( a | s ) Q π ( s , a ) + π θ ( a | s ) ∇ θ f ( s ′ | s , a ) r ( s , a , s ′ ) + = A S � γ V π ( s ′ ) d s ′ d a � � ∇ θ π θ ( a | s ) Q π ( s , a ) + π θ ( a | s ) γ f ( s ′ | s , a ) ∇ θ V π ( s ′ ) d s ′ d a = A S This is recursive expression for the gradient that we can unroll! 13
Policy Gradient Theorem: Proof Continued Unrolling the expression from s 0 gives: � ∇ θ V π ( s 0 ) = ∇ θ π θ ( a 0 | s 0 ) Q π ( s 0 , a 0 ) d a 0 A � � γ f ( s 1 | s 0 , a 0 ) ∇ θ V π ( s 1 ) d s 1 d a 0 + π θ ( a 0 | s 0 ) A S ∞ � � � γ k p ( s k = ¯ s ) Q π (¯ = s | s 0 ) ∇ θ π θ ( a | ¯ s , a ) d a d ¯ s S A k =0 So the policy gradient is given by: ∞ ∇ θ J ( π θ ) � � � � γ k p ( s k = ¯ s ) Q π (¯ (1 − γ ) = d 0 ( s 0 ) s | s 0 ) ∇ θ π θ ( a | ¯ s , a ) d a d ¯ s S S A k =0 � � d π (¯ s ) Q π (¯ = s ) ∇ θ π θ ( a | ¯ s , a ) d a d ¯ s � S A 14
Policy Gradient Theorem: Introducing Critics • However, we generally don’t know the state-action reward function Q π ( s , a ). • The Actor-Critic framework suggests learning an approximation R w ( s , a ) with parameters w . • Given a fixed policy π θ , we want to minimize the expected least-squares error: � � s )1 s , a )] 2 d a d ¯ d π (¯ 2 [ Q π (¯ w = argmin w s ) π θ ( a | ¯ s , a ) − R w (¯ s S A • Can we show that the policy gradient theorem holds for reward function learned this way? 15
Policy Gradient Theorem: The Way Forward Let’s rewrite the policy gradient theorem to use our approximate reward function: � � d π (¯ ∇ θ π θ ( a | ¯ s ) [ R w (¯ s , a )] d a d ¯ ∇ θ J ( π θ ) = s ) s S A � � d π (¯ s , a ) − Q π (¯ s , a ) + Q π (¯ = s ) ∇ θ π θ ( a | ¯ s ) [ R w (¯ s , a )] d a d ¯ s S A � � d π (¯ s ) Q π (¯ = s ) ∇ θ π θ ( a | ¯ s , a ) d a d ¯ s − S A � � d π (¯ s ) [ Q π (¯ s ) ∇ θ π θ ( a | ¯ s , a ) − R w (¯ s , a )] d a d ¯ s S A Intuition: We can impose technical conditions on R w (¯ s , a ) to insure the second term is zero. 16
Policy Gradient Theorem: Restrictions on the Critic The sufficient conditions on R w are: • R w is compatible with the parameterization of the policy π θ in the sense: 1 ∇ w R w ( s , a ) = ∇ θ log π θ ( a | s ) = π θ ( a | s ) ∇ θ π θ ( a | s ) • w has converged to a local minimum: � � s )1 s , a )] 2 d a d ¯ d π (¯ 2 [ Q π (¯ ∇ w s ) π θ ( a | ¯ s , a ) − R w (¯ s = 0 S A � � d π (¯ s , a ) [ Q π (¯ s ) π θ ( a | ¯ s ) ∇ w R w (¯ s , a ) − R w (¯ s , a )] d a d ¯ s = 0 S A � � d π (¯ s ) [ Q π (¯ ∇ θ π θ ( a | ¯ s , a ) − R w (¯ s , a )] d a d ¯ s ) s = 0 S A 17
Policy Gradient Theorem: Function Approximation Version Theorem 2 - Policy Gradient with Function Approximation: [5] If R w ( s , a ) satisfies the conditions on the previous slide, the policy gradient using the learned reward function is: � � d π (¯ ∇ θ J ( π θ ) = s ) ∇ θ π θ ( a | ¯ s ) R w (¯ s , a ) d a d ¯ s . S A 18
Policy Gradient Theorem: Recap • We’ve shown that the gradient of the policy quality w.r.t the policy parameters has a simple form. • We’ve derived sufficient conditions for an actor-critic algorithm to use the policy gradient theorem. • We’ve obtained a necessary functional form for R w ( s , a ), since the compatibility condition requires R w ( s , a ) = ∇ θ log π θ ( a | s ) ⊤ w 19
Policy Gradient Theorem: Actually Computing the Gradient • We can estimate the policy gradient in practice using the score function estimator (aka REINFORCE): � � d π (¯ ∇ θ J ( π θ ) = s ) ∇ θ π θ ( a | ¯ s ) R w (¯ s , a ) d a d ¯ s S A � � d π (¯ = s ) π θ ( a | ¯ s ) ∇ θ log π θ ( a | ¯ s ) R w (¯ s , a ) d a d ¯ s S A � � d π (¯ s ) ∇ θ log π θ ( a | s ) ⊤ w d a d ¯ = s ) π θ ( a | ¯ s ) ∇ θ log π θ ( a | ¯ s S A • We can approximate the necessary integrals using multiple trajectories τ 0: t computed under the current policy π θ . 20
Recommend
More recommend