Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Pathwise derivatives, DDPG, Multigoal RL Katerina Fragkiadaki
Part of the slides on path wise derivatives adapted from John Schulman
Computing Gradients of Expectations When the variable w.r.t. which we are differentiating appears in the distribution: ∇ θ 𝔽 x ∼ p ( ⋅ | θ ) F( x ) = 𝔽 x ∼ p ( ⋅ | θ ) ∇ θ log p ( ⋅ | θ )F( x ) ∇ θ 𝔽 a ∼ π θ R ( a , s ) e.g. likelihood ratio gradient estimator When the variable w.r.t. which we are differentiating appears in the expectation: d F( x ( θ ), z ) dx ∇ θ 𝔽 z ∼𝒪 (0,1) F( x ( θ ), z ) = 𝔽 z ∼𝒪 (0,1) ∇ θ F( x ( θ ), z ) = 𝔽 z ∼𝒪 (0,1) dx d θ pathwise derivative Re-parametrization trick: For some distributions p(x|\theta) we can switch from one gradient estimator to the other. Why would we want to do so?
Known MDP deterministic node: the value is a r 0 deterministic function of its input r 1 stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) ρ ( s , a ) deterministic computation node a 0 a 1 π θ ( s ) π θ ( s ) s 1 s 0 T( s , a ) T( s , a ) ... θ Reward and dynamics are known
Known MDP-let’s make it simpler deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 π θ ( s ) I want to learn \theta to maximize the reward obtained. s 0 θ
What if the policy is deterministic? deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 π θ ( s ) I want to learn \theta to maximize the reward obtained. s 0 θ I can compute the gradient with backpropagation. a = π θ ( s ) ∇ θ ρ ( s , a ) = ρ a π θθ
What if the policy is stochastic? deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 π θ ( s , a ) I want to learn \theta to maximize the reward obtained. s 0 θ Likelihood ratio estimator, works for both continuous and discrete actions 𝔽 a ∇ θ log π θ ( s , a ) ρ ( s , a )
Policies are parametrized Gaussians deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 a ∼ 𝒪 ( μ ( s , θ ), Σ ( s , θ )) µ θ ( s ) σ θ ( s ) π θ ( s ) I want to learn \theta to maximize the reward obtained. s 0 θ 𝔽 a ∇ θ log π θ ( s , a ) ρ ( s , a ) If is constant: σ 2 r θ log π θ ( s, a ) = ( a � µ ( s ; θ )) ∂ µ ( s ; θ ) ∂θ σ 2
Re-parametrization for Gaussian r 0 ρ ( s , a ) a 0 a = μ ( s , θ ) + z ⊙ σ ( s , θ ) z z ∼ 𝒪 (0, I ) µ θ ( s ) σ θ ( s ) π θ ( s ) s 0 θ
Re-parametrization for Gaussian r 0 𝔽 ( μ + z σ ) = μ isotropic Var( μ + z σ ) = σ 2 a = μ ( s , θ ) + z ⊙ σ ( s , θ ) ρ ( s , a ) da d θ = d μ ( s , θ ) + z ⊙ d σ ( s , θ ) d θ d θ d ρ ( a ( θ , z ), s ) a 0 da ( θ , z ) ∇ θ 𝔽 z [ ρ ( a ( θ , z ), s ) ] = 𝔽 z a = μ ( s , θ ) + z ⊙ σ ( s , θ ) da d θ z z ∼ 𝒪 (0, I ) Sample estimate: d ρ ( a ( θ , z ), s ) N N 1 [ ρ ( a ( θ , z i ), s ) ] = 1 da ( θ , z ) µ θ ( s ) σ θ ( s ) ∑ ∑ ∇ θ | z = z i N N da d θ i =1 i =1 π θ ( s ) s 0 θ
Re-parametrization for Gaussian r 0 𝔽 ( μ + z σ ) = μ isotropic Var( μ + z σ ) = σ 2 a = μ ( s , θ ) + z ⊙ σ ( s , θ ) ρ ( s , a ) da d θ = d μ ( s , θ ) + z ⊙ d σ ( s , θ ) d θ d θ d ρ ( a ( θ , z ), s ) a 0 da ( θ , z ) ∇ θ 𝔽 z [ ρ ( a ( θ , z ), s ) ] = 𝔽 z a = μ ( σ , θ ) + z ⊙ σ ( s , θ ) da d θ z z ∼ 𝒪 (0, I ) Sample estimate: d ρ ( a ( θ , z ), s ) N N 1 [ ρ ( a ( θ , z i ), s ) ] = 1 da ( θ , z ) µ θ ( s ) σ θ ( s ) ∑ ∑ ∇ θ | z = z i N N da d θ i =1 i =1 π θ ( s ) s 0 θ
Re-parametrization for Gaussian r 0 𝔽 ( μ + z σ ) = μ isotropic Var( μ + z σ ) = σ 2 a = μ ( s , θ ) + z ⊙ σ ( s , θ ) ρ ( s , a ) da d θ = d μ ( s , θ ) + z ⊙ d σ ( s , θ ) d θ d θ d ρ ( a ( θ , z ), s ) a 0 da ( θ , z ) ∇ θ 𝔽 z [ ρ ( a ( θ , z ), s ) ] = 𝔽 z a = μ ( s , θ ) + z ⊙ σ ( s , θ ) da d θ z z ∼ 𝒪 (0, I ) Sample estimate: d ρ ( a ( θ , z ), s ) N N 1 [ ρ ( a ( θ , z i ), s ) ] = 1 da ( θ , z ) µ θ ( s ) σ θ ( s ) ∑ ∑ ∇ θ | z = z i N N da d θ i =1 i =1 π θ ( s ) general Σ = LL ⊤ a = μ ( σ , θ ) + Lz , s 0 θ The pathwise Pathwise derivative: derivative uses the Likelihood ratio grad estimator: d ρ ( a ( θ , z ), s ) da ( θ , z ) 𝔽 a ∇ θ log π θ ( s , a ) ρ ( s , a ) derivative of the 𝔽 z da d θ reward w.r.t. the action!
Policies are parametrized Categorical distr deterministic node: the value is a r 0 deterministic function of its input stochastic node: the value is sampled based on its input (which parametrizes the distribution to sample from) ρ ( s , a ) deterministic computation node a 0 π θ ( s , a ) I want to learn \theta to maximize the reward obtained. s 0 θ 𝔽 a ∇ θ log π θ ( s , a ) ρ ( s , a )
Re-parametrization for categorical distributions Consider variable y following the K categorical distribution: exp((log p k ) / τ y k ∼ P K j =0 exp((log p j ) / τ ) Categorical reparametrization with Gumbel-Softmax , Sang et al. 2017
Re-parametrization trick for categorical distributions Consider variable y following the K categorical distribution: exp((log p k ) / τ y k ∼ P K j =0 exp((log p j ) / τ ) Re-parametrization: a k = arg max k (log p k + ϵ k ), ϵ k = − log( − log( U )), u ∼ 𝒱 [0,1] Categorical reparametrization with Gumbel-Softmax , Sang et al. 2017
Re-parametrization trick for categorical distributions Consider variable y following the K categorical distribution: exp((log p k ) / τ a k k ∼ P K j =0 exp((log p j ) / τ ) Reparametrization: a k = arg max k (log p k + ϵ k ), ϵ k = − log( − log( U )), u ∼ 𝒱 [0,1] In the forward pass you sample from the parametrized distribution c ∼ G (log p ) a k In the backward pass you use the soft distribution: dc θ = dG dp da dp d θ d θ Categorical reparametrization with Gumbel-Softmax , Sang et al. 2017
Re-parametrization trick for categorical distributions Consider variable y following the K categorical distribution: exp((log p k ) / τ a k k ∼ P K j =0 exp((log p j ) / τ ) Reparametrization: a k = arg max k (log p k + ϵ k ), ϵ k = − log( − log( U )), u ∼ 𝒱 [0,1] In the forward pass you sample from the parametrized distribution c ∼ G (log p ) a k In the backward pass you use the soft distribution: dc θ = dG dp da dp d θ d θ Categorical reparametrization with Gumbel-Softmax , Sang et al. 2017
Back-propagating through discrete variables For binary neurons: backward pass forward pass Straight-through sigmoidal For general categorically distributed neurons: forward pass backward pass http://r2rt.com/binary-stochastic-neurons-in-tensorflow.html Categorical reparametrization with Gumbel-Softmax , Sang et al. 2017
Re-parametrized Policy Gradients I Episodic MDP: θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T We want to compute: ∇ θ 𝔽 [ R T ]
Re-parametrized Policy Gradients I Episodic MDP: θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T We want to compute: ∇ θ 𝔽 [ R T ] | r E r I Reparameterize: a t = ⇡ ( s t , z t ; ✓ ). z t is noise from fixed distribution. θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T
Re-parametrized Policy Gradients I Episodic MDP: θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T We want to compute: ∇ θ 𝔽 [ R T ] | r r I Reparameterize: a t = ⇡ ( s t , z t ; ✓ ). z t is noise from fixed distribution. θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T For path wise derivative to work, we need transition dynamics and reward function to be known.
Re-parametrized Policy Gradients θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T " T " T # # d d R T d a t d E [ R T | a t ] d a t X X d θ E [ R T ] = E = E d a t d θ d a t d θ t =1 t =1 " # " # For path wise derivative to work, we need transition dynamics and reward function to be known, or…
Re-parametrized Policy Gradients θ . . . s 1 s 2 s T R T . . . a 1 a 2 a T . . . z 1 z 2 z T " T " T # # d d R T d a t d E [ R T | a t ] d a t X X d θ E [ R T ] = E = E d a t d θ d a t d θ t =1 t =1 " T " T # # d Q ( s t , a t ) d a t d X X = E = E d θ Q ( s t , π ( s t , z t ; θ )) d a t d θ t =1 t =1 I Learn Q φ to approximate Q π , γ , and use it to compute gradient estimates. N. Heess, G. Wayne, D. Silver, et al. “Learning continuous control policies by stochastic value gradients”. In: NIPS . 2015
Recommend
More recommend