asynchronous rl
play

Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data problem for Deep RL Stability of training neural networks requires the gradient updates


  1. Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Asynchronous RL CMU 10703 Katerina Fragkiadaki

  2. Non-stationary data problem for Deep RL • Stability of training neural networks requires the gradient updates to be de- correlated • This is not the case if data arrives sequentially • Gradient updates computed from some part of the space can cause the value (Q) function approximator to oscillate • Our solution so far has been: Experience buffers where experience tuples are mixed and sampled from. Resulting sampled batches are more stationary that the ones encountered online (without buffer) • This limits deep RL to off-policy methods, since data from an older policy are used to update the weights of the value approximator.

  3. Asynchronous Deep RL • Alternative: parallelize the collection of experience and stabilize training without experience buffers ! • Multiple threads of experience , one per agent, each exploring in different part of the environment contributing experience tuples • Different exploration strategies (e.g., various \epsilon values) in different threads increase diversity • It can be applied to both on policy and off policy methods, applied it to SARSA, DQN, and advantage actor-critic

  4. Distributed RL

  5. Distributed Asynchronous RL Each worker may No locking have slightly modified version of the policy/critic The actor critic trained in such asynchronous way is knows as A3C

  6. Distributed Synchronous RL All worker may have 5. Gradients of all the same actor/critic workers are averaged and weights the central neural net weights are updated The actor critic trained in such synchronous way is knows as A2C

  7. • Training stabilization without Experience Buffer A3C • Use of on policy methods, e.g., SARSA and policy gradients s 1 , s 2 , s 3 , s 4 • Reduction is training time linear to the number of threads r 1 , r 2 , r 3 What is the approximation used for the advantage? R 3 = r 3 + γ V ( s 4 , θ ′ � v ) A 3 = R 3 − V ( s 3 ; θ ′ � v ) R 2 = r 2 + γ r 3 + γ 2 V ( s 4 , θ ′ � A 2 = R 2 − V ( s 2 ; θ ′ � v ) v )

  8. Advantages of Asynchronous (multi-threaded) RL

  9. Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Evolutionary Methods CMU 10703 Katerina Fragkiadaki Part of the slides borrowed by Xi Chen, Pieter Abbeel, John Schulman

  10. Policy Optimization 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] max J ( θ ) = max θ θ τ :a trajectory

  11. Policy Optimization 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] max J ( θ ) = max θ θ τ :a trajectory

  12. Policy Optimization and RL 𝔽 [ R ( s t ) | π θ , μ 0 ( s ) ] T 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] = max ∑ max J ( θ ) = max θ θ θ t =0

  13. 𝔽 [ R ( s t ) | π θ , μ 0 ( s ) ] T 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] = max ∑ max J ( θ ) = max θ θ θ t =0 H X

  14. 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] max J ( θ ) = max θ θ Evolutionary methods H X

  15. Black-box Policy Optimization 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] max J ( θ ) = max θ θ 𝔽 [ R ( τ ) ] θ No information regarding the structure of the reward

  16. Evolutionary methods 𝔽 [ R ( τ ) | π θ , μ 0 ( s 0 ) ] max J ( θ ) = max θ θ General algorithm: Initialize a population of parameter vectors ( genotypes ) 1.Make random perturbations ( mutations ) to each parameter vector 2.Evaluate the perturbed parameter vector ( fitness ) 3.Keep the perturbed vector if the result improves ( selection ) 4.GOTO 1 Biologically plausible…

  17. Cross-entropy method Let’s consider our parameters to be sampled from a multivariate isotropic Gaussian We will evolve this Gaussian towards sampled that have highest fitness CEM: Initialize µ ∈ R d , σ ∈ R d > 0 for iteration = 1, 2, … Sample n parameters θ i ∼ N ( µ, diag( σ 2 )) For each , perform one rollout to get return R ( τ i ) θ i Select the top k% of , and fit a new diagonal Gaussian θ to those samples. Update µ, σ endfor

  18. Covariance Matrix Adaptation Let’s consider our parameters to be sampled from a multivariate Gaussian We will evolve this Gaussian towards sampled that have highest fitness • Sample • Select elites 𝑛 𝑗 , 𝐷 𝑗 μ i , C i • Update mean • Update covariance • iterate

  19. Covariance Matrix Adaptation • Sample • Select elites • Update mean • Update covariance • iterate

  20. Covariance Matrix Adaptation • Sample • Select elites • Update mean • Update covariance • iterate

  21. Covariance Matrix Adaptation • Sample • Select elites • Update mean • Update covariance • iterate

  22. Covariance Matrix Adaptation • Sample • Select elites • Update mean • Update covariance • iterate

  23. Covariance Matrix Adaptation • Sample • Select elites • Update mean • Update covariance • iterate

  24. Covariance Matrix Adaptation • Sample • Select elites μ i +1 , C i +1 𝑛 𝑗+1 , 𝐷 𝑗+1 • Update mean • Update covariance • iterate

  25. CMA-ES, CEM Work embarrassingly well in low-dimensions µ ∈ R 22 [NIPS 2013]

  26. Question • Evolutionary methods work well on relatively low-dim problems • Can they be used to optimize deep network policies?

  27. PG VS ES We are sampling in both cases…

  28. Policy Gradients Review . J ( θ ) = 𝔽 τ ∼ P θ ( τ ) [ R ( τ ) ] max θ ∇ θ J ( θ ) = ∇ θ 𝔽 τ ∼ P θ ( τ ) [ R ( τ ) ] = ∇ θ ∑ P θ ( τ ) R ( τ ) τ = ∑ ∇ θ P θ ( τ ) R ( τ ) τ ∇ μ P θ ( τ ) = ∑ P θ ( τ ) R ( τ ) P θ ( τ ) τ = ∑ P θ ( τ ) ∇ θ log P θ ( τ ) R ( τ ) τ = 𝔽 τ ∼ P θ ( τ ) [ ∇ θ log P θ ( τ ) R ( τ ) ] Sample estimate: N ∇ θ J ( θ ) ≈ 1 ∑ ∇ θ log P θ ( τ ( i ) ) R ( τ ( i ) ) N i =1

  29. ES Considers distribution over policy parameters . U ( μ ) = 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] max μ ∇ μ U ( μ ) = ∇ μ 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] = ∇ μ ∫ P μ ( θ ) F ( θ ) d θ = ∫ ∇ μ P μ ( θ ) F ( θ ) d θ = ∫ P μ ( θ ) ∇ μ P μ ( θ ) F ( θ ) d θ P μ ( θ ) = ∫ P μ ( θ ) ∇ μ log P μ ( θ ) F ( θ ) d θ = 𝔽 θ ∼ P μ ( θ ) [ ∇ μ log P μ ( θ ) F ( θ ) ]

  30. ES Considers distribution over policy parameters . U ( μ ) = 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] max μ ∇ μ U ( μ ) = ∇ μ 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] = ∇ μ ∫ P μ ( θ ) F ( θ ) d θ = ∫ ∇ μ P μ ( θ ) F ( θ ) d θ = ∫ P μ ( θ ) ∇ μ P μ ( θ ) F ( θ ) d θ P μ ( θ ) = ∫ P μ ( θ ) ∇ μ log P μ ( θ ) F ( θ ) d θ = 𝔽 θ ∼ P μ ( θ ) [ ∇ μ log P μ ( θ ) F ( θ ) ] Sample estimate: N ∇ μ U ( μ ) ≈ 1 ∑ ∇ μ log P μ ( θ ( i ) ) F ( θ ( i ) ) N i =1

  31. PG ES Considers distribution over actions Considers distribution over policy parameters . U ( μ ) = 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] . J ( θ ) = 𝔽 τ ∼ P θ ( τ ) [ R ( τ ) ] max max μ θ ∇ μ U ( μ ) = ∇ μ 𝔽 θ ∼ P μ ( θ ) [ F ( θ ) ] ∇ θ J ( θ ) = ∇ θ 𝔽 τ ∼ P θ ( τ ) [ R ( τ ) ] = ∇ θ ∑ = ∇ μ ∫ P μ ( θ ) F ( θ ) d θ P θ ( τ ) R ( τ ) τ = ∑ = ∫ ∇ μ P μ ( θ ) F ( θ ) d θ ∇ θ P θ ( τ ) R ( τ ) τ = ∫ P μ ( θ ) ∇ μ P θ ( τ ) = ∑ ∇ μ P μ ( θ ) P θ ( τ ) R ( τ ) F ( θ ) d θ P θ ( τ ) P μ ( θ ) τ = ∫ P μ ( θ ) ∇ μ log P μ ( θ ) F ( θ ) d θ = ∑ P θ ( τ ) ∇ θ log P θ ( τ ) R ( τ ) τ = 𝔽 τ ∼ P θ ( τ ) [ ∇ θ log P θ ( τ ) R ( τ ) ] = 𝔽 θ ∼ P μ ( θ ) [ ∇ μ log P μ ( θ ) F ( θ ) ] Sample estimate: Sample estimate: N N ∇ θ J ( θ ) ≈ 1 ∇ μ U ( μ ) ≈ 1 ∑ ∑ ∇ θ log P θ ( τ ( i ) ) R ( τ ( i ) ) ∇ μ log P μ ( θ ( i ) ) F ( θ ( i ) ) N N i =1 i =1

  32. From trajectories to actions N N T ∇ θ J ( θ ) ≈ 1 ∇ θ J ( θ ) ≈ 1 ∑ ∑ ∑ ∇ θ log P θ ( τ ( i ) ) R ( τ ( i ) ) ∇ θ log π θ ( α ( i ) t | s ( i ) t ) R ( s ( i ) t , a ( i ) t ) N N i =1 i =1 t =1 T ∏ P ( s ( i ) t +1 | s ( i ) t , a ( i ) ⋅ π θ ( a ( i ) t | s ( i ) ∇ θ log P ( τ ( i ) ; θ ) = ∇ θ log t ) t ) t =0 policy dynamics T ∑ log P ( s ( i ) t +1 | s ( i ) t , a ( i ) + log π θ ( a ( i ) t | s ( i ) = ∇ θ t ) t ) t =0 policy dynamics T ∑ log π θ ( a ( i ) t | s ( i ) = ∇ θ t ) t =0 policy T ∑ ∇ θ log π θ ( a ( i ) t | s ( i ) = t ) t =0

Recommend


More recommend