maximum entropy reinforcement learning
play

Maximum Entropy Reinforcement Learning CMU 10-403 Katerina - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Reinforcement Learning CMU 10-403 Katerina Fragkiadaki RL objective [ R ( s t , a t ) ] * = arg max t ( s t , a t )


  1. Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Maximum Entropy Reinforcement Learning CMU 10-403 Katerina Fragkiadaki

  2. RL objective 𝔽 π [ ∑ R ( s t , a t ) ] π * = arg max π t 𝔽 ( s t , a t ) ∼ ρ π [ ∑ R ( s t , a t ) ] π * = arg max π t

  3. MaxEntRL objective Promoting stochastic policies T ∑ π * = arg max 𝔽 π R ( s t , a t ) + α H( π ( ⋅ | s t )) π t =1 reward entropy Why? Better exploration • Learning alternative ways of accomplishing the task • Better generalization, e.g., in the presence of obstacles a stochastic • policy may still succeed.

  4. Principle of Maximum Entropy Policies that generate similar rewards, should be equally probable. We do not want to commit. Why? • Better exploration • Learning alternative ways of accomplishing the task • Better generalization, e.g., in the presence of obstacles a stochastic policy may still succeed. Reinforcement Learning with Deep Energy-Based Policies,Haarnoja et al.

  5. We have seen this before. d θ ← d θ + ∇ θ ′ � log π ( a i | s i ; θ ′ � ) ( R − V ( s i ; θ ′ � v )+ β ∇ θ ′ � H ( π ( s t ; θ ′ � )) ) “We also found that adding the entropy of the policy π to the objective function improved exploration by discouraging premature convergence to suboptimal deterministic policies. This technique was originally proposed by (Williams & Peng, 1991)” Mnih et al., Asynchronous Methods for Deep Reinforcement Learning

  6. MaxEntRL objective Promoting stochastic policies T ∑ 𝔽 π + α H( π ( ⋅ | s t )) π * = arg max R ( s t , a t ) π t =1 reward entropy How can we maximize such an objective?

  7. Recall:Back-up Diagrams q π ( s , a ) = r ( s , a ) + γ ∑ T ( s ′ � | s , a ) ∑ π ( a ′ � | s ′ � ) q π ( s ′ � , a ′ � ) s ′ � ∈𝒯 a ′ � ∈𝒝

  8. Back-up Diagrams for MaxEnt Objective H ( π ( ⋅ | s ′ � )) = − 𝔽 a log π ( a ′ � | s ′ � )

  9. Back-up Diagrams for MaxEnt Objective − log π ( a ′ � | s ′ � ) q π ( s , a ) = r ( s , a ) + γ ∑ T ( s ′ � | s , a ) ∑ π ( a ′ � | s ′ � ) ( q π ( s ′ � , a ′ � ) − log( π ( a ′ � | s ′ � )) ) s ′ � ∈𝒯 a ′ � ∈𝒝

  10. (Soft) policy evaluation Bellman backup equation: q π ( s , a ) = r ( s , a ) + γ ∑ T ( s ′ � | s , a ) ∑ π ( a ′ � | s ′ � ) q π ( s ′ � , a ′ � ) s ′ � ∈𝒯 a ′ � ∈𝒝 Soft Bellman backup equation: q π ( s , a ) = r ( s , a ) + ∑ T ( s ′ � | s , a ′ � ) ( q π ( s ′ � , a ′ � ) − log( π ( a ′ � | s ′ � )) ) a ′ � , s ′ � Bellman backup update operator: Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 , a t +1 [ Q ( s t +1 , a t +1 | s t +1 )] Soft Bellman backup update operator: Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 , a t +1 [ Q ( s t +1 , a t +1 ) − log( π ( a t +1 | s t +1 )) ]

  11. Soft Bellman backup update operator is a contraction Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 , a t +1 [ Q ( s t +1 , a t +1 ) − log( π ( a t +1 | s t +1 )) ] Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ [ 𝔽 a t +1 ∼ π [ Q ( s t +1 , a t +1 ) − log( π ( a t +1 | s t +1 ))]] ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ , a t +1 ∼ π Q ( s t +1 , a t +1 ) + γ 𝔽 s t +1 ∼ ρ 𝔽 a t +1 ∼ π [ − log π ( a t +1 | s t +1 )] ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ , a t +1 ∼ π Q ( s t +1 , a t +1 ) + γ 𝔽 s t +1 ∼ ρ H( π ( ⋅ | s t +1 )) Rewrite the reward as: r soft ( s t , a t ) = r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ H( π ( ⋅ | s t +1 )) Then we get the old Bellman operator, which we know is a contraction

  12. Soft Bellman backup update operator Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 , a t +1 [ Q ( s t +1 , a t +1 ) − α log π ( a t +1 | s t +1 )] ] Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ [ 𝔽 a t +1 ∼ π [ Q ( s t +1 , a t +1 ) − α log π ( a t +1 | s t +1 )]] ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ , a t +1 ∼ π Q ( s t +1 , a t +1 ) + γα 𝔽 s t +1 ∼ ρ 𝔽 a t +1 ∼ π [ − log π ( a t +1 | s t +1 )] ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ , a t +1 ∼ π Q ( s t +1 , a t +1 ) + γα 𝔽 s t +1 ∼ ρ H ( π ( ⋅ | s t +1 )) We know that: Q ( s t , a t ) ← r ( s t , a t ) + γ 𝔽 s t +1 ∼ ρ [ V ( s t +1 )] Which means that: V ( s t ) = 𝔽 a t ∼ π [ Q ( s t , a t ) − α log π ( a t | s t )]

  13. Review: Policy Iteration (unknown dynamics) Policy iteration iterates between two steps: 1. Policy evaluation: Fix policy, apply Bellman backup operator till convergence q π ( s , a ) ← r ( s , a ) + γ 𝔽 s ′ � , a ′ � q π ( s ′ � , a ′ � ) 4. Policy improvement: Update the policy

  14. Soft Policy Iteration Soft policy iteration iterates between two steps: 1. Soft policy evaluation: Fix policy, apply Bellman backup operator till convergence q π ( s , a ) = r ( s , a ) + 𝔽 s ′ � , a ′ � ( q π ( s ′ � , a ′ � ) − α log( π ( a ′ � | s ′ � )) ) This converges to q π 2. Soft policy improvement: Update the policy: π k ∈Π D KL ( π k ( ⋅ | s t ) || exp(Q π (s t , ⋅ )) ) π ′ � = arg min Z π ( s t ) Leads to a sequence of policies with monotonically increasing soft q values This so far concerns tabular methods. Next we will use function approximations for policy and action values Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

  15. Review: Policy Improvement theorem for deterministic policies Let π , π ′ � be any pair of determinstic policies such that, for all s ∈ 𝒯 : q π ( s , π ′ � ( s )) ≥ v π ( s ) . Then π ′ � must be as good as or better than π , that is: v π ′ � ( s ) ≥ v π ( s )

  16. Review: Policy Improvement theorem for deterministic policies Let π , π ′ � be any pair of determinstic policies such that, for all s ∈ 𝒯 : q π ( s , π ′ � ( s )) ≥ v π ( s ) . Then π ′ � must be as good as or better than π , that is: v π ′ � ( s ) ≥ v π ( s )

  17. Review: Policy Improvement theorem for deterministic policies Let π , π ′ � be any pair of determinstic policies such that, for all s ∈ 𝒯 : q π ( s , π ′ � ( s )) ≥ v π ( s ) . Then π ′ � must be as good as or better than π , that is: v π ′ � ( s ) ≥ v π ( s ) π k ∈Π D KL ( π k ( ⋅ | s t ) || exp(Q π (s t , ⋅ )) ) π ′ � = arg min Z π ( s t )

  18. SoftMax

  19. Soft Policy Iteration - Approximation Use function approximations for policy and action value functions: π ϕ ( a t | s t ) Q θ ( s t )

  20. Soft Policy Iteration - Approximation Use function approximations for policy and action value functions: π ϕ ( a t | s t ) Q θ ( s t ) 1. Learning the state-action value function: Semi-gradient method:

  21. Soft Policy Iteration - Approximation Use function approximations for policy and action value functions: π ϕ ( a t | s t ) Q θ ( s t ) 3. Learning the policy: Z θ ( s t ) = ∫ 𝒝 exp( Q θ ( s t , a t )) da t π ϕ ( a t | s t ) independent of \phi ∇ ϕ J π ( ϕ ) = ∇ ϕ 𝔽 s t ∈ D 𝔽 a t ∼ π ϕ ( a | s t ) log exp( Q θ ( s t , a t )) Z θ ( s t ) π ϕ ( a t | s t ) ∇ ϕ J π ( ϕ ) = ∇ ϕ 𝔽 s t ∈ D , ϵ ∼𝒪 (0, I ) log exp( Q θ ( s t , a t )) The variable w.r.t. which we take gradient parametrizes the distribution inside the expectation.

  22. Soft Policy Iteration - Approximation Use function approximations for policy and action value functions: π ϕ ( a t | s t ) Q θ ( s t ) 3. Learning the policy: π ϕ ( a t | s t ) ∇ ϕ J π ( ϕ ) = ∇ ϕ 𝔽 s t ∈ D 𝔽 a t ∼ π ϕ ( a | s t ) log exp( Q θ ( s t , a t )) Reparametrization trick. The policy becomes a deterministic function of Gaussian random variables (fixed Gaussian distribution): a t = f ϕ ( s t , ϵ ) = μ ϕ ( s t ) + ϵ Σ ϕ ( s t ), ϵ ∼ 𝒪 (0, I ) π ϕ ( a t | s t ) ∇ ϕ J π ( ϕ ) = ∇ ϕ 𝔽 s t ∈ D , ϵ ∼𝒪 (0, I ) log exp( Q θ ( s t , a t ))

  23. Composability of Maximum Entropy Policies Imagine we want to satisfy two objectives at the same time, e.g., pick an object up while avoiding an obstacle. We would learn a policy to maximize the addition of the the corresponding reward functions: C r C ( s , a ) = 1 ∑ r i ( s , a ) C i =1 MaxEnt policies permit to obtain the resulting policy’s optimal Q by simply adding the constituent Qs: C C ( s , a ) ≈ 1 ∑ Q * Q * i ( s , a ) C i =1 We can theoretically bound the suboptimality of the resulting policy w.r.t. the policy trained under the addition of rewards. We cannot do this for deterministic policies. Composable Deep Reinforcement Learning for Robotic Manipulation, Haarnoja et al.

  24. https://www.youtube.com/watch?time_continue=82&v=FmMPHL3TcrE

Recommend


More recommend