budgeted reinforcement learning in continuous state space
play

Budgeted Reinforcement Learning in Continuous State Space Nicolas - PowerPoint PPT Presentation

Budgeted Reinforcement Learning in Continuous State Space Nicolas Carrara 1 , Edouard Leurent 1,2 , Tanguy Urvoy 3 , Romain Laroche 4 , Odalric Maillard 1 , Olivier Pietquin 1,5 1 Inria SequeL, 2 Renault Group, 3 Orange Labs, 4 Microsoft Montr


  1. Budgeted Reinforcement Learning in Continuous State Space Nicolas Carrara 1 , Edouard Leurent 1,2 , Tanguy Urvoy 3 , Romain Laroche 4 , Odalric Maillard 1 , Olivier Pietquin 1,5 1 Inria SequeL, 2 Renault Group, 3 Orange Labs, 4 Microsoft Montr´ eal, 5 Google Research, Brain Team

  2. Contents 01.. Motivation and Setting 02.. Budgeted Dynamic Programming 03.. Budgeted Reinforcement Learning 04.. Experiments 2 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  3. 01 Motivation and Setting 3 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  4. Learning to act Optimal Decision-Making Which action a t should we choose in state s t to maximise a cumulative reward R ? � ∞ � � γ t R ( s t , a t ) max E a t ∼ π ( a t | s t ) π t = 0 4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  5. Learning to act Optimal Decision-Making Which action a t should we choose in state s t to maximise a cumulative reward R ? � ∞ � � γ t R ( s t , a t ) max E a t ∼ π ( a t | s t ) π t = 0 � A very general formulation � Widely used in the industry 4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  6. Learning to act Optimal Decision-Making Which action a t should we choose in state s t to maximise a cumulative reward R ? � ∞ � � γ t R ( s t , a t ) max E a t ∼ π ( a t | s t ) π t = 0 � A very general formulation ✗ Not widely used in the industry 4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  7. Learning to act Optimal Decision-Making Which action a t should we choose in state s t to maximise a cumulative reward R ? � ∞ � � γ t R ( s t , a t ) max E a t ∼ π ( a t | s t ) π t = 0 � A very general formulation ✗ Not widely used in the industry > Sample efficiency > Trial and error > Unpredictable behaviour 4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  8. Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  9. Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  10. Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  11. Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  12. Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  13. Example problems with conflicts Dialogue systems A slot-filling problem: the agent fills a form by asking the user each slot. It can either: • ask to answer using voice (safe/slow); • ask to answer with a numeric pad (unsafe/fast). 6 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  14. Example problems with conflicts Dialogue systems A slot-filling problem: the agent fills a form by asking the user each slot. It can either: • ask to answer using voice (safe/slow); • ask to answer with a numeric pad (unsafe/fast). Autonomous Driving The agent is driving on a two-way road with a car in front of it, • it can stay behind (safe/slow); • it can overtake (unsafe/fast). 6 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  15. Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... For a fixed reward function R , no control over the Task Completion trade-off Safety π ∗ is only guaranteed to lie on a Pareto-optimal curve Π ∗ 7 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  16. The Pareto-optimal curve Task Completion 𝐻 1 = ∑𝛿 𝑢 𝑆 1 𝑢 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 1 , 𝑆 2 ) argmax 𝜌 Safety 𝐻 2 = ∑𝛿 𝑢 𝑆 2 𝑢 8 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  17. From maximal safety to minimal risk Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 Risk 𝐻 𝑑 9 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  18. The optimal policy can move freely along Π ∗ Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 𝜌 ∗ Risk 𝐻 𝑑 10 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  19. How to choose a desired trade-off Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 11 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  20. Constrained Reinforcement Learning Markov Decision Process An MDP is a tuple ( S , A , P , R r , γ ) with: • Rewards R r ∈ R S×A Objective Maximise rewards E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S 12 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  21. Constrained Reinforcement Learning Constrained Markov Decision Process A CMDP is a tuple ( S , A , P , R r , R c , γ, β ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget β Objective Maximise rewards while keeping costs under a fixed budget E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s ] ≤ β s.t. 12 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  22. We want to learn Π ∗ rather than π ∗ β Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 13 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  23. We want to learn Π ∗ rather than π ∗ β Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 13 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  24. Budgeted Reinforcement Learning Budgeted Markov Decision Process A BMDP is a tuple ( S , A , P , R r , R c , γ, B ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget space B Objective Maximise rewards while keeping costs under an adjustable budget. ∀ β ∈ B , E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s , β 0 = β ] max π ∈M ( A×B ) S×B E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s , β 0 = β ] ≤ β s.t. 14 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  25. Problem formulation Budgeted policies π • Take a budget β as an additional input • Output a next budget β ′ → ( a , β ′ ) • π : ( s , β ) � �� � � �� � s a Augment the spaces with the budget β 15 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  26. Augmented Setting Definition (Augmented spaces) • States S = S × B . • Actions A = A × B . • Dynamics P � s ′ ∼ P ( s ′ | s , a ) state ( s , β ) , action ( a , β a ) → next state β ′ = β a Definition (Augmented signals) 1. Rewards R = ( R r , R c ) 2. Returns G π = ( G π c ) def = � ∞ t = 0 γ t R ( s t , a t ) r , G π = E [ G π | s 0 = s ] c ) def 3. Value V π ( s ) = ( V π r , V π c ) def = E [ G π | s 0 = s , a 0 = a ] 4. Q-Value Q π ( s , a ) = ( Q π r , Q π 16 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  27. 02 Budgeted Dynamic Programming 17 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  28. Policy Evaluation Proposition (Budgeted Bellman Expectation) The Bellman Expectation equations are preserved � V π ( s ) = π ( a | s ) Q π ( s , a ) a ∈A � s ′ � � � � s , a V π ( s ′ ) Q π ( s , a ) = R ( s , a ) + γ P s ′ ∈S 18 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

  29. Policy Evaluation Proposition (Budgeted Bellman Expectation) The Bellman Expectation equations are preserved � V π ( s ) = π ( a | s ) Q π ( s , a ) a ∈A � s ′ � � � � s , a V π ( s ′ ) Q π ( s , a ) = R ( s , a ) + γ P s ′ ∈S Proposition (Contraction) The Bellman Expectation Operator T π is a γ -contraction. � � T π Q ( s , a ) def P ( s ′ | s , a ) π ( a ′ | s ′ ) Q ( s ′ , a ′ ) = R ( s , a ) + γ s ′ ∈S a ′ ∈A � We can evaluate a budgeted policy π 18 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Recommend


More recommend