Budgeted Reinforcement Learning in Continuous State Space Nicolas - PowerPoint PPT Presentation

Budgeted Reinforcement Learning in Continuous State Space Nicolas Carrara 1 , Edouard Leurent 1,2 , Tanguy Urvoy 3 , Romain Laroche 4 , Odalric Maillard 1 , Olivier Pietquin 1,5 1 Inria SequeL, 2 Renault Group, 3 Orange Labs, 4 Microsoft Montr´ eal, 5 Google Research, Brain Team

Contents 01.. Motivation and Setting 02.. Budgeted Dynamic Programming 03.. Budgeted Reinforcement Learning 04.. Experiments 2 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

01 Motivation and Setting 3 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Learning to act Optimal Decision-Making Which action a t should we choose in state s t to maximise a cumulative reward R ? � ∞ � � γ t R ( s t , a t ) max E a t ∼ π ( a t | s t ) π t = 0 4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Learning to act Optimal Decision-Making Which action a t should we choose in state s t to maximise a cumulative reward R ? � ∞ � � γ t R ( s t , a t ) max E a t ∼ π ( a t | s t ) π t = 0 � A very general formulation � Widely used in the industry 4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Learning to act Optimal Decision-Making Which action a t should we choose in state s t to maximise a cumulative reward R ? � ∞ � � γ t R ( s t , a t ) max E a t ∼ π ( a t | s t ) π t = 0 � A very general formulation ✗ Not widely used in the industry 4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Learning to act Optimal Decision-Making Which action a t should we choose in state s t to maximise a cumulative reward R ? � ∞ � � γ t R ( s t , a t ) max E a t ∼ π ( a t | s t ) π t = 0 � A very general formulation ✗ Not widely used in the industry > Sample efficiency > Trial and error > Unpredictable behaviour 4 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... 5 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Example problems with conflicts Dialogue systems A slot-filling problem: the agent fills a form by asking the user each slot. It can either: • ask to answer using voice (safe/slow); • ask to answer with a numeric pad (unsafe/fast). 6 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Example problems with conflicts Dialogue systems A slot-filling problem: the agent fills a form by asking the user each slot. It can either: • ask to answer using voice (safe/slow); • ask to answer with a numeric pad (unsafe/fast). Autonomous Driving The agent is driving on a two-way road with a car in front of it, • it can stay behind (safe/slow); • it can overtake (unsafe/fast). 6 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... For a fixed reward function R , no control over the Task Completion trade-off Safety π ∗ is only guaranteed to lie on a Pareto-optimal curve Π ∗ 7 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

The Pareto-optimal curve Task Completion 𝐻 1 = ∑𝛿 𝑢 𝑆 1 𝑢 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 1 , 𝑆 2 ) argmax 𝜌 Safety 𝐻 2 = ∑𝛿 𝑢 𝑆 2 𝑢 8 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

From maximal safety to minimal risk Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 Risk 𝐻 𝑑 9 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

The optimal policy can move freely along Π ∗ Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 𝜌 ∗ Risk 𝐻 𝑑 10 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

How to choose a desired trade-off Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 11 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Constrained Reinforcement Learning Markov Decision Process An MDP is a tuple ( S , A , P , R r , γ ) with: • Rewards R r ∈ R S×A Objective Maximise rewards E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S 12 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Constrained Reinforcement Learning Constrained Markov Decision Process A CMDP is a tuple ( S , A , P , R r , R c , γ, β ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget β Objective Maximise rewards while keeping costs under a fixed budget E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s ] ≤ β s.t. 12 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

We want to learn Π ∗ rather than π ∗ β Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 13 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Budgeted Reinforcement Learning Budgeted Markov Decision Process A BMDP is a tuple ( S , A , P , R r , R c , γ, B ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget space B Objective Maximise rewards while keeping costs under an adjustable budget. ∀ β ∈ B , E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s , β 0 = β ] max π ∈M ( A×B ) S×B E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s , β 0 = β ] ≤ β s.t. 14 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Problem formulation Budgeted policies π • Take a budget β as an additional input • Output a next budget β ′ → ( a , β ′ ) • π : ( s , β ) � �� s a Augment the spaces with the budget β 15 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Augmented Setting Definition (Augmented spaces) • States S = S × B . • Actions A = A × B . • Dynamics P � s ′ ∼ P ( s ′ | s , a ) state ( s , β ) , action ( a , β a ) → next state β ′ = β a Definition (Augmented signals) 1. Rewards R = ( R r , R c ) 2. Returns G π = ( G π c ) def = � ∞ t = 0 γ t R ( s t , a t ) r , G π = E [ G π | s 0 = s ] c ) def 3. Value V π ( s ) = ( V π r , V π c ) def = E [ G π | s 0 = s , a 0 = a ] 4. Q-Value Q π ( s , a ) = ( Q π r , Q π 16 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

02 Budgeted Dynamic Programming 17 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Policy Evaluation Proposition (Budgeted Bellman Expectation) The Bellman Expectation equations are preserved � V π ( s ) = π ( a | s ) Q π ( s , a ) a ∈A � s ′ � � � � s , a V π ( s ′ ) Q π ( s , a ) = R ( s , a ) + γ P s ′ ∈S 18 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Policy Evaluation Proposition (Budgeted Bellman Expectation) The Bellman Expectation equations are preserved � V π ( s ) = π ( a | s ) Q π ( s , a ) a ∈A � s ′ � � � � s , a V π ( s ′ ) Q π ( s , a ) = R ( s , a ) + γ P s ′ ∈S Proposition (Contraction) The Bellman Expectation Operator T π is a γ -contraction. � � T π Q ( s , a ) def P ( s ′ | s , a ) π ( a ′ | s ′ ) Q ( s ′ , a ′ ) = R ( s , a ) + γ s ′ ∈S a ′ ∈A � We can evaluate a budgeted policy π 18 -Budgeted Reinforcement Learning- Carrara N., Leurent E.

Budgeted Reinforcement Learning in Continuous State Space Nicolas - PowerPoint PPT Presentation

Budgeted Reinforcement Learning in Continuous State Space Nicolas Carrara 1 , Edouard Leurent 1,2 , Tanguy Urvoy 3 , Romain Laroche 4 , Odalric Maillard 1 , Olivier Pietquin 1,5 1 Inria SequeL, 2 Renault Group, 3 Orange Labs, 4 Microsoft Montr

1 2 Total Budgeted in Total Planned in Total Planned in 2019 for Total Budgeted 2020 for

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

2020 PROPOSED ANNUAL BUDGET Fiscal Year: January 1 December 31 GENERAL FUND SUMMARY 2019

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Reinforcement Learning in Continuous Environments 64.425 Integrated Seminar: Intelligent Robotics

Action Robust Reinforcement Learning and Applications in Continuous Control Chen Tessler *,

Efficient and Fair Paid Peering Constantine Dovrolis School of Computer Science Georgia

Unit 1: Topic and History Learning goals Unit 1 I. What is multicriteria optimization and

Agent-Based Systems Michael Rovatsos mrovatso@inf.ed.ac.uk Lecture 12 Bargaining 1 / 19

Complexity consideration on the existence of strategy-proof social choice functions Koji

Chapter 4 Welfare economics and the environment Introduc4on

MULTI-AGENT NAVIGATION BACK TO THE BEGINNING A* ALGORITHM - REVISITED Nodes are in one of

Multi-Input Functional Encryption for Inner Products: Function-Hiding Realizations and

Community Detection by Decomposing a Graph into Relaxed Cliques Fabio Furini, Timo Gschwind,