Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... 19 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Example problems with conflicts Two-Way Road The agent is driving on a two-way road with a car in front of it, • it can stay behind (safe/slow); • it can overtake (unsafe/fast). 20 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... For a fixed reward function R , π ∗ is only guaranteed to lie on a Pareto front Π ∗ no control over the Task Completion trade-off Safety 21 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
The Pareto front Task Completion 𝐻 1 = ∑𝛿 𝑢 𝑆 1 𝑢 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 1 , 𝑆 2 ) argmax 𝜌 Safety 𝐻 2 = ∑𝛿 𝑢 𝑆 2 𝑢 22 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
From maximal safety to minimal risk Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 Risk 𝐻 𝑑 23 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
The optimal policy can move freely along Π ∗ Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 𝜌 ∗ Risk 𝐻 𝑑 24 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
How to choose a desired trade-off Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 25 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Constrained Reinforcement Learning Markov Decision Process An MDP is a tuple ( S , A , P , R r , γ ) with: • Rewards R r ∈ R S×A Objective Maximise rewards E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S 26 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Constrained Reinforcement Learning Constrained Markov Decision Process A CMDP is a tuple ( S , A , P , R r , R c , γ, β ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget β Objective Maximise rewards while keeping costs under a fixed budget E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s ] ≤ β s.t. 26 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
We want to learn Π ∗ rather than π ∗ β Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 27 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
We want to learn Π ∗ rather than π ∗ β Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 27 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Budgeted Reinforcement Learning Budgeted Markov Decision Process A BMDP is a tuple ( S , A , P , R r , R c , γ, B ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget space B Objective Maximise rewards while keeping costs under an adjustable budget. ∀ β ∈ B , E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s , β 0 = β ] max π ∈M ( A×B ) S×B E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s , β 0 = β ] ≤ β s.t. 28 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Problem formulation Budgeted policies π • Take a budget β as an additional input • Output a next budget β ′ → ( a , β ′ ) • π : ( s , β ) � �� � � �� � s a Augment the spaces with the budget β 29 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Augmented Setting Definition (Augmented spaces) • States S = S × B . • Actions A = A × B . • Dynamics P � s ′ ∼ P ( s ′ | s , a ) state ( s , β ) , action ( a , β a ) → next state β ′ = β a Definition (Augmented signals) 1. Rewards R = ( R r , R c ) = � ∞ 2. Returns G π = ( G π c ) def r , G π t = 0 γ t R ( s t , a t ) c ) def = E [ G π | s 0 = s ] 3. Value V π ( s ) = ( V π r , V π c ) def = E [ G π | s 0 = s , a 0 = a ] 4. Q-Value Q π ( s , a ) = ( Q π r , Q π 30 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Budgeted Optimality Definition (Budgeted Optimality) In that order, we want to: (i) Respect the budget β : Π a ( s ) def = { π ∈ Π : V π c ( s , β ) ≤ β } (ii) Maximise the rewards: r ( s ) def Π r ( s ) def V ∗ = max π ∈ Π a ( s ) V π = arg max π ∈ Π a ( s ) V π r ( s ) r ( s ) (iii) Minimise the costs: c ( s ) def Π ∗ ( s ) def V ∗ = min π ∈ Π r ( s ) V π = arg min π ∈ Π r ( s ) V π c ( s ) , c ( s ) We define the budgeted action-value function Q ∗ similarly 31 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Budgeted Optimality Theorem (Budgeted Bellman Optimality Equation) Q ∗ verifies the following equation: Q ∗ ( s , a ) = T Q ∗ ( s , a ) � � def π greedy ( a ′ | s ′ ; Q ∗ ) Q ∗ ( s ′ , a ′ ) P ( s ′ | s , a ) = R ( s , a ) + γ s ′ ∈S a ′ ∈A where the greedy policy π greedy is defined by: π greedy ( a | s ; Q ) ∈ arg min ρ ∈ Π Q a ∼ ρ Q c ( s , a ) , E r def Π Q where =arg max ρ ∈M ( A ) E a ∼ ρ Q r ( s , a ) r s.t. a ∼ ρ Q c ( s , a ) ≤ β E 32 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
The optimal policy Proposition (Optimality of the policy) π greedy ( · ; Q ∗ ) is simultaneously optimal in all states s ∈ S : π greedy ( · ; Q ∗ ) ∈ Π ∗ ( s ) In particular, V π greedy ( · ; Q ∗ ) = V ∗ and Q π greedy ( · ; Q ∗ ) = Q ∗ . Proposition (Solving the non-linear program) π greedy can be computed efficiently, as a mixture π hull of two points that lie on the convex hull of Q. π greedy = π hull 33 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Convergence analysis Recall what we’ve shown so far: fixed − point Q ∗ − tractable π hull ( Q ∗ ) − equal π greedy ( Q ∗ ) − T − − − − − − − → − − − − → − − → − − − → optimal 34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Convergence analysis Recall what we’ve shown so far: fixed − point Q ∗ − tractable π hull ( Q ∗ ) − equal π greedy ( Q ∗ ) − T − − − − − − − → − − − − → − − → − − − → optimal We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q ∗ . 34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Convergence analysis Recall what we’ve shown so far: fixed − point Q ∗ − tractable π hull ( Q ∗ ) − equal π greedy ( Q ∗ ) − T − − − − − − − → − − − − → − − → − − − → optimal We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q ∗ . Theorem (Non-Contractivity) For any BMDP ( S , A , P , R r , R c , γ ) with |A| ≥ 2 , T is not a contraction. ∀ ε > 0 , ∃ Q 1 , Q 2 ∈ ( R 2 ) SA : �T Q 1 − T Q 2 � ∞ ≥ 1 ε � Q 1 − Q 2 � ∞ ✗ We cannot guarantee the convergence of T n ( Q 0 ) to Q ∗ 34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Convergence analysis Thankfully, Theorem (Contractivity on smooth Q-functions) T is a contraction when restricted to the subset L γ of Q-functions such that ”Q r is L-Lipschitz with respect to Q c ”, with L < 1 γ − 1 . � � Q ∈ ( R 2 ) SA s.t. ∃ L < 1 γ − 1 : ∀ s ∈ S , a 1 , a 2 ∈ A , L γ = | Q r ( s , a 1 ) − Q r ( s , a 2 ) | ≤ L | Q c ( s , a 1 ) − Q c ( s , a 2 ) | � We guarantee convergence under some (strong) assumptions � We observe empirical convergence 35 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Experiments Lagrangian Relaxation Baseline Consider the dual problem so as to replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ : � γ t R r ( s , a ) − λγ t R c ( s , a ) max E π t • Train many policies π k with penalties λ k and recover the cost budgets β k • Very data/memory-heavy 36 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Experiments G π r G π c 37 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
04 Efficient Model-Based 38 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Principle Model estimation Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) . For instance: � t � s t + 1 − ˆ T ( s t , a t ) � 2 1. Least-square estimate: min ˆ 2 T � t ˆ 2. Maximum Likelihood estimate: max ˆ T ( s t + 1 | s t , a t ) T Planning Leverage ˆ T to compute � ∞ � � � � � γ t r ( s t , a t ) � a t ∼ π ( s t ) , s t + 1 ∼ ˆ max T ( s t , a t ) E � π t = 0 How? 39 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Online Planning We can use ˆ T as a generative model: Agent Environment Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Online Planning We can use ˆ T as a generative model: Agent Environment state Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Online Planning We can use ˆ T as a generative model: Agent Environment state Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Online Planning We can use ˆ T as a generative model: Agent Environment state recommendation Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Online Planning We can use ˆ T as a generative model: action Agent Environment state recommendation Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Online Planning We can use ˆ T as a generative model: action Agent Environment state, reward state recommendation Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Planning performance Online Planning • fixed budget: the model can only be queried n times Objective: minimize E V ∗ − V ( n ) � �� � Simple Regret r n An exploration-exploitation problem. 41 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. • Either you performed well; 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. • Either you performed well; • or you learned something. 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. • Either you performed well; • or you learned something. Instances • Monte-carlo tree search ( MCTS ) (Coulom, 2006): CrazyStone • Reframed in the bandit setting as UCT (Kocsis and Szepesv´ ari, 2006), still very popular (e.g. Alpha Go ). • Proved asymptotic consistency, but no regret bound. 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Analysis of UCT It was analysed in (Coquelin and Munos, 2007)] The sample complexity of is lower-bounded by O (exp(exp( D ))) . 43 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Failing cases of UCT Not just a theoretical counter-example. 44 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems • Introduced by (Hren and Munos, 2008) • Another optimistic algorithm • Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ 45 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems • Introduced by (Hren and Munos, 2008) • Another optimistic algorithm • Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ OLOP : Open-Loop Optimistic Planning • Introduced by (Bubeck and Munos, 2010) • Extends OPD to the stochastic setting • Only considers open-loop policies, i.e. sequences of actions 45 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 3. Sample the sequence with highest UCB: arg max U a a 46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
The idea behind OLOP 47 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
The idea behind OLOP 48 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + 1 : t t = 1 t ≥ h + 1 49 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + ���� 1 : t ���� t = 1 t ≥ h + 1 ≤ U µ ≤ 1 49 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval 50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards 50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards Bounds sharpening B a ( m ) def = 1 ≤ t ≤ L U a 1 : t ( m ) inf 50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
OLOP guarantees Theorem ( OLOP Sample complexity) OLOP satisfies: � � √ n − log 1 /γ � κ ′ > 1 O log κ ′ if γ , E r n = � � √ n − 1 κ ′ ≤ 1 � O if γ , 2 ”Remarkably, in the case κγ 2 > 1 , we obtain the same rate for the simple regret as Hren and Munos (2008). Thus, in this case, we can say that planning in stochastic environments is not harder than planning in deterministic environments”. 51 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Does it work? Our objective: understand and bridge this gap. Make OLOP practical . 52 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
What’s wrong with OLOP ? Explanation: inconsistency • Unintended behaviour happens when U µ a ( m ) > 1 , ∀ a . � 2 log M U µ a ( m ) = ˆ µ a ( m ) + � �� � T a ( m ) � �� � ∈ [ 0 , 1 ] > 0 53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
What’s wrong with OLOP ? Explanation: inconsistency • Unintended behaviour happens when U µ a ( m ) > 1 , ∀ a . � 2 log M U µ a ( m ) = ˆ µ a ( m ) + � �� � T a ( m ) � �� � ∈ [ 0 , 1 ] > 0 • Then the sequence ( U a 1 : t ( m )) t is increasing U a 1 : 1 ( m ) = γ U µ a 1 ( m ) + γ 2 1 + γ 3 1 + . . . a 1 ( m ) + γ 2 U µ U a 1 : 2 ( m ) = γ U µ + γ 3 1 + . . . a 2 ���� > 1 53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
What’s wrong with OLOP ? Explanation: inconsistency • Unintended behaviour happens when U µ a ( m ) > 1 , ∀ a . � 2 log M U µ a ( m ) = ˆ µ a ( m ) + � �� � T a ( m ) � �� � ∈ [ 0 , 1 ] > 0 • Then the sequence ( U a 1 : t ( m )) t is increasing U a 1 : 1 ( m ) = γ U µ a 1 ( m ) + γ 2 1 + γ 3 1 + . . . a 1 ( m ) + γ 2 U µ U a 1 : 2 ( m ) = γ U µ + γ 3 1 + . . . a 2 ���� > 1 • Then B a ( m ) = U a 1 : 1 ( m ) 53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
What’s wrong with OLOP ? What we were promised 54 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
What’s wrong with OLOP ? What we actually get OLOP behaves as uniform planning! 55 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Our contribution: Kullback-Leibler OLOP We summon the upper-confidence bound from kl-UCB (Capp´ e et al., 2013): a ( m ) def U µ = max { q ∈ I : T a ( m ) d (ˆ µ a ( m ) , q ) ≤ f ( m ) } 56 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Our contribution: Kullback-Leibler OLOP We summon the upper-confidence bound from kl-UCB (Capp´ e et al., 2013): a ( m ) def U µ = max { q ∈ I : T a ( m ) d (ˆ µ a ( m ) , q ) ≤ f ( m ) } Algorithm OLOP KL-OLOP Interval I [0, 1] R Divergence d d QUAD d BER f ( m ) 4 log M 2 log M + 2 log log M d QUAD ( p , q ) def = 2 ( p − q ) 2 = p log p q + ( 1 − p ) log 1 − p d BER ( p , q ) def 1 − q 56 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a And now, • U µ a ( m ) ∈ I = [ 0 , 1 ] , ∀ a . 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a And now, • U µ a ( m ) ∈ I = [ 0 , 1 ] , ∀ a . • The sequence ( U a 1 : t ( m )) t is non-increasing 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a And now, • U µ a ( m ) ∈ I = [ 0 , 1 ] , ∀ a . • The sequence ( U a 1 : t ( m )) t is non-increasing • B a ( m ) = U a ( m ) , the bound sharpening step is superfluous. 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Sample complexity Theorem (Sample complexity) KL-OLOP enjoys the same regret bounds as OLOP . More precisely, KL-OLOP satisfies: � � √ n − log 1 /γ � κ ′ > 1 O log κ ′ , if γ E r n = � � √ n − 1 � κ ′ ≤ 1 O , if γ 2 58 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Experiments — Expanded Trees 59 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Experiments — Expanded Trees 59 -Reinforcement Learning for Autonomous Driving- Edouard Leurent
Recommend
More recommend