reinforcement learning for safe decision making in
play

Reinforcement Learning for Safe Decision-Making in Autonomous - PowerPoint PPT Presentation

Reinforcement Learning for Safe Decision-Making in Autonomous Driving Edouard Leurent 1,2,3 , Odalric-Ambrym Maillard 1 , Denis Efimov 2 1 Inria SequeL, 2 Inria Valse, 3 Renault Group 01 Motivation and Scope 2 -Reinforcement Learning for


  1. Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... 19 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  2. Example problems with conflicts Two-Way Road The agent is driving on a two-way road with a car in front of it, • it can stay behind (safe/slow); • it can overtake (unsafe/fast). 20 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  3. Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... For a fixed reward function R , π ∗ is only guaranteed to lie on a Pareto front Π ∗ no control over the Task Completion trade-off Safety 21 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  4. The Pareto front Task Completion 𝐻 1 = ∑𝛿 𝑢 𝑆 1 𝑢 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 1 , 𝑆 2 ) argmax 𝜌 Safety 𝐻 2 = ∑𝛿 𝑢 𝑆 2 𝑢 22 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  5. From maximal safety to minimal risk Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 Risk 𝐻 𝑑 23 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  6. The optimal policy can move freely along Π ∗ Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 𝜌 ∗ Risk 𝐻 𝑑 24 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  7. How to choose a desired trade-off Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 25 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  8. Constrained Reinforcement Learning Markov Decision Process An MDP is a tuple ( S , A , P , R r , γ ) with: • Rewards R r ∈ R S×A Objective Maximise rewards E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S 26 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  9. Constrained Reinforcement Learning Constrained Markov Decision Process A CMDP is a tuple ( S , A , P , R r , R c , γ, β ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget β Objective Maximise rewards while keeping costs under a fixed budget E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s ] ≤ β s.t. 26 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  10. We want to learn Π ∗ rather than π ∗ β Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 27 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  11. We want to learn Π ∗ rather than π ∗ β Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 27 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  12. Budgeted Reinforcement Learning Budgeted Markov Decision Process A BMDP is a tuple ( S , A , P , R r , R c , γ, B ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget space B Objective Maximise rewards while keeping costs under an adjustable budget. ∀ β ∈ B , E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s , β 0 = β ] max π ∈M ( A×B ) S×B E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s , β 0 = β ] ≤ β s.t. 28 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  13. Problem formulation Budgeted policies π • Take a budget β as an additional input • Output a next budget β ′ → ( a , β ′ ) • π : ( s , β ) � �� � � �� � s a Augment the spaces with the budget β 29 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  14. Augmented Setting Definition (Augmented spaces) • States S = S × B . • Actions A = A × B . • Dynamics P � s ′ ∼ P ( s ′ | s , a ) state ( s , β ) , action ( a , β a ) → next state β ′ = β a Definition (Augmented signals) 1. Rewards R = ( R r , R c ) = � ∞ 2. Returns G π = ( G π c ) def r , G π t = 0 γ t R ( s t , a t ) c ) def = E [ G π | s 0 = s ] 3. Value V π ( s ) = ( V π r , V π c ) def = E [ G π | s 0 = s , a 0 = a ] 4. Q-Value Q π ( s , a ) = ( Q π r , Q π 30 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  15. Budgeted Optimality Definition (Budgeted Optimality) In that order, we want to: (i) Respect the budget β : Π a ( s ) def = { π ∈ Π : V π c ( s , β ) ≤ β } (ii) Maximise the rewards: r ( s ) def Π r ( s ) def V ∗ = max π ∈ Π a ( s ) V π = arg max π ∈ Π a ( s ) V π r ( s ) r ( s ) (iii) Minimise the costs: c ( s ) def Π ∗ ( s ) def V ∗ = min π ∈ Π r ( s ) V π = arg min π ∈ Π r ( s ) V π c ( s ) , c ( s ) We define the budgeted action-value function Q ∗ similarly 31 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  16. Budgeted Optimality Theorem (Budgeted Bellman Optimality Equation) Q ∗ verifies the following equation: Q ∗ ( s , a ) = T Q ∗ ( s , a ) � � def π greedy ( a ′ | s ′ ; Q ∗ ) Q ∗ ( s ′ , a ′ ) P ( s ′ | s , a ) = R ( s , a ) + γ s ′ ∈S a ′ ∈A where the greedy policy π greedy is defined by: π greedy ( a | s ; Q ) ∈ arg min ρ ∈ Π Q a ∼ ρ Q c ( s , a ) , E r def Π Q where =arg max ρ ∈M ( A ) E a ∼ ρ Q r ( s , a ) r s.t. a ∼ ρ Q c ( s , a ) ≤ β E 32 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  17. The optimal policy Proposition (Optimality of the policy) π greedy ( · ; Q ∗ ) is simultaneously optimal in all states s ∈ S : π greedy ( · ; Q ∗ ) ∈ Π ∗ ( s ) In particular, V π greedy ( · ; Q ∗ ) = V ∗ and Q π greedy ( · ; Q ∗ ) = Q ∗ . Proposition (Solving the non-linear program) π greedy can be computed efficiently, as a mixture π hull of two points that lie on the convex hull of Q. π greedy = π hull 33 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  18. Convergence analysis Recall what we’ve shown so far: fixed − point Q ∗ − tractable π hull ( Q ∗ ) − equal π greedy ( Q ∗ ) − T − − − − − − − → − − − − → − − → − − − → optimal 34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  19. Convergence analysis Recall what we’ve shown so far: fixed − point Q ∗ − tractable π hull ( Q ∗ ) − equal π greedy ( Q ∗ ) − T − − − − − − − → − − − − → − − → − − − → optimal We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q ∗ . 34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  20. Convergence analysis Recall what we’ve shown so far: fixed − point Q ∗ − tractable π hull ( Q ∗ ) − equal π greedy ( Q ∗ ) − T − − − − − − − → − − − − → − − → − − − → optimal We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q ∗ . Theorem (Non-Contractivity) For any BMDP ( S , A , P , R r , R c , γ ) with |A| ≥ 2 , T is not a contraction. ∀ ε > 0 , ∃ Q 1 , Q 2 ∈ ( R 2 ) SA : �T Q 1 − T Q 2 � ∞ ≥ 1 ε � Q 1 − Q 2 � ∞ ✗ We cannot guarantee the convergence of T n ( Q 0 ) to Q ∗ 34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  21. Convergence analysis Thankfully, Theorem (Contractivity on smooth Q-functions) T is a contraction when restricted to the subset L γ of Q-functions such that ”Q r is L-Lipschitz with respect to Q c ”, with L < 1 γ − 1 . � � Q ∈ ( R 2 ) SA s.t. ∃ L < 1 γ − 1 : ∀ s ∈ S , a 1 , a 2 ∈ A , L γ = | Q r ( s , a 1 ) − Q r ( s , a 2 ) | ≤ L | Q c ( s , a 1 ) − Q c ( s , a 2 ) | � We guarantee convergence under some (strong) assumptions � We observe empirical convergence 35 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  22. Experiments Lagrangian Relaxation Baseline Consider the dual problem so as to replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ : � γ t R r ( s , a ) − λγ t R c ( s , a ) max E π t • Train many policies π k with penalties λ k and recover the cost budgets β k • Very data/memory-heavy 36 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  23. Experiments G π r G π c 37 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  24. 04 Efficient Model-Based 38 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  25. Principle Model estimation Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) . For instance: � t � s t + 1 − ˆ T ( s t , a t ) � 2 1. Least-square estimate: min ˆ 2 T � t ˆ 2. Maximum Likelihood estimate: max ˆ T ( s t + 1 | s t , a t ) T Planning Leverage ˆ T to compute � ∞ � � � � � γ t r ( s t , a t ) � a t ∼ π ( s t ) , s t + 1 ∼ ˆ max T ( s t , a t ) E � π t = 0 How? 39 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  26. Online Planning We can use ˆ T as a generative model: Agent Environment Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  27. Online Planning We can use ˆ T as a generative model: Agent Environment state Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  28. Online Planning We can use ˆ T as a generative model: Agent Environment state Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  29. Online Planning We can use ˆ T as a generative model: Agent Environment state recommendation Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  30. Online Planning We can use ˆ T as a generative model: action Agent Environment state recommendation Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  31. Online Planning We can use ˆ T as a generative model: action Agent Environment state, reward state recommendation Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  32. Planning performance Online Planning • fixed budget: the model can only be queried n times Objective: minimize E V ∗ − V ( n ) � �� � Simple Regret r n An exploration-exploitation problem. 41 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  33. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  34. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. • Either you performed well; 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  35. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. • Either you performed well; • or you learned something. 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  36. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. • Either you performed well; • or you learned something. Instances • Monte-carlo tree search ( MCTS ) (Coulom, 2006): CrazyStone • Reframed in the bandit setting as UCT (Kocsis and Szepesv´ ari, 2006), still very popular (e.g. Alpha Go ). • Proved asymptotic consistency, but no regret bound. 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  37. Analysis of UCT It was analysed in (Coquelin and Munos, 2007)] The sample complexity of is lower-bounded by O (exp(exp( D ))) . 43 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  38. Failing cases of UCT Not just a theoretical counter-example. 44 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  39. Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems • Introduced by (Hren and Munos, 2008) • Another optimistic algorithm • Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ 45 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  40. Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems • Introduced by (Hren and Munos, 2008) • Another optimistic algorithm • Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ OLOP : Open-Loop Optimistic Planning • Introduced by (Bubeck and Munos, 2010) • Extends OPD to the stochastic setting • Only considers open-loop policies, i.e. sequences of actions 45 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  41. The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  42. The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  43. The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 3. Sample the sequence with highest UCB: arg max U a a 46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  44. The idea behind OLOP 47 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  45. The idea behind OLOP 48 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  46. Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + 1 : t t = 1 t ≥ h + 1 49 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  47. Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + ���� 1 : t ���� t = 1 t ≥ h + 1 ≤ U µ ≤ 1 49 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  48. Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval 50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  49. Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards 50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  50. Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards Bounds sharpening B a ( m ) def = 1 ≤ t ≤ L U a 1 : t ( m ) inf 50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  51. OLOP guarantees Theorem ( OLOP Sample complexity) OLOP satisfies:  � � √ n − log 1 /γ   � κ ′ > 1 O log κ ′ if γ , E r n = � � √  n − 1 κ ′ ≤ 1  � O if γ , 2 ”Remarkably, in the case κγ 2 > 1 , we obtain the same rate for the simple regret as Hren and Munos (2008). Thus, in this case, we can say that planning in stochastic environments is not harder than planning in deterministic environments”. 51 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  52. Does it work? Our objective: understand and bridge this gap. Make OLOP practical . 52 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  53. What’s wrong with OLOP ? Explanation: inconsistency • Unintended behaviour happens when U µ a ( m ) > 1 , ∀ a . � 2 log M U µ a ( m ) = ˆ µ a ( m ) + � �� � T a ( m ) � �� � ∈ [ 0 , 1 ] > 0 53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  54. What’s wrong with OLOP ? Explanation: inconsistency • Unintended behaviour happens when U µ a ( m ) > 1 , ∀ a . � 2 log M U µ a ( m ) = ˆ µ a ( m ) + � �� � T a ( m ) � �� � ∈ [ 0 , 1 ] > 0 • Then the sequence ( U a 1 : t ( m )) t is increasing U a 1 : 1 ( m ) = γ U µ a 1 ( m ) + γ 2 1 + γ 3 1 + . . . a 1 ( m ) + γ 2 U µ U a 1 : 2 ( m ) = γ U µ + γ 3 1 + . . . a 2 ���� > 1 53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  55. What’s wrong with OLOP ? Explanation: inconsistency • Unintended behaviour happens when U µ a ( m ) > 1 , ∀ a . � 2 log M U µ a ( m ) = ˆ µ a ( m ) + � �� � T a ( m ) � �� � ∈ [ 0 , 1 ] > 0 • Then the sequence ( U a 1 : t ( m )) t is increasing U a 1 : 1 ( m ) = γ U µ a 1 ( m ) + γ 2 1 + γ 3 1 + . . . a 1 ( m ) + γ 2 U µ U a 1 : 2 ( m ) = γ U µ + γ 3 1 + . . . a 2 ���� > 1 • Then B a ( m ) = U a 1 : 1 ( m ) 53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  56. What’s wrong with OLOP ? What we were promised 54 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  57. What’s wrong with OLOP ? What we actually get OLOP behaves as uniform planning! 55 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  58. Our contribution: Kullback-Leibler OLOP We summon the upper-confidence bound from kl-UCB (Capp´ e et al., 2013): a ( m ) def U µ = max { q ∈ I : T a ( m ) d (ˆ µ a ( m ) , q ) ≤ f ( m ) } 56 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  59. Our contribution: Kullback-Leibler OLOP We summon the upper-confidence bound from kl-UCB (Capp´ e et al., 2013): a ( m ) def U µ = max { q ∈ I : T a ( m ) d (ˆ µ a ( m ) , q ) ≤ f ( m ) } Algorithm OLOP KL-OLOP Interval I [0, 1] R Divergence d d QUAD d BER f ( m ) 4 log M 2 log M + 2 log log M d QUAD ( p , q ) def = 2 ( p − q ) 2 = p log p q + ( 1 − p ) log 1 − p d BER ( p , q ) def 1 − q 56 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  60. Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  61. Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a And now, • U µ a ( m ) ∈ I = [ 0 , 1 ] , ∀ a . 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  62. Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a And now, • U µ a ( m ) ∈ I = [ 0 , 1 ] , ∀ a . • The sequence ( U a 1 : t ( m )) t is non-increasing 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  63. Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a And now, • U µ a ( m ) ∈ I = [ 0 , 1 ] , ∀ a . • The sequence ( U a 1 : t ( m )) t is non-increasing • B a ( m ) = U a ( m ) , the bound sharpening step is superfluous. 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  64. Sample complexity Theorem (Sample complexity) KL-OLOP enjoys the same regret bounds as OLOP . More precisely, KL-OLOP satisfies:  � � √  n − log 1 /γ � κ ′ > 1  O log κ ′ , if γ E r n = � � √  n − 1  � κ ′ ≤ 1 O , if γ 2 58 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  65. Experiments — Expanded Trees 59 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

  66. Experiments — Expanded Trees 59 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Recommend


More recommend