Reinforcement Learning for Safe Decision-Making in Autonomous - PowerPoint PPT Presentation

Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... 19 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Example problems with conflicts Two-Way Road The agent is driving on a two-way road with a car in front of it, • it can stay behind (safe/slow); • it can overtake (unsafe/fast). 20 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Limitation of Reinforcement Learning Reinforcement learning relies on a single reward function R � A convenient formulation, but; ✗ R is not always easy to design. Conflicting Objectives Complex tasks require multiple contradictory aspects. Typically: Task completion vs Safety For example... For a fixed reward function R , π ∗ is only guaranteed to lie on a Pareto front Π ∗ no control over the Task Completion trade-off Safety 21 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

The Pareto front Task Completion 𝐻 1 = ∑𝛿 𝑢 𝑆 1 𝑢 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 1 , 𝑆 2 ) argmax 𝜌 Safety 𝐻 2 = ∑𝛿 𝑢 𝑆 2 𝑢 22 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

From maximal safety to minimal risk Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 Risk 𝐻 𝑑 23 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

The optimal policy can move freely along Π ∗ Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑢 (𝑆 𝑠 , −𝑆 𝑑 ) argmax 𝜌 𝜌 ∗ Risk 𝐻 𝑑 24 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

How to choose a desired trade-off Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 25 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Constrained Reinforcement Learning Markov Decision Process An MDP is a tuple ( S , A , P , R r , γ ) with: • Rewards R r ∈ R S×A Objective Maximise rewards E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S 26 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Constrained Reinforcement Learning Constrained Markov Decision Process A CMDP is a tuple ( S , A , P , R r , R c , γ, β ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget β Objective Maximise rewards while keeping costs under a fixed budget E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s ] max π ∈M ( A ) S E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s ] ≤ β s.t. 26 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

We want to learn Π ∗ rather than π ∗ β Task Completion 𝐻 𝑠 Pareto-optimal curve Π ∗ ∑𝛿 𝑢 𝑆 𝑠 𝑢 argmax 𝜌 𝑢 < 𝛾 𝑡. 𝑢. ∑𝛿 𝑢 𝑆 𝑑 𝜌 ∗ Risk 𝐻 𝑑 𝛾 27 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Budgeted Reinforcement Learning Budgeted Markov Decision Process A BMDP is a tuple ( S , A , P , R r , R c , γ, B ) with: • Rewards R r ∈ R S×A • Costs R c ∈ R S×A • Budget space B Objective Maximise rewards while keeping costs under an adjustable budget. ∀ β ∈ B , E [ � ∞ t = 0 γ t R r ( s t , a t ) | s 0 = s , β 0 = β ] max π ∈M ( A×B ) S×B E [ � ∞ t = 0 γ t R c ( s t , a t ) | s 0 = s , β 0 = β ] ≤ β s.t. 28 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Problem formulation Budgeted policies π • Take a budget β as an additional input • Output a next budget β ′ → ( a , β ′ ) • π : ( s , β ) � �� s a Augment the spaces with the budget β 29 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Augmented Setting Definition (Augmented spaces) • States S = S × B . • Actions A = A × B . • Dynamics P � s ′ ∼ P ( s ′ | s , a ) state ( s , β ) , action ( a , β a ) → next state β ′ = β a Definition (Augmented signals) 1. Rewards R = ( R r , R c ) = � ∞ 2. Returns G π = ( G π c ) def r , G π t = 0 γ t R ( s t , a t ) c ) def = E [ G π | s 0 = s ] 3. Value V π ( s ) = ( V π r , V π c ) def = E [ G π | s 0 = s , a 0 = a ] 4. Q-Value Q π ( s , a ) = ( Q π r , Q π 30 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Budgeted Optimality Definition (Budgeted Optimality) In that order, we want to: (i) Respect the budget β : Π a ( s ) def = { π ∈ Π : V π c ( s , β ) ≤ β } (ii) Maximise the rewards: r ( s ) def Π r ( s ) def V ∗ = max π ∈ Π a ( s ) V π = arg max π ∈ Π a ( s ) V π r ( s ) r ( s ) (iii) Minimise the costs: c ( s ) def Π ∗ ( s ) def V ∗ = min π ∈ Π r ( s ) V π = arg min π ∈ Π r ( s ) V π c ( s ) , c ( s ) We define the budgeted action-value function Q ∗ similarly 31 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Budgeted Optimality Theorem (Budgeted Bellman Optimality Equation) Q ∗ verifies the following equation: Q ∗ ( s , a ) = T Q ∗ ( s , a ) � � def π greedy ( a ′ | s ′ ; Q ∗ ) Q ∗ ( s ′ , a ′ ) P ( s ′ | s , a ) = R ( s , a ) + γ s ′ ∈S a ′ ∈A where the greedy policy π greedy is defined by: π greedy ( a | s ; Q ) ∈ arg min ρ ∈ Π Q a ∼ ρ Q c ( s , a ) , E r def Π Q where =arg max ρ ∈M ( A ) E a ∼ ρ Q r ( s , a ) r s.t. a ∼ ρ Q c ( s , a ) ≤ β E 32 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

The optimal policy Proposition (Optimality of the policy) π greedy ( · ; Q ∗ ) is simultaneously optimal in all states s ∈ S : π greedy ( · ; Q ∗ ) ∈ Π ∗ ( s ) In particular, V π greedy ( · ; Q ∗ ) = V ∗ and Q π greedy ( · ; Q ∗ ) = Q ∗ . Proposition (Solving the non-linear program) π greedy can be computed efficiently, as a mixture π hull of two points that lie on the convex hull of Q. π greedy = π hull 33 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Convergence analysis Recall what we’ve shown so far: fixed − point Q ∗ − tractable π hull ( Q ∗ ) − equal π greedy ( Q ∗ ) − T − − − − − − − → − − − − → − − → − − − → optimal 34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Convergence analysis Recall what we’ve shown so far: fixed − point Q ∗ − tractable π hull ( Q ∗ ) − equal π greedy ( Q ∗ ) − T − − − − − − − → − − − − → − − → − − − → optimal We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q ∗ . 34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Convergence analysis Recall what we’ve shown so far: fixed − point Q ∗ − tractable π hull ( Q ∗ ) − equal π greedy ( Q ∗ ) − T − − − − − − − → − − − − → − − → − − − → optimal We’re almost there! All that is left is to perform Fixed-Point Iteration to compute Q ∗ . Theorem (Non-Contractivity) For any BMDP ( S , A , P , R r , R c , γ ) with |A| ≥ 2 , T is not a contraction. ∀ ε > 0 , ∃ Q 1 , Q 2 ∈ ( R 2 ) SA : �T Q 1 − T Q 2 � ∞ ≥ 1 ε � Q 1 − Q 2 � ∞ ✗ We cannot guarantee the convergence of T n ( Q 0 ) to Q ∗ 34 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Convergence analysis Thankfully, Theorem (Contractivity on smooth Q-functions) T is a contraction when restricted to the subset L γ of Q-functions such that ”Q r is L-Lipschitz with respect to Q c ”, with L < 1 γ − 1 . � � Q ∈ ( R 2 ) SA s.t. ∃ L < 1 γ − 1 : ∀ s ∈ S , a 1 , a 2 ∈ A , L γ = | Q r ( s , a 1 ) − Q r ( s , a 2 ) | ≤ L | Q c ( s , a 1 ) − Q c ( s , a 2 ) | � We guarantee convergence under some (strong) assumptions � We observe empirical convergence 35 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Experiments Lagrangian Relaxation Baseline Consider the dual problem so as to replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ : � γ t R r ( s , a ) − λγ t R c ( s , a ) max E π t • Train many policies π k with penalties λ k and recover the cost budgets β k • Very data/memory-heavy 36 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Experiments G π r G π c 37 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

04 Efficient Model-Based 38 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Principle Model estimation Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) . For instance: � t � s t + 1 − ˆ T ( s t , a t ) � 2 1. Least-square estimate: min ˆ 2 T � t ˆ 2. Maximum Likelihood estimate: max ˆ T ( s t + 1 | s t , a t ) T Planning Leverage ˆ T to compute � ∞ � � � � � γ t r ( s t , a t ) � a t ∼ π ( s t ) , s t + 1 ∼ ˆ max T ( s t , a t ) E � π t = 0 How? 39 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Online Planning We can use ˆ T as a generative model: Agent Environment Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Online Planning We can use ˆ T as a generative model: Agent Environment state Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Online Planning We can use ˆ T as a generative model: Agent Environment state recommendation Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Online Planning We can use ˆ T as a generative model: action Agent Environment state recommendation Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Online Planning We can use ˆ T as a generative model: action Agent Environment state, reward state recommendation Planner 40 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Planning performance Online Planning • fixed budget: the model can only be queried n times Objective: minimize E V ∗ − V ( n ) � �� Simple Regret r n An exploration-exploitation problem. 41 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. • Either you performed well; 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. • Either you performed well; • or you learned something. 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. • Either you performed well; • or you learned something. Instances • Monte-carlo tree search ( MCTS ) (Coulom, 2006): CrazyStone • Reframed in the bandit setting as UCT (Kocsis and Szepesv´ ari, 2006), still very popular (e.g. Alpha Go ). • Proved asymptotic consistency, but no regret bound. 42 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Analysis of UCT It was analysed in (Coquelin and Munos, 2007)] The sample complexity of is lower-bounded by O (exp(exp( D ))) . 43 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Failing cases of UCT Not just a theoretical counter-example. 44 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems • Introduced by (Hren and Munos, 2008) • Another optimistic algorithm • Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ 45 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems • Introduced by (Hren and Munos, 2008) • Another optimistic algorithm • Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ OLOP : Open-Loop Optimistic Planning • Introduced by (Bubeck and Munos, 2010) • Extends OPD to the stochastic setting • Only considers open-loop policies, i.e. sequences of actions 45 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 3. Sample the sequence with highest UCB: arg max U a a 46 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

The idea behind OLOP 47 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

The idea behind OLOP 48 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Under the hood Upper-bounding the value of sequences follow the sequence � �� act optimally � �� h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + 1 : t t = 1 t ≥ h + 1 49 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Under the hood Upper-bounding the value of sequences follow the sequence � �� act optimally � �� h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + �� 1 : t �� t = 1 t ≥ h + 1 ≤ U µ ≤ 1 49 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� T a ( m ) � �� Upper bound Empirical mean Confidence interval 50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� T a ( m ) � �� Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� t = 1 Past rewards Future rewards 50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� T a ( m ) � �� Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� t = 1 Past rewards Future rewards Bounds sharpening B a ( m ) def = 1 ≤ t ≤ L U a 1 : t ( m ) inf 50 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

OLOP guarantees Theorem ( OLOP Sample complexity) OLOP satisfies:  � � √ n − log 1 /γ   � κ ′ > 1 O log κ ′ if γ , E r n = � � √  n − 1 κ ′ ≤ 1  � O if γ , 2 ”Remarkably, in the case κγ 2 > 1 , we obtain the same rate for the simple regret as Hren and Munos (2008). Thus, in this case, we can say that planning in stochastic environments is not harder than planning in deterministic environments”. 51 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Does it work? Our objective: understand and bridge this gap. Make OLOP practical . 52 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

What’s wrong with OLOP ? Explanation: inconsistency • Unintended behaviour happens when U µ a ( m ) > 1 , ∀ a . � 2 log M U µ a ( m ) = ˆ µ a ( m ) + � �� T a ( m ) � �� ∈ [ 0 , 1 ] > 0 53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

What’s wrong with OLOP ? Explanation: inconsistency • Unintended behaviour happens when U µ a ( m ) > 1 , ∀ a . � 2 log M U µ a ( m ) = ˆ µ a ( m ) + � �� T a ( m ) � �� ∈ [ 0 , 1 ] > 0 • Then the sequence ( U a 1 : t ( m )) t is increasing U a 1 : 1 ( m ) = γ U µ a 1 ( m ) + γ 2 1 + γ 3 1 + . . . a 1 ( m ) + γ 2 U µ U a 1 : 2 ( m ) = γ U µ + γ 3 1 + . . . a 2 �� > 1 53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

What’s wrong with OLOP ? Explanation: inconsistency • Unintended behaviour happens when U µ a ( m ) > 1 , ∀ a . � 2 log M U µ a ( m ) = ˆ µ a ( m ) + � �� T a ( m ) � �� ∈ [ 0 , 1 ] > 0 • Then the sequence ( U a 1 : t ( m )) t is increasing U a 1 : 1 ( m ) = γ U µ a 1 ( m ) + γ 2 1 + γ 3 1 + . . . a 1 ( m ) + γ 2 U µ U a 1 : 2 ( m ) = γ U µ + γ 3 1 + . . . a 2 �� > 1 • Then B a ( m ) = U a 1 : 1 ( m ) 53 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

What’s wrong with OLOP ? What we were promised 54 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

What’s wrong with OLOP ? What we actually get OLOP behaves as uniform planning! 55 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Our contribution: Kullback-Leibler OLOP We summon the upper-confidence bound from kl-UCB (Capp´ e et al., 2013): a ( m ) def U µ = max { q ∈ I : T a ( m ) d (ˆ µ a ( m ) , q ) ≤ f ( m ) } 56 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Our contribution: Kullback-Leibler OLOP We summon the upper-confidence bound from kl-UCB (Capp´ e et al., 2013): a ( m ) def U µ = max { q ∈ I : T a ( m ) d (ˆ µ a ( m ) , q ) ≤ f ( m ) } Algorithm OLOP KL-OLOP Interval I [0, 1] R Divergence d d QUAD d BER f ( m ) 4 log M 2 log M + 2 log log M d QUAD ( p , q ) def = 2 ( p − q ) 2 = p log p q + ( 1 − p ) log 1 − p d BER ( p , q ) def 1 − q 56 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a And now, • U µ a ( m ) ∈ I = [ 0 , 1 ] , ∀ a . 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a And now, • U µ a ( m ) ∈ I = [ 0 , 1 ] , ∀ a . • The sequence ( U a 1 : t ( m )) t is non-increasing 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Our contribution: Kullback-Leibler OLOP d ber (ˆ µ a , q ) 1 T a f ( m ) 0 L µ U µ µ a ˆ 1 a a And now, • U µ a ( m ) ∈ I = [ 0 , 1 ] , ∀ a . • The sequence ( U a 1 : t ( m )) t is non-increasing • B a ( m ) = U a ( m ) , the bound sharpening step is superfluous. 57 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Sample complexity Theorem (Sample complexity) KL-OLOP enjoys the same regret bounds as OLOP . More precisely, KL-OLOP satisfies:  � � √  n − log 1 /γ � κ ′ > 1  O log κ ′ , if γ E r n = � � √  n − 1  � κ ′ ≤ 1 O , if γ 2 58 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Experiments — Expanded Trees 59 -Reinforcement Learning for Autonomous Driving- Edouard Leurent

Reinforcement Learning for Safe Decision-Making in Autonomous - PowerPoint PPT Presentation

Reinforcement Learning for Safe Decision-Making in Autonomous Driving Edouard Leurent 1,2,3 , Odalric-Ambrym Maillard 1 , Denis Efimov 2 1 Inria SequeL, 2 Inria Valse, 3 Renault Group 01 Motivation and Scope 2 -Reinforcement Learning for

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

6 Decision- -Making Making MVC (revisited) 6 Decision MVC (revisited) decision

Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and

DECISION MAKING readysetpresent.com Decision Making Program Objectives ( 1 of 2 ) To examine

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Sequential Decision Making AIMA Chapters: 17.1, 17.2, 17.3. Sutton and Barto, Reinforcement

Decision Making 1 Decision Making Skills Establishing a positive decision-making environment.

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Module 4 Markov Processes CS 886 Sequential Decision Making and Reinforcement Learning

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask

TITRE DE LA THESE Pattern Analysis for Source-Code Performance Improvement Authors: Riyane SID

Smooth Path Planning for Cars Thierry Fraichard Dubins/Reeds & Shepp Car Kinematics = y

The Cost of Monotonicity in Distributed Graph Searching David Ilcinkas 1 Nicolas Nisse 2 David

Ins GAM Dr. Camille SALINESI Centre de Recherche en Informatique Universit Paris1

DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW COPROCESSOR ARITH26 |

Bayesian View Synthesis and Image-Based Rendering Principles 1 1 2 Sergi Pujades, Frdric

Robust adaptive discourse parsing for e-learning fora Nadine Lucas & Emmanuel Giguet Cnrs

Urban Sound Symposium April 4, 2019 Ghent University, Belgium Urban Low Barriers Jrme

Sambuz

Useful Links

Newsletter

Mail Us

Reinforcement Learning for Safe Decision-Making in Autonomous - PowerPoint PPT Presentation

Reinforcement Learning for Safe Decision-Making in Autonomous Driving Edouard Leurent 1,2,3 , Odalric-Ambrym Maillard 1 , Denis Efimov 2 1 Inria SequeL, 2 Inria Valse, 3 Renault Group 01 Motivation and Scope 2 -Reinforcement Learning for

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

6 Decision- -Making Making MVC (revisited) 6 Decision MVC (revisited) decision

Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and

DECISION MAKING readysetpresent.com Decision Making Program Objectives ( 1 of 2 ) To examine

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Sequential Decision Making AIMA Chapters: 17.1, 17.2, 17.3. Sutton and Barto, Reinforcement

Decision Making 1 Decision Making Skills Establishing a positive decision-making environment.

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Module 4 Markov Processes CS 886 Sequential Decision Making and Reinforcement Learning

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask

TITRE DE LA THESE Pattern Analysis for Source-Code Performance Improvement Authors: Riyane SID

Smooth Path Planning for Cars Thierry Fraichard Dubins/Reeds &amp; Shepp Car Kinematics = y

The Cost of Monotonicity in Distributed Graph Searching David Ilcinkas 1 Nicolas Nisse 2 David

Ins GAM Dr. Camille SALINESI Centre de Recherche en Informatique Universit Paris1

DYNAMIC PRECISION NUMERICS USING A VARIABLE-PRECISION UNUM TYPE I HW COPROCESSOR ARITH26 |

Bayesian View Synthesis and Image-Based Rendering Principles 1 1 2 Sergi Pujades, Frdric

Robust adaptive discourse parsing for e-learning fora Nadine Lucas &amp; Emmanuel Giguet Cnrs

Urban Sound Symposium April 4, 2019 Ghent University, Belgium Urban Low Barriers Jrme

Sambuz

Useful Links

Newsletter

Mail Us

Smooth Path Planning for Cars Thierry Fraichard Dubins/Reeds & Shepp Car Kinematics = y

Robust adaptive discourse parsing for e-learning fora Nadine Lucas & Emmanuel Giguet Cnrs