Safe Reinforcement Learning for Decision-Making in Autonomous Driving Edouard Leurent, Odalric-Ambrym Maillard, Denis Efimov, Wilfrid Perruquetti, Yann Blanco SequeL, Inria Lille – Nord Europe Valse, Inria Lille – Nord Europe Renault Group Lille, April 2019
Motivation Classic Autonomous Driving Pipeline Safe Reinforcement Learning for Autonomous Driving Lille - 2/54
Motivation Classic Autonomous Driving Pipeline In practice, ◮ The behavioural layer is a hand-crafted rule-based system (e.g. FSM). Safe Reinforcement Learning for Autonomous Driving Lille - 2/54
Motivation Classic Autonomous Driving Pipeline In practice, ◮ The behavioural layer is a hand-crafted rule-based system (e.g. FSM). ◮ Won’t scale to complex scenes, handle negotiation and aggressiveness Safe Reinforcement Learning for Autonomous Driving Lille - 2/54
Reinforcement Learning: why? Search for an optimal policy π ( a | s ) : � � ∞ � � � � γ t r ( s t , a t ) max � a t ∼ π ( s t ) , s t + 1 ∼ T ( s t , a t ) E � π t = 0 � �� � policy return R T π The dynamics T ( s t + 1 | s t , a t ) are unknown. The agent learns by interaction with the environment Challenges: ◮ exploration-exploitation ◮ partial observability ◮ credit assignment ◮ safety Safe Reinforcement Learning for Autonomous Driving Lille - 3/54
Reinforcement Learning: how? Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Safe Reinforcement Learning for Autonomous Driving Lille - 4/54
Reinforcement Learning: how? Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Model-based 1. Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) , 2. ( Planning ) Leverage it to compute � ∞ � � � � � γ t r ( s t , a t ) � a t ∼ π ( s t ) , s t + 1 ∼ ˆ max T ( s t , a t ) E � π t = 0 + Better sample efficiency, interpretability, priors. Safe Reinforcement Learning for Autonomous Driving Lille - 4/54
A first benchmark The highway-env environment ◮ Vehicle kinematics: Kinematic Bicycle Model ◮ Low-level longitudinal and lateral controllers ◮ Behavioural models: IDM and MOBIL ◮ Graphical road network and route planning A few baseline agents — Setup ◮ Model-free: DQN ◮ Model-based (planning): Value Iteration and MCTS Safe Reinforcement Learning for Autonomous Driving Lille - 5/54
A first benchmark — Results Histogram of rewards Histogram of lengths VI 1.0 VI 0.5 DQN DQN MCTS MCTS 0.8 0.4 0.6 Frequency 0.3 Frequency 0.4 0.2 0.2 0.1 0.0 0.0 0 2 4 6 8 10 12 5 10 15 20 25 30 35 40 Rewards Lengths Videos available on Safe Reinforcement Learning for Autonomous Driving Lille - 6/54
The safety / performance trade-off Let us look at the performances of DQN: Safe Reinforcement Learning for Autonomous Driving Lille - 7/54
The safety / performance trade-off Let us look at the performances of DQN: Uncertainty and risk ◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation Safe Reinforcement Learning for Autonomous Driving Lille - 7/54
The safety / performance trade-off Let us look at the performances of DQN: Uncertainty and risk ◮ High return variance, many collisions ◮ In RL, we only maximise the return in expectation Conflicting objectives ◮ Reward r t = ω v velocity − ω c collision π = � ◮ We only control the return R T t γ t r t . ◮ For any fixed ω , there can be many optimal policies with different velocity collision ratios → the Pareto-optimal curve Safe Reinforcement Learning for Autonomous Driving Lille - 7/54
A first formalisation of risk Constrained Reinforcement Learning ◮ Augment the MDP with a cost function c : S × A × S → R , cost discount γ c , and a budget β . ◮ Optimise the reward while keeping the cost under a budget � ∞ � � γ t r t max E π π t = 0 � ∞ � � γ t s.t. c c t ≤ β E π t = 0 Budgeted Reinforcement Learning Find a single budget-dependent policy π ( a | s , β ) that solves all the corresponding CMDPs Safe Reinforcement Learning for Autonomous Driving Lille - 8/54
A BMDP algorithm Lagrangian Relaxation Consider the dual problem and replace the hard constraint by a soft constraint penalised by a Lagrangian multiplier λ : � γ t r t − λγ t max c c t E π t ◮ Train many policies π k with penalties λ k and recover the cost budgets β k ◮ Very data/memory-heavy Safe Reinforcement Learning for Autonomous Driving Lille - 9/54
Our BMDP algorithm Budgeted Fitted-Q [Carrara et al. 2019] A model-free, value-based, fixed-point iteration procedure. regression Q r n + 1 ( s i , a i , β i ) ← − − − − − − � r ′ π n A ( s ′ i , a ′ , β i ) Q r n ( s ′ i , a ′ , π n B ( s ′ i , a ′ , β i )) i + γ a ′ ∈A regression Q c n + 1 ( s i , a i , β i ) ← − − − − − − � c ′ π n A ( s ′ i , a ′ , β i ) Q c n ( s ′ i , a ′ , π n B ( s ′ i , a ′ , β i )) i + γ c a ′ ∈A ( π n A , π n B ) ← � π A ( s , a , β ) Q r arg max n ( s , a , π B ( s , a , β )) ( π A ,π B ) ∈ Ψ n a ∈A π A ∈ M ( A ) S× R , π B ∈ R S×A× R , such that, ∀ s ∈ S , ∀ β ∈ R , Ψ n = � π A ( s , a , β ) Q c n ( s , a , π B ( s , a , β )) ≤ β a ∈A Safe Reinforcement Learning for Autonomous Driving Lille - 10/54
From dynamic programming to RL Continuous Reinfocement Learning 1. Risk-sensitive exploration. 2. Scalable function approximation 3. Parallel computing of the targets and experiences. Safe Reinforcement Learning for Autonomous Driving Lille - 11/54
Risk-sensitive exploration Algorithm 1: Risk-sensitive exploration 1 Initialise an empty batch D . 2 for each intermediate batch do for each episode in batch do 3 Sample initial budget β ∼ U ( B ) . 4 while episode not done do 5 Update ε from schedule. 6 Sample z ∼ U ([ 0 , 1 ]) . 7 if z > ε then See 8 Sample ( a , β ′ ) from ( π A , π B ) . example on 9 // Exploit else 10 Sample ( a , β ′ ) from U (∆ AB ) . 11 // Explore Append transition ( s , β, a , r ′ , c ′ , s ′ ) to batch D . 12 Update episode budget β ← β ′ . 13 ( π A , π B ) ← BFTQ ( D ) . 14 15 return the batch of transitions D Safe Reinforcement Learning for Autonomous Driving Lille - 12/54
Function approximation Hidden Hidden ( s , β ) Q Layer 1 Layer 2 s 0 Q r ( a 0 ) s 1 Q r ( a 1 ) Q c ( a 0 ) Q c ( a 1 ) β Encoder Figure: Neural Network for Q -functions approximation when the state dimension is 2 and there are 2 actions. Safe Reinforcement Learning for Autonomous Driving Lille - 13/54
Parallel computing of the targets Algorithm 3: Compute targets Algorithm 2: BFTQ (parallel) B , γ c , γ r , fit r ,fit c (regression 1 In: D , � 1 Q r , Q c = Q ( D ) algorithms); // perform a single forward pass 2 Out: Q r , Q c ; 2 Split D among workers: D = ∪ w ∈ W D w 3 X = { s i , a i , β i } i ∈ [ 0 , |D| ] ; // Run the following loop on each 4 Initialise Q r = Q c = ( s , a , β ) → 0; worker in parallel 3 for w ∈ W do 5 repeat Y r , Y c = ( Y c w , Y r w ) ← 4 6 compute targets ( D w , Q r , Q c , � compute targets ( D , Q r , Q c , � B , γ c , γ r ) B , γ c , γ r ) ; Q r , Q c = fit r ( X , Y r ) , fit c ( X , Y c ) ; 5 Join the results: Y c = ∪ w ∈ W Y c 7 w and 8 until convergence or timeout ; Y r = ∪ w ∈ W Y r w 6 return ( Y c , Y r ) Safe Reinforcement Learning for Autonomous Driving Lille - 14/54
Experiments Video available on 1 0 2 3 4 β ∈ [0.31,1.00] 6 0.30 β ∈ [0.21,0.29] 0.20 β ∈ [0.11,0.19] 0.10 10 β ∈ [0.01,0.09] λ ∈ {15,20} 0.00 Safe Reinforcement Learning for Autonomous Driving Lille - 15/54
Looking back Histogram of rewards Histogram of lengths VI 1.0 VI 0.5 DQN DQN MCTS MCTS 0.8 0.4 0.6 Frequency 0.3 Frequency 0.2 0.4 0.1 0.2 0.0 0.0 0 2 4 6 8 10 12 5 10 15 20 25 30 35 40 Rewards Lengths Compared to DQN, the MCTS was really good in terms of safety. But the VI, not so much. Safe Reinforcement Learning for Autonomous Driving Lille - 16/54
Model bias Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Model-based 1. Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) , 2. ( Planning ) Leverage it to compute � � ∞ � � � � � a t ∼ π ( s t ) , s t + 1 ∼ ˆ γ t r ( s t , a t ) max T ( s t , a t ) E � π t = 0 + Better sample efficiency, interpretability, priors. Safe Reinforcement Learning for Autonomous Driving Lille - 17/54
Model bias Model-free 1. Directly optimise π ( a | s ) through policy evaluation and policy improvement Model-based 1. Learn a model for the dynamics ˆ T ( s t + 1 | s t , a t ) , 2. ( Planning ) Leverage it to compute � � ∞ � � � � � a t ∼ π ( s t ) , s t + 1 ∼ ˆ γ t r ( s t , a t ) max T ( s t , a t ) E � π t = 0 + Better sample efficiency, interpretability, priors. - Model bias: T � = ˆ T see example at Safe Reinforcement Learning for Autonomous Driving Lille - 17/54
Recommend
More recommend