Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr - PowerPoint PPT Presentation

Safe Reinforcement Learning via Formal Methods Nathan Fulton and André Platzer Carnegie Mellon University

Safety-Critical Systems "How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing

Autonomous Safety-Critical Systems How can we provide people with autonomous cyber-physical systems they can bet their lives on?

Model-Based Verification Reinforcement Learning φ

Model-Based Verification Reinforcement Learning pos < stopSign

Model-Based Verification Reinforcement Learning ctrl pos < stopSign

Model-Based Verification Reinforcement Learning ctrl pos < stopSign Approach : prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification Reinforcement Learning φ Benefits: Strong safety guarantees ● Automated analysis ●

Model-Based Verification Reinforcement Learning φ Benefits: Strong safety guarantees ● Automated analysis ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful”

Model-Based Verification Reinforcement Learning φ Benefits: Strong safety guarantees ● Automated analysis ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful” Assumes accurate model ●

Model-Based Verification Reinforcement Learning Act φ Benefits: Observe Strong safety guarantees ● Automated analysis ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful” Assumes accurate model. ●

Model-Based Verification Reinforcement Learning Act φ Observe Benefits: Benefits: Strong safety guarantees No need for complete model ● ● Automated analysis Optimal (effective) policies ● ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful” Assumes accurate model. ●

Model-Based Verification Reinforcement Learning Act φ Observe Benefits: Benefits: Strong safety guarantees No need for complete model ● ● Automated analysis Optimal (effective) policies ● ● Drawbacks: Drawbacks: Control policies are typically No strong safety guarantees ● ● non-deterministic: answers Proofs are obtained and ● “what is safe”, not “what is checked by hand useful” Formal proofs = decades-long ● Assumes accurate model. proof development ●

Model-Based Verification Reinforcement Learning Act φ Observe Goal: Provably correct reinforcement learning Benefits: Benefits: Strong safety guarantees No need for complete model ● ● Aomputational aids (ATP) Optimal (effective) policies ● ● Drawbacks: Drawbacks: Control policies are typically No strong safety guarantees ● ● non-deterministic: answers Proofs are obtained and ● “what is safe”, not “what is checked by hand useful” Formal proofs = decades-long ● Assumes accurate model proof development ●

Model-Based Verification Reinforcement Learning Act φ Observe Goal: Provably correct reinforcement learning Benefits: Benefits: 1. Learn Safety Strong safety guarantees No need for complete model ● ● 2. Learn a Safe Policy Aomputational aids (ATP) Optimal (effective) policies ● ● 3. Justify claims of safety Drawbacks: Drawbacks: Control policies are typically No strong safety guarantees ● ● non-deterministic: answers Proofs are obtained and ● “what is safe”, not “what is checked by hand useful” Formal proofs = decades-long ● Assumes accurate model proof development ●

Model-Based Verification Accurate, analyzable models often exist! { {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

Model-Based Verification Accurate , analyzable models often exist! { {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} discrete control }* Continuous motion

Model-Based Verification Accurate , analyzable models often exist! { {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} discrete, non-deterministic }* Continuous motion control

Model-Based Verification Accurate , analyzable models often exist! init → [ { ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {pos’ = vel, vel’ = acc} }* ]pos < stopSign

Model-Based Verification Accurate , analyzable models often exist! formal verification gives strong safety guarantees init → [{ ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {pos’ = vel, vel’ = acc} }*]pos < stopSign

Model-Based Verification Accurate , analyzable models often exist! formal verification gives strong safety guarantees ● Computer-checked proofs = of safety specification.

Model-Based Verification Accurate , analyzable models often exist! formal verification gives strong safety guarantees ● Computer-checked proofs = of safety specification ● Formal proofs mapping model to runtime monitors

Model-Based Verification Isn’t Enough Perfect , analyzable models don’t exist!

Model-Based Verification Isn’t Enough Perfect , analyzable models don’t exist! How to implement? { ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {pos’ = vel, vel’ = acc} }* Only accurate sometimes

Model-Based Verification Isn’t Enough Perfect , analyzable models don’t exist! How to implement? { ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {dx’=w*y, dy’=-w*x, ...} }* Only accurate sometimes

Our Contribution Justified Speculative Control is an approach toward provably safe reinforcement learning that: 1. learns to resolve non-determinism without sacrificing formal safety results

Our Contribution Justified Speculative Control is an approach toward provably safe reinforcement learning that: 1. learns to resolve non-determinism without sacrificing formal safety results 2. allows and directs speculation whenever model mismatches occur

Learning to Resolve Non-determinism Act Observe & compute reward

Learning to Resolve Non-determinism accel ∪ brake U turn Observe & compute reward

Learning to Resolve Non-determinism {accel,brake,turn} Observe & compute reward

Learning to Resolve Non-determinism {accel,brake,turn} ⇨ Policy Observe & compute reward

Learning to Resolve Non-determinism {accel,brake,turn} (safe?) ⇨ Policy Observe & compute reward

Learning to Safely Resolve Non-determinism Safety Monitor (safe?) ⇨ Policy Observe & compute reward

Learning to Safely Resolve Non-determinism Safety Monitor (safe?) ⇨ Policy Observe & compute reward ≠ “Trust Me”

Learning to Safely Resolve Non-determinism φ (safe?) ⇨ Policy Observe & compute reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

Learning to Safely Resolve Non-determinism φ Main Theorem: If the ODEs are accurate, then (safe?) ⇨ our formal proofs transfer from the Policy non-deterministic model to the learned Observe & compute (deterministic) policy reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

Learning to Safely Resolve Non-determinism φ Main Theorem: If the ODEs are accurate, then (safe?) ⇨ our formal proofs transfer from the Policy non-deterministic model to the learned Observe & compute (deterministic) policy via the model monitor. reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

What about the physical model? φ (safe?) ⇨ {pos’=vel,vel’=acc} ≠ Policy Observe & compute reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

What About the Physical Model? {brake, accel, turn} Observe & compute reward

What About the Physical Model? Model is accurate. {brake, accel, turn} Observe & compute reward

What About the Physical Model? Model is accurate. {brake, accel, turn} Model is inaccurate Observe & compute reward

What About the Physical Model? Model is accurate. {brake, accel, turn} Model is inaccurate Observe & compute Obstacle! reward

What About the Physical Model? Expected {brake, accel, turn} Reality Observe & compute reward

Speculation is Justified Expected {brake, accel, turn} (safe) Reality (crash!) Observe & compute reward

Leveraging Verification Results to Learn Better {brake, accel, turn} Use a real-valued version of the model monitor as a reward signal Observe & compute reward

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr - PowerPoint PPT Presentation

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr Platzer Carnegie Mellon University Safety-Critical Systems "How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr Platzer Carnegie Mellon

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Formal Methods and Cryptography Lecture 25 Formal Methods Formal Methods Logical foundations

Formal Methods and Cryptography Lecture 24 1 Formal Methods 2 Formal Methods Logical

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

Lecture Outline 1. The lecturer 2. Introduction to Formal Methods DD2452 Formal Methods 3.

Formal Definition of a Finite Automaton Formal Definition of a Finite Automaton p.1/23 Why a

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Orthogonal projections of row spaces P B denotes the operator that projects the row space of a

How to calculate with nondeterministic functions Richard Bird and Florian Rabe Computer Science,

Deterministic PDAs As mentioned before Pushdown Automata Our basic PDA in

Bells Theorem: No local theory can reproduce the statistical predictions of quantum

Limit-Deterministic Bchi Automata for Probabilistic Model Checking Javier Esparza Jan K

A High-Level Implementation of Non-Deterministic, Unrestricted, Independent And-Parallelism

DETERMINISTIC MEAN FIELD GAMES Italo Capuzzo Dolcetta Sapienza Universit` a di Roma and GNAMPA

Lecture 17: Interactive Proofs Arijit Bishnu 23.04.2010 Introduction Probabilistic Verifier