safe reinforcement learning via formal methods
play

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr - PowerPoint PPT Presentation

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr Platzer Carnegie Mellon University Safety-Critical Systems "How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing


  1. Safe Reinforcement Learning via Formal Methods Nathan Fulton and André Platzer Carnegie Mellon University

  2. Safety-Critical Systems "How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing

  3. Autonomous Safety-Critical Systems How can we provide people with autonomous cyber-physical systems they can bet their lives on?

  4. Model-Based Verification Reinforcement Learning φ

  5. Model-Based Verification Reinforcement Learning pos < stopSign

  6. Model-Based Verification Reinforcement Learning ctrl pos < stopSign

  7. Model-Based Verification Reinforcement Learning ctrl pos < stopSign Approach : prove that control software achieves a specification with respect to a model of the physical system.

  8. Model-Based Verification Reinforcement Learning ctrl pos < stopSign Approach : prove that control software achieves a specification with respect to a model of the physical system.

  9. Model-Based Verification Reinforcement Learning φ Benefits: Strong safety guarantees ● Automated analysis ●

  10. Model-Based Verification Reinforcement Learning φ Benefits: Strong safety guarantees ● Automated analysis ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful”

  11. Model-Based Verification Reinforcement Learning φ Benefits: Strong safety guarantees ● Automated analysis ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful” Assumes accurate model ●

  12. Model-Based Verification Reinforcement Learning Act φ Benefits: Observe Strong safety guarantees ● Automated analysis ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful” Assumes accurate model. ●

  13. Model-Based Verification Reinforcement Learning Act φ Observe Benefits: Benefits: Strong safety guarantees No need for complete model ● ● Automated analysis Optimal (effective) policies ● ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful” Assumes accurate model. ●

  14. Model-Based Verification Reinforcement Learning Act φ Observe Benefits: Benefits: Strong safety guarantees No need for complete model ● ● Automated analysis Optimal (effective) policies ● ● Drawbacks: Drawbacks: Control policies are typically No strong safety guarantees ● ● non-deterministic: answers Proofs are obtained and ● “what is safe”, not “what is checked by hand useful” Formal proofs = decades-long ● Assumes accurate model. proof development ●

  15. Model-Based Verification Reinforcement Learning Act φ Observe Goal: Provably correct reinforcement learning Benefits: Benefits: Strong safety guarantees No need for complete model ● ● Aomputational aids (ATP) Optimal (effective) policies ● ● Drawbacks: Drawbacks: Control policies are typically No strong safety guarantees ● ● non-deterministic: answers Proofs are obtained and ● “what is safe”, not “what is checked by hand useful” Formal proofs = decades-long ● Assumes accurate model proof development ●

  16. Model-Based Verification Reinforcement Learning Act φ Observe Goal: Provably correct reinforcement learning Benefits: Benefits: 1. Learn Safety Strong safety guarantees No need for complete model ● ● 2. Learn a Safe Policy Aomputational aids (ATP) Optimal (effective) policies ● ● 3. Justify claims of safety Drawbacks: Drawbacks: Control policies are typically No strong safety guarantees ● ● non-deterministic: answers Proofs are obtained and ● “what is safe”, not “what is checked by hand useful” Formal proofs = decades-long ● Assumes accurate model proof development ●

  17. Model-Based Verification Accurate, analyzable models often exist! { {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

  18. Model-Based Verification Accurate , analyzable models often exist! { {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} discrete control }* Continuous motion

  19. Model-Based Verification Accurate , analyzable models often exist! { {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} discrete, non-deterministic }* Continuous motion control

  20. Model-Based Verification Accurate , analyzable models often exist! init → [ { ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {pos’ = vel, vel’ = acc} }* ]pos < stopSign

  21. Model-Based Verification Accurate , analyzable models often exist! formal verification gives strong safety guarantees init → [{ ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {pos’ = vel, vel’ = acc} }*]pos < stopSign

  22. Model-Based Verification Accurate , analyzable models often exist! formal verification gives strong safety guarantees ● Computer-checked proofs = of safety specification.

  23. Model-Based Verification Accurate , analyzable models often exist! formal verification gives strong safety guarantees ● Computer-checked proofs = of safety specification ● Formal proofs mapping model to runtime monitors

  24. Model-Based Verification Isn’t Enough Perfect , analyzable models don’t exist!

  25. Model-Based Verification Isn’t Enough Perfect , analyzable models don’t exist! How to implement? { ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {pos’ = vel, vel’ = acc} }* Only accurate sometimes

  26. Model-Based Verification Isn’t Enough Perfect , analyzable models don’t exist! How to implement? { ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {dx’=w*y, dy’=-w*x, ...} }* Only accurate sometimes

  27. Our Contribution Justified Speculative Control is an approach toward provably safe reinforcement learning that: 1. learns to resolve non-determinism without sacrificing formal safety results

  28. Our Contribution Justified Speculative Control is an approach toward provably safe reinforcement learning that: 1. learns to resolve non-determinism without sacrificing formal safety results 2. allows and directs speculation whenever model mismatches occur

  29. Learning to Resolve Non-determinism Act Observe & compute reward

  30. Learning to Resolve Non-determinism accel ∪ brake U turn Observe & compute reward

  31. Learning to Resolve Non-determinism {accel,brake,turn} Observe & compute reward

  32. Learning to Resolve Non-determinism {accel,brake,turn} ⇨ Policy Observe & compute reward

  33. Learning to Resolve Non-determinism {accel,brake,turn} (safe?) ⇨ Policy Observe & compute reward

  34. Learning to Safely Resolve Non-determinism Safety Monitor (safe?) ⇨ Policy Observe & compute reward

  35. Learning to Safely Resolve Non-determinism Safety Monitor (safe?) ⇨ Policy Observe & compute reward ≠ “Trust Me”

  36. Learning to Safely Resolve Non-determinism φ (safe?) ⇨ Policy Observe & compute reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  37. Learning to Safely Resolve Non-determinism φ (safe?) ⇨ Policy Observe & compute reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  38. Learning to Safely Resolve Non-determinism φ Main Theorem: If the ODEs are accurate, then (safe?) ⇨ our formal proofs transfer from the Policy non-deterministic model to the learned Observe & compute (deterministic) policy reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  39. Learning to Safely Resolve Non-determinism φ Main Theorem: If the ODEs are accurate, then (safe?) ⇨ our formal proofs transfer from the Policy non-deterministic model to the learned Observe & compute (deterministic) policy via the model monitor. reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  40. What about the physical model? φ (safe?) ⇨ {pos’=vel,vel’=acc} ≠ Policy Observe & compute reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  41. What About the Physical Model? {brake, accel, turn} Observe & compute reward

  42. What About the Physical Model? Model is accurate. {brake, accel, turn} Observe & compute reward

  43. What About the Physical Model? Model is accurate. {brake, accel, turn} Observe & compute reward

  44. What About the Physical Model? Model is accurate. {brake, accel, turn} Model is inaccurate Observe & compute reward

  45. What About the Physical Model? Model is accurate. {brake, accel, turn} Model is inaccurate Observe & compute Obstacle! reward

  46. What About the Physical Model? Expected {brake, accel, turn} Reality Observe & compute reward

  47. Speculation is Justified Expected {brake, accel, turn} (safe) Reality (crash!) Observe & compute reward

  48. Leveraging Verification Results to Learn Better {brake, accel, turn} Use a real-valued version of the model monitor as a reward signal Observe & compute reward

Recommend


More recommend