safe reinforcement learning via formal methods
play

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr - PowerPoint PPT Presentation

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr Platzer Carnegie Mellon University Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr Platzer Carnegie Mellon University Safety-Critical Systems


  1. Safe Reinforcement Learning via Formal Methods Nathan Fulton and André Platzer Carnegie Mellon University

  2. Safe Reinforcement Learning via Formal Methods Nathan Fulton and André Platzer Carnegie Mellon University

  3. Safety-Critical Systems "How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing

  4. Autonomous Safety-Critical Systems How can we provide people with autonomous cyber-physical systems they can bet their lives on?

  5. Model-Based Verification Reinforcement Learning φ

  6. Model-Based Verification Reinforcement Learning pos < stopSign

  7. Model-Based Verification Reinforcement Learning ctrl pos < stopSign

  8. Model-Based Verification Reinforcement Learning ctrl pos < stopSign Approach : prove that control software achieves a specification with respect to a model of the physical system.

  9. Model-Based Verification Reinforcement Learning ctrl pos < stopSign Approach : prove that control software achieves a specification with respect to a model of the physical system.

  10. Model-Based Verification Reinforcement Learning φ Benefits: Strong safety guarantees ● Automated analysis ●

  11. Model-Based Verification Reinforcement Learning φ Benefits: Strong safety guarantees ● Automated analysis ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful”

  12. Model-Based Verification Reinforcement Learning φ Benefits: Strong safety guarantees ● Automated analysis ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful” Assumes accurate model ●

  13. Model-Based Verification Reinforcement Learning Act φ Benefits: Observe Strong safety guarantees ● Automated analysis ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful” Assumes accurate model. ●

  14. Model-Based Verification Reinforcement Learning Act φ Observe Benefits: Benefits: Strong safety guarantees No need for complete model ● ● Automated analysis Optimal (effective) policies ● ● Drawbacks: Control policies are typically ● non-deterministic: answers “what is safe”, not “what is useful” Assumes accurate model. ●

  15. Model-Based Verification Reinforcement Learning Act φ Observe Benefits: Benefits: Strong safety guarantees No need for complete model ● ● Automated analysis Optimal (effective) policies ● ● Drawbacks: Drawbacks: Control policies are typically No strong safety guarantees ● ● non-deterministic: answers Proofs are obtained and ● “what is safe”, not “what is checked by hand useful” Formal proofs = decades-long ● Assumes accurate model. proof development ●

  16. Model-Based Verification Reinforcement Learning Act φ Observe Goal: Provably correct reinforcement learning Benefits: Benefits: Strong safety guarantees No need for complete model ● ● Aomputational aids (ATP) Optimal (effective) policies ● ● Drawbacks: Drawbacks: Control policies are typically No strong safety guarantees ● ● non-deterministic: answers Proofs are obtained and ● “what is safe”, not “what is checked by hand useful” Formal proofs = decades-long ● Assumes accurate model proof development ●

  17. Model-Based Verification Reinforcement Learning Act φ Observe Goal: Provably correct reinforcement learning Benefits: Benefits: 1. Learn Safety Strong safety guarantees No need for complete model ● ● 2. Learn a Safe Policy Aomputational aids (ATP) Optimal (effective) policies ● ● 3. Justify claims of safety Drawbacks: Drawbacks: Control policies are typically No strong safety guarantees ● ● non-deterministic: answers Proofs are obtained and ● “what is safe”, not “what is checked by hand useful” Formal proofs = decades-long ● Assumes accurate model proof development ●

  18. Model-Based Verification Accurate, analyzable models often exist! { {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

  19. Model-Based Verification Accurate , analyzable models often exist! { {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} discrete control }* Continuous motion

  20. Model-Based Verification Accurate , analyzable models often exist! { {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} discrete, non-deterministic }* Continuous motion control

  21. Model-Based Verification Accurate , analyzable models often exist! init → [ { ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {pos’ = vel, vel’ = acc} }* ]pos < stopSign

  22. Model-Based Verification Accurate , analyzable models often exist! formal verification gives strong safety guarantees init → [{ ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {pos’ = vel, vel’ = acc} }*]pos < stopSign

  23. Model-Based Verification Accurate , analyzable models often exist! formal verification gives strong safety guarantees ● Computer-checked proofs = of safety specification.

  24. Model-Based Verification Accurate , analyzable models often exist! formal verification gives strong safety guarantees ● Computer-checked proofs = of safety specification ● Formal proofs mapping model to runtime monitors

  25. Model-Based Verification Isn’t Enough Perfect , analyzable models don’t exist!

  26. Model-Based Verification Isn’t Enough Perfect , analyzable models don’t exist! How to implement? { ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {pos’ = vel, vel’ = acc} }* Only accurate sometimes

  27. Model-Based Verification Isn’t Enough Perfect , analyzable models don’t exist! How to implement? { ∪ brake ∪ ?safeTurn; turn}; { ?safeAccel;accel {dx’=w*y, dy’=-w*x, ...} }* Only accurate sometimes

  28. Our Contribution Justified Speculative Control is an approach toward provably safe reinforcement learning that: 1. learns to resolve non-determinism without sacrificing formal safety results

  29. Our Contribution Justified Speculative Control is an approach toward provably safe reinforcement learning that: 1. learns to resolve non-determinism without sacrificing formal safety results 2. allows and directs speculation whenever model mismatches occur

  30. Learning to Resolve Non-determinism Act Observe & compute reward

  31. Learning to Resolve Non-determinism accel ∪ brake U turn Observe & compute reward

  32. Learning to Resolve Non-determinism {accel,brake,turn} Observe & compute reward

  33. Learning to Resolve Non-determinism {accel,brake,turn} ⇨ Policy Observe & compute reward

  34. Learning to Resolve Non-determinism {accel,brake,turn} (safe?) ⇨ Policy Observe & compute reward

  35. Learning to Safely Resolve Non-determinism Safety Monitor (safe?) ⇨ Policy Observe & compute reward

  36. Learning to Safely Resolve Non-determinism Safety Monitor (safe?) ⇨ Policy Observe & compute reward ≠ “Trust Me”

  37. Learning to Safely Resolve Non-determinism φ (safe?) ⇨ Policy Observe & compute reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  38. Learning to Safely Resolve Non-determinism φ (safe?) ⇨ Policy Observe & compute reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  39. Learning to Safely Resolve Non-determinism φ Main Theorem: If the ODEs are accurate, then (safe?) ⇨ our formal proofs transfer from the Policy non-deterministic model to the learned Observe & compute (deterministic) policy reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  40. Learning to Safely Resolve Non-determinism φ Main Theorem: If the ODEs are accurate, then (safe?) ⇨ our formal proofs transfer from the Policy non-deterministic model to the learned Observe & compute (deterministic) policy via the model monitor. reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  41. What about the physical model? φ (safe?) ⇨ {pos’=vel,vel’=acc} ≠ Policy Observe & compute reward Use a theorem prover to prove: (init → [{{accel ∪ brake};ODEs}*](safe)) ↔ φ

  42. What About the Physical Model? {brake, accel, turn} Observe & compute reward

  43. What About the Physical Model? Model is accurate. {brake, accel, turn} Observe & compute reward

  44. What About the Physical Model? Model is accurate. {brake, accel, turn} Observe & compute reward

  45. What About the Physical Model? Model is accurate. {brake, accel, turn} Model is inaccurate Observe & compute reward

  46. What About the Physical Model? Model is accurate. {brake, accel, turn} Model is inaccurate Observe & compute Obstacle! reward

  47. What About the Physical Model? Expected {brake, accel, turn} Reality Observe & compute reward

  48. Speculation is Justified Expected {brake, accel, turn} (safe) Reality (crash!) Observe & compute reward

Recommend


More recommend