a composable specification language for reinforcement
play

A Composable Specification Language for Reinforcement Learning Tasks - PowerPoint PPT Presentation

A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1 Control System Continuous states and actions Controller System can be probabilistic


  1. A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1

  2. Control System Ø Continuous states and actions Controller 𝑑 ∈ 𝑇 𝑏 ∈ 𝐡 Ø System can be probabilistic Ø Discrete Time System Ø Finite time horizon - T 𝑇 = Set of System States 𝐡 = Set of Control Inputs 2

  3. Reinforcement Learning Controller Neural Network 𝑑 ∈ 𝑇 𝑏 ∈ 𝐡 Ø Use neural networks to map states to Ø Use neural networks to map states to actions actions Ø Design reward function R mapping runs to Ø Design reward function R mapping runs to System rewards rewards 𝑇 = Set of System States Ø Learn NN parameters optimizing: Ø Learn NN parameters optimizing: 𝐡 = Set of Control Inputs 3

  4. Reward Functions Ø Too low-level as compared to logical specification Ø No obvious way to compose rewards 𝑆 ' : Reward function for β€œReach q” 𝑆 ( : Reward function for β€œReach p” Reward function for β€œReach q and then Reach p” ? 4

  5. Need to generate reward function from a given logical specification 5

  6. Need For Memory Ø Specification: Reach q, then Reach p, then Reach r Ø Controller maps states to actions Ø Action at p depends on the history of the run Solution: Add additional state component to 𝑠 indicate whether q has already been visited 6

  7. Need to generate reward function from a given logical specification Need to automatically infer the additional state components from the specification 7

  8. Our Framework Ø System MDP = 𝑇, 𝐡, 𝑄, π‘ˆ, 𝑑 . where 𝑄 𝑑, 𝑏, 𝑑 / = Pr 𝑑 / 𝑑, 𝑏) given as a black-box forward simulator Ø Specification πœ’ given in our task specification language Synthesizes a control policy 𝜌 βˆ— such that, 𝜌 βˆ— ∈ argmax 9 Pr[𝜍 ⊨ πœ’] 8

  9. Our Framework Product MDP System Reinforcement Learning Algorithm Reward Monitor Function Automaton Nondeterministic Specification Control Policy Task Monitor 9

  10. Task Specification Language 𝜚 ≔ achieve 𝑐 𝜚 ' ensuring 𝑐 𝜚 ' ; 𝜚 ( | 𝜚 ' or 𝜚 ( Ø Example base predicates: o 𝑂𝑓𝑏𝑠 A is satisfied if and only if the distance to q is less than 1 o 𝐡π‘₯𝑏𝑧 D is satisfied if and only if there is a positive distance to O Ø Specification for navigation example: achieve 𝑂𝑓𝑏𝑠 A ; achieve 𝑂𝑓𝑏𝑠 K ensuring 𝐡π‘₯𝑏𝑧 D 10

  11. Quantitative Semantics Ø Assume each base predicate 𝑐 ∈ 𝑄 is associated with a quantitative semantics, 𝑐 : 𝑇 β†’ ℝ such that, 𝑑 ⊨ 𝑐 if and only if 𝑐 𝑑 > 0 𝑂𝑓𝑏𝑠 𝑑 = 1 βˆ’ 𝑒𝑗𝑑𝑒 𝑑, π‘Ÿ o A 𝐡π‘₯𝑏𝑧 D 𝑑 = 𝑒𝑗𝑑𝑒(𝑑, 𝑃) o Ø Extend to positive Boolean combinations by, 𝑐 ' ∨ 𝑐 ( = max( 𝑐 ' , 𝑐 ( ) o 𝑐 ' ∧ 𝑐 ( = min( 𝑐 ' , 𝑐 ( ) o 11

  12. Task Monitor Ø Finite State Machine Ø Registers that store quantitative information Ø Compilation similar to NFA construction from regular expressions Task Monitor for 𝜚 = achieve 𝑐 12

  13. Task Monitor Register Updates Transition Predicate Registers Task monitor for achieve 𝑂𝑓𝑏𝑠 A ; achieve 𝑂𝑓𝑏𝑠 K ensuring 𝐡π‘₯𝑏𝑧 D 13 𝑣:

  14. Extended Policy Monitor state (q) Map state q to neural network System state System action Next monitor Register values transition Neural Network for state q 14

  15. Assigning Rewards Given a sequence of extended system states, 𝜍 = π‘Ÿ . , 𝑑 . , 𝑀 . β†’ β‹― β†’ (π‘Ÿ f , 𝑑 f , 𝑀 f ) what should be its reward? Ø Case 1: ( π‘Ÿ f is a final state) Reward is given by monitor Ø Case 2: ( π‘Ÿ f not a final state) Not all tasks have been completed o Suggestion 1: 𝑆(𝜍) = βˆ’βˆž o Suggestion 2: Find a reward function 𝑆′ that preserves ordering of runs w.r.t. 𝑆, 𝑆 𝜍 > 𝑆 πœβ€² implies 𝑆 / 𝜍 > 𝑆′(πœβ€²) 15

  16. Reward Shaping Given 𝜍 = π‘Ÿ . , 𝑑 . , 𝑀 . β†’ β‹― β†’ (π‘Ÿ f , 𝑑 f , 𝑀 f ) with π‘Ÿ f non-final, 𝜏 ' 𝑣 ' π‘Ÿ f Higher reward for states farther from start 𝜏 x 𝑣 x 𝑆 // (π‘Ÿ f ) (𝑑, 𝑀) = 𝐷 j + 2𝐷 m 𝑒 A n βˆ’ 𝐸 + max ⟦𝜏 p ⟧(𝑑, 𝑀) p 𝑆 / 𝜍 = max t∢A v wA n 𝑆′′(π‘Ÿ f )(𝑑 t , 𝑀 t ) Prefer runs that get close to satisfying some predicate on transitions that make progress 𝑒 A : Length of the longest path from π‘Ÿ . to π‘Ÿ without using self loops Β§ 𝐷 j : Lower bound for possible reward in any final state Β§ 𝐷 m : Upper bound on the third term for all π‘Ÿ Β§ 16

  17. Experiments Ø Implemented our approach in a tool called SPECTRL (SPECifying Tasks for RL) Ø Case study in the 2D navigation setting: o 𝑇 = ℝ ( and A = ℝ ( o Transitions given by 𝑑 tz' = 𝑑 t + 𝑏 t + 𝜁 where 𝜁 is a small gaussian noise 17

  18. 2D Navigation Tasks SPECTRL TLTL CCE 18 Learning curves for different tasks

  19. 2D Navigation Tasks Sample Complexity Curve Y-axis denotes number of sample trajectories needed to learn X-axis denotes number of nested goals 19

  20. Cartpole Learning Curve for Cartpole Spec: Go to the right and return to start position without letting the pole fall 20

  21. THANK YOU! 21

Recommend


More recommend