A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1
Control System Γ Continuous states and actions Controller π‘ β π π β π΅ Γ System can be probabilistic Γ Discrete Time System Γ Finite time horizon - T π = Set of System States π΅ = Set of Control Inputs 2
Reinforcement Learning Controller Neural Network π‘ β π π β π΅ Γ Use neural networks to map states to Γ Use neural networks to map states to actions actions Γ Design reward function R mapping runs to Γ Design reward function R mapping runs to System rewards rewards π = Set of System States Γ Learn NN parameters optimizing: Γ Learn NN parameters optimizing: π΅ = Set of Control Inputs 3
Reward Functions Γ Too low-level as compared to logical specification Γ No obvious way to compose rewards π ' : Reward function for βReach qβ π ( : Reward function for βReach pβ Reward function for βReach q and then Reach pβ ? 4
Need to generate reward function from a given logical specification 5
Need For Memory Γ Specification: Reach q, then Reach p, then Reach r Γ Controller maps states to actions Γ Action at p depends on the history of the run Solution: Add additional state component to π indicate whether q has already been visited 6
Need to generate reward function from a given logical specification Need to automatically infer the additional state components from the specification 7
Our Framework Γ System MDP = π, π΅, π, π, π‘ . where π π‘, π, π‘ / = Pr π‘ / π‘, π) given as a black-box forward simulator Γ Specification π given in our task specification language Synthesizes a control policy π β such that, π β β argmax 9 Pr[π β¨ π] 8
Our Framework Product MDP System Reinforcement Learning Algorithm Reward Monitor Function Automaton Nondeterministic Specification Control Policy Task Monitor 9
Task Specification Language π β achieve π π ' ensuring π π ' ; π ( | π ' or π ( Γ Example base predicates: o ππππ A is satisfied if and only if the distance to q is less than 1 o π΅π₯ππ§ D is satisfied if and only if there is a positive distance to O Γ Specification for navigation example: achieve ππππ A ; achieve ππππ K ensuring π΅π₯ππ§ D 10
Quantitative Semantics Γ Assume each base predicate π β π is associated with a quantitative semantics, π : π β β such that, π‘ β¨ π if and only if π π‘ > 0 ππππ π‘ = 1 β πππ‘π’ π‘, π o A π΅π₯ππ§ D π‘ = πππ‘π’(π‘, π) o Γ Extend to positive Boolean combinations by, π ' β¨ π ( = max( π ' , π ( ) o π ' β§ π ( = min( π ' , π ( ) o 11
Task Monitor Γ Finite State Machine Γ Registers that store quantitative information Γ Compilation similar to NFA construction from regular expressions Task Monitor for π = achieve π 12
Task Monitor Register Updates Transition Predicate Registers Task monitor for achieve ππππ A ; achieve ππππ K ensuring π΅π₯ππ§ D 13 π£:
Extended Policy Monitor state (q) Map state q to neural network System state System action Next monitor Register values transition Neural Network for state q 14
Assigning Rewards Given a sequence of extended system states, π = π . , π‘ . , π€ . β β― β (π f , π‘ f , π€ f ) what should be its reward? Γ Case 1: ( π f is a final state) Reward is given by monitor Γ Case 2: ( π f not a final state) Not all tasks have been completed o Suggestion 1: π(π) = ββ o Suggestion 2: Find a reward function πβ² that preserves ordering of runs w.r.t. π, π π > π πβ² implies π / π > πβ²(πβ²) 15
Reward Shaping Given π = π . , π‘ . , π€ . β β― β (π f , π‘ f , π€ f ) with π f non-final, π ' π£ ' π f Higher reward for states farther from start π x π£ x π // (π f ) (π‘, π€) = π· j + 2π· m π A n β πΈ + max β¦π p β§(π‘, π€) p π / π = max tβΆA v wA n πβ²β²(π f )(π‘ t , π€ t ) Prefer runs that get close to satisfying some predicate on transitions that make progress π A : Length of the longest path from π . to π without using self loops Β§ π· j : Lower bound for possible reward in any final state Β§ π· m : Upper bound on the third term for all π Β§ 16
Experiments Γ Implemented our approach in a tool called SPECTRL (SPECifying Tasks for RL) Γ Case study in the 2D navigation setting: o π = β ( and A = β ( o Transitions given by π‘ tz' = π‘ t + π t + π where π is a small gaussian noise 17
2D Navigation Tasks SPECTRL TLTL CCE 18 Learning curves for different tasks
2D Navigation Tasks Sample Complexity Curve Y-axis denotes number of sample trajectories needed to learn X-axis denotes number of nested goals 19
Cartpole Learning Curve for Cartpole Spec: Go to the right and return to start position without letting the pole fall 20
THANK YOU! 21
Recommend
More recommend