A Composable Specification Language for Reinforcement Learning Tasks - PowerPoint PPT Presentation

A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1

Control System Ø Continuous states and actions Controller 𝑡 ∈ 𝑇 𝑏 ∈ 𝐵 Ø System can be probabilistic Ø Discrete Time System Ø Finite time horizon - T 𝑇 = Set of System States 𝐵 = Set of Control Inputs 2

Reinforcement Learning Controller Neural Network 𝑡 ∈ 𝑇 𝑏 ∈ 𝐵 Ø Use neural networks to map states to Ø Use neural networks to map states to actions actions Ø Design reward function R mapping runs to Ø Design reward function R mapping runs to System rewards rewards 𝑇 = Set of System States Ø Learn NN parameters optimizing: Ø Learn NN parameters optimizing: 𝐵 = Set of Control Inputs 3

Reward Functions Ø Too low-level as compared to logical specification Ø No obvious way to compose rewards 𝑆 ' : Reward function for “Reach q” 𝑆 ( : Reward function for “Reach p” Reward function for “Reach q and then Reach p” ? 4

Need to generate reward function from a given logical specification 5

Need For Memory Ø Specification: Reach q, then Reach p, then Reach r Ø Controller maps states to actions Ø Action at p depends on the history of the run Solution: Add additional state component to 𝑠 indicate whether q has already been visited 6

Need to generate reward function from a given logical specification Need to automatically infer the additional state components from the specification 7

Our Framework Ø System MDP = 𝑇, 𝐵, 𝑄, 𝑈, 𝑡 . where 𝑄 𝑡, 𝑏, 𝑡 / = Pr 𝑡 / 𝑡, 𝑏) given as a black-box forward simulator Ø Specification 𝜒 given in our task specification language Synthesizes a control policy 𝜌 ∗ such that, 𝜌 ∗ ∈ argmax 9 Pr[𝜍 ⊨ 𝜒] 8

Our Framework Product MDP System Reinforcement Learning Algorithm Reward Monitor Function Automaton Nondeterministic Specification Control Policy Task Monitor 9

Task Specification Language 𝜚 ≔ achieve 𝑐 𝜚 ' ensuring 𝑐 𝜚 ' ; 𝜚 ( | 𝜚 ' or 𝜚 ( Ø Example base predicates: o 𝑂𝑓𝑏𝑠 A is satisfied if and only if the distance to q is less than 1 o 𝐵𝑥𝑏𝑧 D is satisfied if and only if there is a positive distance to O Ø Specification for navigation example: achieve 𝑂𝑓𝑏𝑠 A ; achieve 𝑂𝑓𝑏𝑠 K ensuring 𝐵𝑥𝑏𝑧 D 10

Quantitative Semantics Ø Assume each base predicate 𝑐 ∈ 𝑄 is associated with a quantitative semantics, 𝑐 : 𝑇 → ℝ such that, 𝑡 ⊨ 𝑐 if and only if 𝑐 𝑡 > 0 𝑂𝑓𝑏𝑠 𝑡 = 1 − 𝑒𝑗𝑡𝑢 𝑡, 𝑟 o A 𝐵𝑥𝑏𝑧 D 𝑡 = 𝑒𝑗𝑡𝑢(𝑡, 𝑃) o Ø Extend to positive Boolean combinations by, 𝑐 ' ∨ 𝑐 ( = max( 𝑐 ' , 𝑐 ( ) o 𝑐 ' ∧ 𝑐 ( = min( 𝑐 ' , 𝑐 ( ) o 11

Task Monitor Ø Finite State Machine Ø Registers that store quantitative information Ø Compilation similar to NFA construction from regular expressions Task Monitor for 𝜚 = achieve 𝑐 12

Task Monitor Register Updates Transition Predicate Registers Task monitor for achieve 𝑂𝑓𝑏𝑠 A ; achieve 𝑂𝑓𝑏𝑠 K ensuring 𝐵𝑥𝑏𝑧 D 13 𝑣:

Extended Policy Monitor state (q) Map state q to neural network System state System action Next monitor Register values transition Neural Network for state q 14

Assigning Rewards Given a sequence of extended system states, 𝜍 = 𝑟 . , 𝑡 . , 𝑤 . → ⋯ → (𝑟 f , 𝑡 f , 𝑤 f ) what should be its reward? Ø Case 1: ( 𝑟 f is a final state) Reward is given by monitor Ø Case 2: ( 𝑟 f not a final state) Not all tasks have been completed o Suggestion 1: 𝑆(𝜍) = −∞ o Suggestion 2: Find a reward function 𝑆′ that preserves ordering of runs w.r.t. 𝑆, 𝑆 𝜍 > 𝑆 𝜍′ implies 𝑆 / 𝜍 > 𝑆′(𝜍′) 15

Reward Shaping Given 𝜍 = 𝑟 . , 𝑡 . , 𝑤 . → ⋯ → (𝑟 f , 𝑡 f , 𝑤 f ) with 𝑟 f non-final, 𝜏 ' 𝑣 ' 𝑟 f Higher reward for states farther from start 𝜏 x 𝑣 x 𝑆 // (𝑟 f ) (𝑡, 𝑤) = 𝐷 j + 2𝐷 m 𝑒 A n − 𝐸 + max ⟦𝜏 p ⟧(𝑡, 𝑤) p 𝑆 / 𝜍 = max t∶A v wA n 𝑆′′(𝑟 f )(𝑡 t , 𝑤 t ) Prefer runs that get close to satisfying some predicate on transitions that make progress 𝑒 A : Length of the longest path from 𝑟 . to 𝑟 without using self loops § 𝐷 j : Lower bound for possible reward in any final state § 𝐷 m : Upper bound on the third term for all 𝑟 § 16

Experiments Ø Implemented our approach in a tool called SPECTRL (SPECifying Tasks for RL) Ø Case study in the 2D navigation setting: o 𝑇 = ℝ ( and A = ℝ ( o Transitions given by 𝑡 tz' = 𝑡 t + 𝑏 t + 𝜁 where 𝜁 is a small gaussian noise 17

2D Navigation Tasks SPECTRL TLTL CCE 18 Learning curves for different tasks

2D Navigation Tasks Sample Complexity Curve Y-axis denotes number of sample trajectories needed to learn X-axis denotes number of nested goals 19

Cartpole Learning Curve for Cartpole Spec: Go to the right and return to start position without letting the pole fall 20

THANK YOU! 21

A Composable Specification Language for Reinforcement Learning Tasks - PowerPoint PPT Presentation

A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1 Control System Continuous states and actions Controller System can be probabilistic

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Formal Specification and Verification Formal specification Temporal logic 12.06.2012

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

Formal Specification and Verification Formal specification (2) 6.12.2016 Viorica

REQUIREMENT Requirement specification SPECIFICATION motivation and basics Today:

Formal Specification and Verification Formal specification (2) 29.11.2016 Viorica

Software specification in CASL - The Common Algebraic Specification Language Till Mossakowski,

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

The Value Proposition of The Value Proposition of Investor Dispute Resolution Investor Dispute

International Regulatory Cooperation The range of possible approaches Cline Kauffmann, Deputy

1 We might not always see them, but regulations the rules government imposes on businesses and

Integrated Type Automatic Changeover Regulator ITO KOKI CO.,LTD The

1 2 3 The Industry Standard The Industry Standard Design and investigation of rectangular,

SYNTHESIS OF CARBON NANOTUBE REINFORCEMENT IN ALUMINUM POWDER BY IN SITU CHEMICAL VAPOR

Reinforcement Learning and Model Predictive Control RL : optimizes policy for given cost

MANAGING BEHAVIOR UTILIZING POSITIVE BEHAVIORAL SUPPORTS TO IMPROVE SCHOOL CLIMATE IS DISCIPLINE

A Composable Specification Language for Reinforcement Learning Tasks - PowerPoint PPT Presentation

A Composable Specification Language for Reinforcement Learning Tasks Kishor Jothimurugan, Rajeev Alur, Osbert Bastani 1 Control System Continuous states and actions Controller System can be probabilistic

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE &amp; EXTENSIBLE A FLEXIBLE, COMPOSABLE &amp;

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Formal Specification and Verification Formal specification Temporal logic 12.06.2012

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Fused and Composable Heterogeneous Cores Roshan Nair and Anirudh Krishna Villivalam Single cores

Formal Specification and Verification Formal specification (2) 6.12.2016 Viorica

REQUIREMENT Requirement specification SPECIFICATION motivation and basics Today:

Formal Specification and Verification Formal specification (2) 29.11.2016 Viorica

Software specification in CASL - The Common Algebraic Specification Language Till Mossakowski,

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

The Value Proposition of The Value Proposition of Investor Dispute Resolution Investor Dispute

International Regulatory Cooperation The range of possible approaches Cline Kauffmann, Deputy

1 We might not always see them, but regulations the rules government imposes on businesses and

Integrated Type Automatic Changeover Regulator ITO KOKI CO.,LTD The

1 2 3 The Industry Standard The Industry Standard Design and investigation of rectangular,

SYNTHESIS OF CARBON NANOTUBE REINFORCEMENT IN ALUMINUM POWDER BY IN SITU CHEMICAL VAPOR

Reinforcement Learning and Model Predictive Control RL : optimizes policy for given cost

MANAGING BEHAVIOR UTILIZING POSITIVE BEHAVIORAL SUPPORTS TO IMPROVE SCHOOL CLIMATE IS DISCIPLINE

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &