Efficient Reinforcement Learning with Hierarchies of Machines by - PowerPoint PPT Presentation

Efficient Reinforcement Learning with Hierarchies of Machines by Leveraging Internal Transitions Aijun Bai* Stuart Russell UC Berkeley/Microsoft Research UC Berkeley

Outline • Hierarchical RL with partial programs • Deterministic internal transitions • Results

Hierarchical RL with partial programs [Parr & Russell, NIPS 97; Andre & Russell, NIPS 00, AAAI 02; Marthi et al, IJCAI 05] Partial program a Learning algorithm s,r Completion Hierarchically optimal for all terminating programs 3

Partial Program – an Example repeat forever Choose({a1,a2,…})

Partial Program – an Example Navigate(destination) while  At(destination,CurrentState()) Choose({N,S,E,W})

Concurrent Partial Programs Top() for each p in Effectors() PlayKeep(p) PlayKeep(p) s  CurrentState() while  Terminal(s) if BallKickable(s) then Choose({Pass(),Hold()}) else if FastestToBall(s) then Intercept() else Choose(Stay(),Move()) Pass() KickTo(Choose(Effectors()\{self}),Choose({slow,fast}) …

Technical development • Decisions based on internal state • Joint state ω = [s,m] environment state + program state (cf. [Russell & Wefald 1989] ) • MDP + partial program = SMDP over choice states in {ω}, learn Q(ω,c) for choices c • Additive decomposition of value functions • by subroutine structure [Dietterich 00, Andre & Russell 02] Q is a sum of sub-Q functions per subroutine • across concurrent threads [Russell & Zimdars 03] Q is a sum of sub-Q functions per thread, with decomposed reward signal 7

Internal Transitions Top() • Transitions between for each p in Effectors() choice points with no PlayKeep(p) physical action intervening PlayKeep(p) s  CurrentState() • Internal transitions while  Terminal(s) take no (real) time and have zero reward if BallKickable(s) then Choose({Pass(),Hold()}) else if FastestToBall(s) then Intercept() • Internal transitions are else Choose(Stay(),Move()) deterministic Pass() KickTo(Choose(Effectors()\{self}),Choose({slow,fast}) …

Idea 1 • Use internal transitions to shortcircuit the computations of Q values recursively if applicable • If (s, m, c) -> (s, m’) is an internal transition • Then, Q(s, m, c) = V(s, m’) = max c’ Q(s, m’, c’) • Cache internal transitions as <s, m, c, m’> tuples • No need for Q-learning on these

Idea 2 • Identify weakest precondition P(s) for this internal transition to occur (cf EBL, chunking) • Cache internal transitions as <P, m, c, m’> tuples • Cache size independent of |S|, roughly proportional to size of partial program call graph

The HAMQ-INT Algorithm • Track the set of predicates since last choice point • Save an abstracted rule of internal transition if qualified (τ = 0) in a dictionary ρ • Use the saved rules to shortcircuit the computation of Q values recursively whenever possible

Experimental Result on Taxi

3 vs 2 Keepaway Comparisons • Option (Stone, 2005): • Each keeper learning separately • Learn a policy over Hold() and Pass(k, v) if ball kickable; otherwise, follow a fixed policy • Intercept() if fastest to the ball; otherwise, GetOpen() • GetOpen() is manually programmed for Option • Concurrent-Option: • Concurrent version of Option • One global Q function is learnt • Random: randomized version of Option • Concurrent-HAMQ • Learn its own version of GetOpen() by calling Stay() and Move(d, v) • Concurrent-HAMQ-INT

Experimental Result on Keepaway

Before and After Initial policy Converged policy

Summary • HAMQ-INT algorithm • Automatically discovers internal transitions • Takes advantage of internal transitions for efficient learning • Outperforms the state of the art significantly on Taxi and RoboCup Keepaway • Future work • Scale up to full RoboCup task • More general integration of model-based and model-free reinforcement learning • More flexible forms of partial program (e.g., temporal logic)

Efficient Reinforcement Learning with Hierarchies of Machines by - PowerPoint PPT Presentation

Efficient Reinforcement Learning with Hierarchies of Machines by Leveraging Internal Transitions Aijun Bai* Stuart Russell UC Berkeley/Microsoft Research UC Berkeley

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Integrable twisted hierarchies Twisted with D 2 symmetries hierarchies of a splitting type

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Complexity Hierarchies Lecture 2 2

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

OUTLINE CHAPTER 10 Recursive Hierarchies Table of contents Recursive Hierarchies and Bridges

Lecture 20: Cache Hierarchies, Virtual Memory Todays topics: Cache hierarchies

Relational Data Hierarchies CSC444 Why hierarchies?

Hierarchies in inclusion logic Miika Hannula University of Helsinki 27.8.2014 Miika Hannula

Soliton hierarchies and matrix loop algebras Wen-Xiu Ma Department of Mathematics and Statistics

Jesus Pastor Youth

BVDFree England National BVD Elimination Scheme Break Free from BVD Contents Bovine Viral

Auto-scaling deadline- constrained workloads in containers in the cloud Jay Jay DesLauriers

COVID-19 stimulus package update JobKeeper Payment and other measures Rahul Singh Doug McBirnie

Parameterized Programs CS256/Spring 2008 Lecture #09 Zohar Manna 0 : loop

F OOT F ORM Decomposed: Using primitive constraints in OT Jason Eisner, University of

PhD Thesis proposal Supporting Conceptual Modelling in ORM by Reasoning Francesco Sportelli

Investor In estor Pr Prese esentation ntation rd Quart 3 rd arter er 2017 Filed by Peoples