Reinforcement Learning: A Tutorial Satinder Singh Computer Science - PowerPoint PPT Presentation

Eligibility Traces • The policy evaluation problem: given a (in general stochastic) policy � , estimate V � (i) = E � {r 0 + � r 1 + � 2 r 2 + � 3 r 3 +… | s 0 =i} from multiple experience trajectories generated by following policy � repeatedly from state i A single trajectory: r 0 r 1 r 2 r 3 …. r k r k+1 ….

TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. r 0 + � V(s 1 ) 0-step (e 0 ): V new (s 0 ) = V old (s 0 ) + � [r 0 + � V old (s 1 ) - V old (s 0 )] temporal difference V new (s 0 ) = V old (s 0 ) + � [e 0 - V old (s 0 )] TD(0)

TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. r 0 + � V(s 1 ) r 0 + � r 1 + � 2 V(s 2 ) 1-step (e 1 ): V new (s 0 ) = V old (s 0 ) + � [e 1 - V old (s 0 )] V old (s 0 ) + � [r 0 + � r 1 + � 2 V old (s 2 ) - V old (s 0 )]

TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. w 0 e 0 : r 0 + � V(s 1 ) w 1 e 1 : r 0 + � r 1 + � 2 V(s 2 ) w 2 e 2 : r 0 + � r 1 + � 2 r 2 + � 3 V(s 3 ) r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k-1 r k-1 + � k V(s k ) e k-1 : w k-1 w � r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k r k + � k+1 r k+1 + … e � : V new (s 0 ) = V old (s 0 ) + � [ � k w k e k - V old (s 0 )]

TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. (1- � ) r 0 + � V(s 1 ) (1- � ) � r 0 + � r 1 + � 2 V(s 2 ) (1- � ) � 2 r 0 + � r 1 + � 2 r 2 + � 3 V(s 3 ) (1- � ) � k-1 r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k-1 r k-1 + � k V(s k ) V new (s 0 ) = V old (s 0 ) + � [ � k (1- � ) � k e k - V old (s 0 )] 0 � � � 1 interpolates between 1-step TD and Monte-Carlo

TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. r 0 + � V(s 1 ) - V(s 0 ) � 0 r 1 + � V(s 2 ) - V(s 1) � 1 r 2 + � V(s 3 ) - V(s 2 ) � 2 � k r k-1 + � V(s k )-V(s k-1 ) V new (s 0 ) = V old (s 0 ) + � [ � k (1- � ) � k � k ] eligibility w.p.1 convergence ( Jaakkola, Jordan & Singh ) trace

Bias-Variance Tradeoff r 0 r 1 r 2 r 3 …. r k r k+1 …. decreasing bias e 0 : r 0 + � V(s 1 ) increasing e 1 : r 0 + � r 1 + � 2 V(s 2 ) variance e 2 : r 0 + � r 1 + � 2 r 2 + � 3 V(s 3 ) r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k-1 r k-1 + � k V(s k ) e k-1 : r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k r k + � k+1 r k+1 + … e � :

Bias-Variance Tradeoff Constant step-size t 1 � b � t error t � a � + b � 1 � b � t �� , error asymptotes at a � 1- b � ( an increasing function of � ) t (exponential) Rate of convergence is b � b � is a decreasing function of � Intuition: start with large � and then decrease over time Kearns & Singh, 2000

Near-Optimal Reinforcement Learning in Polynomial Time (solving the exploration versus exploitation dilemma)

Setting • Unknown MDP M • At any step: explore or exploit • Finite time analysis • Goal: Develop an algorithm such that an agent following that algorithm will in time polynomial in the complexity of the MDP, will achieve nearly the same payoff per time step as an agent that knew the MDP to begin with. • Need to solve exploration versus exploitation • Algorithm called E 3

Preliminaries 1 • Actual return: ( R R ... R ) + + + 1 2 T T • Let T* denote the (unknown) mixing time of the MDP • One key insight: even the optimal policy will take time O(T*) to achieve actual return that is near-optimal • E3 has the property that it always compares favorably to the best policy amongst the policies that mix in the time that the algorithm is run.

The Algorithm (informal) • Do “balanced wandering” until some state is known • Do forever: • Construct known-state MDP • Compute optimal exploitation policy in known-state MDP • If return of above policy is near optimal, execute it • Otherwise compute optimal exploration policy in known-state MDP and execute it; do balanced wandering from unknown states.

ˆ : estimated known state MDP M : true known state MDP M

Main Result • A new algorithm E 3 , taking inputs and such tha for any V * and T * holding in the unknown MDP: • Total number of actions and computation time required by E 3 are poly ( , , T * ,N) • Performance guarantee: with probability at least (1- ) amortized return of E3 so far will exceed (1 - )V*

Function Approximation and Reinforcement Learning

General Idea Q ( s , a ) s Function Approximator a targets or errors Could be: • table gradient- • Backprop Neural Network descent • Radial-Basis-Function Network methods • Tile Coding (CMAC) • Nearest Neighbor, Memory Based • Decision Tree

Neural Networks as FAs Q ( s , a ) = f ( s , a , w ) weight vector standard backprop gradient e.g., gradient-descent Sarsa: [ ] � w � w + � r t + 1 + � Q ( s t + 1 , a t + 1 ) � Q ( s t , a t ) w f ( s t , a t , w ) estimated value target value

Linear in the Parameters FAs r r T r � ˆ ˆ � r V ( s ) = � V ( s ) = � � s s r Each state s represented by a feature vector � s r Or represent a state-action pair with � sa and approximate action values: � ( s , a ) = E r 2 r 3 + L s t = s , a Q 1 + � r 2 + � t = a , � r T r ˆ Q ( s , a ) = � � s , a

Sparse Coarse Coding . . . . Linear . fixed expansive . last Re-representation . layer . . . . features Coarse: Large receptive fields Sparse: Few features present at one time

Shaping Generalization in Coarse Coding

FAs & RL • Linear FA (divergence can happen) Nonlinear Neural Networks (theory is not well developed) Non-parametric, e.g., nearest-neighbor (provably not divergent; bounds on error) Everyone uses their favorite FA… little theoretical guidance yet! • Does FA really beat the curse of dimensionality? • Probably; with FA, computation seems to scale with the complexity of the solution (crinkliness of the value function) and how hard it is to find it • Empirically it works • though many folks have a hard time making it so • no off-the-shelf FA+RL yet

by Andrew Ng and colleagues

Dynamic Channel Assignment in Cellular Telephones

Dynamic Channel Assignment State: current assignments Actions: feasible assignments Reward: 1 per call per sec. Agent Channel assignment in cellular telephone systems • what (if any) conflict-free channel to assign to caller Learned better dynamic assignment policies than competition Singh & Bertsekas (NIPS)

Run Cellphone Demo (http://www.eecs.umich.edu/~baveja/Demo.html)

After MDPs... • Great success with MDPs • What next? • Rethinking Actions, States, Rewards • Options instead of actions • POMDPs

Rethinking Action (Hierarchical RL) Options (Precup, Sutton, Singh) MAXQ by Dietterich HAMs by Parr & Russell

Related Work Reinforcement Learning “Classical” AI Robotics and and MDP Planning Control Engineering Fikes, Hart & Nilsson(1972) Newell & Simon (1972) Mahadevan & Connell (1992) Brooks (1986) Sacerdoti (1974, 1977) Singh (1992) Maes (1991) Lin (1993) Koza & Rice (1992) Macro-Operators Dayan & Hinton (1993) Brockett (1993) Korf (1985) Kaelbling(1993) Grossman et. al (1993) Minton (1988) Chrisman (1994) Dorigo & Colombetti (1994) Iba (1989) Bradtke & Duff (1995) Asada et. al (1996) Kibler & Ruby (1992) Ring (1995) Uchibe et. al (1996) Sutton (1995) Huber & Grupen(1997) Qualitative Reasoning Thrun & Schwartz (1995) Kalmar et. al (1997) Kuipers (1979) Boutilier et. al (1997) Mataric(1997) de Kleer & Brown (1984) Dietterich(1997) Sastry (1997) Dejong (1994) Wiering & Schmidhuber (1997) Toth et. al (1997) Precup, Sutton & Singh (1997) Laird et al. (1986) McGovern & Sutton (1998) Drescher (1991) Parr & Russell (1998) Levinson & Fuchs (1994) Drummond (1998) Say & Selahatin (1996) Hauskrecht et. al (1998) Brafman & Moshe (1997) Meuleau et. al (1998) Ryan and Pendrith (1998)

Abstraction in Learning and Planning • A long-standing, key problem in AI ! • How can we give abstract knowledge a clear semantics? e.g. “I could go to the library” • How can different levels of abstraction be related? � spatial: states � temporal: time scales • How can we handle stochastic, closed-loop, temporally extended courses of action? • Use RL/MDPs to provide a theoretical foundation

Options A generalization of actions to include courses of action An option is a triple o =< I , � , � > • I � S is the set of states in which o may be started • � : S � A � [0,1] is the policy followed during o • � : S � [0,1] is the probability of terminating in each state Option execution is assumed to be call-and-return Example: docking I : all states in which charger is in sight � : hand-crafted controller � : terminate when docked or charger not visible Options can take variable number of steps

Rooms Example 4 rooms 4 hallways 4 unreliable primitive actions ROOM HALLWAYS up Fail 33% left right of the time O 1 down G? 8 multi-step options O 2 G? (to each room's 2 hallways) Given goal location, quickly plan shortest route Goal states are given All rewards zero � = .9 a terminal value of 1

Options define a Semi-Markov Decison Process (SMDP) Time Discrete time MDP State Homogeneous discount Continuous time SMDP Discrete events Interval-dependent discount Discrete time Options Overlaid discrete events over MDP Interval-dependent discount A discrete-time SMDP overlaid on an MDP Can be analyzed at either level

MDP + Options = SMDP Theorem: For any MDP, and any set of options, the decision process that chooses among the options, executing each to termination, is an SMDP. Thus all Bellman equations and DP results extend for value functions over options and models of options (cf. SMDP theory).

What does the SMDP connection give us? • Policies over options: µ : S � O � [0,1] * ( s , o ) * ( s ), Q O • Value functions over options: V µ ( s ), Q µ ( s , o ), V O • Learning methods: Bradtke & Duff (1995), Parr (1998) • Models of options • Planning methods: e.g. value iteration, policy iteration, Dyna... • A coherent theory of learning and planning with courses of action at variable time scales, yet at the same level A theoretical fondation for what we really need! But the most interesting issues are beyond SMDPs...

Value Functions for Options Define value functions for options, similar to the MDP case V µ ( s ) = E { r t + 1 + � r t + 2 + ... | E ( µ , s , t )} Q µ ( s , o ) = E { r t + 1 + � r t + 2 + ... | E ( o µ , s , t )} Now consider policies µ �� ( O ) restricted to choose only from options in O : V O * ( s ) = max � ( O ) V µ ( s ) µ � µ �� ( O ) Q µ ( s , o ) Q O * ( s , o ) = max

Models of Options Knowing how an option is executed is not enough for reasoning about it, or planning with it. We need information about its consequences The model of the consequences of starting option o in state s has: • a reward part o = E { r 1 + � r 2 + ... + � k � 1 r r k | s 0 = s , o taken in s 0 , lasts k steps} s • a next - state part o = E { � k � s k s ' | s 0 = s , o taken in s 0 , lasts k steps} p ss ' � 1 if s ' = s k is the termination state, 0 otherwise This form follows from SMDP theory. Such models can be used interchangeably with models of primitive actions in Bellman equations.

Room Example 4 rooms 4 hallways 4 unreliable primitive actions ROOM HALLWAYS up Fail 33% left right of the time O 1 down G? 8 multi-step options O 2 G? (to each room's 2 hallways) Given goal location, quickly plan shortest route Goal states are given All rewards zero � = .9 a terminal value of 1

Example: Synchronous Value Iteration Generalized to Options Initialize : V 0 ( s ) � 0 � s � S o + o V k ( s ')] � s � S Iterate : V k + 1 ( s ) � max o � O [ r p ss ' � s s ' � S The algorithm converges to the optimal value function,given the options * lim k �� V k = V O * is readily determined. * is computed, µ O Once V O If O = A , the algorithm reduces to conventional value iteration * = V * If A � O , then V O

Rooms Example with cell-to-cell primitive actions V ( goal )=1 Iteration #0 Iteration #1 Iteration #2 with room-to-room options V ( goal )=1 Iteration #0 Iteration #1 Iteration #2

Example with Goal � Subgoal both primitive actions and options Initial values Iteration #1 Iteration #2 Iteration #3 Iteration #4 Iteration #5

What does the SMDP connection give us? • Policies over options: µ : S � O a [0,1] µ ( s , o ), V * ( s , o ) • Value functions over options: V µ ( s ), Q * ( s ), Q O O • Learning methods: Bradtke & Duff (1995), Parr (1998) • Models of options • Planning methods: e.g. value iteration, policy iteration, Dyna... • A coherent theory of learning and planning with courses of action at variable time scales, yet at the same level A theoretical foundation for what we really need! But the most interesting issues are beyond SMDPs...

Advantages of Dual MDP/SMDP View At the SMDP level Compute value functions and policies over options with the benefit of increased speed / flexibility At the MDP level Learn how to execute an option for achieving a given goal Between the MDP and SMDP level Improve over existing options (e.g. by terminating early) Learn about the effects of several options in parallel, without executing them to termination

Between MDPs and SMDPs • Termination Improvement Improving the value function by changing the termination conditions of options • Intra-Option Learning Learning the values of options in parallel, without executing them to termination Learning the models of options in parallel, without executing them to termination • Tasks and Subgoals Learning the policies inside the options

Termination Improvement Idea: We can do better by sometimes interrupting ongoing options - forcing them to terminate before � says to Theorem: For any policy over options µ : S � O � [0,1], suppose we interrupt its options one or more times, when Q µ ( s , o ) < Q µ ( s , µ ( s )), where s is the state at that time o is the ongoing option to obtain µ ': S � O ' � [0,1], Then µ '> µ (it attains more or equal reward everywhere) * . * and thus µ = µ O Application : Suppose we have determined Q O * Then µ ' is guaranteed better than µ O and is available with no additional computation.

Landmarks Task Task: navigate from S to G as range (input set) of each run-to-landmark controller fast as possible G 4 primitive actions, for taking tiny steps up, down, left, 7 controllers for going straight landmarks to each one of the landmarks, from within a circular region where the landmark is visible S In this task, planning at the level of primitive actions is computationally intractable, we need the controllers

Reinforcement Learning: A Tutorial Satinder Singh Computer Science - PowerPoint PPT Presentation

Reinforcement Learning: A Tutorial Satinder Singh Computer Science & Engineering University of Michigan, Ann Arbor with special thanks to Rich Sutton , Michael Kearns, Andy Barto, Michael Littman, Doina Precup, Peter Stone, Andrew Ng,...

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CMPUT 609/499: Reinforcement Learning for Artificial Intelligence Instructor: Rich Sutton Dept

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1

Class 2: Model-Free Prediction Sutton and Barto, Chapters 5 and 6 David Silver 295, class 2 1

MAY 2020 RUP - TSXV CAUTIONARY RY STATEMENT Cautionary Note Regarding Forward-Looking

London Borough of Sutton Pension Fund Page 11 Actuarial valuation as at 31 March 2019 Agenda

Baumgartner, POLI 203 Spring 2016 RJA 1: the 2009 Law Reading: RJA 2009, 11, 15 March 7,

Rootkits and Trojans on your SAP Landscape Ertunga Arsal Chaos Communication Congress 2010 1