reinforcement learning a tutorial
play

Reinforcement Learning: A Tutorial Satinder Singh Computer Science - PowerPoint PPT Presentation

Reinforcement Learning: A Tutorial Satinder Singh Computer Science & Engineering University of Michigan, Ann Arbor with special thanks to Rich Sutton , Michael Kearns, Andy Barto, Michael Littman, Doina Precup, Peter Stone, Andrew Ng,...


  1. Eligibility Traces • The policy evaluation problem: given a (in general stochastic) policy � , estimate V � (i) = E � {r 0 + � r 1 + � 2 r 2 + � 3 r 3 +… | s 0 =i} from multiple experience trajectories generated by following policy � repeatedly from state i A single trajectory: r 0 r 1 r 2 r 3 …. r k r k+1 ….

  2. TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. r 0 + � V(s 1 ) 0-step (e 0 ): V new (s 0 ) = V old (s 0 ) + � [r 0 + � V old (s 1 ) - V old (s 0 )] temporal difference V new (s 0 ) = V old (s 0 ) + � [e 0 - V old (s 0 )] TD(0)

  3. TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. r 0 + � V(s 1 ) r 0 + � r 1 + � 2 V(s 2 ) 1-step (e 1 ): V new (s 0 ) = V old (s 0 ) + � [e 1 - V old (s 0 )] V old (s 0 ) + � [r 0 + � r 1 + � 2 V old (s 2 ) - V old (s 0 )]

  4. TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. w 0 e 0 : r 0 + � V(s 1 ) w 1 e 1 : r 0 + � r 1 + � 2 V(s 2 ) w 2 e 2 : r 0 + � r 1 + � 2 r 2 + � 3 V(s 3 ) r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k-1 r k-1 + � k V(s k ) e k-1 : w k-1 w � r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k r k + � k+1 r k+1 + … e � : V new (s 0 ) = V old (s 0 ) + � [ � k w k e k - V old (s 0 )]

  5. TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. (1- � ) r 0 + � V(s 1 ) (1- � ) � r 0 + � r 1 + � 2 V(s 2 ) (1- � ) � 2 r 0 + � r 1 + � 2 r 2 + � 3 V(s 3 ) (1- � ) � k-1 r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k-1 r k-1 + � k V(s k ) V new (s 0 ) = V old (s 0 ) + � [ � k (1- � ) � k e k - V old (s 0 )] 0 � � � 1 interpolates between 1-step TD and Monte-Carlo

  6. TD( � ) r 0 r 1 r 2 r 3 …. r k r k+1 …. r 0 + � V(s 1 ) - V(s 0 ) � 0 r 1 + � V(s 2 ) - V(s 1) � 1 r 2 + � V(s 3 ) - V(s 2 ) � 2 � k r k-1 + � V(s k )-V(s k-1 ) V new (s 0 ) = V old (s 0 ) + � [ � k (1- � ) � k � k ] eligibility w.p.1 convergence ( Jaakkola, Jordan & Singh ) trace

  7. Bias-Variance Tradeoff r 0 r 1 r 2 r 3 …. r k r k+1 …. decreasing bias e 0 : r 0 + � V(s 1 ) increasing e 1 : r 0 + � r 1 + � 2 V(s 2 ) variance e 2 : r 0 + � r 1 + � 2 r 2 + � 3 V(s 3 ) r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k-1 r k-1 + � k V(s k ) e k-1 : r 0 + � r 1 + � 2 r 2 + � 3 r 3 + … � k r k + � k+1 r k+1 + … e � :

  8. TD( )

  9. Bias-Variance Tradeoff Constant step-size t 1 � b � t error t � a � + b � 1 � b � t �� , error asymptotes at a � 1- b � ( an increasing function of � ) t (exponential) Rate of convergence is b � b � is a decreasing function of � Intuition: start with large � and then decrease over time Kearns & Singh, 2000

  10. Near-Optimal Reinforcement Learning in Polynomial Time (solving the exploration versus exploitation dilemma)

  11. Setting • Unknown MDP M • At any step: explore or exploit • Finite time analysis • Goal: Develop an algorithm such that an agent following that algorithm will in time polynomial in the complexity of the MDP, will achieve nearly the same payoff per time step as an agent that knew the MDP to begin with. • Need to solve exploration versus exploitation • Algorithm called E 3

  12. Preliminaries 1 • Actual return: ( R R ... R ) + + + 1 2 T T • Let T* denote the (unknown) mixing time of the MDP • One key insight: even the optimal policy will take time O(T*) to achieve actual return that is near-optimal • E3 has the property that it always compares favorably to the best policy amongst the policies that mix in the time that the algorithm is run.

  13. The Algorithm (informal) • Do “balanced wandering” until some state is known • Do forever: • Construct known-state MDP • Compute optimal exploitation policy in known-state MDP • If return of above policy is near optimal, execute it • Otherwise compute optimal exploration policy in known-state MDP and execute it; do balanced wandering from unknown states.

  14. ˆ : estimated known state MDP M : true known state MDP M

  15. Main Result • A new algorithm E 3 , taking inputs and such tha for any V * and T * holding in the unknown MDP: • Total number of actions and computation time required by E 3 are poly ( , , T * ,N) • Performance guarantee: with probability at least (1- ) amortized return of E3 so far will exceed (1 - )V*

  16. Function Approximation and Reinforcement Learning

  17. General Idea Q ( s , a ) s Function Approximator a targets or errors Could be: • table gradient- • Backprop Neural Network descent • Radial-Basis-Function Network methods • Tile Coding (CMAC) • Nearest Neighbor, Memory Based • Decision Tree

  18. Neural Networks as FAs Q ( s , a ) = f ( s , a , w ) weight vector standard backprop gradient e.g., gradient-descent Sarsa: [ ] � w � w + � r t + 1 + � Q ( s t + 1 , a t + 1 ) � Q ( s t , a t ) w f ( s t , a t , w ) estimated value target value

  19. Linear in the Parameters FAs r r T r � ˆ ˆ � r V ( s ) = � V ( s ) = � � s s r Each state s represented by a feature vector � s r Or represent a state-action pair with � sa and approximate action values: � ( s , a ) = E r 2 r 3 + L s t = s , a Q 1 + � r 2 + � t = a , � r T r ˆ Q ( s , a ) = � � s , a

  20. Sparse Coarse Coding . . . . Linear . fixed expansive . last Re-representation . layer . . . . features Coarse: Large receptive fields Sparse: Few features present at one time

  21. Shaping Generalization in Coarse Coding

  22. FAs & RL • Linear FA (divergence can happen) Nonlinear Neural Networks (theory is not well developed) Non-parametric, e.g., nearest-neighbor (provably not divergent; bounds on error) Everyone uses their favorite FA… little theoretical guidance yet! • Does FA really beat the curse of dimensionality? • Probably; with FA, computation seems to scale with the complexity of the solution (crinkliness of the value function) and how hard it is to find it • Empirically it works • though many folks have a hard time making it so • no off-the-shelf FA+RL yet

  23. by Andrew Ng and colleagues

  24. Dynamic Channel Assignment in Cellular Telephones

  25. Dynamic Channel Assignment State: current assignments Actions: feasible assignments Reward: 1 per call per sec. Agent Channel assignment in cellular telephone systems • what (if any) conflict-free channel to assign to caller Learned better dynamic assignment policies than competition Singh & Bertsekas (NIPS)

  26. Run Cellphone Demo (http://www.eecs.umich.edu/~baveja/Demo.html)

  27. After MDPs... • Great success with MDPs • What next? • Rethinking Actions, States, Rewards • Options instead of actions • POMDPs

  28. Rethinking Action (Hierarchical RL) Options (Precup, Sutton, Singh) MAXQ by Dietterich HAMs by Parr & Russell

  29. Related Work Reinforcement Learning “Classical” AI Robotics and and MDP Planning Control Engineering Fikes, Hart & Nilsson(1972) Newell & Simon (1972) Mahadevan & Connell (1992) Brooks (1986) Sacerdoti (1974, 1977) Singh (1992) Maes (1991) Lin (1993) Koza & Rice (1992) Macro-Operators Dayan & Hinton (1993) Brockett (1993) Korf (1985) Kaelbling(1993) Grossman et. al (1993) Minton (1988) Chrisman (1994) Dorigo & Colombetti (1994) Iba (1989) Bradtke & Duff (1995) Asada et. al (1996) Kibler & Ruby (1992) Ring (1995) Uchibe et. al (1996) Sutton (1995) Huber & Grupen(1997) Qualitative Reasoning Thrun & Schwartz (1995) Kalmar et. al (1997) Kuipers (1979) Boutilier et. al (1997) Mataric(1997) de Kleer & Brown (1984) Dietterich(1997) Sastry (1997) Dejong (1994) Wiering & Schmidhuber (1997) Toth et. al (1997) Precup, Sutton & Singh (1997) Laird et al. (1986) McGovern & Sutton (1998) Drescher (1991) Parr & Russell (1998) Levinson & Fuchs (1994) Drummond (1998) Say & Selahatin (1996) Hauskrecht et. al (1998) Brafman & Moshe (1997) Meuleau et. al (1998) Ryan and Pendrith (1998)

  30. Abstraction in Learning and Planning • A long-standing, key problem in AI ! • How can we give abstract knowledge a clear semantics? e.g. “I could go to the library” • How can different levels of abstraction be related? � spatial: states � temporal: time scales • How can we handle stochastic, closed-loop, temporally extended courses of action? • Use RL/MDPs to provide a theoretical foundation

  31. Options A generalization of actions to include courses of action An option is a triple o =< I , � , � > • I � S is the set of states in which o may be started • � : S � A � [0,1] is the policy followed during o • � : S � [0,1] is the probability of terminating in each state Option execution is assumed to be call-and-return Example: docking I : all states in which charger is in sight � : hand-crafted controller � : terminate when docked or charger not visible Options can take variable number of steps

  32. Rooms Example 4 rooms 4 hallways 4 unreliable primitive actions ROOM HALLWAYS up Fail 33% left right of the time O 1 down G? 8 multi-step options O 2 G? (to each room's 2 hallways) Given goal location, quickly plan shortest route Goal states are given All rewards zero � = .9 a terminal value of 1

  33. Options define a Semi-Markov Decison Process (SMDP) Time Discrete time MDP State Homogeneous discount Continuous time SMDP Discrete events Interval-dependent discount Discrete time Options Overlaid discrete events over MDP Interval-dependent discount A discrete-time SMDP overlaid on an MDP Can be analyzed at either level

  34. MDP + Options = SMDP Theorem: For any MDP, and any set of options, the decision process that chooses among the options, executing each to termination, is an SMDP. Thus all Bellman equations and DP results extend for value functions over options and models of options (cf. SMDP theory).

  35. What does the SMDP connection give us? • Policies over options: µ : S � O � [0,1] * ( s , o ) * ( s ), Q O • Value functions over options: V µ ( s ), Q µ ( s , o ), V O • Learning methods: Bradtke & Duff (1995), Parr (1998) • Models of options • Planning methods: e.g. value iteration, policy iteration, Dyna... • A coherent theory of learning and planning with courses of action at variable time scales, yet at the same level A theoretical fondation for what we really need! But the most interesting issues are beyond SMDPs...

  36. Value Functions for Options Define value functions for options, similar to the MDP case V µ ( s ) = E { r t + 1 + � r t + 2 + ... | E ( µ , s , t )} Q µ ( s , o ) = E { r t + 1 + � r t + 2 + ... | E ( o µ , s , t )} Now consider policies µ �� ( O ) restricted to choose only from options in O : V O * ( s ) = max � ( O ) V µ ( s ) µ � µ �� ( O ) Q µ ( s , o ) Q O * ( s , o ) = max

  37. Models of Options Knowing how an option is executed is not enough for reasoning about it, or planning with it. We need information about its consequences The model of the consequences of starting option o in state s has: • a reward part o = E { r 1 + � r 2 + ... + � k � 1 r r k | s 0 = s , o taken in s 0 , lasts k steps} s • a next - state part o = E { � k � s k s ' | s 0 = s , o taken in s 0 , lasts k steps} p ss ' � 1 if s ' = s k is the termination state, 0 otherwise This form follows from SMDP theory. Such models can be used interchangeably with models of primitive actions in Bellman equations.

  38. Room Example 4 rooms 4 hallways 4 unreliable primitive actions ROOM HALLWAYS up Fail 33% left right of the time O 1 down G? 8 multi-step options O 2 G? (to each room's 2 hallways) Given goal location, quickly plan shortest route Goal states are given All rewards zero � = .9 a terminal value of 1

  39. Example: Synchronous Value Iteration Generalized to Options Initialize : V 0 ( s ) � 0 � s � S o + o V k ( s ')] � s � S Iterate : V k + 1 ( s ) � max o � O [ r p ss ' � s s ' � S The algorithm converges to the optimal value function,given the options * lim k �� V k = V O * is readily determined. * is computed, µ O Once V O If O = A , the algorithm reduces to conventional value iteration * = V * If A � O , then V O

  40. Rooms Example with cell-to-cell primitive actions V ( goal )=1 Iteration #0 Iteration #1 Iteration #2 with room-to-room options V ( goal )=1 Iteration #0 Iteration #1 Iteration #2

  41. Example with Goal � Subgoal both primitive actions and options Initial values Iteration #1 Iteration #2 Iteration #3 Iteration #4 Iteration #5

  42. What does the SMDP connection give us? • Policies over options: µ : S � O a [0,1] µ ( s , o ), V * ( s , o ) • Value functions over options: V µ ( s ), Q * ( s ), Q O O • Learning methods: Bradtke & Duff (1995), Parr (1998) • Models of options • Planning methods: e.g. value iteration, policy iteration, Dyna... • A coherent theory of learning and planning with courses of action at variable time scales, yet at the same level A theoretical foundation for what we really need! But the most interesting issues are beyond SMDPs...

  43. Advantages of Dual MDP/SMDP View At the SMDP level Compute value functions and policies over options with the benefit of increased speed / flexibility At the MDP level Learn how to execute an option for achieving a given goal Between the MDP and SMDP level Improve over existing options (e.g. by terminating early) Learn about the effects of several options in parallel, without executing them to termination

  44. Between MDPs and SMDPs • Termination Improvement Improving the value function by changing the termination conditions of options • Intra-Option Learning Learning the values of options in parallel, without executing them to termination Learning the models of options in parallel, without executing them to termination • Tasks and Subgoals Learning the policies inside the options

  45. Termination Improvement Idea: We can do better by sometimes interrupting ongoing options - forcing them to terminate before � says to Theorem: For any policy over options µ : S � O � [0,1], suppose we interrupt its options one or more times, when Q µ ( s , o ) < Q µ ( s , µ ( s )), where s is the state at that time o is the ongoing option to obtain µ ': S � O ' � [0,1], Then µ '> µ (it attains more or equal reward everywhere) * . * and thus µ = µ O Application : Suppose we have determined Q O * Then µ ' is guaranteed better than µ O and is available with no additional computation.

  46. Landmarks Task Task: navigate from S to G as range (input set) of each run-to-landmark controller fast as possible G 4 primitive actions, for taking tiny steps up, down, left, 7 controllers for going straight landmarks to each one of the landmarks, from within a circular region where the landmark is visible S In this task, planning at the level of primitive actions is computationally intractable, we need the controllers

Recommend


More recommend