Bellman Fixed Point • Define Bellman backup operator B : V t � � � � � �� �� � � � � � � � � � � � �� � � � ��� � � �� � � � � � � � V t-1 • ∃ an optimal value function V* and an optimal deterministic greedy policy π *= π V* satisfying: � �� � � � � � � � � � � �� � �
Bellman Error and Properties • Define Bellman error BE : � �� � � � ��� � � � � �� � � � � � � � � � • Clearly: � �� � � � � � • Can prove B is a contraction operator for BE : � �� � � � �� � � � �� � � Hmmm…. Does this suggest a solution?
Value Iteration: in search of fixed-point • Start with arbitrary value function V 0 • Iteratively apply Bellman backup Look familiar? Same DP solution � � � � � � � � � � � � �� � � as before. • Bellman error decreases on each iteration – Terminate when � � � � � � � � � � � � � � � � � �� � � � ��� � � � – Guarantees ε -optimal value function • i.e., V t within ε of V* for all states Precompute maximum number of steps for ε ?
Single Dynamic Programming (DP) Step • Graphical view: � � � � � � s 1 � � � � � � � � � � � � �� �� � � � � a 1 � � �� � � � � � � � � � � � � s 2 s 1 ��� � � � � � � � � � � � � � � � s 3 � a 2 � � �� �� � � � � � � � � � � � � s 2
Synchronous DP Updates (Value Iteration) 3 2 1 0 V (s ) V (s ) V (s ) V (s ) 1 1 1 1 A 1 A 1 A 1 S S S S 1 1 1 1 MAX MAX MAX A 2 A 2 A 2 A 1 A 1 A 1 MAX MAX MAX S S S S 2 2 2 2 A 2 A 2 A 2 2 1 0 3 V (s ) V (s ) V (s ) V (s ) 2 2 2 2
Asynchronous DP Updates (Asynchronous Value Iteration) 0 V (s ) Don’t need to update 1 V (s ) 1 1 values synchronously S 1 with uniform depth. As long as each state S A 1 2 2 updated with non-zero V (s ) 1 probability, convergence S MAX 1 still guaranteed! S A 2 1 ... S A 1 3 2 V (s ) ... 1 S S 2 1 MAX A 2 Can you see intuition ... for error contraction?
Real-time DP (RTDP) • Async. DP guaranteed to converge over relevant states – relevant states : states reachable from initial states under π * – may converge without visiting all states! ��������� � � ���� ���� ���������� �������� ���� ��������� ����� �� ���������� � � � ���� ���������� ����� �������� � � � �� � � � ����� ��� ��������� ��� ��� ��� �� ���� �� ���� � ���� ������� ����� ������������ �� ������� ����� � � � � �������� ������ � �� � � � � � � ������������� � � � � � � � �� � � ������������ � � � � � � �� � �� � � � � � �� �� � � �� ������ � � � � ���
Prioritized Sweeping (PS) • Simple asynchronous DP idea – Focus backups on high error states – Can use in conjunction with other focused methods, e.g., RTDP Where do RTDP and • Every time state visited: PS each focus? – Record Bellman error of state – Push state onto queue with priority = Bellman error • In between simulations / experience, repeat: – Withdraw maximal priority state from queue – Perform Bellman backup on state • Record Bellman error of predecessor states • Push predecessor states onto queue with priority = Bellman error
Which approach is better? • Synchronous DP Updates – Good when you need a policy for every state – OR transitions are dense • Asynchronous DP Updates – Know best states to update • e.g., reachable states, e.g. RTDP • e.g., high error states, e.g. PS – Know how to order updates • e.g., from goal back to initial state if DAG
Policy Evaluation • Given π , how to derive V π ? • Matrix inversion • Set up linear equality (no max!) for each state � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � �� � � � �� � � � � • Can solve linear system in vector form as follows � � � � � � � � �� � � � � Guaranteed invertible. • Successive approximation • Essentially value iteration with fixed policy 0 arbitrarily • Initialize V π � � � � � �� � � � � � � � � � � � � � � � � � � � � � � �� � � �� � � � �� � � � � � • Guaranteed to converge to V π
Policy Iteration � ��������������� !��" �� ��������# ������� �������� ���$ ��# � � � % ��� ��� � � � � ������ ����������� &��'� ��� � � � ����'���� ������ ( ������ ������������ )��� � ��� �����# � � �� ���� �� � �����# �����# � � � � � � � ��� ��� � � � � � � � �� � � � � � �� � * � � �� ���� ���� �����'�� '�� � ����� ���������� ����� �'�� �������� + ����������� ������ ,� � � �� � � � � ���� ��������� � ��� �� �� ���� � ���� ������ � � ��
Modified Policy Iteration • Value iteration – Each iteration seen as doing 1-step of policy evaluation for current greedy policy – Bootstrap with value estimate of previous policy • Policy iteration – Each iteration is full evaluation of V π for current policy π – Then do greedy policy update • Modified policy iteration – Like policy iteration, but V π i need only be closer to V* than V π i-1 • Fixed number of steps of successive approximation for V π i suffices when bootstrapped with V π i-1 – Typically faster than VI & PI in practice
Conclusion • Basic introduction to MDPs – Bellman equations from first principles – Solution via various algorithms • Should be familiar with model-based solutions – Value Iteration • Synchronous DP • Asynchronous DP (RTDP, PS) – (Modified) Policy Iteration • Policy evaluation • Model-free solutions just sample from above
Model-free MDP Solutions Reinforcement Learning Scott Sanner Act NICTA / ANU First.Last@nicta.com.au Learn Sense
Chapter 5: Monte Carlo Methods Reinforcement Learning, Sutton & Barto, 1998. Online. • Monte Carlo methods learn from sample returns – Sample from • experience in real application, or • simulations of known model – Only defined for episodic (terminating) tasks – On-line: Learn while acting Slides from Rich Sutton’s course CMP499/609 Reinforcement Learning in AI http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2007.html
Essence of Monte Carlo (MC) • MC samples directly from value expectation for each state given π � � � � � � � � � � � � � � � � � � � � � � � � � ��
Monte Carlo Policy Evaluation • Goal: learn V π (s) • Given: some number of episodes under π which contain s • Idea: Average returns observed after visits to s Start 1 2 3 4 5 Goal update each state with final discounted return
Monte Carlo policy evaluation
Blackjack example • Object: Have your card sum be greater than the dealer’s without exceeding 21. • States (200 of them): – current sum (12-21) – dealer’s showing card (ace-10) – do I have a useable ace? • Reward: +1 win, 0 draw, -1 lose Assuming fixed • Actions: stick (stop receiving cards), policy for now. hit (receive another card) • Policy: Stick if my sum is 20 or 21, else hit
Blackjack value functions
Backup diagram for Monte Carlo • Entire episode included • Only one choice at each state (unlike DP) • MC does not bootstrap • Time required to estimate one state does not depend on the total number of states terminal state
MC Control: Need for Q-values • Control: want to learn a good policy – Not just evaluate a given policy • If no model available – Cannot execute policy based on V(s) – Instead, want to learn Q * (s,a) • Q π (s,a) - average return starting from state s and action a following π � � � � � � � � � � � � � � �� � � � � � � � � � �� � � � � � ��
Monte Carlo Control evaluation Instance of Q π Q →� Generalized Policy Iteration . π Q π→� greedy ( Q ) improvement • MC policy iteration: Policy evaluation using MC methods followed by policy improvement • Policy improvement step: Greedy π ’(s) is action a maximizing Q π (s,a)
Convergence of MC Control • Greedy policy update improves or keeps value: • This assumes all Q(s,a) visited an infinite number of times – Requires exploration, not just exploitation • In practice, update policy after finite iterations
Blackjack Example Continued • MC Control with exploring starts… • Start with random (s,a) then follow π
Monte Carlo Control • How do we get rid of exploring starts? – Need soft policies: π (s,a) > 0 for all s and a – e.g. ε -soft policy: ε ε − ε + 1 A ( s ) A ( s ) non-max greedy • Similar to GPI: move policy towards greedy policy (i.e. ε -soft) • Converges to best ε -soft policy
Summary • MC has several advantages over DP: – Learn from direct interaction with environment – No need for full models – Less harm by Markovian violations • MC methods provide an alternate policy evaluation process • No bootstrapping (as opposed to DP)
Temporal Difference Methods Reinforcement Learning Scott Sanner Act NICTA / ANU First.Last@nicta.com.au Learn Sense
Chapter 6: Temporal Difference (TD) Learning Reinforcement Learning, Sutton & Barto, 1998. Online. • Rather than sample full returns as in Monte Carlo… TD methods sample Bellman backup Slides from Rich Sutton’s course CMP499/609 Reinforcement Learning in AI http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2007.html
TD Prediction Policy Evaluation (the prediction problem) : π for a given policy π , compute the state-value function V Recall: Simple every - visit Monte Carlo method : [ ] V ( s t ) ← V ( s t ) + α R t − V ( s t ) target : the actual return after time t The simplest TD method, TD(0) : [ ] V ( s t ) ← V ( s t ) + α r t + 1 + γ V ( s t + 1 ) − V ( s t ) target : an estimate of the return
Simple Monte Carlo [ ] V ( s t ) ← V ( s t ) + α R t − V ( s t ) where R t is the actual return following state s t . s t T T T T T T T T T T T T T T T T T T T T
Simplest TD Method [ ] V ( s t ) ← V ( s t ) + α r t + 1 + γ V ( s t + 1 ) − V ( s t ) s t r t + 1 s t + 1 T T T T T T T T T T T T T T T T T T T T
cf. Dynamic Programming { } V ( s t ) ← E π r t + 1 + γ V ( s t ) s t r t + 1 s t + 1 T T T T T T T T T T T T T
TD methods bootstrap and sample • Bootstrapping: update with estimate – MC does not bootstrap – DP bootstraps – TD bootstraps • Sampling: – MC samples – DP does not sample – TD samples
Example: Driving Home Stat e Elapsed T ime Pr edicted Pr edicted (minu tes) Time to Go Tota l Tim e leaving o ffice 0 30 30 reach car, ra ining 5 (5) 35 40 (15) exit highway 20 15 35 beh ind truck 30 (10) 10 40 home street 40 3 43 (10) arrive ho me 43 (3) 0 43
Driving Home Changes recommended by Changes recommended Monte Carlo methods (α =1) by TD methods ( α =1) 45 actual outcome 40 Predicted total travel 35 time 30 leaving reach exiting 2ndary home arrive car road street home office highway Situation
Advantages of TD Learning • TD methods do not require a model of the environment, only experience • TD, but not MC, methods can be fully incremental – You can learn before knowing the final outcome • Less memory • Less peak computation – You can learn without the final outcome • From incomplete sequences • Both MC and TD converge, but which is faster?
Random Walk Example 0 0 0 0 0 1 A B C D E start Values learned by TD(0) after various numbers of episodes
TD and MC on the Random Walk Data averaged over 100 sequences of episodes
Optimality of TD(0) Batch Updating : train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Only update estimates after complete pass through the data. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small α . Constant- α MC also converges under these conditions, but to a different answer!
Random Walk under Batch Updating .25 B ATCH T RAINING .2 RMS error, .15 averaged over states .1 MC TD .05 .0 0 25 50 75 100 Walks / Episodes After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.
You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 V (A)? B, 1 B, 1 V (B)? B, 1 B, 1 B, 0
You are the Predictor r = 1 75% r = 0 V (A)? A B 100% r = 0 25%
You are the Predictor • The prediction that best matches the training data is V(A)=0 – This minimizes the mean-square-error – This is what a batch Monte Carlo method gets • If we consider the sequential aspect of the problem, then we would set V(A)=.75 – This is correct for the maximum likelihood estimate of a Markov model generating the data – This is what TD(0) gets MC and TD results are same in ∞ limit of data. But what if data < ∞ ?
Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: SARSA = TD(0) for Q functions.
Q-Learning: Off-Policy TD Control One - step Q - learning: [ ] ( ) ← Q s t , a t ( ) + α r ( ) − Q s t , a t ( ) t + 1 + γ max Q s t , a t a Q s t + 1 , a
Cliffwalking ε− greedy , ε = 0.1 Optimal exploring policy. Optimal policy, but exploration hurts more here.
Practical Modeling: Afterstates • Usually, a state-value function evaluates states in which the agent can take an action. • But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. • Why is this useful? X X + + O O X X • An afterstate is really just an action that looks like a state X O X
Summary • TD prediction • Introduced one-step tabular model-free TD methods • Extend prediction to control – On-policy control: Sarsa (instance of GPI) – Off-policy control: Q-learning a.k.a. bootstrapping • These methods sample from Bellman backup, combining aspects of DP and MC methods
TD( λ λ λ λ ): Between TD and MC Reinforcement Learning Scott Sanner Act NICTA / ANU First.Last@nicta.com.au Learn Sense
Unified View Is there a hybrid of TD & MC?
Chapter 7: TD( λ ) Reinforcement Learning, Sutton & Barto, 1998. Online. • MC and TD estimate same value – More estimators between two extremes • Idea: average multiple estimators – Yields lower variance – Leads to faster learning Slides from Rich Sutton’s course CMP499/609 Reinforcement Learning in AI http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2007.html
N-step TD Prediction • Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps) All of these estimate same value!
N-step Prediction • Monte Carlo: t + 3 + � + γ T − t − 1 r t + 2 + γ 2 r R t = r t + 1 + γ r �� T (1) = r • TD: t + 1 + γ V t ( s t + 1 ) R t – Use V to estimate remaining return • n-step TD: (2) = r t + 2 + γ 2 V t ( s t + 2 ) – 2 step return: t + 1 + γ r R t ( n ) = r – n-step return: t + 3 + � + γ n − 1 r t + 2 + γ 2 r t + n + γ n V t ( s t + n ) t + 1 + γ r R t ��
Random Walk Examples • How does 2-step TD work here? • How about 3-step TD? Hint: TD(0) is 1-step return… update previous state on each time step.
A Larger Example • Task: 19 state random walk • Do you think there is an optimal n? for everything?
Averaging N-step Returns One backup • n-step methods were introduced to help with TD( λ ) understanding • Idea: backup an average of several returns – e.g. backup half of 2-step & 4-step avg = 1 (2) + 1 (4) R t 2 R t 2 R t • Called a complex backup – Draw each component – Label with the weights for that component
Forward View of TD( λ ) • TD( λ ) is a method for averaging all n-step backups – weight by λ n-1 (time since visitation) � λ -return: ∞ � λ = (1 − λ ) λ n − 1 ( n ) R t R t n = 1 • Backup using λ -return: λ − V t ( s t ) [ ] ∆ V t ( s t ) = α R t What happens when λ =1, λ = 0?
λ -return Weighting Function T − t − 1 � λ = (1 − λ ) ( n ) + λ λ n − 1 T − t − 1 R t R t R t n = 1 Until termination After termination
Forward View of TD( λ ) • Look forward from each state to determine update from future states and rewards:
λ -return on the Random Walk • Same 19 state random walk as before • Why do you think intermediate values of λ are best?
Backward View δ t = r t + 1 + γ V t ( s t + 1 ) − V t ( s t ) • Shout δ t backwards over time • The strength of your voice decreases with temporal distance by γλ
Backward View of TD( λ ) • The forward view was for theory • The backward view is for mechanism e t ( s ) ∈ ℜ + • New variable called eligibility trace – On each step, decay all traces by γλ and increment the trace for the current state by 1 – Accumulating trace � γλ e t − 1 ( s ) if s ≠ s t � e t ( s ) = � γλ e t − 1 ( s ) + 1 if s = s t
On-line Tabular TD( λ ) Initialize V ( s ) arbitrarily Repeat (for each episode): e ( s ) = 0, for all s ∈ S Initialize s Repeat (for each step of episode): a ← action given by π for s Take action a , observe reward, r , and next state ′ s δ ← r + γ V ( ′ s ) − V ( s ) e ( s ) ← e ( s ) + 1 For all s: V ( s ) ← V ( s ) + αδ e ( s ) e ( s ) ← γλ e ( s ) s ← ′ s Until s is terminal
Forward View = Backward View • The forward (theoretical) view of averaging returns in TD( λ ) is equivalent to the backward (mechanistic) view for off-line updating • The book shows: T − 1 T − 1 � � λ ( s t ) TD ( s ) ∆ V t = ∆ V t I ss t t = 0 t = 0 Backward updates Forward updates algebra shown in book T − 1 T − 1 T − 1 T − 1 T − 1 T − 1 � � I ss t � � � I ss t � ( γλ ) k − t δ k λ ( s t ) I ss t ( γλ ) k − t δ k TD ( s ) ∆ V t = α ∆ V t = α t = 0 t = 0 k = t t = 0 t = 0 k = t
On-line versus Off-line on Random Walk Save all updates for end of episode. • Same 19 state random walk • On-line better over broader range of params – Updates used immediately
Control: Sarsa( λ ) • Save eligibility for state-action pairs instead of just states � e t ( s , a ) = γλ e t − 1 ( s , a ) + 1 if s = s t and a = a t � � γλ e t − 1 ( s , a ) otherwise Q t + 1 ( s , a ) = Q t ( s , a ) + αδ t e t ( s , a ) δ t = r t + 1 + γ Q t ( s t + 1 , a t + 1 ) − Q t ( s t , a t )
Sarsa( λ ) Algorithm Initialize Q ( s , a ) arbitrarily Repeat (for each episode) : e ( s , a ) = 0, for all s , a Initialize s , a Repeat (for each step of episode) : a , observe r , ′ Take action s a from ′ ′ Q (e.g. ε - greedy) Choose s using policy derived from δ ← r + γ Q ( ′ s , ′ a ) − Q ( s , a ) e ( s , a ) ← e ( s , a ) + 1 For all s,a : Q ( s , a ) ← Q ( s , a ) + αδ e ( s , a ) e ( s , a ) ← γλ e ( s , a ) s ← s ; a ← ′ ′ a Until s is terminal
Sarsa( λ ) Gridworld Example • With one trial, the agent has much more information about how to get to the goal – not necessarily the best way • Can considerably accelerate learning
Replacing Traces • Using accumulating traces, frequently visited states can have eligibilities greater than 1 – This can be a problem for convergence • Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1 � e t ( s ) = γλ e t − 1 ( s ) if s ≠ s t � � if s = s t 1
Why Replacing Traces? • Replacing traces can significantly speed learning • Perform well for a broader set of parameters • Accumulating traces poor for certain types of tasks: Why is this task particularly onerous for accumulating traces?
Replacing Traces Example • Same 19 state random walk task as before • Replacing traces perform better than accumulating traces over more values of λ
The Two Views Efficient implementation Averaging Advantage of backward view estimators. for continuing tasks?
Conclusions • TD( λ ) and eligibilities – efficient, incremental way to interpolate between MC and TD • Averages multiple noisy estimators – Lower variance – Faster learning • Can significantly speed learning • Does have a cost in computation
Practical Issues and Discussion Reinforcement Learning Scott Sanner Act NICTA / ANU First.Last@nicta.com.au Learn Sense
Need for Exploration in RL • For model-based (known) MDP solutions – Get convergence with deterministic policies • But for model-free RL… – Need exploration – Usually use stochastic policies for this • Choose exploration action with small probability – Then get convergence to optimality
Recommend
More recommend