Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Structure closely follows much of David Silver’s Lecture 5. For additional reading please see SB Sections 5.2-5.4, 6.4, 6.5, 6.7 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 1 / 58
Refresh Your Knowledge 3. Piazza Poll Which of the following equations express a TD update? s ′ p ( s ′ | s t , a t ) V ( s ′ ) V ( s t ) = r ( s t , a t ) + γ � 1 V ( s t ) = (1 − α ) V ( s t ) + α ( r ( s t , a t ) + γ V ( s t +1 )) 2 V ( s t ) = (1 − α ) V ( s t ) + α � H i = t r ( s i , a i ) 3 V ( s t ) = (1 − α ) V ( s t ) + α max a ( r ( s t , a ) + γ V ( s t +1 )) 4 Not sure 5 Bootstrapping is when Samples of (s,a,s’) transitions are used to approximate the true 1 expectation over next states An estimate of the next state value is used instead of the true next 2 state value Used in Monte-Carlo policy evaluation 3 Not sure 4 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 2 / 58
Refresh Your Knowledge 3. Piazza Poll Which of the following equations express a TD update? True. V ( s t ) = (1 − α ) V ( s t ) + α ( r ( s t , a t ) + γ V ( s t +1 )) Bootstrapping is when An estimate of the next state value is used instead of the true next state value Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 3 / 58
Table of Contents Generalized Policy Iteration 1 Importance of Exploration 2 Maximization Bias 3 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 4 / 58
Class Structure Last time: Policy evaluation with no knowledge of how the world works (MDP model not given) This time: Control (making decisions) without a model of how the world works Next time: Generalization – Value function approximation Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 5 / 58
Evaluation to Control Last time: how good is a specific policy? Given no access to the decision process model parameters Instead have to estimate from data / experience Today: how can we learn a good policy? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 6 / 58
Recall: Reinforcement Learning Involves Optimization Delayed consequences Exploration Generalization Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 7 / 58
Today: Learning to Control Involves Optimization: Goal is to identify a policy with high expected rewards (similar to Lecture 2 on computing an optimal policy given decision process models) Delayed consequences: May take many time steps to evaluate whether an earlier decision was good or not Exploration: Necessary to try different actions to learn what actions can lead to high rewards Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 8 / 58
Today: Model-free Control Generalized policy improvement Importance of exploration Monte Carlo control Model-free control with temporal difference (SARSA, Q-learning) Maximization bias Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 9 / 58
Model-free Control Examples Many applications can be modeled as a MDP: Backgammon, Go, Robot locomation, Helicopter flight, Robocup soccer, Autonomous driving, Customer ad selection, Invasive species management, Patient treatment For many of these and other problems either: MDP model is unknown but can be sampled MDP model is known but it is computationally infeasible to use directly, except through sampling Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 10 / 58
On and Off-Policy Learning On-policy learning Direct experience Learn to estimate and evaluate a policy from experience obtained from following that policy Off-policy learning Learn to estimate and evaluate a policy using experience gathered from following a different policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 11 / 58
Table of Contents Generalized Policy Iteration 1 Importance of Exploration 2 Maximization Bias 3 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 12 / 58
Recall Policy Iteration Initialize policy π Repeat: Policy evaluation: compute V π Policy improvement: update π π ′ ( s ) = arg max � P ( s ′ | s , a ) V π ( s ′ ) = arg max R ( s , a ) + γ Q π ( s , a ) a a s ′ ∈ S Now want to do the above two steps without access to the true dynamics and reward models Last lecture introduced methods for model-free policy evaluation Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 13 / 58
Model Free Policy Iteration Initialize policy π Repeat: Policy evaluation: compute Q π Policy improvement: update π Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 14 / 58
MC for On Policy Q Evaluation Initialize N ( s , a ) = 0, G ( s , a ) = 0, Q π ( s , a ) = 0, ∀ s ∈ S , ∀ a ∈ A Loop Using policy π sample episode i = s i , 1 , a i , 1 , r i , 1 , s i , 2 , a i , 2 , r i , 2 , . . . , s i , T i G i , t = r i , t + γ r i , t +1 + γ 2 r i , t +2 + · · · γ T i − 1 r i , T i For each state,action ( s , a ) visited in episode i For first or every time t that ( s , a ) is visited in episode i N ( s , a ) = N ( s , a ) + 1, G ( s , a ) = G ( s , a ) + G i , t Update estimate Q π ( s , a ) = G ( s , a ) / N ( s , a ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 15 / 58
Model-free Generalized Policy Improvement Given an estimate Q π i ( s , a ) ∀ s , a Update new policy π i +1 ( s ) = arg max Q π i ( s , a ) a Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 16 / 58
Model-free Policy Iteration Initialize policy π Repeat: Policy evaluation: compute Q π Policy improvement: update π given Q π May need to modify policy evaluation: If π is deterministic, can’t compute Q ( s , a ) for any a � = π ( s ) How to interleave policy evaluation and improvement? Policy improvement is now using an estimated Q Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 17 / 58
Table of Contents Generalized Policy Iteration 1 Importance of Exploration 2 Maximization Bias 3 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 18 / 58
Policy Evaluation with Exploration Want to compute a model-free estimate of Q π In general seems subtle Need to try all ( s , a ) pairs but then follow π Want to ensure resulting estimate Q π is good enough so that policy improvement is a monotonic operator For certain classes of policies can ensure all (s,a) pairs are tried such that asymptotically Q π converges to the true value Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 19 / 58
ǫ -greedy Policies Simple idea to balance exploration and exploitation Let | A | be the number of actions Then an ǫ -greedy policy w.r.t. a state-action value Q ( s , a ) is π ( a | s ) = Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 20 / 58
ǫ -greedy Policies Simple idea to balance exploration and exploitation Let | A | be the number of actions Then an ǫ -greedy policy w.r.t. a state-action value Q ( s , a ) is π ( a | s ) = arg max a Q ( s , a ), w. prob 1 − ǫ + ǫ | A | a ′ � = arg max Q ( s , a ) w. prob ǫ | A | Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 21 / 58
For Later Practice: MC for On Policy Q Evaluation Initialize N ( s , a ) = 0, G ( s , a ) = 0, Q π ( s , a ) = 0, ∀ s ∈ S , ∀ a ∈ A Loop Using policy π sample episode i = s i , 1 , a i , 1 , r i , 1 , s i , 2 , a i , 2 , r i , 2 , . . . , s i , T i G i , t = r i , t + γ r i , t +1 + γ 2 r i , t +2 + · · · γ T i − 1 r i , T i For each state,action ( s , a ) visited in episode i For first or every time t that ( s , a ) is visited in episode i N ( s , a ) = N ( s , a ) + 1, G ( s , a ) = G ( s , a ) + G i , t Update estimate Q π ( s , a ) = G ( s , a ) / N ( s , a ) Mars rover with new actions: r ( − , a 1 ) = [ 1 0 0 0 0 0 +10], r ( − , a 2 ) = [ 0 0 0 0 0 0 +5], γ = 1. Assume current greedy π ( s ) = a 1 ∀ s , ǫ =.5 Sample trajectory from ǫ -greedy policy Trajectory = ( s 3 , a 1 , 0, s 2 , a 2 , 0, s 3 , a 1 , 0, s 2 , a 2 , 0, s 1 , a 1 , 1, terminal) First visit MC estimate of Q of each ( s , a ) pair? Q ǫ − π ( − , a 1 ) = [1 0 1 0 0 0 0], Q ǫ − π ( − , a 2 ) = [0 1 0 0 0 0 0] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 22 / 58
Monotonic ǫ -greedy Policy Improvement Theorem For any ǫ -greedy policy π i , the ǫ -greedy policy w.r.t. Q π i , π i +1 is a monotonic improvement V π i +1 ≥ V π i Q π i ( s , π i +1 ( s )) � π i +1 ( a | s ) Q π i ( s , a ) = a ∈ A � Q π i ( s , a ) + (1 − ǫ ) max Q π i ( s , a ) = ( ǫ/ | A | ) a a ∈ A Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2020 23 / 58
Recommend
More recommend