u s u s r u s 0 u s
play

U ( s ) = U ( s ) + [ r + U ( s 0 ) U ( s )] inputs : mdp , an - PowerPoint PPT Presentation

Learning the Value of a Policy } The dynamic programming algorithm we have seen works fine if we already know everything about an MDP system, including: Probabilities of all state-action transitions 1. Rewards we get in each case 2. } If we


  1. Learning the Value of a Policy } The dynamic programming algorithm we have seen works fine if we already know everything about an MDP system, including: Probabilities of all state-action transitions 1. Rewards we get in each case 2. } If we don’t have this information, how can we figure out Class #25: the value of a policy? Reinforcement Learning } Turns out we can use a sampling method } “Follow the policy, and see what happens” Machine Learning (COMP 135): M. Allen, 22 Apr. 20 2 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 1 2 Temporal Difference (TD) Updates The Basic TD(0) Update function TD-Policy-Evaluation ( mdp, π ) returns a value function U ( s ) = U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] inputs : mdp , an MDP, and π , a policy to be evaluated } When we make one-step update, we add one-step reward ∀ s ∈ S, U ( s ) = 0 that we get, r , plus the difference between where we start, repeat for each episode E : set start-state s ← s 0 U ( s ) , and where we end up U ( s ´) , discounted by the repeat for each time-step t of episode E , until s is terminal: factor g as usual take action π ( s ) observe : next state s 0 , one-step reward r } If state where we end up s ´ is better than original state s U ( s ) ← U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] after discounting, then the value of s goes up s ← s 0 } If s ´ is worse than s , the value of s goes down return value function U ≈ U π } Agent in an MDP takes actions, sees new states and rewards } Now, we don’t base value-update on a probability distribution } Instead, based on the single state we actually see, over and over again, stopping whenever we hit a terminal condition, for some number of learning episodes 4 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 3 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 3 4 1

  2. The Basic TD(0) Update Advantages and a Problem U ( s ) = U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] } With TD updates, we only update the states we actually see given the policy we are following } We also weight the value-update amount by another } Don’t need to know MDP dynamics constant a , (less than 1 ), called a step-size parameter } May only have to update very few states, saving much time to get the values of those we actually reach under our policy If this value shrinks to 0 over time, values stop changing 1. If we do this slowly , the update will eventually converge 2. } However, this can be a source of difficulty: we may not be to actual value of state if we follow the policy p able to find a better policy, since we don’t know values of } For example, if we update over episodes, e = 1, 2, 3,… , we states that we never happen to visit can set the parameter for each episode to be: α e = 1 e 6 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 5 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 5 6 Exploration and Exploitation Almost-Greedy Policies } One simple way to add exploration is to use a policy } If we use the Dynamic Programming method, we calculate the value of every state that is mostly greedy, but not always } Easy to update policy (just be greedy) } An “epsilon-greedy” ( e -greedy) policy sets some } This is exploitation: use best values seen to choose actions probability threshold, e , and chooses actions by: 1. Picking a random number R ∈ [0,1] } When we are learning, however, we sometimes don’t 2. If R ≤ e , choosing the action at random know what certain states are like, because we’ve never actually seen them yet 3. If R > e , acting in a greedy fashion (as before) } Our current policy may never get us to things we really want } Thus, we must use exploration: try out things even if our current best policy doesn’t think it’s a good idea 8 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 7 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 7 8 2

  3. Learning with e -greedy policies TD-Learning } We can add this idea to our sampling update method function TD-Learning ( mdp ) returns a policy inputs : mdp , an MDP } After we take an action, and see a state-transition from s to s´ , we do the same updates as before: ∀ s ∈ S, U ( s ) = 0 repeat for each episode E : set start-state s ← s 0 U ( s ) = U ( s ) + α [ r + γ U ( s 0 ) − U ( s )] repeat for each time-step t of episode E , until s is terminal: choose action a , using ✏ -greedy policy based on U ( s ) } When we choose actions, we do so in an e -greedy way, observe next state s 0 , one-step reward r U ( s ) ← U ( s ) + ↵ [ r + � U ( s 0 ) − U ( s )] sometimes following the policy based on learned values , s ← s 0 and sometimes trying random things } Over enough time, this can converge to true value return policy ⇡ , set greedily for every state s ∈ S , based upon U ( s ) } Algorithm is the same, but explores using sometimes-greedy and function U * of the optimal policy p * sometimes-probabilistic action-choices instead of fixed policy p } We reduce learning parameter a just as before to converge 10 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 9 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 9 10 Randomness and Weighting in Learning Randomness and Weighting in Learning } Our algorithm uses two parameters, a and e (plus the } The control parameters a and e give us simple ways to usual discount factor g ), to control its overall behavior control complex learning behavior } We don’t always want to reduce each over time } Each can be adapted over time to control algorithm e : the amount of randomness in the policy } In a purely stationary environment , where system 1. dynamics don’t ever change, and all probabilities stay the When we don’t know much, set it to a high value , so that } we start off with lots of random exploration same, we can simply slowly reduce each until we We reduce this value over time until e = 0 , and we are being converge upon a stable learned behavior } purely greedy, and just exploiting what he have learned } In a non-stationary environment, where things may a : the weight on each learning-update step change at some point, learned solutions may quit working 2. Reduce this over time, as well: when a = 0 , U -values don’t } change anymore, and we can converge on final policy values 12 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 11 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 11 12 3

  4. Non-Stationary Environments Non-Stationary Environments } Suppose environment starts off in one configuration: } The environment may change, however: GOAL GOAL s 0 s 0 } If e and a stay at 0, policy is sub-optimal from now on } Over time, we can learn a policy for shortest path to goal } By letting e and a go to 0, the policy becomes stable 14 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 13 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 13 14 Non-Stationary Environments Bellman Equations for Q -values } We may be able to tell that environment changes, however } Instead of the value of a state U ( s ) , we can calculate the value of a state-action pair Q ( s , a ) } The value of taking action a in state s , and then following GOAL the policy π after that: � P ( s, a, s � ) [ R ( s, a, s � ) + γ Q π ( s � , π ( s � ))] Q π ( s, a ) = s � } Similarly, we calculate optimal values Q *( s,a ) of taking a in state s , then following best possible policy after that: s 0 X X X X X X X X X X X X X X ⇥ ⇤ � Q � ( s, a ) = P ( s, a, s � ) R ( s, a, s � ) + γ max Q � ( s � , a � ) } If value drops off over a long time, we can increase e and a a � s � again, to resume learning and find new optimal policy } We can do learning for Q -values, too… 16 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 15 Monday, 20 Apr. 2020 Machine Learning (COMP 135) 15 16 4

Recommend


More recommend