MDP: Optimal policy state-value and action-value functions ο½ Optimal policies share the same optimal state-value function ( π π β (π‘) will be abbreviated as π β (π‘) ): π β π‘ = max π π π‘ , βπ‘ β π π ο½ And the same optimal action-value function: π β π‘, π = max π π π‘, π , βπ‘ β π, π β π(π‘) π ο½ For any MDP, a deterministic optimal policy exists! 28
Optimal policy ο½ If we have π β (π‘) and π(π‘ π’+1 |π‘ π’ , π π’ ) we can compute π β (π‘) π β π‘ = argmax π π + πΏπ β (π‘β²) ΰ· π¬ π‘π‘ β² β π‘π‘ β² π π‘ β² ο½ It can also be computed as: π β π‘ = argmax π β π‘, π πβπ(π‘) ο½ Optimal policy has the interesting property that it is the optimal policy for all states. ο½ Share the same optimal state-value function ο½ It is not dependent on the initial state. ο½ use the same policy no matter what the initial state of MDP is 29
Bellman optimality equation π β π‘ = max + πΏπ β π‘β² π π πβπ(π‘) ΰ· π¬ π‘π‘ β² β π‘π‘ β² π‘ β² π β π‘, π = ΰ· π β² π β π‘ β² , π β² π π π¬ π‘π‘ β² β π‘π‘ β² + πΏ max π‘ β² π β π‘ = max πβπ(π‘) π β π‘, π π β π‘, π = ΰ· + πΏπ β π‘β² π π π¬ π‘π‘ β² β π‘π‘ β² π‘ β² 30
Optimal Quantities ο§ The value (utility) of a state s: V * (s) = expected utility starting in s and acting optimally s is a s state a ο§ The value (utility) of a q-state (s,a): (s, a) is a s, a Q * (s,a) = expected utility starting out q-state having taken action a from state s and s,a,s β (s,a,s β ) is a (thereafter) acting optimally transition s β ο§ The optimal policy: ο° * (s) = optimal action from state s 31
Snapshot of Demo β Gridworld V Values Noise = 0.2 Discount = 0.9 Living reward = 0 32
Snapshot of Demo β Gridworld Q Values Noise = 0.2 Discount = 0.9 Living reward = 0 33
Value Iteration algorithm Consider only MDPs with finite state and action spaces: Initialize all π(π‘) to zero 1) Repeat until convergence 2) ο½ for π‘ β π π π Ο π‘ β² π¬ π‘π‘ β² π(π‘) β max β π‘π‘ β² + πΏπ(π‘β²) ο½ π for π‘ β π 3) π π Ο π‘ β² π¬ π‘π‘ β² π(π‘) β argmax β π‘π‘ β² + πΏπ(π‘β²) π π(π‘) converges to π β (π‘) Asynchronous: Instead of updating values for all states at once in each iteration, it can update them state by state, or more often to some states than others. 34
Value Iteration ο½ Bellman equations characterize the optimal values: V(s) a s, a s,a,s ο½ Value iteration computes them: β V(s β ) ο½ Value iteration is just a fixed point solution method ο½ β¦ though the V k vectors are also interpretable as time-limited values 35
V k+1 (s) a Racing Search Tree s, a s,a,s β V k (s β ) 36
Racing Search Tree 37
Time-Limited Values ο½ Key idea: time-limited values ο½ Define V k (s) to be the optimal value of s if the game ends in k more time steps 38
Value Iteration ο½ Start withV 0 (s) = 0: no time steps left means an expected reward sum of zero ο½ Given vector ofV k (s) values, do one ply of expectimax from each state: V k+1 (s) a s, a ο½ Repeat until convergence s,a,s β V k (s β ) ο½ Complexity of each iteration: O(S 2 A) ο½ Theorem: will converge to unique optimal values Basic idea: approximations get refined towards optimal values ο½ Policy may converge long before values do ο½ 39
Example: Value Iteration 3.5 2.5 0 2 1 0 Assume no discount! 0 0 0 40
Computing Time-Limited Values 41
k=0 Noise = 0.2 Discount = 0.9 Living reward = 0 42
k=1 Noise = 0.2 Discount = 0.9 Living reward = 0 43
k=2 Noise = 0.2 Discount = 0.9 Living reward = 0 44
k=3 Noise = 0.2 Discount = 0.9 Living reward = 0 45
k=4 Noise = 0.2 Discount = 0.9 Living reward = 0 46
k=5 Noise = 0.2 Discount = 0.9 Living reward = 0 47
k=6 Noise = 0.2 Discount = 0.9 Living reward = 0 48
k=7 Noise = 0.2 Discount = 0.9 Living reward = 0 49
k=8 Noise = 0.2 Discount = 0.9 Living reward = 0 50
k=9 Noise = 0.2 Discount = 0.9 Living reward = 0 51
k=10 Noise = 0.2 Discount = 0.9 Living reward = 0 52
k=11 Noise = 0.2 Discount = 0.9 Living reward = 0 53
k=12 Noise = 0.2 Discount = 0.9 Living reward = 0 54
k=100 Noise = 0.2 Discount = 0.9 Living reward = 0 55
Computing Actions from Values ο½ Let β s imagine we have the optimal valuesV*(s) ο½ How should we act? ο½ It β s not obvious! ο½ We need to do (one step) ο½ This is called policy extraction, since it gets the policy implied by the values 56
Computing Actions from Q-Values ο½ Let β s imagine we have the optimal q-values: ο½ How should we act? ο½ Completely trivial to decide! ο½ Important lesson: actions are easier to select from q-values than values! 57
Convergence* How do we know the V k vectors are going to ο½ converge? ο½ Case 1: If the tree has maximum depth M, then V M holds the actual untruncated values Case 2: If the discount is less than 1 ο½ Sketch: For any state V k and V k+1 can be viewed as depth ο½ k+1 results in nearly identical search trees ο½ The difference is that on the bottom layer, V k+1 has actual rewards while V k has zeros ο½ That last layer is at best all R MAX It is at worst R MIN ο½ But everything is discounted by Ξ³ k that far out ο½ So V k and V k+1 are at most Ξ³ k max|R| different ο½ ο½ So as k increases, the values converge 58
Value Iteration ο½ Value iteration works even if we randomly traverse the environment instead of looping through each state and action (update asynchronously) ο½ but we must still visit each state infinitely often ο½ Value Iteration ο½ It is time and memory expensive 59
Problems with Value Iteration ο½ Value iteration repeats the Bellman updates: s a ο½ Problem 1: It β s slow β O(S 2 A) per iteration s, a s,a,s β ο½ Problem 2: The β max β at each state rarely changes s β ο½ Problem 3: The policy often converges long before the values 60
Convergence [Russel, AIMA, 2010] 61
Main steps in solving Bellman optimality equations ο½ Two kinds of steps, which are repeated in some order for all the states until no further changes take place π π + πΏπ π (π‘β²) π π‘ = argmax ΰ· π¬ π‘π‘ β² β π‘π‘ β² π π‘ β² π(π‘) β π‘π‘ β² π(π‘) + πΏπ π (π‘β²) π π π‘ = ΰ· π¬ π‘π‘ β² π‘ β² 62
Policy Iteration algorithm Initialize π(π‘) arbitrarily 1) Repeat until convergence 2) ο½ Compute the value function for the current policy π (i.e. π π ) ο½ π β π π ο½ for π‘ β π π π Ο π‘ β² π¬ π‘π‘ β² π(π‘) β argmax β π‘π‘ β² + πΏπ(π‘β²) ο½ π updates the policy (greedily) using the current value function. π(π‘) converges to π β (π‘) 63
Policy Iteration ο½ Repeat steps until policy converges ο½ Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence ο½ Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values ο½ This is policy iteration ο½ It β s still optimal! ο½ Can converge (much) faster under some conditions 64
Fixed Policies Evaluation Do what ο° says to do Do the optimal action s s ο° (s) a s, ο° (s) s, a s, ο° (s),s β s,a,s β s β s β fixed some policy ο° (s), then the tree max over all actions to compute the optimal values would be simpler β only one action per state 65
Utilities for a Fixed Policy ο½ Another basic operation: compute the utility of a state s s under a fixed (generally non-optimal) policy ο° (s) s, ο° (s) ο½ Recursive relation (one-step look-ahead / Bellman equation): s, ο° (s),s β s β V ο° (s) = expected total discounted rewards starting in s and following ο° 66
Policy Evaluation ο½ How do we calculate the V β s for a fixed policy ο° ? ο½ Idea 1: Turn recursive Bellman equations into updates s (like value iteration) ο° (s) s, ο° (s) s, ο° (s),s β s β ο½ Efficiency: O(S 2 ) per iteration ο½ Idea 2: Without the maxes, the Bellman equations are just a linear system 67
Policy Iteration ο½ Evaluation: For fixed current policy ο° , find values with policy evaluation: Iterate until values converge: ο½ ο½ Improvement: For fixed values, get a better policy using policy extraction One-step look-ahead: ο½ 68
When to stop iterations: [Russel, AIMA 2010] 69
Comparison ο½ Both value iteration and policy iteration compute the same thing (all optimal values) ο½ In value iteration: ο½ Every iteration updates both the values and (implicitly) the policy ο½ We don β t track the policy, but taking the max over actions implicitly recomputes it ο½ In policy iteration: ο½ We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action, not all of them) ο½ After the policy is evaluated, a new policy is chosen (slow like a value iteration pass) ο½ The new policy will be better (or we β re done) ο½ Both are dynamic programs for solving MDPs 70
MDP Algorithms: Summary ο½ So you want to β¦ . ο½ Compute optimal values: use value iteration or policy iteration ο½ Compute values for a particular policy: use policy evaluation ο½ Turn your values into a policy: use policy extraction (one-step lookahead) 71
Unknown transition model π ο½ So far: learning optimal policy when we know π¬ π‘π‘ β² (i.e. T(s,a,s β ) ) and β π‘π‘ β² π ο½ it requires prior knowledge of the environment's dynamics ο½ If a model is not available, then it is particularly useful to estimate action values rather than state values 72
Reinforcement Learning ο½ Still assume a Markov decision process (MDP): ο½ A set of states s ο S ο½ A set of actions (per state) A ο½ A model T(s,a,s β ) ο½ A reward function R(s,a,s β ) ο½ Still looking for a policy ο° (s) ο½ New twist: don β t know T or R ο½ I.e. we don β t know which states are good or what the actions do ο½ Must actually try actions and states out to learn 73
Reinforcement Learning Agent State: s Actions: a Reward: r Environmen t ο½ Basic idea: ο½ Receive feedback in the form of rewards ο½ Agent β s utility is defined by the reward function ο½ Must (learn to) act so as to maximize expected rewards ο½ All learning is based on observed samples of outcomes! 74
Applications ο½ Control & robotics ο½ Autonomous helicopter ο½ self-reliant agent must do to learn from its own experiences. ο½ eliminating hand coding of control strategies ο½ Board games ο½ Resource (time, memory, channel, β¦ ) allocation 75
Double Bandits 76
Let β s Play! $2 $2 $0 $2 $2 $2 $2 $0 $0 $0 77
What Just Happened? ο½ That wasn β t planning, it was learning! ο½ Specifically, reinforcement learning ο½ There was an MDP, but you couldn β t solve it with just computation ο½ You needed to actually act to figure it out ο½ Important ideas in reinforcement learning that came up ο½ Exploration: you have to try unknown actions to get information ο½ Exploitation: eventually, you have to use what you know ο½ Regret: even if you learn intelligently, you make mistakes ο½ Sampling: because of chance, you have to try things repeatedly ο½ Difficulty: learning can be much harder than solving a known MDP 78
Offline (MDPs) vs. Online (RL) Offline Solution Online Learning 79
RL algorithms ο½ Model-based (passive) ο½ Learn model of environment (transition and reward probabilities) ο½ Then, value iteration or policy iteration algorithms ο½ Model-free (active) 80
Model-Based Learning ο½ Model-Based Idea: ο½ Learn an approximate model based on experiences ο½ Solve for values as if the learned model were correct ο½ Step 1: Learn empirical MDP model ο½ Count outcomes s β for each s, a ο½ Normalize to give an estimate of ο½ Discover each when we experience (s, a, s β ) ο½ Step 2: Solve the learned MDP ο½ For example, use value iteration, as before 81
Example: Model-Based Learning Input Policy Observed Episodes (Training) Learned Model ο° Episode 1 Episode 2 T(s,a,s β ). T(B, east, C) = 1.00 B, east, C, -1 B, east, C, -1 A T(C, east, D) = 0.75 C, east, D, -1 C, east, D, -1 T(C, east, A) = 0.25 D, exit, x, +10 D, exit, x, +10 β¦ B C D R(s,a,s β ). Episode 3 Episode 4 E R(B, east, C) = -1 E, north, C, -1 E, north, C, -1 R(C, east, D) = -1 C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10 Assume: ο§ = 1 D, exit, x, +10 A, exit, x, -10 β¦ 82
Model-Free Learning 83
Reinforcement Learning ο½ We still assume an MDP: ο½ A set of states s ο S ο½ A set of actions (per state) A ο½ A modelT(s,a,s β ) ο½ A reward function R(s,a,s β ) ο½ Still looking for a policy ο° (s) ο½ New twist: don β t knowT or R, so must try out actions ο½ Big idea: Compute all averages over T using sample outcomes 84
Direct Evaluation of a Policy ο½ Goal: Compute values for each state under ο° ο½ Idea:Average together observed sample values ο½ Act according to ο° ο½ Every time you visit a state, write down what the sum of discounted rewards turned out to be ο½ Average those samples ο½ This is called direct evaluation 85
Example: Direct Evaluation Input Policy ο° Observed Episodes (Training) Output Values Episode 1 Episode 2 -10 B, east, C, -1 B, east, C, -1 A A C, east, D, -1 C, east, D, -1 D, exit, x, +10 D, exit, x, +10 +8 +4 +10 B C D B C D Episode 3 Episode 4 -2 E E E, north, C, -1 E, north, C, -1 C, east, D, -1 C, east, A, -1 Assume: ο§ = 1 D, exit, x, +10 A, exit, x, -10 86
Monte Carlo methods ο½ do not assume complete knowledge of the environment ο½ require only experience ο½ sample sequences of states, actions, and rewards from on-line or simulated interaction with an environment ο½ are based on averaging sample returns ο½ are defined for episodic tasks 87
A Monte Carlo control algorithm using exploring starts Initialize π and π arbitrarily and πππ’π£π ππ‘ to empty lists 1) Repeat 2) ο½ Generate an episode using π and exploring starts ο½ for each pair of π‘ and π appearing in the episode ο¨ π β return following the first occurrence of π‘, π ο¨ Append π to πππ’π£π ππ‘(π‘, π) ο¨ π π‘, π β ππ€ππ πππ πππ’π£π ππ‘(π‘, π) ο½ for each π‘ in the episode π(π‘) β argmax π (π‘, π) ο½ π ο½ 88
Problems with Direct Evaluation Output Values ο½ What β s good about direct evaluation? ο½ It β s easy to understand -10 A ο½ It doesn β t require any knowledge of T, R +8 +4 +10 ο½ It eventually computes the correct average B C D values, using just sample transitions -2 E ο½ What bad about it? If B and E both go to C ο½ It wastes information about state connections under this policy, how can ο½ Each state must be learned separately their values be different? ο½ So, it takes a long time to learn 89
Connections between states ο½ Simplified Bellman updates calculate V for a fixed policy: s Each round, replace V with a one-step-look-ahead layer over V ο½ ο° (s) s, ο° (s) s, ο° (s),s β s β This approach fully exploited the connections between the states ο½ Unfortunately, we need T and R to do it! ο½ ο½ Key question: how can we do this update to V without knowing T and R? In other words, how to we take a weighted average without knowing the weights? ο½ 90
Connections between states ο½ We want to improve our estimate of V by computing these averages: ο½ Idea: Take samples of outcomes s β (by doing the action!) and average s ο° (s) s, ο° (s) s, ο° (s),s β s 1 s s 2 s 3 ' ' ' ' Almost! But we can β t rewind time to get sample after sample from state s. 91
Temporal Difference Learning ο½ Big idea: learn from every experience! Update V(s) each time we experience a transition (s, a, s β , r) ο½ s Likely outcomes s β will contribute updates more often ο½ ο° (s) s, ο° (s) ο½ Temporal difference learning of values ο½ Policy still fixed, still doing evaluation! Move values toward value of whatever successor occurs: running ο½ s β average Sample of V(s): Update to V(s): Same update: 92
Exponential Moving Average ο½ Exponential moving average ο½ The running interpolation update: ο½ Makes recent samples more important: ο½ Forgets about the past (distant past values were wrong anyway) ο½ Decreasing learning rate (alpha) can give converging averages 93
Example: Temporal Difference Learning States Observed Transitions B, east, C, -2 C, east, D, -2 A 0 0 0 B C D 0 0 -1 0 -1 3 8 8 8 E 0 0 0 Assume: ο§ = 1, Ξ± = 1/2 94
Temporal difference methods ο½ TD learning is a combination of MC and DP ) i.e. Bellman equations) ideas. ο½ Like MC methods, can learn directly from raw experience without a model of the environment's dynamics. ο½ Like DP , update estimates based in part on other learned estimates, without waiting for a final outcome. 95
Temporal difference on value function ο½ π π‘ π’ β π π‘ π’ + π½ π π’+1 + πΏπ π‘ π’+1 β π(π‘ π’ ) π : the policy to be evaluated Initialize π(π‘) arbitrarily 1) Repeat (for each episode) 2) ο½ Initialize s ο½ π β action given by policy π for π‘ ο½ Take action π ; observe reward π , and next state π‘β² ο½ π π‘ β π π‘ + π½ π + πΏπ π‘β² β π(π‘) ο½ until s is terminal fully incremental fashion 96
Problems with TD Value Learning ο½ TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages ο½ However, if we want to turn values into a (new) policy, we β re sunk: s a s, a s,a,s β s β 97
Unknown transition model: New policy ο½ With a model, state values alone are sufficient to determine a policy ο½ simply look ahead one step and chooses whichever action leads to the best combination of reward and next state ο½ Without a model, state values alone are not sufficient. ο½ However, if agent knows π (π‘, π) , it can choose optimal action without knowing π and π : π β π‘ = argmax π (π‘, π) π 98
Unknown transition model: New policy ο½ Idea: learn Q-values, not state values ο½ Makes action selection model-free too! 99
Detour: Q-Value Iteration ο½ Value iteration: find successive (depth-limited) values Start with V 0 (s) = 0, which we know is right ο½ Given V k , calculate the depth k+1 values for all states: ο½ ο½ But Q-values are more useful, so compute them instead Start with Q 0 (s,a) = 0, which we know is right ο½ Given Q k , calculate the depth k+1 q-values for all q-states: ο½ 100
Recommend
More recommend