To To Maximize Reward • We can represent the environment as a process – A mathematical characterization with a true value for its parameters representing the actual environment • The agent must model this environment process – Formulate its own model for the environment, which must ideally match the true values as closely as possible • Based only on what it observes • Agent must formulate winning strategy based on model of environment 40
Ma Mark rkov propert rty and observability • Environment state is Markov – An assumption that is generally valid for a properly defined true environment state 𝑄 𝑇 "() |𝑇 , , 𝑇 ) , … , 𝑇 " = 𝑄 𝑇 "() |𝑇 " • In theory, if the agent doesn’t observe the environment’s internals, he cannot model what he observes of the environment as Markov! – Amazing, but trivial result – E.g. the observations generated by an HMM are not Markov • In practice, the agent may assume anything – The agent may only have a local model of the true state of the system • But can still assume that the states in its model behave in the same Markovian way that the environment’s actual states do 41
Mark Ma rkov propert rty and observability Chess: environment state fully observable to agent Poker: environment state only partially and indirectly observable to agent • Observability – The agent’s observations inform it about the environment state – The agent may observe the entire environment state • Now the agents state is isomorphic to the environment state • Note – observing the state is not the same as knowing the state’s true dynamics 𝑄 𝑇 "() = 𝑡 6 |𝑇 " = 𝑡 7 We focus on this in our lectures Markov Decision Process • – Or only part of it • E.g. only seeing some stock prices, or only the traffic immediately in front of you 42 • Partially Observable Markov Decision Process
A A Markov Pr Process • A Markov process is a random process where the future is only determined by the present – Memoryless • Is fully defined by the set of states 𝒯 , and the state transition probabilities 𝑄(𝑡 7 |𝑡 6 ) – Formally, the tuple M = 𝒯, 𝒬 . – 𝒯 is the (possibly finite) set of states – 𝒬 is the complete set of transition probabilities 𝑄(𝑡 |𝑡′) – Note 𝑄(𝑡 |𝑡′) stands for 𝑄(𝑇 "() = 𝑡|𝑇 " = 𝑡′) at any time 𝑢 – Will use the shorthand 𝑄 @,@A 43
Ma Mark rkov Reward Process • A Markov Reward Process (MRP) is a Markov Process where states give you rewards • Formally, a Markov Reward Process is the tuple M = 𝒯, 𝒬, ℛ, 𝛿 – 𝒯 is the (possibly finite) set of states – 𝒬 is the complete set of transition probabilities 𝑄 @,@ D – ℛ is a reward function, consisting of the distributions 𝑄 𝑠|𝑡 or 𝑄 𝑠|𝑡, 𝑡 A • Or alternately, the expected value 𝐹 𝑠|𝑡 or 𝐹 𝑠|𝑡, 𝑡′ – 𝛿 ∈ [0,1] is a discount factor 44
Th The discounted return P "(L + 𝛿 L 𝑠 "(M + ⋯ = N 𝛿 O 𝑠 𝐻 " = 𝑠 "() + 𝛿𝑠 "(O() OQ, • The return is the total future reward all the way to the end • But each future step is slightly less “believable” and is hence discounted – We trust our own observations of the future less and less • The future is a fuzzy place • The discount factor 𝛿 is our belief in the predictability of the future – 𝛿 = 0 : The future is totally unpredictable, only trust what you see immediately ahead of you (myopic) – 𝛿 = 1 : The future is clear; consider all of it (far sighted) • Part of the Markov Reward Process model 45
Th The Markov Decision Process • Mathematical formulation of RL problems • A Markov Decision Process is a Markov Reward Process, where the agent has the ability to decide its actions! – We will represent the action at time t as 𝑏 " • The agent’s actions affect the environment’s behavior – The transitions made by the environment are functions of the action – The rewards returned are functions of the action 46
Th The Markov Decision Process • Formally, a Markov Decision Process is the tuple M = 𝒯, 𝒬, , ℛ, 𝛿 – 𝒯 is a (possibly finite) set of states : 𝒯 = {𝑡} – is a (possibly finite) set of actions : = {𝑏} U – 𝒬 is the set of action conditioned transition probabilities 𝑄 @,@ D = 𝑄(𝑇 "() = 𝑡|𝑇 " = 𝑡′, 𝑏 " = 𝑏) is an action conditioned reward function 𝐹 𝑠|𝑇 = 𝑡, 𝐵 = 𝑏, 𝑇 A = 𝑡′ U – ℛ @@ D – 𝛿 ∈ [0,1] is a discount factor 47
Po Policy • The policy is the probability distribution over actions that the agent may take at any state 𝜌 𝑏|𝑡 = 𝑄 𝑏 " = 𝑏|𝑡 " = 𝑡 – What are the preferred actions of the spider at any state • The policy may be deterministic, i.e. 𝜌 𝑡 = 𝑏 @ where 𝑏 @ is the preferred action in state 𝑡 48
Markov Decision Process • At time step t=0, environment samples initial state 𝑡 , ~𝑞(𝑡 , ) • Then, for t=0 until done: – Agent selects action 𝑏 " – Environment samples next state 𝑡 "() ~𝑄(. |𝑡 " , 𝑏 " ) – Environment samples reward 𝑠 " ~𝑆(. |𝑡 " , 𝑏 " , 𝑡 "() ) – Agent receives reward 𝑠 " and next state 𝑡 "() • A policy 𝜌 is a function from S to A that specifies what action to take in each state • Objective: find policy 𝜌 ∗ that maximizes cumulative discounted reward: P ) + 𝛿 L 𝑠 L + ⋯ = N 𝛿 O 𝑠 𝐻 = 𝑠 , + 𝛿𝑠 O OQ, 49
Le Learn rning from m experi rience • Learn by playing (or observing) – Problem: The tree of possible moves is exponentially large • Learn to generalize – What do we mean by “generalize”? – If a particular board position always leads to loss, avoid any moves that move you into that position 50
A simple MDP: Grid World 51
A simple MDP: Grid World 52
In Intr troduc ducing ing the the “Value” alue” func unctio tion • The “Value” of a state is the expected total discounted return, when the process begins in that state 𝑊 ] (𝑡) = 𝐹 𝐻 , |𝑇 , = 𝑡, 𝜌 • Or, since the process is Markov and the future only depends on the present and not the past 𝑊 ] (𝑡) = 𝐹 𝐻 " |𝑇 " = 𝑡, 𝜌 • Or more generally 𝑊 ] (𝑡) = 𝐹 𝐻|𝑇 = 𝑡, 𝜌 53
Definitions: Value function and Q-value function 54
Value function for policy 𝜌 𝑊 ] 𝑡 = 𝐹 ∑ P 𝛿 " 𝑠 " 𝑡 , = 𝑡, 𝜌 "Q, 𝑅 ] 𝑡, 𝑏 = 𝐹 ∑ P 𝛿 " 𝑠 " 𝑡 , = 𝑡, 𝑏 , = 𝑏 , 𝜌 "Q, • 𝑊 ] 𝑡 : How good for the agent to be in the state 𝑡 when its policy is 𝜌 – It is simply the expected sum of discounted rewards upon starting in state s and taking actions according to 𝜌 𝑊 ] 𝑡 = 𝐹 𝑠 + 𝛿𝑊 ] (𝑡′)|𝑡, 𝜌 Bellman Equations 𝑅 ] 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿𝑅 ] (𝑡 A , 𝜌(𝑡′))|𝑡, 𝑏, 𝜌 55
� � � Th The st state value fu function of f an MDP • The expected return from any state depends on the policy you follow • We will index the value of any state by the policy to indicate this 𝑊 ] 𝑡 = N 𝜌 𝑏|𝑡 + 𝛿𝑊 ] 𝑡 A U U N 𝑄 @,@ D ℛ @@ D U∈ @A For deterministic policies: ] @ + 𝛿𝑊 ] 𝑡 A 𝑊 ] 𝑡 = N 𝑄 @,@ D ] @ ℛ @@ D @A Bellman Expectation Equation for State Value Functions of an MDP Note: Although reward was not dependent on action for the fly example, 56 more generally it will be
“C “Computi puting ng” ” the the MDP • Finding the state and/or action value functions for the MDP: U , expected rewards U , – Given complete MDP (all transition probabilities 𝑄 @,@ D 𝑆 @,@ D and discount 𝛿 ) – and a policy 𝜌 – find all value terms 𝑊 ] 𝑡 and/or 𝑅 ] 𝑡, 𝑏 • The Bellman expectation equations are simultaneous equations that can be solved for the value functions – Although this will be computationally intractable for very large state spaces 57
� Va Value Iteration (Prediction DP DP) , • Start with an initialization 𝑊 ] • Iterate ( 𝑙 = 0 … convergence): for all states (O()) 𝑡 = N 𝑄 @,@ D ](@) 𝑆 @@ D ](@) + 𝛿𝑊 (O) 𝑡′ 𝑊 ] ] @A
Va Value-ba based ed Planni nning ng • “Value”-based solution • Breakdown: – Prediction: Given any policy 𝜌 find value function 𝑊 ] 𝑡 – Control: Find the optimal policy
Op Optimal Policies • Different policies can result in different value functions • What is the optimal policy? • The optimal policy is the policy that will maximize the expected total discounted reward at every state: 𝐹 𝐻 " 𝑇 " = 𝑡 P = 𝐹 N 𝛿 O 𝑠 |𝑇 " = 𝑡 "(O() OQ, – Recall: why do we consider the discounted return, rather than the actual P return ∑ 𝑠 ? "(O() OQ, 60
Po Policy or ordering def efinition • A policy 𝜌 is “better” than a policy 𝜌′ if the value function under 𝜌 is greater than or equal to the value function under 𝜌′ at all states 𝜌 ≥ 𝜌 A ⇒ 𝑊 ] 𝑡 ≥ 𝑊 ] D 𝑡 ∀𝑡 • Under the better policy, you will expect better overall outcome no matter what the current state 61
Th The optimal policy theorem • Theorem : For any MDP there exists an optimal policy 𝜌 ∗ that is better than or equal to every other policy: 𝜌 ∗ ≥ 𝜌 ∀𝜌 • Corollary : If there are multiple optimal policies 𝜌 ef") , 𝜌 ef"L , … all of them achieve the same value function 𝑊 ] ghij 𝑡 = 𝑊 ] ∗ 𝑡 ∀𝑡 • All optimal policies achieve the same action value function 𝑅 ] ghij 𝑡, 𝑏 = 𝑅 ∗ 𝑡, 𝑏 ∀𝑡, 𝑏 62
Ho How w to find ind the the optim timal al po polic licy • For the optimal policy: 𝜌 ∗ 𝑡 = argmax 𝑅 ∗ 𝑡, 𝑏 U∈(@) • Easy to prove – For any other policy 𝜌 , 𝑅 ] 𝑡, 𝑏 ≤ 𝑅 ∗ 𝑡, 𝑏 • Knowing the optimal action value function 𝑅 ∗ 𝑡, 𝑏 ∀𝑡, 𝑏 is sufficient to find the optimal policy 63
� Ba Backup di diagr gram 𝑊 ∗ 𝑡 Figures from Sutton 𝑅 ∗ 𝑡, 𝑏 𝑊 ∗ 𝑡′ 𝑊 ∗ 𝑡 = max 𝑅 ∗ 𝑡, 𝑏 U U + 𝛿𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U 𝑆 @@ D @ D 64
� � Ba Backup di diagr gram 𝑊 ∗ 𝑡 Figures from Sutton 𝑅 ∗ 𝑡, 𝑏 𝑊 ∗ 𝑡′ 𝑊 ∗ 𝑡 = max 𝑅 ∗ 𝑡, 𝑏 U U + 𝛿𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U 𝑆 @@ D 𝑊 ∗ 𝑡 = max 𝑅 ∗ 𝑡, 𝑏 @ D U U + 𝛿𝑊 ∗ 𝑡′ U = max N 𝑄 @,@ D 𝑆 @@ D 65 U @ D
� � Backup di Ba diagr gram 𝑅 ∗ 𝑡, 𝑏 Figures from Sutton 𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡′, 𝑏′ U + 𝛿𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U 𝑆 @@ D @ D U + 𝛿 max 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U D 𝑅 ∗ 𝑡 A , 𝑏 A U 𝑆 @@ D @ D 66
� Optimality Op y re relationships: Summary • Given the MDP: 𝒯, 𝒬, , ℛ, 𝛿 • Given the optimal action value functions, the optimal value function can be found 𝑊 ∗ 𝑡 = max 𝑅 ∗ 𝑡, 𝑏 U • Given the optimal value function, the optimal action value function can be found U + 𝛿𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U 𝑆 @@ D @ D 𝑅 ∗ 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max U D 𝑅 ∗ 𝑡 A , 𝑏 A |𝑡, 𝑏 • Given the optimal action value function, the optimal policy can be found 𝜌 ∗ 𝑡 = argmax 𝑅 ∗ 𝑡, 𝑏 U∈(@) 67
“S “Solvi ving ng” ” the the MDP • Solving the MDP equates to finding the optimal policy 𝜌 ∗ (𝑡) • Which is equivalent to finding the optimal value function 𝑊 ∗ 𝑡 • Or finding the optimal action value function 𝑅 ∗ 𝑡, 𝑏 • Various solutions will estimate one or the other – Value based solutions solve for 𝑊 ∗ 𝑡 and 𝑅 ∗ 𝑡, 𝑏 and derive the optimal policy from them – Policy based solutions directly estimate 𝜌 ∗ (𝑡) 68
So Solving the Be Bellma man Optima mality Equation • No closed form solutions • Solutions are iterative • Given the MDP (Planning): – Value iterations – Policy iterations • Not given the MDP (Reinforcement Learning): – Q-learning – SARSA.. 69
� Va Value Iteration • Start with any initial value function 𝑊 (,) 𝑡 • Iterate ( 𝑙 = 1 … convergence): – Update the value function 𝑊 (O) 𝑡 = max + 𝛿𝑊 (Oq)) 𝑡′ U U ∑ 𝑄 @,@ D 𝑆 @,@ D @ D U • Note: no explicit policy estimation • Directly learning optimal value function • Guaranteed to give you optimal value function at convergence – But intermediate value function estimates may not represent any policy
Al Alterna nate strategy gy • Worked with Value function – For N states, estimates N terms • Could alternately work with action-value function – For M actions, must estimate MN terms • Much more expensive • But more useful in some scenarios
� Solving for the optimal policy: Value iteration 𝑅 (O()) 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max U D 𝑅 (O) 𝑡 A , 𝑏 A 𝑅 (O()) 𝑡, 𝑏 = ∑ U D 𝑅 (O) 𝑡 A , 𝑏 A U U 𝑄 @,@ D 𝑆 @,@ D + 𝛿 max @ D 72
Po Policy Itera ration • Start with any policy 𝜌 (,) • Iterate ( 𝑙 = 0 … convergence): –Use value iteration (prediction DP) to find the value function 𝑊 ] (r) 𝑡 –Find the greedy policy 𝜌 O() s = 𝑠𝑓𝑓𝑒𝑧 𝑊 ] (r) 𝑡
Ne Next U Up • We’ve worked so far with planning – Someone gave us the MDP • Next: Reinforcement Learning – MDP unknown..
Mo Model del-Fr Free Methods • AKA model-free reinforcement learning • How do you find the value of a policy, without knowing the underlying MDP? – Model-free prediction • How do you find the optimal policy, without knowing the underlying MDP? – Model-free control
Mo Model-Fr Free ee Metho thods ds • AKA model-free reinforcement learning • How do you find the value of a policy, without knowing the underlying MDP? – Model-free prediction • How do you find the optimal policy, without knowing the underlying MDP? – Model-free control • Assumption: We can identify the states, know the actions, and measure rewards, but have no knowledge of the system dynamics – The key knowledge required to “solve” for the best policy – A reasonable assumption in many discrete-state scenarios – Can be generalized to other scenarios with infinite or unknowable state 76
Me Methods • Monte-Carlo Learning • Temporal-Difference Learning
cy 𝜌 Mo Monte-Carlo learning to learn the value of a policy • Just “let the system run” while following the policy 𝜌 and learn the value of different states • Procedure: Record several episodes of the following – Take actions according to policy 𝜌 – Note states visited and rewards obtained as a result – Record entire sequence: – 𝑡 ) , 𝑏 ) , 𝑠 L , 𝑡 L , 𝑏 L , 𝑠 M , … , 𝑡 2 – Assumption: Each “episode” ends at some time • Estimate value functions based on observations by counting
Mo Monte-Ca Carl rlo Value Estima mation • Objective: Estimate value function 𝑊 ] (𝑡) for every state 𝑡 , given recordings of the kind: 𝑡 ) , 𝑏 ) , 𝑠 L , 𝑡 L , 𝑏 L , 𝑠 M , … , 𝑡 2 • Recall, the value function is the expected return: 𝑊 ] 𝑡 = 𝐹 𝐻 " |𝑇 " = 𝑡 "(L + ⋯ + 𝛿 2q"q) 𝑠 2 |𝑇 " = 𝑡 = 𝐹 𝑠 "() + 𝛿𝑠 • To estimate this, we replace the statistical expectation 𝐹 𝐻 " |𝑇 " = 𝑡 by the empirical average 𝑏𝑤 𝐻 " |𝑇 " = 𝑡
A bi A bit of f no notation • We actually record many episodes ()) , 𝑏 ) ()) , 𝑠 ()) , 𝑡 L ()) , 𝑏 L ()) , 𝑠 ()) , … , 𝑡 2 ()) – 𝑓𝑞𝑗𝑡𝑝𝑒𝑓 1 = 𝑡 ) L M (L) , 𝑏 ) (L) , 𝑠 (L) , 𝑡 L (L) , 𝑏 L (L) , 𝑠 (L) , … , 𝑡 2 (L) – 𝑓𝑞𝑗𝑡𝑝𝑒𝑓 2 = 𝑡 ) L M – … – Different episodes may be different lengths • Return at time I for each episode: ()) = 𝑠 ()) + 𝛿𝑠 ()) + ⋯ + 𝛿 2 ()) { qL 𝑠 – 𝐻 7 2 7() 7() (L) = 𝑠 (L) + 𝛿𝑠 (L) + ⋯ + 𝛿 2 (L) { qL 𝑠 – 𝐻 7 7() 7() 2 – … (") = 𝑠 (") + 𝛿𝑠 (") + ⋯ + 𝛿 2 (") { qL 𝑠 – 𝐻 7 7() 7(L 2
Es Estimating ng the he Value ue of f a State • For every state 𝑡 – Initialize: Count 𝑂 𝑡 = 0 , Total return 𝑤 ] 𝑡 = 0 – For every episode 𝑓 • For every time 𝑢 = 1 … 𝑈 ~ • Compute 𝐻 " • If (𝑇 " == 𝑡) • 𝑂 𝑡 = 𝑂 𝑡 + 1 • 𝑊 ] 𝑡 = 𝑊 ] 𝑡 + 𝐻 " – 𝑊 ] 𝑡 = 𝑊 ] 𝑡 /𝑂(𝑡) • Can be done more efficiently..
Mo Monte Ca Carl rlo estima mation • Learning from experience explicitly • After a sufficiently large number of episodes, in which all states have been visited a sufficiently large number of times, we will obtain good estimates of the value functions of all states • Easily extended to evaluating action value functions
Mo Monte Ca Carl rlo: : Good and Ba Bad • Good: – Will eventually get to the right answer – Unbiased estimate • Bad: – Cannot update anything until the end of an episode • Which may last for ever – High variance! Each return adds many random values – Slow to converge
Inc Increm emen ental al Upda Update e of Aver erag ages es • Given a sequence 𝑦 ) , 𝑦 L , 𝑦 M , … a running estimate of their average can be computed as O 𝑦̅ O = 1 𝑙 N 𝑦 7 7Q) • This can be rewritten as: 𝑦̅ O = (𝑙 − 1)𝑦̅ Oq) + 𝑦 O 𝑙 • And further refined to 𝑦̅ O = 𝑦̅ Oq) + 1 𝑙 𝑦 O − 𝑦̅ Oq) 84
� � Inc Increm emen ental al Upda Update e of Aver erag ages es • Given a sequence 𝑦 ) , 𝑦 L , 𝑦 M , … a running estimate of their average can be computed as 𝑦̅ O = 𝑦̅ Oq) + 1 𝑙 𝑦 O − 𝑦̅ Oq) • Or more generally as 𝑦̅ O = 𝑦̅ Oq) + 𝛽 𝑦 O − 𝑦̅ Oq) • The latter is particularly useful for non-stationary environments • For stationary environments 𝛽 must shrink with iterations, but not too fast L – ∑ 𝛽 O < 𝐷, ∑ 𝛽 O = ∞, 𝛽 O ≥ 0 O O 85
Inc Increm emen ental al Upda Updates es 𝑦̅ O = 𝑦̅ Oq) + 1 𝑙 𝑦 O − 𝑦̅ Oq) 𝑦̅ O = 𝑦̅ Oq) + 𝛽 𝑦 O − 𝑦̅ Oq) 𝛽 = 0.1 𝛽 = 0.05 𝛽 = 0.03 • Example of running average of a uniform random variable
Increm Inc emen ental al Upda Updates es 𝑦̅ O = 𝑦̅ Oq) + 1 𝑙 𝑦 O − 𝑦̅ Oq) 𝑦̅ O = 𝑦̅ Oq) + 𝛽 𝑦 O − 𝑦̅ Oq) 𝛽 = 0.1 𝛽 = 0.05 𝛽 = 0.03 • Correct equation is unbiased and converges to true value • Equation with 𝛽 is biased (early estimates can be expected to be wrong) but converges to true value
Updating Value Funct ction Incr crementally • Actual update 1 ˆ(@) 𝑊 ] 𝑡 = 𝑂(𝑡) N 𝐻 "(7) 7Q) • 𝑂(𝑡) is the total number of visits to state s across all episodes • 𝐻 "(7) is the discounted return at the time instant of the i-th visit to state 𝑡
On Online update • Given any episode • Update the value of each state visited 𝑂 𝑇 " = 𝑂 𝑇 " + 1 1 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝑂(𝑇 " ) 𝐻 " − 𝑊 ] 𝑇 " • Incremental version 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝐻 " − 𝑊 ] 𝑇 " • Still an unrealistic rule Requires the entire track until the end of the episode to compute Gt •
On Online update • Given any episode • Update the value of each state visited 𝑂 𝑇 " = 𝑂 𝑇 " + 1 1 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝑂(𝑇 " ) 𝐻 " − 𝑊 ] 𝑇 " Problem • Incremental version 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝐻 " − 𝑊 ] 𝑇 " • Still an unrealistic rule Requires the entire track until the end of the episode to compute Gt •
Temporal Difference (TD TD) solution 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝐻 " − 𝑊 ] 𝑇 " Problem • But 𝐻 " = 𝑠 "() + 𝛿𝐻 "() • We can approximate 𝐻 "() by the expected return at the next state 𝑇 "() ≈ 𝑊 ] 𝑇 "() "() + 𝛿𝑊 ] 𝑇 "() 𝐻 " ≈ 𝑠 • We don’t know the real value of 𝑊 ] 𝑇 "() but we can “bootstrap” it by its current estimate
TD TD vs MC • What are 𝑊(𝐵) and 𝑊(𝐶) – Using MC – Using TD, where you are allowed to repeatedly go over the data
TD TD so solution: On Online update 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝐻 " − 𝑊 ] 𝑇 " • Where 𝐻 " ≈ 𝑠 "() + 𝛿𝑊 ] 𝑇 "() • Giving us – 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝑠 "() + 𝛿𝑊 ] 𝑇 "() − 𝑊 ] 𝑇 " The error between an (estimated) observation of 𝐻 " and the current estimate 𝑊 ] 𝑇 "
TD TD solution: Online update • For all 𝑡 Initialize: 𝑊 ] 𝑡 = 0 • For every episode 𝑓 – For every time 𝑢 = 1 … 𝑈 ~ – 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝑠 "() + 𝛿𝑊 ] 𝑇 "() − 𝑊 ] 𝑇 " • There’s a “lookahead” of one state, to know which state the process arrives at at the next time • But is otherwise online, with continuous updates
TD TD Solution • Updates continuously – improve estimates as soon as you observe a state (and its successor) • Can work even with infinitely long processes that never terminate • Guaranteed to converge to the true values eventually – Although initial values will be biased as seen before – Is actually lower variance than MC!! • Only incorporates one RV at any time • TD can give correct answers when MC goes wrong – Particularly when TD is allowed to loop over all learning episodes
St Story so far • Want to compute the values of all states, given a policy, but no knowledge of dynamics • Have seen monte-carlo and temporal difference solutions – TD is quicker to update, and in many situations the better solution
Op Optimal Policy: y: Co Control • We learned how to estimate the state value functions for an MDP whose transition probabilities are unknown for a given policy • How do we find the optimal policy?
Va Value vs. Action Va Value • The solution we saw so far only computes the value functions of states • Not sufficient – to compute the optimal policy from value functions alone, we will need extra information, namely transition probabilities – Which we do not have • Instead, we can use the same method to compute action value functions – Optimal policy in any state : Choose the action that has the largest optimal action value
Va Value vs. Action value • Given only value functions, the optimal policy must be estimated as: U (ℛ @@ D 𝜌 ∗ 𝑡 = argmax + 𝑊 𝑡 A ) U N 𝒬 @@ D U∈ @ D – Needs knowledge of transition probabilities • Given action value functions, we can find it as: 𝜌 ∗ 𝑡 = argmax 𝑅 𝑡, 𝑏 U∈ • This is model free (no need for knowledge of model parameters)
Pr Problem of optimal control • From a series of episodes of the kind: 𝑡 ) , 𝑏 ) , 𝑠 L , 𝑡 L , 𝑏 L , 𝑠 M , 𝑡 M , 𝑏 M , 𝑠 ‹ , … , 𝑡 2 • Find the optimal action value function 𝑅 ∗ 𝑡, 𝑏 – The optimal policy can be found from it • Ideally do this online – So that we can continuously improve our policy from ongoing experience
Recommend
More recommend