Reinforcement Learning M. Soleymani Sharif University of Technology - PowerPoint PPT Presentation

To To Maximize Reward • We can represent the environment as a process – A mathematical characterization with a true value for its parameters representing the actual environment • The agent must model this environment process – Formulate its own model for the environment, which must ideally match the true values as closely as possible • Based only on what it observes • Agent must formulate winning strategy based on model of environment 40

Ma Mark rkov propert rty and observability • Environment state is Markov – An assumption that is generally valid for a properly defined true environment state 𝑄 𝑇 "() |𝑇 , , 𝑇 ) , … , 𝑇 " = 𝑄 𝑇 "() |𝑇 " • In theory, if the agent doesn’t observe the environment’s internals, he cannot model what he observes of the environment as Markov! – Amazing, but trivial result – E.g. the observations generated by an HMM are not Markov • In practice, the agent may assume anything – The agent may only have a local model of the true state of the system • But can still assume that the states in its model behave in the same Markovian way that the environment’s actual states do 41

Mark Ma rkov propert rty and observability Chess: environment state fully observable to agent Poker: environment state only partially and indirectly observable to agent • Observability – The agent’s observations inform it about the environment state – The agent may observe the entire environment state • Now the agents state is isomorphic to the environment state • Note – observing the state is not the same as knowing the state’s true dynamics 𝑄 𝑇 "() = 𝑡 6 |𝑇 " = 𝑡 7 We focus on this in our lectures Markov Decision Process • – Or only part of it • E.g. only seeing some stock prices, or only the traffic immediately in front of you 42 • Partially Observable Markov Decision Process

A A Markov Pr Process • A Markov process is a random process where the future is only determined by the present – Memoryless • Is fully defined by the set of states 𝒯 , and the state transition probabilities 𝑄(𝑡 7 |𝑡 6 ) – Formally, the tuple M = 𝒯, 𝒬 . – 𝒯 is the (possibly finite) set of states – 𝒬 is the complete set of transition probabilities 𝑄(𝑡 |𝑡′) – Note 𝑄(𝑡 |𝑡′) stands for 𝑄(𝑇 "() = 𝑡|𝑇 " = 𝑡′) at any time 𝑢 – Will use the shorthand 𝑄 @,@A 43

Ma Mark rkov Reward Process • A Markov Reward Process (MRP) is a Markov Process where states give you rewards • Formally, a Markov Reward Process is the tuple M = 𝒯, 𝒬, ℛ, 𝛿 – 𝒯 is the (possibly finite) set of states – 𝒬 is the complete set of transition probabilities 𝑄 @,@ D – ℛ is a reward function, consisting of the distributions 𝑄 𝑠|𝑡 or 𝑄 𝑠|𝑡, 𝑡 A • Or alternately, the expected value 𝐹 𝑠|𝑡 or 𝐹 𝑠|𝑡, 𝑡′ – 𝛿 ∈ [0,1] is a discount factor 44

Th The discounted return P "(L + 𝛿 L 𝑠 "(M + ⋯ = N 𝛿 O 𝑠 𝐻 " = 𝑠 "() + 𝛿𝑠 "(O() OQ, • The return is the total future reward all the way to the end • But each future step is slightly less “believable” and is hence discounted – We trust our own observations of the future less and less • The future is a fuzzy place • The discount factor 𝛿 is our belief in the predictability of the future – 𝛿 = 0 : The future is totally unpredictable, only trust what you see immediately ahead of you (myopic) – 𝛿 = 1 : The future is clear; consider all of it (far sighted) • Part of the Markov Reward Process model 45

Th The Markov Decision Process • Mathematical formulation of RL problems • A Markov Decision Process is a Markov Reward Process, where the agent has the ability to decide its actions! – We will represent the action at time t as 𝑏 " • The agent’s actions affect the environment’s behavior – The transitions made by the environment are functions of the action – The rewards returned are functions of the action 46

Th The Markov Decision Process • Formally, a Markov Decision Process is the tuple M = 𝒯, 𝒬, 𝒝, ℛ, 𝛿 – 𝒯 is a (possibly finite) set of states : 𝒯 = {𝑡} – 𝒝 is a (possibly finite) set of actions : 𝒝 = {𝑏} U – 𝒬 is the set of action conditioned transition probabilities 𝑄 @,@ D = 𝑄(𝑇 "() = 𝑡|𝑇 " = 𝑡′, 𝑏 " = 𝑏) is an action conditioned reward function 𝐹 𝑠|𝑇 = 𝑡, 𝐵 = 𝑏, 𝑇 A = 𝑡′ U – ℛ @@ D – 𝛿 ∈ [0,1] is a discount factor 47

Po Policy • The policy is the probability distribution over actions that the agent may take at any state 𝜌 𝑏|𝑡 = 𝑄 𝑏 " = 𝑏|𝑡 " = 𝑡 – What are the preferred actions of the spider at any state • The policy may be deterministic, i.e. 𝜌 𝑡 = 𝑏 @ where 𝑏 @ is the preferred action in state 𝑡 48

Markov Decision Process • At time step t=0, environment samples initial state 𝑡 , ~𝑞(𝑡 , ) • Then, for t=0 until done: – Agent selects action 𝑏 " – Environment samples next state 𝑡 "() ~𝑄(. |𝑡 " , 𝑏 " ) – Environment samples reward 𝑠 " ~𝑆(. |𝑡 " , 𝑏 " , 𝑡 "() ) – Agent receives reward 𝑠 " and next state 𝑡 "() • A policy 𝜌 is a function from S to A that specifies what action to take in each state • Objective: find policy 𝜌 ∗ that maximizes cumulative discounted reward: P ) + 𝛿 L 𝑠 L + ⋯ = N 𝛿 O 𝑠 𝐻 = 𝑠 , + 𝛿𝑠 O OQ, 49

Le Learn rning from m experi rience • Learn by playing (or observing) – Problem: The tree of possible moves is exponentially large • Learn to generalize – What do we mean by “generalize”? – If a particular board position always leads to loss, avoid any moves that move you into that position 50

A simple MDP: Grid World 51

A simple MDP: Grid World 52

In Intr troduc ducing ing the the “Value” alue” func unctio tion • The “Value” of a state is the expected total discounted return, when the process begins in that state 𝑊 ] (𝑡) = 𝐹 𝐻 , |𝑇 , = 𝑡, 𝜌 • Or, since the process is Markov and the future only depends on the present and not the past 𝑊 ] (𝑡) = 𝐹 𝐻 " |𝑇 " = 𝑡, 𝜌 • Or more generally 𝑊 ] (𝑡) = 𝐹 𝐻|𝑇 = 𝑡, 𝜌 53

Definitions: Value function and Q-value function 54

Value function for policy 𝜌 𝑊 ] 𝑡 = 𝐹 ∑ P 𝛿 " 𝑠 " 𝑡 , = 𝑡, 𝜌 "Q, 𝑅 ] 𝑡, 𝑏 = 𝐹 ∑ P 𝛿 " 𝑠 " 𝑡 , = 𝑡, 𝑏 , = 𝑏 , 𝜌 "Q, • 𝑊 ] 𝑡 : How good for the agent to be in the state 𝑡 when its policy is 𝜌 – It is simply the expected sum of discounted rewards upon starting in state s and taking actions according to 𝜌 𝑊 ] 𝑡 = 𝐹 𝑠 + 𝛿𝑊 ] (𝑡′)|𝑡, 𝜌 Bellman Equations 𝑅 ] 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿𝑅 ] (𝑡 A , 𝜌(𝑡′))|𝑡, 𝑏, 𝜌 55

� � � Th The st state value fu function of f an MDP • The expected return from any state depends on the policy you follow • We will index the value of any state by the policy to indicate this 𝑊 ] 𝑡 = N 𝜌 𝑏|𝑡 + 𝛿𝑊 ] 𝑡 A U U N 𝑄 @,@ D ℛ @@ D U∈𝒝 @A For deterministic policies: ] @ + 𝛿𝑊 ] 𝑡 A 𝑊 ] 𝑡 = N 𝑄 @,@ D ] @ ℛ @@ D @A Bellman Expectation Equation for State Value Functions of an MDP Note: Although reward was not dependent on action for the fly example, 56 more generally it will be

“C “Computi puting ng” ” the the MDP • Finding the state and/or action value functions for the MDP: U , expected rewards U , – Given complete MDP (all transition probabilities 𝑄 @,@ D 𝑆 @,@ D and discount 𝛿 ) – and a policy 𝜌 – find all value terms 𝑊 ] 𝑡 and/or 𝑅 ] 𝑡, 𝑏 • The Bellman expectation equations are simultaneous equations that can be solved for the value functions – Although this will be computationally intractable for very large state spaces 57

� Va Value Iteration (Prediction DP DP) , • Start with an initialization 𝑊 ] • Iterate ( 𝑙 = 0 … convergence): for all states (O()) 𝑡 = N 𝑄 @,@ D ](@) 𝑆 @@ D ](@) + 𝛿𝑊 (O) 𝑡′ 𝑊 ] ] @A

Va Value-ba based ed Planni nning ng • “Value”-based solution • Breakdown: – Prediction: Given any policy 𝜌 find value function 𝑊 ] 𝑡 – Control: Find the optimal policy

Op Optimal Policies • Different policies can result in different value functions • What is the optimal policy? • The optimal policy is the policy that will maximize the expected total discounted reward at every state: 𝐹 𝐻 " 𝑇 " = 𝑡 P = 𝐹 N 𝛿 O 𝑠 |𝑇 " = 𝑡 "(O() OQ, – Recall: why do we consider the discounted return, rather than the actual P return ∑ 𝑠 ? "(O() OQ, 60

Po Policy or ordering def efinition • A policy 𝜌 is “better” than a policy 𝜌′ if the value function under 𝜌 is greater than or equal to the value function under 𝜌′ at all states 𝜌 ≥ 𝜌 A ⇒ 𝑊 ] 𝑡 ≥ 𝑊 ] D 𝑡 ∀𝑡 • Under the better policy, you will expect better overall outcome no matter what the current state 61

Th The optimal policy theorem • Theorem : For any MDP there exists an optimal policy 𝜌 ∗ that is better than or equal to every other policy: 𝜌 ∗ ≥ 𝜌 ∀𝜌 • Corollary : If there are multiple optimal policies 𝜌 ef") , 𝜌 ef"L , … all of them achieve the same value function 𝑊 ] ghij 𝑡 = 𝑊 ] ∗ 𝑡 ∀𝑡 • All optimal policies achieve the same action value function 𝑅 ] ghij 𝑡, 𝑏 = 𝑅 ∗ 𝑡, 𝑏 ∀𝑡, 𝑏 62

Ho How w to find ind the the optim timal al po polic licy • For the optimal policy: 𝜌 ∗ 𝑡 = argmax 𝑅 ∗ 𝑡, 𝑏 U∈𝒝(@) • Easy to prove – For any other policy 𝜌 , 𝑅 ] 𝑡, 𝑏 ≤ 𝑅 ∗ 𝑡, 𝑏 • Knowing the optimal action value function 𝑅 ∗ 𝑡, 𝑏 ∀𝑡, 𝑏 is sufficient to find the optimal policy 63

� Ba Backup di diagr gram 𝑊 ∗ 𝑡 Figures from Sutton 𝑅 ∗ 𝑡, 𝑏 𝑊 ∗ 𝑡′ 𝑊 ∗ 𝑡 = max 𝑅 ∗ 𝑡, 𝑏 U U + 𝛿𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U 𝑆 @@ D @ D 64

� � Ba Backup di diagr gram 𝑊 ∗ 𝑡 Figures from Sutton 𝑅 ∗ 𝑡, 𝑏 𝑊 ∗ 𝑡′ 𝑊 ∗ 𝑡 = max 𝑅 ∗ 𝑡, 𝑏 U U + 𝛿𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U 𝑆 @@ D 𝑊 ∗ 𝑡 = max 𝑅 ∗ 𝑡, 𝑏 @ D U U + 𝛿𝑊 ∗ 𝑡′ U = max N 𝑄 @,@ D 𝑆 @@ D 65 U @ D

� � Backup di Ba diagr gram 𝑅 ∗ 𝑡, 𝑏 Figures from Sutton 𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡′, 𝑏′ U + 𝛿𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U 𝑆 @@ D @ D U + 𝛿 max 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U D 𝑅 ∗ 𝑡 A , 𝑏 A U 𝑆 @@ D @ D 66

� Optimality Op y re relationships: Summary • Given the MDP: 𝒯, 𝒬, 𝒝, ℛ, 𝛿 • Given the optimal action value functions, the optimal value function can be found 𝑊 ∗ 𝑡 = max 𝑅 ∗ 𝑡, 𝑏 U • Given the optimal value function, the optimal action value function can be found U + 𝛿𝑊 ∗ 𝑡′ 𝑅 ∗ 𝑡, 𝑏 = N 𝑄 @,@ D U 𝑆 @@ D @ D 𝑅 ∗ 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max U D 𝑅 ∗ 𝑡 A , 𝑏 A |𝑡, 𝑏 • Given the optimal action value function, the optimal policy can be found 𝜌 ∗ 𝑡 = argmax 𝑅 ∗ 𝑡, 𝑏 U∈𝒝(@) 67

“S “Solvi ving ng” ” the the MDP • Solving the MDP equates to finding the optimal policy 𝜌 ∗ (𝑡) • Which is equivalent to finding the optimal value function 𝑊 ∗ 𝑡 • Or finding the optimal action value function 𝑅 ∗ 𝑡, 𝑏 • Various solutions will estimate one or the other – Value based solutions solve for 𝑊 ∗ 𝑡 and 𝑅 ∗ 𝑡, 𝑏 and derive the optimal policy from them – Policy based solutions directly estimate 𝜌 ∗ (𝑡) 68

So Solving the Be Bellma man Optima mality Equation • No closed form solutions • Solutions are iterative • Given the MDP (Planning): – Value iterations – Policy iterations • Not given the MDP (Reinforcement Learning): – Q-learning – SARSA.. 69

� Va Value Iteration • Start with any initial value function 𝑊 (,) 𝑡 • Iterate ( 𝑙 = 1 … convergence): – Update the value function 𝑊 (O) 𝑡 = max + 𝛿𝑊 (Oq)) 𝑡′ U U ∑ 𝑄 @,@ D 𝑆 @,@ D @ D U • Note: no explicit policy estimation • Directly learning optimal value function • Guaranteed to give you optimal value function at convergence – But intermediate value function estimates may not represent any policy

Al Alterna nate strategy gy • Worked with Value function – For N states, estimates N terms • Could alternately work with action-value function – For M actions, must estimate MN terms • Much more expensive • But more useful in some scenarios

� Solving for the optimal policy: Value iteration 𝑅 (O()) 𝑡, 𝑏 = 𝐹 𝑠 + 𝛿 max U D 𝑅 (O) 𝑡 A , 𝑏 A 𝑅 (O()) 𝑡, 𝑏 = ∑ U D 𝑅 (O) 𝑡 A , 𝑏 A U U 𝑄 @,@ D 𝑆 @,@ D + 𝛿 max @ D 72

Po Policy Itera ration • Start with any policy 𝜌 (,) • Iterate ( 𝑙 = 0 … convergence): –Use value iteration (prediction DP) to find the value function 𝑊 ] (r) 𝑡 –Find the greedy policy 𝜌 O() s = 𝑕𝑠𝑓𝑓𝑒𝑧 𝑊 ] (r) 𝑡

Ne Next U Up • We’ve worked so far with planning – Someone gave us the MDP • Next: Reinforcement Learning – MDP unknown..

Mo Model del-Fr Free Methods • AKA model-free reinforcement learning • How do you find the value of a policy, without knowing the underlying MDP? – Model-free prediction • How do you find the optimal policy, without knowing the underlying MDP? – Model-free control

Mo Model-Fr Free ee Metho thods ds • AKA model-free reinforcement learning • How do you find the value of a policy, without knowing the underlying MDP? – Model-free prediction • How do you find the optimal policy, without knowing the underlying MDP? – Model-free control • Assumption: We can identify the states, know the actions, and measure rewards, but have no knowledge of the system dynamics – The key knowledge required to “solve” for the best policy – A reasonable assumption in many discrete-state scenarios – Can be generalized to other scenarios with infinite or unknowable state 76

Me Methods • Monte-Carlo Learning • Temporal-Difference Learning

cy 𝜌 Mo Monte-Carlo learning to learn the value of a policy • Just “let the system run” while following the policy 𝜌 and learn the value of different states • Procedure: Record several episodes of the following – Take actions according to policy 𝜌 – Note states visited and rewards obtained as a result – Record entire sequence: – 𝑡 ) , 𝑏 ) , 𝑠 L , 𝑡 L , 𝑏 L , 𝑠 M , … , 𝑡 2 – Assumption: Each “episode” ends at some time • Estimate value functions based on observations by counting

Mo Monte-Ca Carl rlo Value Estima mation • Objective: Estimate value function 𝑊 ] (𝑡) for every state 𝑡 , given recordings of the kind: 𝑡 ) , 𝑏 ) , 𝑠 L , 𝑡 L , 𝑏 L , 𝑠 M , … , 𝑡 2 • Recall, the value function is the expected return: 𝑊 ] 𝑡 = 𝐹 𝐻 " |𝑇 " = 𝑡 "(L + ⋯ + 𝛿 2q"q) 𝑠 2 |𝑇 " = 𝑡 = 𝐹 𝑠 "() + 𝛿𝑠 • To estimate this, we replace the statistical expectation 𝐹 𝐻 " |𝑇 " = 𝑡 by the empirical average 𝑏𝑤𝑕 𝐻 " |𝑇 " = 𝑡

A bi A bit of f no notation • We actually record many episodes ()) , 𝑏 ) ()) , 𝑠 ()) , 𝑡 L ()) , 𝑏 L ()) , 𝑠 ()) , … , 𝑡 2 ()) – 𝑓𝑞𝑗𝑡𝑝𝑒𝑓 1 = 𝑡 ) L M (L) , 𝑏 ) (L) , 𝑠 (L) , 𝑡 L (L) , 𝑏 L (L) , 𝑠 (L) , … , 𝑡 2 (L) – 𝑓𝑞𝑗𝑡𝑝𝑒𝑓 2 = 𝑡 ) L M – … – Different episodes may be different lengths • Return at time I for each episode: ()) = 𝑠 ()) + 𝛿𝑠 ()) + ⋯ + 𝛿 2 ()) { qL 𝑠 – 𝐻 7 2 7() 7() (L) = 𝑠 (L) + 𝛿𝑠 (L) + ⋯ + 𝛿 2 (L) { qL 𝑠 – 𝐻 7 7() 7() 2 – … (") = 𝑠 (") + 𝛿𝑠 (") + ⋯ + 𝛿 2 (") { qL 𝑠 – 𝐻 7 7() 7(L 2

Es Estimating ng the he Value ue of f a State • For every state 𝑡 – Initialize: Count 𝑂 𝑡 = 0 , Total return 𝑤 ] 𝑡 = 0 – For every episode 𝑓 • For every time 𝑢 = 1 … 𝑈 ~ • Compute 𝐻 " • If (𝑇 " == 𝑡) • 𝑂 𝑡 = 𝑂 𝑡 + 1 • 𝑊 ] 𝑡 = 𝑊 ] 𝑡 + 𝐻 " – 𝑊 ] 𝑡 = 𝑊 ] 𝑡 /𝑂(𝑡) • Can be done more efficiently..

Mo Monte Ca Carl rlo estima mation • Learning from experience explicitly • After a sufficiently large number of episodes, in which all states have been visited a sufficiently large number of times, we will obtain good estimates of the value functions of all states • Easily extended to evaluating action value functions

Mo Monte Ca Carl rlo: : Good and Ba Bad • Good: – Will eventually get to the right answer – Unbiased estimate • Bad: – Cannot update anything until the end of an episode • Which may last for ever – High variance! Each return adds many random values – Slow to converge

Inc Increm emen ental al Upda Update e of Aver erag ages es • Given a sequence 𝑦 ) , 𝑦 L , 𝑦 M , … a running estimate of their average can be computed as O 𝑦̅ O = 1 𝑙 N 𝑦 7 7Q) • This can be rewritten as: 𝑦̅ O = (𝑙 − 1)𝑦̅ Oq) + 𝑦 O 𝑙 • And further refined to 𝑦̅ O = 𝑦̅ Oq) + 1 𝑙 𝑦 O − 𝑦̅ Oq) 84

� � Inc Increm emen ental al Upda Update e of Aver erag ages es • Given a sequence 𝑦 ) , 𝑦 L , 𝑦 M , … a running estimate of their average can be computed as 𝑦̅ O = 𝑦̅ Oq) + 1 𝑙 𝑦 O − 𝑦̅ Oq) • Or more generally as 𝑦̅ O = 𝑦̅ Oq) + 𝛽 𝑦 O − 𝑦̅ Oq) • The latter is particularly useful for non-stationary environments • For stationary environments 𝛽 must shrink with iterations, but not too fast L – ∑ 𝛽 O < 𝐷, ∑ 𝛽 O = ∞, 𝛽 O ≥ 0 O O 85

Inc Increm emen ental al Upda Updates es 𝑦̅ O = 𝑦̅ Oq) + 1 𝑙 𝑦 O − 𝑦̅ Oq) 𝑦̅ O = 𝑦̅ Oq) + 𝛽 𝑦 O − 𝑦̅ Oq) 𝛽 = 0.1 𝛽 = 0.05 𝛽 = 0.03 • Example of running average of a uniform random variable

Increm Inc emen ental al Upda Updates es 𝑦̅ O = 𝑦̅ Oq) + 1 𝑙 𝑦 O − 𝑦̅ Oq) 𝑦̅ O = 𝑦̅ Oq) + 𝛽 𝑦 O − 𝑦̅ Oq) 𝛽 = 0.1 𝛽 = 0.05 𝛽 = 0.03 • Correct equation is unbiased and converges to true value • Equation with 𝛽 is biased (early estimates can be expected to be wrong) but converges to true value

Updating Value Funct ction Incr crementally • Actual update 1 ˆ(@) 𝑊 ] 𝑡 = 𝑂(𝑡) N 𝐻 "(7) 7Q) • 𝑂(𝑡) is the total number of visits to state s across all episodes • 𝐻 "(7) is the discounted return at the time instant of the i-th visit to state 𝑡

On Online update • Given any episode • Update the value of each state visited 𝑂 𝑇 " = 𝑂 𝑇 " + 1 1 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝑂(𝑇 " ) 𝐻 " − 𝑊 ] 𝑇 " • Incremental version 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝐻 " − 𝑊 ] 𝑇 " • Still an unrealistic rule Requires the entire track until the end of the episode to compute Gt •

On Online update • Given any episode • Update the value of each state visited 𝑂 𝑇 " = 𝑂 𝑇 " + 1 1 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝑂(𝑇 " ) 𝐻 " − 𝑊 ] 𝑇 " Problem • Incremental version 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝐻 " − 𝑊 ] 𝑇 " • Still an unrealistic rule Requires the entire track until the end of the episode to compute Gt •

Temporal Difference (TD TD) solution 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝐻 " − 𝑊 ] 𝑇 " Problem • But 𝐻 " = 𝑠 "() + 𝛿𝐻 "() • We can approximate 𝐻 "() by the expected return at the next state 𝑇 "() ≈ 𝑊 ] 𝑇 "() "() + 𝛿𝑊 ] 𝑇 "() 𝐻 " ≈ 𝑠 • We don’t know the real value of 𝑊 ] 𝑇 "() but we can “bootstrap” it by its current estimate

TD TD vs MC • What are 𝑊(𝐵) and 𝑊(𝐶) – Using MC – Using TD, where you are allowed to repeatedly go over the data

TD TD so solution: On Online update 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝐻 " − 𝑊 ] 𝑇 " • Where 𝐻 " ≈ 𝑠 "() + 𝛿𝑊 ] 𝑇 "() • Giving us – 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝑠 "() + 𝛿𝑊 ] 𝑇 "() − 𝑊 ] 𝑇 " The error between an (estimated) observation of 𝐻 " and the current estimate 𝑊 ] 𝑇 "

TD TD solution: Online update • For all 𝑡 Initialize: 𝑊 ] 𝑡 = 0 • For every episode 𝑓 – For every time 𝑢 = 1 … 𝑈 ~ – 𝑊 ] 𝑇 " = 𝑊 ] 𝑇 " + 𝛽 𝑠 "() + 𝛿𝑊 ] 𝑇 "() − 𝑊 ] 𝑇 " • There’s a “lookahead” of one state, to know which state the process arrives at at the next time • But is otherwise online, with continuous updates

TD TD Solution • Updates continuously – improve estimates as soon as you observe a state (and its successor) • Can work even with infinitely long processes that never terminate • Guaranteed to converge to the true values eventually – Although initial values will be biased as seen before – Is actually lower variance than MC!! • Only incorporates one RV at any time • TD can give correct answers when MC goes wrong – Particularly when TD is allowed to loop over all learning episodes

St Story so far • Want to compute the values of all states, given a policy, but no knowledge of dynamics • Have seen monte-carlo and temporal difference solutions – TD is quicker to update, and in many situations the better solution

Op Optimal Policy: y: Co Control • We learned how to estimate the state value functions for an MDP whose transition probabilities are unknown for a given policy • How do we find the optimal policy?

Va Value vs. Action Va Value • The solution we saw so far only computes the value functions of states • Not sufficient – to compute the optimal policy from value functions alone, we will need extra information, namely transition probabilities – Which we do not have • Instead, we can use the same method to compute action value functions – Optimal policy in any state : Choose the action that has the largest optimal action value

Va Value vs. Action value • Given only value functions, the optimal policy must be estimated as: U (ℛ @@ D 𝜌 ∗ 𝑡 = argmax + 𝑊 𝑡 A ) U N 𝒬 @@ D U∈𝒝 @ D – Needs knowledge of transition probabilities • Given action value functions, we can find it as: 𝜌 ∗ 𝑡 = argmax 𝑅 𝑡, 𝑏 U∈𝒝 • This is model free (no need for knowledge of model parameters)

Pr Problem of optimal control • From a series of episodes of the kind: 𝑡 ) , 𝑏 ) , 𝑠 L , 𝑡 L , 𝑏 L , 𝑠 M , 𝑡 M , 𝑏 M , 𝑠 ‹ , … , 𝑡 2 • Find the optimal action value function 𝑅 ∗ 𝑡, 𝑏 – The optimal policy can be found from it • Ideally do this online – So that we can continuously improve our policy from ongoing experience

Reinforcement Learning M. Soleymani Sharif University of Technology - PowerPoint PPT Presentation

Reinforcement Learning M. Soleymani Sharif University of Technology Spring 2020 Most slides are based on Bhiksha Raj, 11-785, CMU 2019, some slides from Fei Fei Li and colleagues lectures, cs231n, Stanford 2018. Overview What is

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Bargaining and Coalition Formation Dr James Tremewan (james.tremewan@univie.ac.at)

The Dynamic economic effects of a US corporate income tax Rate Reduction John W. Diamond Kelly

Translating Unknown Words by Analogical Learning Philippe Langlais and Alexandre Patry Dept.

1 Peter Series Lesson #127 April 19, 2018 Dean Bible Ministries www.deanbibleministries.org Dr.

Reinforcement Learning Framework Reinforcement Learning Rewards, Returns Lectures 4 and 5

A Concerted Effort Towards Flourishing Global Software Development Dehua Ju ASTI Shanghai

Learning to Play Games Tutorial Lectures Professor Simon M. Lucas Game Intelligence Group

Markov Decision Processes CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 14 and 15,

Sambuz

Useful Links

Newsletter

Mail Us