This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020
Last lecture v Reinforcement Learning Components § Model, Value function, Policy v Model-based Control § Policy Evaluation, Policy Iteration, Value Iteration v Project 1 description.
Quiz 1 Week 4 (9 /24 R) v Model-based Control § Policy Evaluation, Policy Iteration, Value Iteration § 20 min at the beginning • You can start as early as 5:55PM, and finish as late as 6:20PM. The quiz duration is 20 minutes. § Login class zoom so you can ask questions regarding the quiz in Zoom chat box. Project 1 due Week 4 (9 /24 R)
This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: Model based control § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation
Example: Taxi passenger-seeking task as a decision-making process s 6 s 5 s 3 s 2 s 4 s 1 States: Locations of taxi ( s 1 , . . . , s 6 ) on the road Actions: Left or Right Rewards: +1 in state s 1 +3 in state s 5 0 in all other states
RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy
RL components: ( 1 ) Model v Agent’s representation of how the world changes in response to agent’s action, with two parts: Transition model Reward model predicts next agent state predicts immediate reward p(s t+1 =s’ |s t =s, a t =a) r(s,a)
RL components: ( 2 ) Policy v Policy π determines how the agent chooses actions § π : S → A, mapping from states to actions a s v Deterministic policy: a’ § π ( s ) = a § In the other word, a’’ a • π (a| s ) = 0 , • π (a’| s ) = π (a’’| s )= 0, a’ s v Stochastic policy: § π ( a | s ) = Pr( a t = a | s t = s ) a’’
RL components: ( 3 ) Value Function v Value function V π : expected discounted sum of future rewards under a particular policy π v Discount factor γ weighs immediate vs future rewards v Can be used to quantify goodness/badness of states and actions v And decide how to act by comparing policies a s a’
RL agents and algorithms Model-Free: Model-based No model Explicit: Model
Find a good policy: Problem settings Model-based control Model-free control v Computing while v (Agent’s internal interacting with computation) environment § Given model of how the § Agent doesn’t know world works how world works § Dynamics and reward § Interacts with world to model implicitly/explicitly learn § Algorithm computes how how world works to act in order to § Agent improves policy maximize expected reward (may involve planning)
Find a good policy: Problem settings Model-based control Model-free control v (Agent’s internal v Computing while interacting computation) with environment § Frozen Lake project 1 § Taxi passenger-seeking problem § Know all rules of game / perfect model § Demand/Traffic dynamics are uncertain § dynamic programming, tree search § Huge state space Path 1 Path 2 Path 3
Find a good policy: Problem settings Model-based control Model-free control v Given: MDP/R/P v Given: MDP § S, A, γ § S, A, P, R, γ v Unknow § P , R, v Output: v Output: § π § π
This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: Model based control § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation
DP, MRP, and MDP v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process
Random Walks on Graphs Random Walk Random walk sampling Routing Molecule in liquid Influence diffusion
Undirected Graphs Undirected !! 2 3 1 6 4 5
Random Walk v Adjacency matrix 1 2 ! $ ! $ 0 1 1 1 3 0 0 0 # & # & 1 0 1 0 0 2 0 0 # & Symmetric # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 1 0 1 0 0 0 0 2 " % " % 4 3 v Transition Probability Matrix Undirected ⎧ 1 if i is not equal to j " % ⎪ 0 1/ 3 1/ 3 1/ 3 P k i $ ' ij = ⎨ 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' ⎪ $ ' 0 if i = j 1/ 3 1/ 3 0 1/ 3 ⎩ $ ' 1/ 2 0 1/ 2 0 # & v |E|: number of links v Stationary Distribution π i = d i 2 E
A random walker: Markov Chain / Markov Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3
A random walker: Markov Chain / Markov Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 s 0 * P = s 1
Taxi passenger-seeking task: Markov Process --- Episodes s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 Example: Sample episodes starting from s 3 s 3 , s 2 , s 2 , s 2 , s 1 , s 1 ,... s 3 , s 3 , s 4 , s 5 , s 6 , s 6 ,... s 3 , s 4 , s 5 , s 4 ,...
DP, MRP, and MDP v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process
A random walker + rewards: Markov Reward Process (MRP) s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 v Reward: +1 in s 1 , +3 in s 5 , 0 in all other states.
A random walker + rewards: Markov Reward Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 v Reward: +1 in s 1 , +3 in s 5 , 0 in all other states Sample returns for sample 4-step episodes, γ = ½ v s 3 (t=1), s 4 (t=2), s 5 (t=3), s 5 (t=4): v G 1 =? v G 3 =?
A random walker + rewards: Markov Reward Process s 6 s 5 s 3 s 2 s 4 s 1 0.3 0.3 0.3 0.3 0.3 0.7 0.3 0.4 0.7 0.3 0.4 0.4 0.3 0.4 0.3 0.3 v Reward: +1 in s 1 , +3 in s 5 , 0 in all other states v Sample returns for sample 4-step episodes, γ = 1/2 v s 3 , s 4 , s 5 , s 6 : G 1 =? v s 3 , s 3 , s 4 , s 3 : G 1 =? v s 3 , s 2 , s 1 , s 1 : G 1 =?
Path 1 Samples: v s 3 , s 4 , s 5 , s 6… v s 3 , s 3 , s 4 , s 3… Path 2 v s 3 , s 2 , s 1 , s 1… Path 3 v …
Return vs Value function Path 1 Path 2 Path 2 Path 3 Samples: v Samples: v s 3 , s 4 , s 5 , s 6…, s 3 , s 3 , s 4 , s 3… v …
DP, MRP, and MDP v Markov Process (Markov Chain) v Markov Reward Process v Markov Decision Process
Taxi passenger-seeking task: Markov Decision Process (MDP) s 6 s 5 s 3 s 2 s 4 s 1 a2 a1 Deterministic transition model
This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation
For deterministic policy:
For deterministic and stochastic policy: From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition
(All-in-one algorithm) From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition
Deterministic policy
From: Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition
This lecture v Markov Process (Markov Chain), Markov Reward Process, and Markov Decision Process § MP, MRP, MDP, POMDP v Review: § Policy Iteration, and Value iteration v Model-Free Policy Evaluation § Monte Carlo policy evaluation § Temporal-difference (TD) policy evaluation
Review of Dynamic Programming for policy evaluation (model-baased) action state equivalently, " (#′)] " (#) = & ",$ [( + *! ! ! !%& 53
Review of Dynamic Programming for policy evaluation (model-based) Bootstrapping action state " (#′)] " (#) = & ",$ [( + *! ! ! !%& v Bootstrapping: Update for V uses an estimate v Known model P(s’|s,a) and r(s,a) 54
Review of Dynamic Programming for policy evaluation (model-based) Bootstrapping action state " (#′)] " (#) = & ",$ [( + *! ! ! !%& v Requires model of MDP P(s’|s,a) and r(s,a) Bootstraps future return using value estimate Requires Markov assumption: bootstrapping regardless of history 55
Model-free Policy Evaluation v What if don’t know dynamics model P and/or reward model R? v Today: Policy evaluation without a model v Given data and/or ability to interact in the environment Efficiently compute a good estimate of a policy π 56
Model-free Policy Evaluation v Monte Carlo (MC) policy evaluation § First visit based § Every visit based v Temporal Difference (TD) § TD(0) v Metrics to evaluate and compare algorithms 57
Recommend
More recommend