This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020
Last Lecture v Model Free Control § Generalized policy iteration § Control with Exploration § Monte Carlo (MC) Policy Iteration § Temporal-Difference (TD) Policy Iteration • SARSA • Q-Learning § Maximization bias and Double-Q-Learning § Project 2 description.
This Lecture v Value Function Approximation VFA § Introduction § VFA for Policy Evaluation § VFA for Control
This Lecture v Value Function Approximation VFA § Introduction § VFA for Policy Evaluation § VFA for Control
RL algorithms Function Representation Tabular Representation Value Function Value Function Policy function v Model-based Model-Free approximation control Control § Policy § Policy evaluation (DP) evaluation Value Function MC ( First/every approximation visit) and TD § Policy iteration § Value/Policy (Asynchronous) § Value iteration Iteration Advantage Actor • MC Iteration Critic: • TD Iteration A2C – SARSA A3C – Q-Learning – Double Q- Learning
Value function representations v Value function v Tabular representation v V π can be viewed as a v enormous state and/or action spaces vector v Tabular representation is insufficient
Value function representations v Value function v Tabular representation s V π (s) s V π (s;w) w …… s’ V π (s’) w with k dimensions, k<<|S| S → V π of |S| dimensions v enormous state and/or action spaces v V π can be viewed as a v Tabular representation is vector insufficient
Value Function Approximation (VFA) v Represent a (state-action/state) value function with a parameterized function instead of a table v V π (s) s V π (s;w) w v Q π (s,a) s Q π (s,a;w) a w
Why VFA? Benefits of VFA? s V π (s;w) w v V π (s) s Q π (s,a;w) v Q π (s,a) a w
Why VFA? Benefits of VFA? v Huge state and/or action space, thus impossible by tabular representation v Want more compact representation that generalizes across state or states and actions v V π (s) s V π (s;w) w v Q π (s,a) s Q π (s,a;w) a w
Benefits of Generalization via VFA v Huge state and/or action space, § Reduce the memory needed v More compact representation that generalizes across state or states and actions § Generalization across states/state-action-pairs § Advantages of tabular: Exact value of s, or s,a v A trade-off § Capacity vs (computational and space) efficiency s V π (s) w with k dimensions, w k<<|S|
? What function for VFA? v V π (s) s V π (s;w) w v Q π (s,a) s Q π (s,a;w) a w v What function approximator?
What function for VFA? v Many possible function approximators including § Linear combinations of features § Neural networks § Decision trees § Nearest neighbors, and more. v In this class we will focus on function approximators that are differentiable (Why?) v Two very popular classes of differentiable function approximators § Linear feature representations; § Neural networks (Deep Reinforcement Learnig).
This Lecture v Value Function Approximation VFA § Introduction § VFA for Policy Evaluation § VFA for Control
Review: Gradient Descent v Consider a function J( w ) that is a differentiable function of a parameter vector w v Goal is to find parameter w that minimizes J v The gradient of J(w) is
Review: Gradient Descent v Consider a function J( w ) that is a differentiable function of a parameter vector w v Goal is to find parameter w that minimizes J v The gradient of J(w) is J ( w ) v Gradient vector points the uphill direction. v To minimize J(w), we remove α -weighted gradient vector from w in each iteration.
VFA problem v Consider an oracle function exists, that takes s as input, and outputs a V π (s). § The oracle may not be accessible in practice (that is the model-free problem setting). s V π (s; w ) w v The objective was to find the best approximate representation of V π (s), given a particular parameterized function V’ π (s; w )
Without loss of generality, a constant parameter ½ was added.
Without loss of generality, a constant parameter ½ was added. - From full gradient to Stochastic gradient
Model-free Policy Evaluation From tabular Representation to VFA v Following a fixed policy π (or had access to prior data) Goal is to estimate V π and/or Q π v Maintained a look up table to store estimates V π and/or Q π v Updated these tabular estimates § after each episode (Monte Carlo methods) or § after each step (TD methods) V(1) V(2) V(3) V(4) V(5)
Model-free Policy Evaluation From tabular Representation to VFA v Following a fixed policy π (or had access to prior data) Goal is to estimate V π and/or Q π v Maintained a function parameter vector w to store estimates V π and/or Q π v Updated the function parameter vector w § after each episode (Monte Carlo methods) or § after each step (TD methods) s V π (s) w
-
v From updating initial V over iterations § MC § TD v To updating initial w over iterations -
- - -
^ -
s 5 s 4 s 3 s 1 s 2 s 6 s 7 Init w 0 =[1,1,1,1,1,1,1,1] T , α=0.5, γ=1 - (s 1 ,a 1 ,0,s 7 ,a 1 ,0,s 7 ,a 1 ,0,T) What is Δ w and w 1 = w 0 - Δ w after update with the first visit of s 1 ?
x(s 1 )=[2,0,0,0,0,0,0,1] T x(s 2 )=[0,2,0,0,0,0,0,1] T x(s 3 )=[0,0,2,0,0,0,0,1] T x(s 4 )=[0,0,0,2,0,0,0,1] T x(s 5 )=[0,0,0,0,2,0,0,1] T s 5 s 4 x(s 6 )=[0,0,0,0,0,2,0,1] T s 3 s 1 s 2 s 6 x(s 7 )=[0,0,0,0,0,0,1,2] T w 0 =[1,1,1,1,1,1,1,1] T s 7 , α=0.5, γ=1 - (s 1 ,a 1 ,0,s 7 ,a 1 ,0,s 7 ,a 1 ,0,T) s 1 : G s1 =0, V(s 1 )=x(s 1 ) T w =3 α=0.5, x(s 1 )=[2,0,0,0,0,0,0,1] T Δ w =-0.5*(0-3) [2,0,0,0,0,0,0,1] T =[3,0,0,0,0,0,0,1.5] T w 1 =w 0 -Δ w= [1,1,1,1,1,1,1,1] T -[3,0,0,0,0,0,0,1.5] T =[-2,1,1,1,1,1,1,-0.5] T
Tabular representation
- - -
Linear VFA with TD (Offline Practice) s 5 s 4 s 3 s 1 s 2 s 6 s 7 Init w 0 =[1,1,1,1,1,1,1,1] T , α=0.5, γ=1 - TD What is w 1 after update with a tuple (s 1 ,a 1 ,1,s 7 )?
Linear VFA with TD (Offline Practice) s 5 s 4 s 3 s 1 s 2 s 6 s 7 Init w 0 =[1,1,1,1,1,1,1,1] T , α=0.5, γ=1 - TD What is w 1 after update with a tuple (s 1 ,a 1 ,1,s 7 )? Answer: w 1 =[2,1,1,1,1,1,1,1.5]
This Lecture v Value Function Approximation VFA § Introduction § VFA for Policy Evaluation § VFA for Control
Recall: Tabular representation
-
Recall: Tabular representation
Model-Free Q-Learning Control Value Function Approximation (VFA)
Model-Free Q-Learning Control Value Function Approximation (VFA)
RL algorithms Function Representation Tabular Representation Value Function Value Function Policy function v Model-based Model-Free approximation control Control § Policy § Policy evaluation (DP) evaluation Value Function MC ( First/every approximation visit) and TD § Policy iteration § Value/Policy (Asynchronous) § Value iteration Iteration Advantage Actor • MC Iteration Critic: • TD Iteration A2C – SARSA A3C – Q-Learning – Double Q- Learning
Project 3 is available Starts 10/15 Thursday Due 10/29 Thursday mid-night v http://users.wpi.edu/~yli15/courses/DS595C S525Fall20/Assignments.html v https://github.com/yingxue- zhang/DS595CS525-RL- Projects/tree/master/Project3 48
Next Lecture v (Continue) Value Function Approximation § Linear Value Function v Review of Deep Learning v Deep Learning Implementation in Pytorch § (by TA Yingxue)
Questions?
Recommend
More recommend