ds595 cs525 reinforcement learning
play

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last Lecture v Model Free Control Generalized policy iteration Control with Exploration


  1. This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020

  2. Last Lecture v Model Free Control § Generalized policy iteration § Control with Exploration § Monte Carlo (MC) Policy Iteration § Temporal-Difference (TD) Policy Iteration • SARSA • Q-Learning § Maximization bias and Double-Q-Learning § Project 2 description.

  3. This Lecture v Value Function Approximation VFA § Introduction § VFA for Policy Evaluation § VFA for Control

  4. This Lecture v Value Function Approximation VFA § Introduction § VFA for Policy Evaluation § VFA for Control

  5. RL algorithms Function Representation Tabular Representation Value Function Value Function Policy function v Model-based Model-Free approximation control Control § Policy § Policy evaluation (DP) evaluation Value Function MC ( First/every approximation visit) and TD § Policy iteration § Value/Policy (Asynchronous) § Value iteration Iteration Advantage Actor • MC Iteration Critic: • TD Iteration A2C – SARSA A3C – Q-Learning – Double Q- Learning

  6. Value function representations v Value function v Tabular representation v V π can be viewed as a v enormous state and/or action spaces vector v Tabular representation is insufficient

  7. Value function representations v Value function v Tabular representation s V π (s) s V π (s;w) w …… s’ V π (s’) w with k dimensions, k<<|S| S → V π of |S| dimensions v enormous state and/or action spaces v V π can be viewed as a v Tabular representation is vector insufficient

  8. Value Function Approximation (VFA) v Represent a (state-action/state) value function with a parameterized function instead of a table v V π (s) s V π (s;w) w v Q π (s,a) s Q π (s,a;w) a w

  9. Why VFA? Benefits of VFA? s V π (s;w) w v V π (s) s Q π (s,a;w) v Q π (s,a) a w

  10. Why VFA? Benefits of VFA? v Huge state and/or action space, thus impossible by tabular representation v Want more compact representation that generalizes across state or states and actions v V π (s) s V π (s;w) w v Q π (s,a) s Q π (s,a;w) a w

  11. Benefits of Generalization via VFA v Huge state and/or action space, § Reduce the memory needed v More compact representation that generalizes across state or states and actions § Generalization across states/state-action-pairs § Advantages of tabular: Exact value of s, or s,a v A trade-off § Capacity vs (computational and space) efficiency s V π (s) w with k dimensions, w k<<|S|

  12. ? What function for VFA? v V π (s) s V π (s;w) w v Q π (s,a) s Q π (s,a;w) a w v What function approximator?

  13. What function for VFA? v Many possible function approximators including § Linear combinations of features § Neural networks § Decision trees § Nearest neighbors, and more. v In this class we will focus on function approximators that are differentiable (Why?) v Two very popular classes of differentiable function approximators § Linear feature representations; § Neural networks (Deep Reinforcement Learnig).

  14. This Lecture v Value Function Approximation VFA § Introduction § VFA for Policy Evaluation § VFA for Control

  15. Review: Gradient Descent v Consider a function J( w ) that is a differentiable function of a parameter vector w v Goal is to find parameter w that minimizes J v The gradient of J(w) is

  16. Review: Gradient Descent v Consider a function J( w ) that is a differentiable function of a parameter vector w v Goal is to find parameter w that minimizes J v The gradient of J(w) is J ( w ) v Gradient vector points the uphill direction. v To minimize J(w), we remove α -weighted gradient vector from w in each iteration.

  17. VFA problem v Consider an oracle function exists, that takes s as input, and outputs a V π (s). § The oracle may not be accessible in practice (that is the model-free problem setting). s V π (s; w ) w v The objective was to find the best approximate representation of V π (s), given a particular parameterized function V’ π (s; w )

  18. Without loss of generality, a constant parameter ½ was added.

  19. Without loss of generality, a constant parameter ½ was added. - From full gradient to Stochastic gradient

  20. Model-free Policy Evaluation From tabular Representation to VFA v Following a fixed policy π (or had access to prior data) Goal is to estimate V π and/or Q π v Maintained a look up table to store estimates V π and/or Q π v Updated these tabular estimates § after each episode (Monte Carlo methods) or § after each step (TD methods) V(1) V(2) V(3) V(4) V(5)

  21. Model-free Policy Evaluation From tabular Representation to VFA v Following a fixed policy π (or had access to prior data) Goal is to estimate V π and/or Q π v Maintained a function parameter vector w to store estimates V π and/or Q π v Updated the function parameter vector w § after each episode (Monte Carlo methods) or § after each step (TD methods) s V π (s) w

  22. -

  23. v From updating initial V over iterations § MC § TD v To updating initial w over iterations -

  24. - - -

  25. ^ -

  26. s 5 s 4 s 3 s 1 s 2 s 6 s 7 Init w 0 =[1,1,1,1,1,1,1,1] T , α=0.5, γ=1 - (s 1 ,a 1 ,0,s 7 ,a 1 ,0,s 7 ,a 1 ,0,T) What is Δ w and w 1 = w 0 - Δ w after update with the first visit of s 1 ?

  27. x(s 1 )=[2,0,0,0,0,0,0,1] T x(s 2 )=[0,2,0,0,0,0,0,1] T x(s 3 )=[0,0,2,0,0,0,0,1] T x(s 4 )=[0,0,0,2,0,0,0,1] T x(s 5 )=[0,0,0,0,2,0,0,1] T s 5 s 4 x(s 6 )=[0,0,0,0,0,2,0,1] T s 3 s 1 s 2 s 6 x(s 7 )=[0,0,0,0,0,0,1,2] T w 0 =[1,1,1,1,1,1,1,1] T s 7 , α=0.5, γ=1 - (s 1 ,a 1 ,0,s 7 ,a 1 ,0,s 7 ,a 1 ,0,T) s 1 : G s1 =0, V(s 1 )=x(s 1 ) T w =3 α=0.5, x(s 1 )=[2,0,0,0,0,0,0,1] T Δ w =-0.5*(0-3) [2,0,0,0,0,0,0,1] T =[3,0,0,0,0,0,0,1.5] T w 1 =w 0 -Δ w= [1,1,1,1,1,1,1,1] T -[3,0,0,0,0,0,0,1.5] T =[-2,1,1,1,1,1,1,-0.5] T

  28. Tabular representation

  29. - - -

  30. Linear VFA with TD (Offline Practice) s 5 s 4 s 3 s 1 s 2 s 6 s 7 Init w 0 =[1,1,1,1,1,1,1,1] T , α=0.5, γ=1 - TD What is w 1 after update with a tuple (s 1 ,a 1 ,1,s 7 )?

  31. Linear VFA with TD (Offline Practice) s 5 s 4 s 3 s 1 s 2 s 6 s 7 Init w 0 =[1,1,1,1,1,1,1,1] T , α=0.5, γ=1 - TD What is w 1 after update with a tuple (s 1 ,a 1 ,1,s 7 )? Answer: w 1 =[2,1,1,1,1,1,1,1.5]

  32. This Lecture v Value Function Approximation VFA § Introduction § VFA for Policy Evaluation § VFA for Control

  33. Recall: Tabular representation

  34. -

  35. Recall: Tabular representation

  36. Model-Free Q-Learning Control Value Function Approximation (VFA)

  37. Model-Free Q-Learning Control Value Function Approximation (VFA)

  38. RL algorithms Function Representation Tabular Representation Value Function Value Function Policy function v Model-based Model-Free approximation control Control § Policy § Policy evaluation (DP) evaluation Value Function MC ( First/every approximation visit) and TD § Policy iteration § Value/Policy (Asynchronous) § Value iteration Iteration Advantage Actor • MC Iteration Critic: • TD Iteration A2C – SARSA A3C – Q-Learning – Double Q- Learning

  39. Project 3 is available Starts 10/15 Thursday Due 10/29 Thursday mid-night v http://users.wpi.edu/~yli15/courses/DS595C S525Fall20/Assignments.html v https://github.com/yingxue- zhang/DS595CS525-RL- Projects/tree/master/Project3 48

  40. Next Lecture v (Continue) Value Function Approximation § Linear Value Function v Review of Deep Learning v Deep Learning Implementation in Pytorch § (by TA Yingxue)

  41. Questions?

Recommend


More recommend