DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020

Last Lecture v Model Free Control § Generalized policy iteration § Control with Exploration § Monte Carlo (MC) Policy Iteration § Temporal-Difference (TD) Policy Iteration • SARSA • Q-Learning § Maximization bias and Double-Q-Learning § Project 2 description.

This Lecture v Value Function Approximation VFA § Introduction § VFA for Policy Evaluation § VFA for Control

RL algorithms Function Representation Tabular Representation Value Function Value Function Policy function v Model-based Model-Free approximation control Control § Policy § Policy evaluation (DP) evaluation Value Function MC ( First/every approximation visit) and TD § Policy iteration § Value/Policy (Asynchronous) § Value iteration Iteration Advantage Actor • MC Iteration Critic: • TD Iteration A2C – SARSA A3C – Q-Learning – Double Q- Learning

Value function representations v Value function v Tabular representation v V π can be viewed as a v enormous state and/or action spaces vector v Tabular representation is insufficient

Value function representations v Value function v Tabular representation s V π (s) s V π (s;w) w …… s’ V π (s’) w with k dimensions, k<<|S| S → V π of |S| dimensions v enormous state and/or action spaces v V π can be viewed as a v Tabular representation is vector insufficient

Value Function Approximation (VFA) v Represent a (state-action/state) value function with a parameterized function instead of a table v V π (s) s V π (s;w) w v Q π (s,a) s Q π (s,a;w) a w

Why VFA? Benefits of VFA? s V π (s;w) w v V π (s) s Q π (s,a;w) v Q π (s,a) a w

Why VFA? Benefits of VFA? v Huge state and/or action space, thus impossible by tabular representation v Want more compact representation that generalizes across state or states and actions v V π (s) s V π (s;w) w v Q π (s,a) s Q π (s,a;w) a w

Benefits of Generalization via VFA v Huge state and/or action space, § Reduce the memory needed v More compact representation that generalizes across state or states and actions § Generalization across states/state-action-pairs § Advantages of tabular: Exact value of s, or s,a v A trade-off § Capacity vs (computational and space) efficiency s V π (s) w with k dimensions, w k<<|S|

? What function for VFA? v V π (s) s V π (s;w) w v Q π (s,a) s Q π (s,a;w) a w v What function approximator?

What function for VFA? v Many possible function approximators including § Linear combinations of features § Neural networks § Decision trees § Nearest neighbors, and more. v In this class we will focus on function approximators that are differentiable (Why?) v Two very popular classes of differentiable function approximators § Linear feature representations; § Neural networks (Deep Reinforcement Learnig).

Review: Gradient Descent v Consider a function J( w ) that is a differentiable function of a parameter vector w v Goal is to find parameter w that minimizes J v The gradient of J(w) is

Review: Gradient Descent v Consider a function J( w ) that is a differentiable function of a parameter vector w v Goal is to find parameter w that minimizes J v The gradient of J(w) is J （ w ） v Gradient vector points the uphill direction. v To minimize J(w), we remove α -weighted gradient vector from w in each iteration.

VFA problem v Consider an oracle function exists, that takes s as input, and outputs a V π (s). § The oracle may not be accessible in practice (that is the model-free problem setting). s V π (s; w ) w v The objective was to find the best approximate representation of V π (s), given a particular parameterized function V’ π (s; w )

Without loss of generality, a constant parameter ½ was added.

Without loss of generality, a constant parameter ½ was added. - From full gradient to Stochastic gradient

Model-free Policy Evaluation From tabular Representation to VFA v Following a fixed policy π (or had access to prior data) Goal is to estimate V π and/or Q π v Maintained a look up table to store estimates V π and/or Q π v Updated these tabular estimates § after each episode (Monte Carlo methods) or § after each step (TD methods) V(1) V(2) V(3) V(4) V(5)

Model-free Policy Evaluation From tabular Representation to VFA v Following a fixed policy π (or had access to prior data) Goal is to estimate V π and/or Q π v Maintained a function parameter vector w to store estimates V π and/or Q π v Updated the function parameter vector w § after each episode (Monte Carlo methods) or § after each step (TD methods) s V π (s) w

v From updating initial V over iterations § MC § TD v To updating initial w over iterations -

s 5 s 4 s 3 s 1 s 2 s 6 s 7 Init w 0 =[1,1,1,1,1,1,1,1] T , α=0.5, γ=1 - (s 1 ,a 1 ,0,s 7 ,a 1 ,0,s 7 ,a 1 ,0,T) What is Δ w and w 1 = w 0 - Δ w after update with the first visit of s 1 ?

x(s 1 )=[2,0,0,0,0,0,0,1] T x(s 2 )=[0,2,0,0,0,0,0,1] T x(s 3 )=[0,0,2,0,0,0,0,1] T x(s 4 )=[0,0,0,2,0,0,0,1] T x(s 5 )=[0,0,0,0,2,0,0,1] T s 5 s 4 x(s 6 )=[0,0,0,0,0,2,0,1] T s 3 s 1 s 2 s 6 x(s 7 )=[0,0,0,0,0,0,1,2] T w 0 =[1,1,1,1,1,1,1,1] T s 7 , α=0.5, γ=1 - (s 1 ,a 1 ,0,s 7 ,a 1 ,0,s 7 ,a 1 ,0,T) s 1 : G s1 =0, V(s 1 )=x(s 1 ) T w =3 α=0.5, x(s 1 )=[2,0,0,0,0,0,0,1] T Δ w =-0.5*(0-3) [2,0,0,0,0,0,0,1] T =[3,0,0,0,0,0,0,1.5] T w 1 =w 0 -Δ w= [1,1,1,1,1,1,1,1] T -[3,0,0,0,0,0,0,1.5] T =[-2,1,1,1,1,1,1,-0.5] T

Tabular representation

Linear VFA with TD (Offline Practice) s 5 s 4 s 3 s 1 s 2 s 6 s 7 Init w 0 =[1,1,1,1,1,1,1,1] T , α=0.5, γ=1 - TD What is w 1 after update with a tuple (s 1 ,a 1 ,1,s 7 )?

Linear VFA with TD (Offline Practice) s 5 s 4 s 3 s 1 s 2 s 6 s 7 Init w 0 =[1,1,1,1,1,1,1,1] T , α=0.5, γ=1 - TD What is w 1 after update with a tuple (s 1 ,a 1 ,1,s 7 )? Answer: w 1 =[2,1,1,1,1,1,1,1.5]

Recall: Tabular representation

Model-Free Q-Learning Control Value Function Approximation (VFA)

RL algorithms Function Representation Tabular Representation Value Function Value Function Policy function v Model-based Model-Free approximation control Control § Policy § Policy evaluation (DP) evaluation Value Function MC ( First/every approximation visit) and TD § Policy iteration § Value/Policy (Asynchronous) § Value iteration Iteration Advantage Actor • MC Iteration Critic: • TD Iteration A2C – SARSA A3C – Q-Learning – Double Q- Learning

Project 3 is available Starts 10/15 Thursday Due 10/29 Thursday mid-night v http://users.wpi.edu/~yli15/courses/DS595C S525Fall20/Assignments.html v https://github.com/yingxue- zhang/DS595CS525-RL- Projects/tree/master/Project3 48

Next Lecture v (Continue) Value Function Approximation § Linear Value Function v Review of Deep Learning v Deep Learning Implementation in Pytorch § (by TA Yingxue)

Questions?

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last Lecture v Model Free Control Generalized policy iteration Control with Exploration

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Urban Network Analysis -- Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Coherence Spaces for Computable Analysis II K e i M a t s u mo t o a n d K a z u

The Behavioral Approach to Systems Theory Paolo Rapisarda, Un. of Southampton, U.K. & Jan

Introduction to magnetic symmetry III. Magnetic space groups vs. Iirreps J. Manuel Perez-Mato

State Space Representations and Search Algorithms CS271P, Fall Quarter, 2018 Introduction to

Prolog to Lecture 16 CS 236 On-Line MS Program Networks and Systems Security Peter Reiher

STATE BUSINESS RELATIONS AND ECONOMIC DEVELOPMENT IN AFRICA AND INDIA KUNAL SEN IDPM,

Future of Enzo Michael L. Norman James Bordner LCA/SDSC/UCSD SDSC Resources Data to

An Online Learning-Based Task Offloading Framework for 5G Small Cell Networks ICPP2020 1

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last Lecture v Model Free Control Generalized policy iteration Control with Exploration

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Reinforcement Learning --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Urban Network Analysis -- Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Coherence Spaces for Computable Analysis II K e i M a t s u mo t o a n d K a z u

The Behavioral Approach to Systems Theory Paolo Rapisarda, Un. of Southampton, U.K. &amp; Jan

Introduction to magnetic symmetry III. Magnetic space groups vs. Iirreps J. Manuel Perez-Mato

State Space Representations and Search Algorithms CS271P, Fall Quarter, 2018 Introduction to

Prolog to Lecture 16 CS 236 On-Line MS Program Networks and Systems Security Peter Reiher

STATE BUSINESS RELATIONS AND ECONOMIC DEVELOPMENT IN AFRICA AND INDIA KUNAL SEN IDPM,

Future of Enzo Michael L. Norman James Bordner LCA/SDSC/UCSD SDSC Resources Data to

An Online Learning-Based Task Offloading Framework for 5G Small Cell Networks ICPP2020 1

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

The Behavioral Approach to Systems Theory Paolo Rapisarda, Un. of Southampton, U.K. & Jan