Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without - PowerPoint PPT Presentation

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1 Emma Brunskill CS234 Reinforcement Learning Winter 2019 1 Material builds on structure from David SIlver’s Lecture 4: Model-Free Prediction. Other resources: Sutton and Barto Jan 1 2018 draft Chapter/Sections: 5.1; 5.5; 6.1-6.3 Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 1 / 62

Today’s Plan Last Time: Markov reward / decision processes Policy evaluation & control when have true model (of how the world works) Today Policy evaluation without known dynamics & reward models Next Time: Control when don’t have a model of how the world works Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 2 / 62

This Lecture: Policy Evaluation Estimating the expected return of a particular policy if don’t have access to true MDP models Dynamic programming Monte Carlo policy evaluation Policy evaluation when don’t have a model of how the world work Given on-policy samples Temporal Difference (TD) Metrics to evaluate and compare algorithms Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 3 / 62

Recall Definition of Return, G t (for a MRP) Discounted sum of rewards from time step t to horizon G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · Definition of State Value Function, V π ( s ) Expected return from starting in state s under policy π V π ( s ) = E π [ G t | s t = s ] = E π [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s ] Definition of State-Action Value Function, Q π ( s , a ) Expected return from starting in state s , taking action a and then following policy π Q π ( s , a ) = E π [ G t | s t = s , a t = a ] = E π [ r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · | s t = s , a t = a ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 4 / 62

Dynamic Programming for Policy Evaluation Initialize V π 0 ( s ) = 0 for all s For k = 1 until convergence For all s in S � V π p ( s ′ | s , π ( s )) V π k − 1 ( s ′ ) k ( s ) = r ( s , π ( s )) + γ s ′ ∈ S Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 5 / 62

Dynamic Programming for Policy π , Value Evaluation Initialize V π 0 ( s ) = 0 for all s For k = 1 until convergence For all s in S � p ( s ′ | s , π ( s )) V π k − 1 ( s ′ ) V π k ( s ) = r ( s , π ( s )) + γ s ′ ∈ S V π k ( s ) is exact value of k -horizon value of state s under policy π V π k ( s ) is an estimate of infinite horizon value of state s under policy π V π ( s ) = E π [ G t | s t = s ] ≈ E π [ r t + γ V k − 1 | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 6 / 62

Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 7 / 62

Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Bootstrapping: Update for V uses an estimate Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 11 / 62

Dynamic Programming Policy Evaluation V π ( s ) ← E π [ r t + γ V k − 1 | s t = s ] Bootstrapping: Update for V uses an estimate Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 12 / 62

Policy Evaluation: V π ( s ) = E π [ G t | s t = s ] G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · in MDP M under policy π Dynamic Programming V π ( s ) ≈ E π [ r t + γ V k − 1 | s t = s ] Requires model of MDP M Bootstraps future return using value estimate Requires Markov assumption: bootstrapping regardless of history What if don’t know dynamics model P and/ or reward model R ? Today: Policy evaluation without a model Given data and/or ability to interact in the environment Efficiently compute a good estimate of a policy π Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 13 / 62

This Lecture Overview: Policy Evaluation Dynamic Programming Evaluating the quality of an estimator Monte Carlo policy evaluation Policy evaluation when don’t know dynamics and/or reward model Given on policy samples Temporal Difference (TD) Metrics to evaluate and compare algorithms Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 14 / 62

Monte Carlo (MC) Policy Evaluation G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · in MDP M under policy π V π ( s ) = E T ∼ π [ G t | s t = s ] Expectation over trajectories T generated by following π Simple idea: Value = mean return If trajectories are all finite, sample set of trajectories & average returns Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 15 / 62

Monte Carlo (MC) Policy Evaluation If trajectories are all finite, sample set of trajectories & average returns Does not require MDP dynamics/rewards No bootstrapping Does not assume state is Markov Can only be applied to episodic MDPs Averaging over returns from a complete episode Requires each episode to terminate Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 16 / 62

Monte Carlo (MC) On Policy Evaluation Aim: estimate V π ( s ) given episodes generated under policy π s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . where the actions are sampled from π G t = r t + γ r t +1 + γ 2 r t +2 + γ 3 r t +3 + · · · in MDP M under policy π V π ( s ) = E π [ G t | s t = s ] MC computes empirical mean return Often do this in an incremental fashion After each episode, update estimate of V π Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 17 / 62

First-Visit Monte Carlo (MC) On Policy Evaluation Initialize N ( s ) = 0, G ( s ) = 0 ∀ s ∈ S Loop Sample episode i = s i , 1 , a i , 1 , r i , 1 , s i , 2 , a i , 2 , r i , 2 , . . . , s i , T i Define G i , t = r i , t + γ r i , t +1 + γ 2 r i , t +2 + · · · γ T i − 1 r i , T i as return from time step t onwards in i th episode For each state s visited in episode i For first time t that state s is visited in episode i Increment counter of total first visits: N ( s ) = N ( s ) + 1 Increment total return G ( s ) = G ( s ) + G i , t Update estimate V π ( s ) = G ( s ) / N ( s ) Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 18 / 62

Bias, Variance and MSE Consider a statistical model that is parameterized by θ and that determines a probability distribution over observed data P ( x | θ ) Consider a statistic ˆ θ that provides an estimate of θ and is a function of observed data x E.g. for a Gaussian distribution with known variance, the average of a set of i.i.d data points is an estimate of the mean of the Gaussian Definition: the bias of an estimator ˆ θ is: Bias θ (ˆ θ ) = E x | θ [ˆ θ ] − θ Definition: the variance of an estimator ˆ θ is: Var (ˆ θ ) = E x | θ [(ˆ θ − E [ˆ θ ]) 2 ] Definition: mean squared error (MSE) of an estimator ˆ θ is: MSE (ˆ θ ) = Var (ˆ θ ) + Bias θ (ˆ θ ) 2 Emma Brunskill (CS234 Reinforcement Learning) Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the Wo Winter 2019 19 / 62

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without - PowerPoint PPT Presentation

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1 Emma Brunskill CS234 Reinforcement Learning Winter 2019 1 Material builds on structure from David SIlvers Lecture 4: Model-Free Prediction. Other

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Gluten Free & Free From August 25, 2014 Natural & Organic ECRM San Diego, CA Agenda

INTERSTARCH GLUTEN-FREE BAKING MIXES 1 GLUTEN-FREE PRODUCTS from INTERSTARCH GLUTEN -FREE

The Smoke- -Free Free The Smoke Arizona Act Arizona Act Arizona Department of Health Services

Free Software and the Environment Ben ONeill What makes free software good? What makes free

FREE FREE FREE FREE RIDE RIDE RIDE RIDE W HAT HAT IS IS F REE REE RIDE RIDE ? HAT HAT IS

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

Hopes and Fears for Evergreen Oh were free! Free! Forever were free Come join the song

Class 2: Model-Free Prediction Sutton and Barto, Chapters 5 and 6 David Silver 295, class 2 1

Military Safeguards: An Outlier Case Orpet Peixoto ABACC - Brazilian Argentine Agency for

PRECIP: Towards Practical and Retrofittable Confidential Information Protection XiaoFeng Wang

CMPUT 609/499: Reinforcement Learning for Artificial Intelligence Instructor: Rich Sutton Dept

Reinforcement Learning: A Tutorial Satinder Singh Computer Science & Engineering University

MAY 2020 RUP - TSXV CAUTIONARY RY STATEMENT Cautionary Note Regarding Forward-Looking

London Borough of Sutton Pension Fund Page 11 Actuarial valuation as at 31 March 2019 Agenda

Baumgartner, POLI 203 Spring 2016 RJA 1: the 2009 Law Reading: RJA 2009, 11, 15 March 7,

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without - PowerPoint PPT Presentation

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1 Emma Brunskill CS234 Reinforcement Learning Winter 2019 1 Material builds on structure from David SIlvers Lecture 4: Model-Free Prediction. Other

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works 1

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Outline Scale-Free Networks Networks Scale-Free Networks Original model Original model

Gluten Free &amp; Free From August 25, 2014 Natural &amp; Organic ECRM San Diego, CA Agenda

INTERSTARCH GLUTEN-FREE BAKING MIXES 1 GLUTEN-FREE PRODUCTS from INTERSTARCH GLUTEN -FREE

The Smoke- -Free Free The Smoke Arizona Act Arizona Act Arizona Department of Health Services

Free Software and the Environment Ben ONeill What makes free software good? What makes free

FREE FREE FREE FREE RIDE RIDE RIDE RIDE W HAT HAT IS IS F REE REE RIDE RIDE ? HAT HAT IS

EQUATION OF FREE FALL Chapter 2 = Free Fall v = u - gt Chapter 2 = Free Fall v = u - gt

Bio-inspired computation: Clock-free, grid-free, scale-free, and symbol-free (FA2386-12-1-4050)

When Free Software Isn't Better When Free Software Isn't Better benjamin mako hill :: when

Hopes and Fears for Evergreen Oh were free! Free! Forever were free Come join the song

Class 2: Model-Free Prediction Sutton and Barto, Chapters 5 and 6 David Silver 295, class 2 1

Military Safeguards: An Outlier Case Orpet Peixoto ABACC - Brazilian Argentine Agency for

PRECIP: Towards Practical and Retrofittable Confidential Information Protection XiaoFeng Wang

CMPUT 609/499: Reinforcement Learning for Artificial Intelligence Instructor: Rich Sutton Dept

Reinforcement Learning: A Tutorial Satinder Singh Computer Science &amp; Engineering University

MAY 2020 RUP - TSXV CAUTIONARY RY STATEMENT Cautionary Note Regarding Forward-Looking

London Borough of Sutton Pension Fund Page 11 Actuarial valuation as at 31 March 2019 Agenda

Baumgartner, POLI 203 Spring 2016 RJA 1: the 2009 Law Reading: RJA 2009, 11, 15 March 7,

Gluten Free & Free From August 25, 2014 Natural & Organic ECRM San Diego, CA Agenda

Reinforcement Learning: A Tutorial Satinder Singh Computer Science & Engineering University