CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic - PowerPoint PPT Presentation

Value Function Methods CS 285 Instructor: Sergey Levine UC Berkeley

Recap: actor-critic fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Can we omit policy gradient completely? forget policies, let’s just do this! fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Policy iteration fit a model to High level idea: estimate return generate samples (i.e. how to do this? run the policy) improve the policy

Dynamic programming 0.2 0.3 0.4 0.3 0.5 0.3 0.3 0.3 0.4 0.4 0.6 0.4 0.5 0.5 0.5 0.7 just use the current estimate here

Policy iteration with dynamic programming fit a model to estimate return generate samples (i.e. run the policy) improve the policy 0.2 0.3 0.4 0.3 0.5 0.3 0.3 0.3 0.4 0.4 0.4 0.6 0.5 0.5 0.5 0.7

Even simpler dynamic programming fit a model to estimate return generate samples (i.e. run the policy) improve the policy approximates the new value!

Fitted Value Iteration & Q-Iteration

Fitted value iteration fit a model to estimate return generate samples (i.e. run the policy) improve the policy curse of dimensionality

What if we don’t know the transition dynamics? need to know outcomes for different actions! Back to policy iteration… can fit this using samples

Can we do the “max” trick again? forget policy, compute value directly can we do this with Q-values also , without knowing the transitions? doesn’t require simulation of actions! + works even for off-policy samples (unlike actor-critic) + only one network, no high-variance policy gradient - no convergence guarantees for non-linear function approximation (more on this later)

Fitted Q-iteration

Review • Value-based methods fit a model to • Don’t learn a policy explicitly estimate return • Just learn value or Q-function generate samples (i.e. • If we have value function, we run the policy) have a policy improve the policy • Fitted Q-iteration

From Q-Iteration to Q-Learning

Why is this algorithm off-policy? dataset of transitions Fitted Q-iteration

What is fitted Q-iteration optimizing? most guarantees are lost when we leave the tabular case (e.g., use neural networks)

Online Q-learning algorithms fit a model to estimate return generate samples (i.e. run the policy) improve the policy off policy, so many choices here!

Exploration with Q-learning final policy: why is this a bad idea for step 1? “epsilon - greedy” “Boltzmann exploration” We’ll discuss exploration in detail in a later lecture!

Review • Value-based methods fit a model to • Don’t learn a policy explicitly estimate return • Just learn value or Q-function generate samples (i.e. • If we have value function, we run the policy) have a policy improve the policy • Fitted Q-iteration • Batch mode, off-policy method • Q-learning • Online analogue of fitted Q- iteration

Value Functions in Theory

Value function learning theory 0.2 0.3 0.4 0.3 0.3 0.3 0.5 0.3 0.4 0.4 0.4 0.6 0.5 0.5 0.7 0.5

Non-tabular value function learning

Non-tabular value function learning Conclusions: value iteration converges (tabular case) fitted value iteration does not converge not in general often not in practice

What about fitted Q-iteration? Applies also to online Q-learning

But… it’s just regression! Q-learning is not gradient descent! no gradient through target value

A sad corollary An aside regarding terminology

Review • Value iteration theory • Operator for backup fit a model to • Operator for projection estimate return • Backup is contraction • Value iteration converges generate samples (i.e. • Convergence with function run the policy) approximation • Projection is also a contraction improve the • Projection + backup is not a contraction policy • Fitted value iteration does not in general converge • Implications for Q-learning • Q-learning, fitted Q-iteration, etc. does not converge with function approximation • But we can make it work in practice! • Sometimes – tune in next time

CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic - PowerPoint PPT Presentation

Value Function Methods CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic fit a model to estimate return generate samples (i.e. run the policy) improve the policy Can we omit policy gradient completely? forget policies,

Performa 285 Performa 285 High Alloy Zinc Nickel High Alloy Zinc Nickel Alloy Zinc Automotive

Ichthys LNG Project Ichthys Project Location Abadi WA 285 P Ichthys Field WA 285

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

Ichthys LNG Project Ichthys NG roject Ichthys Project Location Abadi WA 285 P Ichthys

BLU-285: A potent and highly selective inhibitor designed to target malignancies driven by KIT and

GIST: imatinib and beyond Clinical activity of BLU-285 in advanced gastrointestinal stromal tumor

Particulate Air Quality Around Wisconsin Frac Sand Mines #285 B A Presentation by Dr. Crispin

Quality Candles ...in a modern design www.diana-candles.com 285 employees Aprox .

the public sector with Lorraine Forrest-Turner governmentevents.co.uk | 0330 0584 285 |

Clinical activity in a Phase 1 study of BLU-285, a potent, highly-selective inhibitor of KIT D816V

Visual disability Low vision 2015 Estimated blind people 2020 Visually impaired 285 M Blind

Southern Companys Demonstration of a 285 MW Coal-Based Transport Gasifier Project Project

Georgia DOT Updates: MMIP and Transform 285/400 January 23, 2018 Tim Matthews, P.E. MMIP

Lanes and I-285 Top End Express Lanes Fulton County Schools Briefing Tim Matthews, P.E.

COST OR PRICE COST OR PRICE REASONABLENESS REASONABLENESS (CPR) (CPR) UH APM A8.285 RCUH

Introduction to Intelligent Transportation Systems (ITS): I-285 Variable Speed Limits Andrew

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

A principled approach: Solution 3: Journaling Transactions (write ahead logging) Group together

Streaming Design Patterns Using Alpakka Kafka Connector Sean Glover, Lightbend @seg1o Who am I?

Custom IP C r i s t i a n S i s t e r n a U n i v e r s i d a d N a c i o n a l d e S a n J u

'You Better Run' Connecting low-energy Dark Matter searches with high-energy physics Bradley J.

How FAST can Zeek RUN? ZeekWeek 2019 Jim Mellander Seattle WA Cybersecurity Engineer October

One-Sided Hypothesis Testing for a Proportion August 22, 2019 August 22, 2019 1 / 15 Choosing a

MATH 12002 - CALCULUS I 1.6: Infinite Limits Professor Donald L. White Department of