CS 285 Instructor: Sergey Levine UC Berkeley Definitions - PowerPoint PPT Presentation

Introduction to Reinforcement Learning CS 285 Instructor: Sergey Levine UC Berkeley

Definitions

Terminology & notation 1. run away 2. ignore 3. pet

Imitation Learning supervised training learning data Images: Bojarski et al. ‘16, NVIDIA

Reward functions

Definitions Andrey Markov

Definitions Richard Bellman Andrey Markov

Definitions Richard Bellman

Definitions

The goal of reinforcement learning we’ll come back to partially observed later

The goal of reinforcement learning

Finite horizon case: state-action marginal state-action marginal

Infinite horizon case: stationary distribution stationary distribution stationary = the same before and after transition

Expectations and stochastic systems infinite horizon case finite horizon case In RL, we almost always care about expectations +1 -1

Algorithms

The anatomy of a reinforcement learning algorithm fit a model/ estimate the return generate samples (i.e. run the policy) improve the policy

A simple example fit a model/ estimate the return generate samples (i.e. run the policy) improve the policy

Another example: RL by backprop fit a model/ estimate the return generate samples (i.e. run the policy) improve the policy

Which parts are expensive? trivial, fast fit a model/ estimate the return real robot/car/power grid/whatever: expensive 1x real time, until we invent time travel generate samples (i.e. run the policy) MuJoCo simulator: up to 10000x real time improve the policy

Value Functions

How do we deal with all these expectations? what if we knew this part?

Definition: Q-function Definition: value function

Using Q-functions and value functions

The anatomy of a reinforcement learning algorithm this often uses Q- fit a model/ functions or value estimate the return functions generate samples (i.e. run the policy) improve the policy

Types of Algorithms

Types of RL algorithms • Policy gradients: directly differentiate the above objective • Value-based: estimate value function or Q-function of the optimal policy (no explicit policy) • Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy • Model- based RL: estimate the transition model, and then… • Use it for planning (no explicit policy) • Use it to improve a policy • Something else

Model-based RL algorithms fit a model/ estimate the return generate samples (i.e. run the policy) improve the policy

Model-based RL algorithms improve the policy 1. Just use the model to plan (no policy) • Trajectory optimization/optimal control (primarily in continuous spaces) – essentially backpropagation to optimize over actions • Discrete planning in discrete action spaces – e.g., Monte Carlo tree search 2. Backpropagate gradients into the policy • Requires some tricks to make it work 3. Use the model to learn a value function • Dynamic programming • Generate simulated experience for model-free learner

Value function based algorithms fit a model/ estimate the return generate samples (i.e. run the policy) improve the policy

Direct policy gradients fit a model/ estimate the return generate samples (i.e. run the policy) improve the policy

Actor-critic: value functions + policy gradients fit a model/ estimate the return generate samples (i.e. run the policy) improve the policy

Tradeoffs Between Algorithms

Why so many RL algorithms? • Different tradeoffs • Sample efficiency • Stability & ease of use fit a model/ estimate return • Different assumptions • Stochastic or deterministic? generate samples (i.e. • Continuous or discrete? run the policy) • Episodic or infinite horizon? improve the policy • Different things are easy or hard in different settings • Easier to represent the policy? • Easier to represent the model?

Comparison: sample efficiency • Sample efficiency = how many samples fit a model/ do we need to get a good policy? estimate return • Most important question: is the generate samples (i.e. algorithm off policy ? run the policy) • Off policy: able to improve the policy improve the without generating new samples from that policy policy • On policy: each time the policy is changed, even a little bit, we need to generate new samples just one gradient step

Comparison: sample efficiency off-policy on-policy More efficient Less efficient (fewer samples) (more samples) model-based model-based off-policy actor-critic on-policy policy evolutionary or shallow RL deep RL Q-function style gradient gradient-free learning methods algorithms algorithms Why would we use a less efficient algorithm? Wall clock time is not the same as efficiency!

Comparison: stability and ease of use • Does it converge? • And if it converges, to what? • And does it converge every time? Why is any of this even a question??? • Supervised learning: almost always gradient descent • Reinforcement learning: often not gradient descent • Q-learning: fixed point iteration • Model-based RL: model is not optimized for expected reward • Policy gradient: is gradient descent, but also often the least efficient!

Comparison: stability and ease of use • Value function fitting • At best, minimizes error of fit (“Bellman error”) • Not the same as expected reward • At worst, doesn’t optimize anything • Many popular deep RL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case • Model-based RL • Model minimizes error of fit • This will converge • No guarantee that better model = better policy • Policy gradient • The only one that actually performs gradient descent (ascent) on the true objective

Comparison: assumptions • Common assumption #1: full observability • Generally assumed by value function fitting methods • Can be mitigated by adding recurrence • Common assumption #2: episodic learning • Often assumed by pure policy gradient methods • Assumed by some model-based RL methods • Common assumption #3: continuity or smoothness • Assumed by some continuous value function learning methods • Often assumed by some model-based RL methods

Examples of Algorithms

Examples of specific algorithms • Value function fitting methods • Q-learning, DQN • Temporal difference learning • Fitted value iteration • Policy gradient methods We’ll learn about • REINFORCE • Natural policy gradient most of these in the • Trust region policy optimization next few weeks! • Actor-critic algorithms • Asynchronous advantage actor-critic (A3C) • Soft actor-critic (SAC) • Model-based RL algorithms • Dyna • Guided policy search

Example 1: Atari games with Q-functions • Playing Atari with deep reinforcement learning, Mnih et al. ‘13 • Q-learning with convolutional neural networks

Example 2: robots and model-based RL • End-to-end training of deep visuomotor policies, L.* , Finn* ’16 • Guided policy search (model-based RL) for image-based robotic manipulation

Example 3: walking with policy gradients • High-dimensional continuous control with generalized advantage estimation, Schulman et al. ‘16 • Trust region policy optimization with value function approximation

Example 4: robotic grasping with Q-functions • QT-Opt, Kalashnikov et al. ‘18 • Q-learning from images for real-world robotic grasping

CS 285 Instructor: Sergey Levine UC Berkeley Definitions - PowerPoint PPT Presentation

Introduction to Reinforcement Learning CS 285 Instructor: Sergey Levine UC Berkeley Definitions Terminology & notation 1. run away 2. ignore 3. pet Imitation Learning supervised training learning data Images: Bojarski et al. 16,

Performa 285 Performa 285 High Alloy Zinc Nickel High Alloy Zinc Nickel Alloy Zinc Automotive

Ichthys LNG Project Ichthys Project Location Abadi WA 285 P Ichthys Field WA 285

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

Ichthys LNG Project Ichthys NG roject Ichthys Project Location Abadi WA 285 P Ichthys

BLU-285: A potent and highly selective inhibitor designed to target malignancies driven by KIT and

GIST: imatinib and beyond Clinical activity of BLU-285 in advanced gastrointestinal stromal tumor

Particulate Air Quality Around Wisconsin Frac Sand Mines #285 B A Presentation by Dr. Crispin

Quality Candles ...in a modern design www.diana-candles.com 285 employees Aprox .

the public sector with Lorraine Forrest-Turner governmentevents.co.uk | 0330 0584 285 |

Clinical activity in a Phase 1 study of BLU-285, a potent, highly-selective inhibitor of KIT D816V

Visual disability Low vision 2015 Estimated blind people 2020 Visually impaired 285 M Blind

Southern Companys Demonstration of a 285 MW Coal-Based Transport Gasifier Project Project

Georgia DOT Updates: MMIP and Transform 285/400 January 23, 2018 Tim Matthews, P.E. MMIP

Lanes and I-285 Top End Express Lanes Fulton County Schools Briefing Tim Matthews, P.E.

COST OR PRICE COST OR PRICE REASONABLENESS REASONABLENESS (CPR) (CPR) UH APM A8.285 RCUH

Introduction to Intelligent Transportation Systems (ITS): I-285 Variable Speed Limits Andrew

MATH 12002 - CALCULUS I 1.5: Continuity Professor Donald L. White Department of Mathematical

Lecture 14: Reinforcement Learning Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li

The components of a Trale grammar Implementing HPSG grammars Signature The TRALE system

Part I. Finding solutions of a given differential equation. 1. Find the real numbers r such that

Functions The function f maps A to B f : A B f ( a ) = b where a A and b B 1 2 3 4 5 6 7 8 9 10

+ Working with Functions in Python Introduction to Programming - Python + Functions +

Neural Networks Learning the network: Backprop 11-785, Spring 2020 Lecture 4 1 Recap: The MLP

Elementary Functions Part 1, Functions Lecture 1.4a, Symmetries of Functions: Even and Odd

CS 285 Instructor: Sergey Levine UC Berkeley Definitions - PowerPoint PPT Presentation

Introduction to Reinforcement Learning CS 285 Instructor: Sergey Levine UC Berkeley Definitions Terminology & notation 1. run away 2. ignore 3. pet Imitation Learning supervised training learning data Images: Bojarski et al. 16,

Performa 285 Performa 285 High Alloy Zinc Nickel High Alloy Zinc Nickel Alloy Zinc Automotive

Ichthys LNG Project Ichthys Project Location Abadi WA 285 P Ichthys Field WA 285

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

Ichthys LNG Project Ichthys NG roject Ichthys Project Location Abadi WA 285 P Ichthys

BLU-285: A potent and highly selective inhibitor designed to target malignancies driven by KIT and

GIST: imatinib and beyond Clinical activity of BLU-285 in advanced gastrointestinal stromal tumor

Particulate Air Quality Around Wisconsin Frac Sand Mines #285 B A Presentation by Dr. Crispin

Quality Candles ...in a modern design www.diana-candles.com 285 employees Aprox .

the public sector with Lorraine Forrest-Turner governmentevents.co.uk | 0330 0584 285 |

Clinical activity in a Phase 1 study of BLU-285, a potent, highly-selective inhibitor of KIT D816V

Visual disability Low vision 2015 Estimated blind people 2020 Visually impaired 285 M Blind

Southern Companys Demonstration of a 285 MW Coal-Based Transport Gasifier Project Project

Georgia DOT Updates: MMIP and Transform 285/400 January 23, 2018 Tim Matthews, P.E. MMIP

Lanes and I-285 Top End Express Lanes Fulton County Schools Briefing Tim Matthews, P.E.

COST OR PRICE COST OR PRICE REASONABLENESS REASONABLENESS (CPR) (CPR) UH APM A8.285 RCUH

Introduction to Intelligent Transportation Systems (ITS): I-285 Variable Speed Limits Andrew

MATH 12002 - CALCULUS I 1.5: Continuity Professor Donald L. White Department of Mathematical

Lecture 14: Reinforcement Learning Fei-Fei Li &amp; Justin Johnson &amp; Serena Yeung Fei-Fei Li

The components of a Trale grammar Implementing HPSG grammars Signature The TRALE system

Part I. Finding solutions of a given differential equation. 1. Find the real numbers r such that

Functions The function f maps A to B f : A B f ( a ) = b where a A and b B 1 2 3 4 5 6 7 8 9 10

+ Working with Functions in Python Introduction to Programming - Python + Functions +

Neural Networks Learning the network: Backprop 11-785, Spring 2020 Lecture 4 1 Recap: The MLP

Elementary Functions Part 1, Functions Lecture 1.4a, Symmetries of Functions: Even and Odd

Lecture 14: Reinforcement Learning Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li