Last Time: We want RL Algorithms that Perform Optimization Delayed - PowerPoint PPT Presentation

Lecture 8: Policy Gradient I 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 1 / 62

Last Time: We want RL Algorithms that Perform Optimization Delayed consequences Exploration Generalization And do it statistically and computationally efficiently Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 2 / 62

Last Time: Generalization and Efficiency Can use structure and additional knowledge to help constrain and speed reinforcement learning Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 3 / 62

Class Structure Last time: Imitation Learning This time: Policy Search Next time: Policy Search Cont. Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 4 / 62

Table of Contents Introduction 1 Policy Gradient 2 Score Function and Policy Gradient Theorem 3 Policy Gradient Algorithms and Reducing Variance 4 Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 5 / 62

Policy-Based Reinforcement Learning In the last lecture we approximated the value or action-value function using parameters θ , V θ ( s ) ≈ V π ( s ) Q θ ( s , a ) ≈ Q π ( s , a ) A policy was generated directly from the value function e.g. using ǫ -greedy In this lecture we will directly parametrize the policy π θ ( s , a ) = P [ a | s ; θ ] Goal is to find a policy π with the highest value function V π We will focus again on model-free reinforcement learning Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 6 / 62

Value-Based and Policy-Based RL Value Based Learnt Value Function Implicit policy (e.g. ǫ -greedy) Policy Based No Value Function Learnt Policy Actor-Critic Learnt Value Function Learnt Policy Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 7 / 62

Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 8 / 62

Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock-paper-scissors A deterministic policy is easily exploited A uniform random policy is optimal (i.e. Nash equilibrium) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 9 / 62

Example: Aliased Gridword (1) The agent cannot differentiate the grey states Consider features of the following form (for all N, E, S, W) φ ( s , a ) = ✶ (wall to N , a = move E) Compare value-based RL, using an approximate value function Q θ ( s , a ) = f ( φ ( s , a ); θ ) To policy-based RL, using a parametrized policy π θ ( s , a ) = g ( φ ( s , a ); θ ) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 10 / 62

Example: Aliased Gridworld (2) Under aliasing, an optimal deterministic policy will either move W in both grey states (shown by red arrows) move E in both grey states Either way, it can get stuck and never reach the money Value-based RL learns a near-deterministic policy e.g. greedy or ǫ -greedy So it will traverse the corridor for a long time Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 11 / 62

Example: Aliased Gridworld (3) An optimal stochastic policy will randomly move E or W in grey states π θ (wall to N and S, move E) = 0 . 5 π θ (wall to N and S, move W) = 0 . 5 It will reach the goal state in a few steps with high probability Policy-based RL can learn the optimal stochastic policy Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 12 / 62

Policy Objective Functions Goal: given a policy π θ ( s , a ) with parameters θ , find best θ But how do we measure the quality for a policy π θ ? In episodic environments we can use the start value of the policy J 1 ( θ ) = V π θ ( s 1 ) In continuing environments we can use the average value � d π θ ( s ) V π θ ( s ) J avV ( θ ) = s where d π θ ( s ) is the stationary distribution of Markov chain for π θ . Or the average reward per time-step � � d π θ ( s ) J avR ( θ ) = π θ ( s , a ) R ( a , s ) s a For simplicity, today will mostly discuss the episodic case, but can easily extend to the continuing / infinite horizon case Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 13 / 62

Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 14 / 62

Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Can use gradient free optimization Hill climbing Simplex / amoeba / Nelder Mead Genetic algorithms Cross-Entropy method (CEM) Covariance Matrix Adaptation (CMA) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 15 / 62

Human-in-the-Loop Exoskeleton Optimization (Zhang et al. Science 2017) Figure: Zhang et al. Science 2017 Optimization was done using CMA-ES, variation of covariance matrix evaluation Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 16 / 62

Gradient Free Policy Optimization Can often work embarrassingly well: ”discovered that evolution strategies (ES), an optimization technique that’s been known for decades, rivals the performance of standard reinforcement learning (RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo)” (https://blog.openai.com/evolution-strategies/) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 17 / 62

Gradient Free Policy Optimization Often a great simple baseline to try Benefits Can work with any policy parameterizations, including non-differentiable Frequently very easy to parallelize Limitations Typically not very sample efficient because it ignores temporal structure Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 18 / 62

Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Can use gradient free optimization: Greater efficiency often possible using gradient Gradient descent Conjugate gradient Quasi-newton We focus on gradient descent, many extensions possible And on methods that exploit sequential structure Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 19 / 62

Table of Contents Introduction 1 Policy Gradient 2 Score Function and Policy Gradient Theorem 3 Policy Gradient Algorithms and Reducing Variance 4 Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 20 / 62

Policy Gradient Define V ( θ ) = V π θ to make explicit the dependence of the value on the policy parameters Assume episodic MDPs (easy to extend to related objectives, like average reward) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 21 / 62

Policy Gradient Define V ( θ ) = V π θ to make explicit the dependence of the value on the policy parameters Assume episodic MDPs Policy gradient algorithms search for a local maximum in V ( θ ) by ascending the gradient of the policy, w.r.t parameters θ ∆ θ = α ∇ θ V ( θ ) Where ∇ θ V ( θ ) is the policy gradient   ∂ V ( θ ) ∂θ 1 .   . ∇ θ V ( θ ) =   .   ∂ V ( θ ) ∂θ n and α is a step-size parameter Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 22 / 62

Computing Gradients by Finite Differences To evaluate policy gradient of π θ ( s , a ) For each dimension k ∈ [1 , n ] Estimate k th partial derivative of objective function w.r.t. θ By perturbing θ by small amount ǫ in k th dimension ≈ V ( θ + ǫ u k ) − V ( θ ) ∂ V ( θ ) ∂θ k ǫ where u k is a unit vector with 1 in k th component, 0 elsewhere. Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 23 / 62

Computing Gradients by Finite Differences To evaluate policy gradient of π θ ( s , a ) For each dimension k ∈ [1 , n ] Estimate k th partial derivative of objective function w.r.t. θ By perturbing θ by small amount ǫ in k th dimension ∂ V ( θ ) ≈ V ( θ + ǫ u k ) − V ( θ ) ∂θ k ǫ where u k is a unit vector with 1 in k th component, 0 elsewhere. Uses n evaluations to compute policy gradient in n dimensions Simple, noisy, inefficient - but sometimes effective Works for arbitrary policies, even if policy is not differentiable Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2019 24 / 62

Last Time: We want RL Algorithms that Perform Optimization Delayed - PowerPoint PPT Presentation

Lecture 8: Policy Gradient I 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 8: Policy

Math 3B: Lecture 2 Noah White September 26, 2016 Last time Last time, we spoke about The

Tourism Exp o Perform a nce Im p rov em ent 30 May 2013 Pw C Hum a n Resource Serv ices (HRS)

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Java CUP 1 Last Time What do we want? An AST When do we want it? Now! 2 This Time A

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

My colleges Jordyn keuchler-Carey I want to major in being a lawyer. I want to be a lawyer

Complexity of Counting Lecture 21 #P: Toda s Theorem 1 Last Time 2 Last Time #P:

Is it necessary to perform a controlled Is it necessary to perform a controlled cooling phase at

Validation, Synthesis Validation, Synthesis and Perform ance Perform ance Evaluation of of

Algorithms and Architecture I Sorting in Linear Time 1 Linear Sort? But... Best algorithms

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Random Local Exploration Techniques for Sublinear-Time Algorithms Krzysztof Onak IBM Research

Who Is My Counselor? Last Name A-Co: Mrs. Ary Last Name Cr-He: Mr. Peslak Last Name Hi-Ma:

1.2 Basic Graphics Programming Hao Li http://cs420.hao-li.com 1 Last time Last Time Computer

CPSC 121: Models of Computation Module 6: Rewriting predicate logic statements Module 6:

Spool : Reliable Virtualized NVMe Storage Systems in Public Cloud Infrastructure Shuai Xue,

Rhythm: Component-distinguishable Workload Deployment in Datacenters Laiping Zhao 1 , Yanan Yang 1

Adventures in Multicellularity The social amoeba ( a.k.a. slime molds ) Dictyostelium discoideum

Object-Oriented Programming What

AN ALGEBRA APPROACH TO TROPICAL MATHEMATICS Louis Rowen, Department of Mathematics, Bar-Ilan

TagYoure It! TagYoure It! Michelle Cummings M.S. Training Wheels Michelle

1 Ikegamis Theorem for zero-dimensional Polish spaces Let I be a -ideal on a set X . We call