10703 Deep Reinforcement Learning Policy Gradient Methods Tom - PowerPoint PPT Presentation

10703 Deep Reinforcement Learning Policy Gradient Methods Tom Mitchell October 1, 2018 Reading: Barto & Sutton, Chapter 13

Used Materials • Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton textbook. • Some slides are borrowed from Ruslan Salakhutdinov, who in turn borrowed from Rich Sutton’s RL class and David Silver’s Deep RL tutorial

Policy-Based Reinforcement Learning So far we approximated the value or action-value function using ‣ parameters θ (e.g. neural networks) A policy was generated directly from the value function e.g. using ε - ‣ greedy In this lecture we will directly parameterize the policy ‣ We will focus again on model-free reinforcement learning ‣

Policy-Based Reinforcement Learning So far we approximated the value or action-value function using ‣ parameters θ (e.g. neural networks) Sometimes I will also use the notation: A policy was generated directly from the value function e.g. using ε - ‣ greedy In this lecture we will directly parameterize the policy ‣ We will focus again on model-free reinforcement learning ‣

Typical Parameterized Differentiable Policy ‣ Softmax: where h(s,a, θ ) is any function of s, a with params θ e.g. , linear function of features x(s,a) you make up e.g., h(s,a, θ ) is output of trained neural net

Value-Based and Policy-Based RL Value Based ‣ - Learn a Value Function - Implicit policy (e.g. ε -greedy) Policy Based ‣ - Learn a Policy directly Actor-Critic ‣ - Learn a Value Function, and - Learn a Policy

Advantages of Policy-Based RL Advantages ‣ - Better convergence properties - Effective in high-dimensional, even continuous action spaces - Can learn stochastic policies   Disadvantages ‣ - Typically converge to a local rather than global optimum

Example: Why use non-deterministic policy?

What Policy Learning Objective? Goal: given policy π θ (s,a) with parameters θ , wish to find best θ ‣ define “best θ ” as argmax θ J( θ ) for some J( θ ) ‣ In episodic environments we can optimize the value of start state s 1 ‣ Remember: Episode of experience under policy π :

What Policy Learning Objective? Goal: given policy π θ (s,a) with parameters θ , wish to find best θ ‣ define “best θ ” as argmax θ J( θ ) for some J( θ ) ‣ In episodic environments we can optimize the value of start state s 1 ‣ In continuing environments we can optimize the average value ‣ Or the average immediate reward per time-step ‣ where is stationary distribution of Markov chain for π θ

Policy Optimization Policy based reinforcement learning is an optimization problem ‣ - Find θ that maximizes J( θ )   Some approaches do not use gradient ‣ - Hill climbing - Genetic algorithms Greater efficiency often possible using gradient ‣ - Gradient descent - Conjugate gradient - Quasi-Newton We focus on gradient ascent, many extensions possible ‣ And on methods that exploit sequential structure ‣

Gradient of Policy Objective Let J( θ ) be any policy objective function ‣ Policy gradient algorithms search for a local ‣ maximum in J( θ ) by ascending the gradient of the policy, w.r.t. parameters θ is the policy gradient α is a step-size parameter (learning rate)

Computing Gradients By Finite Differences To evaluate policy gradient of π θ (s, a) ‣ For each dimension k in [1, n] ‣ - Estimate k th partial derivative of objective function w.r.t. θ - By perturbing θ by small amount ε in k th dimension where u k is a unit vector with 1 in k th component, 0 elsewhere Uses n evaluations to compute policy gradient in n dimensions ‣ Simple, inefficient – but general purpose! ‣ Works for arbitrary policies, even if policy is not differentiable ‣

How do we find an expression for ? Consider episodic case: Problem in calculating : :   doesn’t a change to θ alter both: action chosen by π θ in each state s • distribution of states we’ll encounter • Remember: Episode of experience under policy π :

How do we find an expression for ? Consider episodic case: Problem in calculating : :   doesn’t a change to θ alter both: action chosen by π θ in each state s • distribution of states we’ll encounter • Good news: policy gradient theorem: where is probability distribution over states

SGD Approach to Optimizing J( θ ) : Approach 1

SGD Approach to Optimizing J( θ ) : Approach 2

REINFORCE algorithm

Note because

Typical Parameterized Differentiable Policy ‣ Softmax: where h(s,a, θ ) is any function of s, a with params θ e.g. , linear function of features x(s,a) you make up e.g., h(s,a, θ ) is output of trained neural net

REINFORCE algorithm on Short Corridor World

Good news: • REINFORCE converges to local optimum under usual SGD assumptions • because E π [G t ] = Q(S t ,A t ) But variance is high • recall high variance of Monte Carlo sampling

Adding a baseline to REINFORCE Algorithm replace by for some fixed function b(s) that captures prior for s Note the equation is still valid because Result:

Adding a baseline to REINFORCE Algorithm replacing by for a good b(S t ) reduces variance in training target one typical b(S) is a learned value function b(S t ) =

Good news: • REINFORCE converges to local optimum under usual SGD assumptions • because E π [G t ] = Q(S t ,A t ) But variance is high • recall high variance of Monte Carlo sampling

Actor-Critic Model • learn both Q and π • use Q to generate target values, instead of G One step actor-critic model:

10703 Deep Reinforcement Learning Policy Gradient Methods Tom - PowerPoint PPT Presentation

10703 Deep Reinforcement Learning Policy Gradient Methods Tom Mitchell October 1, 2018 Reading: Barto & Sutton, Chapter 13 Used Materials Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton

10703 Deep Reinforcement Learning Reinforcement Learning in Humans and Animals Tom Mitchell

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

10703 Deep Reinforcement Learning Imitation Learning - 1 Tom Mitchell November 4, 2018

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

10703 Deep Reinforcement Learning Exploration vs. Exploitation Tom Mitchell October 22, 2018

10703 Deep Reinforcement Learning Solving known MDPs Tom Mitchell September 10, 2018 Many

10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides

10703 Deep Reinforcement Learning Policy Gradient Methods Part 3 Tom Mitchell October 8, 2018

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data problem for Deep RL

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Thomas Garnier SkyRecon Systems Recon 2008 05/23/2008 Overview Introduction LPC

Tutorial # 7: Decision Aid Methodology 29 th May, 2012 MATERIAL TRANSPORTATION FOR SICO Background

Constraint Programming Justin Pearson Uppsala University 1st July 2016 Special thanks to Pierre

10/21/2015 1 10/21/2015 Challenges Integration tests fails due to unreliable dependency

Deep Prolog: End-to-end Differentiable Proving in Knowledge Bases Tim Rockt aschel University

Linking Emissions Trading Systems: Paving the way toward a low-carbon future Damien MEADOWS

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks Nicolas

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian