Trust Region Policy Optimization Yixin Lin Duke University - PowerPoint PPT Presentation

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21

Overview Preliminaries 1 Markov Decision Processes Policy iteration Policy gradients TRPO 2 Kakade & Langford Natural policy gradients Overview of TRPO Practice Experiments References 3 Yixin Lin (Duke) TRPO March 28, 2017 2 / 21

Introduction Reinforcement learning is the problem of sequential decision making in dynamic environment Goal: capture the most important aspects of an agent making decisions Input (sensing the state of the environment) Action (choosing to affect on the environment) Goal (prefers some states of the environment over others) This is incredibly general Examples Robots (and their components) Games Better A/B testing Yixin Lin (Duke) TRPO March 28, 2017 3 / 21

The Markov Decision Process (MDP) S : set of possible states of the environment p ( s init ) , s init ∈ S : a distribution over initial state Markov property: we assume that the current state summarizes everything we need to remember A : set of possible actions P ( s new | s old , a ): state transitions, for each state s and action a R : S → R : reward γ ∈ [0 , 1]: discount factor Yixin Lin (Duke) TRPO March 28, 2017 4 / 21

Policies and value functions π : a policy (what action to do, given a state) Return: (possibly discounted) sum of future rewards r t + γ r t +1 + γ 2 r t +2 . . . Performance of policy: η ( π ) = E [return] V π ( s ) = E π [return | s ] How good is a state, given a policy? Q π ( s , a ) = E π [return | s , a ] How good is an action at a state, given a policy? Yixin Lin (Duke) TRPO March 28, 2017 5 / 21

Policy iteration Assume perfect model of MDP Alternate between the following until convergence Evaluating the policy (computing V π ) s ′ , r r + γ V ( s ′ )] For each state s , V ( s ) = E [ � Repeat until convergence (or just once for value iteration ) Setting policy to be greedy ( π ( s ) = arg max a E [ r + γ V π ( s ′ )]) Guaranteed convergence (for both policy and value iteration) Yixin Lin (Duke) TRPO March 28, 2017 6 / 21

Policy gradients Policy iteration scales very badly: have to repeatedly evaluate policy on all states Parameterize policy a ∼ π | θ We can sample instead Yixin Lin (Duke) TRPO March 28, 2017 7 / 21

Policy gradients Sample a lot of trajectories (simulate your environment under the policy) under the current policy Make good actions more probable Specifically, estimate gradient using score function gradient estimator For each trajectory τ i , ˆ g i = R ( τ i ) ∇ θ log p ( τ i | θ ) Intuitively, take the gradient of log probability of the trajectory, then weight it by the final reward Reduce variance by temporal structure and other tricks (e.g. baseline) Replace reward by the advantage function A π = Q π ( s , a ) − V π ( s ) Intuitively, how much better is the action we picked over the average action? Repeat Yixin Lin (Duke) TRPO March 28, 2017 8 / 21

Vanilla policy gradient algorithm/REINFORCE Initialize policy π | θ while gradient estimate has not converged do Sample trajectories using π for each timestep do Compute return and advantage estimate end for Refit optimal baseline Update the policy using gradient estimate ˆ g end while Yixin Lin (Duke) TRPO March 28, 2017 9 / 21

Connection to supervised learning t log π ( a t | s t ; θ ) ˆ Minimizing L ( θ ) = � A t In the paper, they use cost functions instead of reward functions Intuitively, we have some parameterized policy (“model”) giving us a distribution over actions We don’t have the correct action (“label”), so we just use the reward at the end as our label We can do better. How do we do credit assignment? Baseline (roughly encourage half of the actions, not just all of them) Discounted future reward (actions affect near-term future), etc. Yixin Lin (Duke) TRPO March 28, 2017 10 / 21

Kakade & Langford: conservative policy iteration “A useful identity” for η ˜ π , the expected discounted cost of a new policy ˜ π t =0 γ t A π ( s t , a t )] = η (˜ π ) = η ( π ) + E [ � η ( π ) + � s ρ ˜ π ( s ) � a ˜ π ( a | s ) A π ( s , a ) Intuitively: the expected return of a new policy is the expected return of the old policy, plus how much better the new policy is at each state Local approximation: switch out ρ ˜ π for ρ π , since we only have the state visitation frequency for the old policy, not the new policy L π (˜ π ) = η ( π ) + � s ρ π ( s ) � a ˜ π ( a | s ) A π ( s , a ) Kakade & Langford proved that optimizing this local approximation is good for small step sizes, but for mixture policies only Yixin Lin (Duke) TRPO March 28, 2017 11 / 21

Natural policy gradients π ) + CD max In this paper, they prove that η (˜ π ) ≤ L π (˜ KL ( π, ˜ π ), C is a constant dependent on γ Intuitively, we optimize the approximation, but regularize with the KL divergence between old and new policy Algorithm called the natural policy gradient Problem: choosing hyperparameter C is difficult Yixin Lin (Duke) TRPO March 28, 2017 12 / 21

Overview of TRPO Instead of adding KL divergence as a cost, simply use it as an optimization constraint π ), constraint that D max TRPO algorithm: minimize L π (˜ KL ≤ δ for some easily-picked hyperparameter δ Yixin Lin (Duke) TRPO March 28, 2017 13 / 21

Practical considerations How do we sample trajectories? Single-path: simply run each sample to completion “Vine”: for each sampled trajectory, pick random states along the trajectory and perform small rollout How do we compute gradient? Use conjugate gradient algorithm followed by line search Yixin Lin (Duke) TRPO March 28, 2017 14 / 21

Algorithm while gradient not converged do Collect trajectories (either single-path or vine) Estimate advantage function Compute policy gradient estimator Solve quadratic approximation to L ( π θ ) using CG Rescale using line search Apply update end while Yixin Lin (Duke) TRPO March 28, 2017 15 / 21

Experiments - MuJoCo robotic locomotion Link to demonstration Same δ hyperparameter across experiments Yixin Lin (Duke) TRPO March 28, 2017 16 / 21

Experiments - MuJoCo learning curves Link to demonstration Same δ hyperparameter across experiments Yixin Lin (Duke) TRPO March 28, 2017 17 / 21

Experiments - Atari Not always better than previous techniques, but consistently decent Very little problem-specific engineering Yixin Lin (Duke) TRPO March 28, 2017 18 / 21

Takeaway TRPO is a good default policy gradient technique which scales well and has minimal hyperparameter tuning Just use KL constraint on gradient approximator Yixin Lin (Duke) TRPO March 28, 2017 19 / 21

References RS Sutton. Introduction to reinforcement learning Kakade and Langford. Approximately optimal approximate reinforcement learning Schulman et al. Trust Region Policy Optimization Schulman, Levine, Finn. Deep Reinforcement Learning course: link Andrej Karpathy. Deep Reinforcement Learning: From Pong to Pixels: link Trust Region Policy Optimization summary: link Yixin Lin (Duke) TRPO March 28, 2017 20 / 21

Thanks! Link to presentation: yixinlin.net/trpo Yixin Lin (Duke) TRPO March 28, 2017 21 / 21

Trust Region Policy Optimization Yixin Lin Duke University - PowerPoint PPT Presentation

Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview Preliminaries 1 Markov Decision Processes Policy iteration Policy gradients TRPO 2

Trust Region Method Lectures for PHD course on Numerical optimization Enrico Bertolazzi DIMS

TULA REGION TULA Moscow REGION Moscow region Kaluga region Tula Novomoskovsk Ryazan

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Trust But Verify Trust But Verify Trust But Verify Trust But Verify What Is CEC Entertainment?

Dynamics, robustness and fragility Private trust Public trust of trust Conclusions Dusko

Gods stories Gods stories Trust Trust To Rely Upon Something Totally Trust trust:

Composite Trust Composite Trust Composite Trust A formal derivation of conjunction A formal

A Nonlinear Trust Region Framework for PDE-Constrained Optimization Using

A Nonlinear Trust-Region Framework for PDE-Constrained Optimization Using Adaptive Model

MATH529 Fundamentals of Optimization Trust Region Algorithms Marco A. Montes de Oca

Klaipeda region about the region Klaipda region the right location Klaipda region is

Development Corporation Stuttgart Region Economic The Stuttgart Region The Stuttgart Region is

POTENTIAL OF NAMANGAN REGION 1 Namangan region Namangan region on the map of Uzbekistan

Group- Group -per per- -Region Allocation Region Allocation Region Bounds Region Bounds

May 3: Trust and Hybrid Models Trust models Chinese Wall model Aggressive Chinese Wall

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

LunarLander-v2 using Deep Reinforcement Learning A project developed for Autonomous Agents

ML MLOp Ops CI CI/CD CD for or Ma Machine Le Learn rning SASHA ROSENBAUM Sasha Rosenbaum

Fast and Easy Hyper-Parameter Grid Search for Deep Learning GTC 2016 Mark Whitney Rescale

Expected Nodes : a quality function for the detection of link communities e Gaumont , Fran cois

VGG/MOBILENET SSD Michael Sun July September, 2019 DATASETS FOR FINE-TUNING BelgaLogos

Using Deep Learning to Detect Galaxy Mergers Jonas Arilho Levy Supervisor: Mateus Espadoto [

Network Anomaly Detection in Modbus TCP Industrial Control Systems RP1 #52: Industrial Control

Predicting AsiaYo Users Spending for Improving Search Results Travis Greene, Martin Hsia,

Sambuz

Useful Links

Newsletter

Mail Us