Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ - PowerPoint PPT Presentation

Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ , Xinyan Yan # , Nathan Ratliff $ , Byron Boots #$ Jun 11 2019 @ ICML # $

Policy optimization We consider episodic learning (MDP) Optimize a policy for sequential decision making 2

Learning efficiency ● Cost of Interactions > cost of computation learning efficiency = sample efficiency ○ we should maybe spend time on planning before real interactions ○ to do so we need models, but should we? 3

Why we should use models ● A way to summarize prior knowledge & past experiences ● Can optimize the policy indirectly without costly real-world interactions ● Can be provably more sample-efficient (Sun et al., 2019) 4 airsim flex Miniatur Wunderland

Why we should NOT use models ● Models are, by definition, inexact ● Weakness of the model can be exploited in policy optimization ● Result in biased performance of the trained policy "The reality gap" ItsJerryAndHarry (Youtube) 5

Toward reconciling both sides Biased world dual policy iteration Model-free unbiased dyna inefficient learning to Model-based plan efficient biased control variate Unbiased kingdom truncating horizon 6

A new direction Biased world PicCoLO Model-free unbiased inefficient Model-based efficient biased Unbiased kingdom new unbiased efficient hybrids 7 *can be combined with control variate and learning to plan

A new direction ● Main idea We should not fully trust a model (e.g. the methods in the biased world) but leverage only the correct part ● How? 1. Frame policy optimization as predictable online learning (POL) 2. Design a reduction-based algorithm for POL to reuse known algorithms 3. When translated back, this gives a meta-algorithm for policy optimization 8

Online learning LEARNER OPPONENT policy optimization algorithm 9

Online learning Opponent chooses a loss try a policy observe statistics Learner makes a Learner suffers decision update policy 10

Online learning ● Loss sequence can be adversarially chosen by the opponent ● Common performance measure ● For convex losses, algorithms with sublinear regret are well-known e.g. mirror descent, follow-the-regularized-leader, etc. 11

Policy optimization as online learning ● Define online losses such that sublinear regret implies policy learning ● This idea started in the context of imitation learning (Ross et al., 2011) ● We show that episodic policy optimization can be viewed similarly 12

Policy optimization as online learning try a policy states of the current policy advantage function observe statistics Learner makes a Check if the current policy is decision better than the previous policy update policy The gradient of this loss is the actor-critic gradient (implemented in practice) 13

Possible algorithms ● We can try typical no-regret algorithms e.g. mirror descent in online learning -> actor-critic update ● But it turns out they are not optimal. We can actually learn faster! ● Insight loss functions here are not adversarial but can be inferred from the past but these typical algorithms were designed for adversarial setups e.g. similar policies visit similar states 14

Predictability and predictive models ● We can view predictability, e.g., as the ability to predict future gradients ● Predictive model: a function that estimates gradient of future loss ● Examples (averaged) past gradients function approximator inexact simulator 15 replay buffer

Policy optimization is predictable online learning ● We need algorithms that consider predictability ● There are two- step algorithms for predictable setups, but …. ● We have more sophisticated algorithms for adversarial problems, but … ● That is, we need a reduction from predictable to adversarial problems This is PicCoLO 16

The idea behind PicCoLO Predictable It suffices to consider if we wisely select prediction error via predictive model Adversarial 17

PicCoLO is a meta algorithm Apply a standard method (e.g. gradient descent) to this new sequence Prediction Step Correction Step trick : adapt steps size based on the size of gradient error take larger steps when the prediction is accurate and vice versa (PicCoLO can use control variate to reduce further the variance of ) 18

PicCoLO Prediction Step Correction Step Same idea applies to any algorithm in the family of (adaptive) mirror descent and Follow-the-Regularized-Leader PicCoLO recovers existing algorithms, e.g. extra-gradient, optimistic mirror descent, and provides their adaptive generalization 19

PicCoLO Prediction Step Correction Step Theoretically we can show • the performance is unbiased, even when the prediction (model) is incorrect • learning accelerates, when the prediction is relatively accurate 20

How to compute the prediction Prediction Step Correction Step ● We want to set because ● We can use predictive model to realize this! 21

How to compute the prediction Prediction Step Correction Step ● We want to set because ● Because and , we can set ● We can select by solving a fixed-point problem (FP) Prediction Step 22

How to compute the prediction Prediction Step Correction Step ● When , the FP becomes an optimization problem: this is regularized optimal control ● Heuristic: set or just do a few iterations 23

Experiments ● For example, with ADAM as the base algorithm cartpole accumulated reward shows acceleration when predictions are accurate better robust against model error previous-decision heuristic iteration 24 similar properties observed for other base algorithms (e.g. natural gradient descent, TRPO)

Experiments ● For example, with ADAM as the base algorithm hopper snake accumulated reward previous-decision heuristic better approximate fixed-point heuristic iteration iteration The fixed-point formulation converges even faster 25

Summary • “ PicCoLOed ” model -free algorithms can learn faster without bias • The predictive model can be viewed as a unified interface for injecting prior learning and parameterizing predictive models are of practical importance • As PicCoLO is designed for general predictable online learning, we expect applications to other problems and domains 26

Thanks for attention Please come to our poster # 106

Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ - PowerPoint PPT Presentation

Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ , Xinyan Yan # , Nathan Ratliff $ , Byron Boots #$ Jun 11 2019 @ ICML # $ Policy optimization We consider episodic learning (MDP) Optimize a policy

Predictor-corrector ensemble filters for high-dimensional nonlinear systems and sparse data and

Predictor-Corrector and Morphing Ensemble Filters for the Assimilation of Sparse Data into

Amplitude Detuning from Corrector Misalignments in the LHC and HL-LHC (Part II) Joschua Dilly O

Is Chocolate a Personality Is Chocolate a Personality Predictor? Predictor? Susan C. Sharpe,

Variant Effect Predictor Demo: The Variant Effect Predictor (VEP)

1 Tournament Branch Predictor Accuracy of Return Address Predictor Used in Alpha 21264: Track

Post-processing functions for a biased physical random number generator Patrick Lacharme

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Five poTAGEs and a COLT for an unrealistic predictor Pierre Michaud june 2014 Competition

Pushing the branch predictability limits with the multi-poTAGE+SC predictor Pierre Michaud +

1 Predictor for a Single Branch Branch History Table of 1-bit Predictor BHT also Called Branch

HLM An Introduction James H. Steiger Department of Psychology and Human Development

Adding a Level-1 Predictor PSYC 575 August 25, 2020 (updated: 7 September 2020) Week Learning

5. Summary of linear regression so far Main points Model/function/predictor class of linear

Parametric Signal Modeling and Linear Prediction Theory 5. Lattice Predictor Electrical &

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

COMP 150: Developmental Robotics Instructor: Jivko Sinapov www.cs.tufts.edu/~jsinapov

PARACYCLING: HEALTHCARE CONSIDERATIONS Erik Moen PT BikePT / Corpore Sano Physical Therapy UCI and

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James

Homological Mirror Symmetry and VGIT David Favero University of Vienna January 24, 2013 David

Algorithms for Differential Privacy: Exponential & Median Mechanism CompSci 590.03

Introduction to nginx.conf scripting Introduction to nginx.conf scripting agentzh@gmail.com

15 12 Slides, 3 Writers Foreword Giving Yourself an Assignment Without a Net Starting Do You

Object Intro and Miscellaneous Checkout ObjectIntroAndMisc project from SVN Writing clean code

Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ - PowerPoint PPT Presentation

Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ , Xinyan Yan # , Nathan Ratliff $ , Byron Boots #$ Jun 11 2019 @ ICML # $ Policy optimization We consider episodic learning (MDP) Optimize a policy

Predictor-corrector ensemble filters for high-dimensional nonlinear systems and sparse data and

Predictor-Corrector and Morphing Ensemble Filters for the Assimilation of Sparse Data into

Amplitude Detuning from Corrector Misalignments in the LHC and HL-LHC (Part II) Joschua Dilly O

Is Chocolate a Personality Is Chocolate a Personality Predictor? Predictor? Susan C. Sharpe,

Variant Effect Predictor Demo: The Variant Effect Predictor (VEP)

1 Tournament Branch Predictor Accuracy of Return Address Predictor Used in Alpha 21264: Track

Post-processing functions for a biased physical random number generator Patrick Lacharme

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Five poTAGEs and a COLT for an unrealistic predictor Pierre Michaud june 2014 Competition

Pushing the branch predictability limits with the multi-poTAGE+SC predictor Pierre Michaud +

1 Predictor for a Single Branch Branch History Table of 1-bit Predictor BHT also Called Branch

HLM An Introduction James H. Steiger Department of Psychology and Human Development

Adding a Level-1 Predictor PSYC 575 August 25, 2020 (updated: 7 September 2020) Week Learning

5. Summary of linear regression so far Main points Model/function/predictor class of linear

Parametric Signal Modeling and Linear Prediction Theory 5. Lattice Predictor Electrical &amp;

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

COMP 150: Developmental Robotics Instructor: Jivko Sinapov www.cs.tufts.edu/~jsinapov

PARACYCLING: HEALTHCARE CONSIDERATIONS Erik Moen PT BikePT / Corpore Sano Physical Therapy UCI and

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James

Homological Mirror Symmetry and VGIT David Favero University of Vienna January 24, 2013 David

Algorithms for Differential Privacy: Exponential &amp; Median Mechanism CompSci 590.03

Introduction to nginx.conf scripting Introduction to nginx.conf scripting agentzh@gmail.com

15 12 Slides, 3 Writers Foreword Giving Yourself an Assignment Without a Net Starting Do You

Object Intro and Miscellaneous Checkout ObjectIntroAndMisc project from SVN Writing clean code

Parametric Signal Modeling and Linear Prediction Theory 5. Lattice Predictor Electrical &

Algorithms for Differential Privacy: Exponential & Median Mechanism CompSci 590.03