Controlling Overestimation Bias with Truncated Mixture of Continuous - PowerPoint PPT Presentation

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics Arsenii Kuznetsov 1 Pavel Shvechikov 1,2 Alexander Grishin 1,3 Dmitry Vetrov 1,3 1 Samsung AI center, Moscow 2 Higher School of Economics, Moscow 1 3 Samsung HSE Laboratory

Overestimation bias in off-policy learning 1. Value estimates are imprecise 2. Agent pursues erroneous estimates ... 3. Errors propagate through time 4. Performance degrades 2

Overestimation bias in off-policy learning 1. Value estimates are imprecise 2. Agent pursues erroneous estimates ... 3. Errors propagate through time 4. Performance degrades We propose a novel method: Truncated Quantile Critics (TQC) 3

Key elements of TQC 1. Distributional critics Impressive empirical performance ● Captures info about return variance ● 4

Key elements of TQC 1. Distributional critics Impressive empirical performance ● Captures info about return variance ● 2. Ensembling of the critics Increases performance and stability ● 5

Key elements of TQC 1. Distributional critics Impressive empirical performance ● Captures info about return variance ● 2. Ensembling of the critics Increases performance and stability ● 3. Truncating the mixture of distributions Alleviates overestimation ● 6

TQC’s novelties 1. Incorporates stochasticity of returns into the overestimation control 7

TQC’s novelties 1. Incorporates stochasticity of returns into the overestimation control 2. Provides fine-grained and adjustable level of the overestimation control 8

TQC’s novelties 1. Incorporates stochasticity of returns into the overestimation control 2. Provides fine-grained and adjustable level of the overestimation control 3. Decouples the overestimation control and the number of critics 9

TQC is a new SOTA on MuJoCo 10

TQC is a new SOTA on MuJoCo 11

Stochastic Continuous Control in MDP Stochastic Agent Continuous Reward State Action Environment 12

Overestimation: intuition Value, Q( a ) Noisy samples True, Q Actions, a 15

Overestimation: intuition Value, Q( a ) Noisy samples True, Q Approximation, Q Actions, a 16

Overestimation: intuition Value, Q( a ) Noisy samples True, Q Approximation, Q Sources of distortion, U: 1. Insufficient data 2. Limited model capacity 3. SGD noise 4. Env’s stochasticity 5. Ongoing policy changes Actions, a 17

Overestimation: intuition Value, Q( a ) Noisy samples True, Q Q ( a APPROX ) Approximation, Q Error Sources of distortion, U: Q ( a APPROX ) 1. Insufficient data 2. Limited model capacity 3. SGD noise 4. Env’s stochasticity 5. Ongoing policy changes a APPROX Actions, a 18

Overestimation: mathematical model 1 Predicted maximum: [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 19

Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 20

Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: Jensen inequality [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 21

Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 22

Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: Predicted ≥ True [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 23

Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: Predicted ≥ True 1. Policy exploits critic’s erroneous estimates 2. TD-learning propagates estimation errors 3. Positive feedback loop may occur [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 24

Soft Actor Critic 2 [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 25

Soft Actor Critic 2 Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 26

Soft Actor Critic 2 Overestimation alleviation (Clipped Double Estimate 3 ): Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 29 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

Soft Actor Critic 2 Overestimation alleviation (Clipped Double Estimate 3 ): Soft Policy Evaluation: 1. [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 30 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

Soft Actor Critic 2 Overestimation alleviation (Clipped Double Estimate 3 ): Soft Policy Evaluation: 1. 2. [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 31 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

Overestimation alleviation (Clipped Double Estimate 3 ): Limitations: 1. ● Coarse bias control ● Wasteful aggregation 2. Solution: Truncated Quantile Critics [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 32 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

TQC step 1: Prediction of N distributions 33

TQC step 2: Pooling 34

TQC step 3: Truncation 35

TQC step 4: Discounting and Shifting 36

Training For each Z-network: 37

Training For each Z-network: 4 [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 38

Training For each Z-network: 4 [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 39

Training For each Z-network: 4 Policy: Maximizes nontruncated average of all atoms of the mixture [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 40

Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control A novel direction: interplay between overestimation and stochasticity 41

Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control A novel direction: interplay between overestimation and stochasticity Variance Mean increases Overestimation increases Mean 42

Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control A novel direction: interplay between overestimation and stochasticity Variance Mean Mean increases Truncation Overestimation compensation Overestimation increases Mean Mean 43

Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control 2. Method provides adjustable and fine-grained overestimation bias control # of quantiles ... ... Resolution M = 10 increases Overestimation compensation M = 5 44 1 / 5 3 / 10 2 / 5 Fraction of dropped quantiles

Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control 2. Method provides adjustable and fine-grained overestimation bias control 3. Decouples overestimation control and number of approximators # of networks ... N = 3 Better performance More computations N = 2 ... ... ... N = 1 Fraction of dropped quantiles 45

Controlling Overestimation Bias with Truncated Mixture of Continuous - PowerPoint PPT Presentation

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics Arsenii Kuznetsov 1 Pavel Shvechikov 1,2 Alexander Grishin 1,3 Dmitry Vetrov 1,3 1 Samsung AI center, Moscow 2 Higher School of Economics,

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

Equity & Excellence: Hidden Bias Implicit Bias Inherent Bias

Bias in, Bias out: Gender Equality and the Fourth Industrial Revolution Debra Howcroft and

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

go to the source The Media Bias Chart The Media Bias Chart A new taxonomy for discussing the

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Making Generative Classifiers Robust to Selection Bias Andrew Smith Charles Elkan November

pn -junctonJ under dark conditons No Bias Forward Bias Reverse Bias Model - + Circuit P N

Transistor bias circuits 1 Objectives Discuss the concept of dc biasing of a transistor for

Microwave Scan Bias Status Report Bjorn Lambrigtsen February 25, 2003 AIRS Science Team

Improving Outcomes and Controlling Costs: Improving Outcomes and Controlling Costs: Improving

Project Pacifier Controlling a TV with a Clicker device by Florian Thurnwald THURNWALD

Non-uniform transverse laser shaping for slice emittance improvement in photoinjector

Adventures with the GPU Roice Nelson GE Aviation, Austin TX My goals for this talk Provide

Logarithmic Mult ltiplication for Convolutional Neural Networks HyunJin in Kim im, , Min in

Operations that preserve integrability, and truncated Riesz spaces Marco Abbadini Dipartimento di

Simulation from the Normal Distribution Truncated to an Interval far in the Tail Zdravko Botev

Improvement of the higher-order tensor renormalization group method Satoshi Morita (ISSP ,

Symmetry properties of generalized graph truncations Primo parl University of Ljubljana and

Modern BTree techniques DMITRY DOLGOV 31-01-2020 @aspis7 1 @aspis7 1 Graefe, Goetz and

Controlling Overestimation Bias with Truncated Mixture of Continuous - PowerPoint PPT Presentation

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics Arsenii Kuznetsov 1 Pavel Shvechikov 1,2 Alexander Grishin 1,3 Dmitry Vetrov 1,3 1 Samsung AI center, Moscow 2 Higher School of Economics,

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS BIAS LIGHT LIGHT &amp; &amp; MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

Equity &amp; Excellence: Hidden Bias Implicit Bias Inherent Bias

Bias in, Bias out: Gender Equality and the Fourth Industrial Revolution Debra Howcroft and

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

go to the source The Media Bias Chart The Media Bias Chart A new taxonomy for discussing the

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Making Generative Classifiers Robust to Selection Bias Andrew Smith Charles Elkan November

pn -junctonJ under dark conditons No Bias Forward Bias Reverse Bias Model - + Circuit P N

Transistor bias circuits 1 Objectives Discuss the concept of dc biasing of a transistor for

Microwave Scan Bias Status Report Bjorn Lambrigtsen February 25, 2003 AIRS Science Team

Improving Outcomes and Controlling Costs: Improving Outcomes and Controlling Costs: Improving

Project Pacifier Controlling a TV with a Clicker device by Florian Thurnwald THURNWALD

Non-uniform transverse laser shaping for slice emittance improvement in photoinjector

Adventures with the GPU Roice Nelson GE Aviation, Austin TX My goals for this talk Provide

Logarithmic Mult ltiplication for Convolutional Neural Networks HyunJin in Kim im, , Min in

Operations that preserve integrability, and truncated Riesz spaces Marco Abbadini Dipartimento di

Simulation from the Normal Distribution Truncated to an Interval far in the Tail Zdravko Botev

Improvement of the higher-order tensor renormalization group method Satoshi Morita (ISSP ,

Symmetry properties of generalized graph truncations Primo parl University of Ljubljana and

Modern BTree techniques DMITRY DOLGOV 31-01-2020 @aspis7 1 @aspis7 1 Graefe, Goetz and

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Equity & Excellence: Hidden Bias Implicit Bias Inherent Bias