Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics Arsenii Kuznetsov 1 Pavel Shvechikov 1,2 Alexander Grishin 1,3 Dmitry Vetrov 1,3 1 Samsung AI center, Moscow 2 Higher School of Economics, Moscow 1 3 Samsung HSE Laboratory
Overestimation bias in off-policy learning 1. Value estimates are imprecise 2. Agent pursues erroneous estimates ... 3. Errors propagate through time 4. Performance degrades 2
Overestimation bias in off-policy learning 1. Value estimates are imprecise 2. Agent pursues erroneous estimates ... 3. Errors propagate through time 4. Performance degrades We propose a novel method: Truncated Quantile Critics (TQC) 3
Key elements of TQC 1. Distributional critics Impressive empirical performance ● Captures info about return variance ● 4
Key elements of TQC 1. Distributional critics Impressive empirical performance ● Captures info about return variance ● 2. Ensembling of the critics Increases performance and stability ● 5
Key elements of TQC 1. Distributional critics Impressive empirical performance ● Captures info about return variance ● 2. Ensembling of the critics Increases performance and stability ● 3. Truncating the mixture of distributions Alleviates overestimation ● 6
TQC’s novelties 1. Incorporates stochasticity of returns into the overestimation control 7
TQC’s novelties 1. Incorporates stochasticity of returns into the overestimation control 2. Provides fine-grained and adjustable level of the overestimation control 8
TQC’s novelties 1. Incorporates stochasticity of returns into the overestimation control 2. Provides fine-grained and adjustable level of the overestimation control 3. Decouples the overestimation control and the number of critics 9
TQC is a new SOTA on MuJoCo 10
TQC is a new SOTA on MuJoCo 11
Stochastic Continuous Control in MDP Stochastic Agent Continuous Reward State Action Environment 12
Stochastic Continuous Control in MDP Stochastic Agent Continuous Reward State Action Environment 13
Stochastic Continuous Control in MDP Stochastic Agent Continuous Reward State Action Environment 14
Overestimation: intuition Value, Q( a ) Noisy samples True, Q Actions, a 15
Overestimation: intuition Value, Q( a ) Noisy samples True, Q Approximation, Q Actions, a 16
Overestimation: intuition Value, Q( a ) Noisy samples True, Q Approximation, Q Sources of distortion, U: 1. Insufficient data 2. Limited model capacity 3. SGD noise 4. Env’s stochasticity 5. Ongoing policy changes Actions, a 17
Overestimation: intuition Value, Q( a ) Noisy samples True, Q Q ( a APPROX ) Approximation, Q Error Sources of distortion, U: Q ( a APPROX ) 1. Insufficient data 2. Limited model capacity 3. SGD noise 4. Env’s stochasticity 5. Ongoing policy changes a APPROX Actions, a 18
Overestimation: mathematical model 1 Predicted maximum: [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 19
Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 20
Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: Jensen inequality [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 21
Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 22
Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: Predicted ≥ True [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 23
Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: Predicted ≥ True 1. Policy exploits critic’s erroneous estimates 2. TD-learning propagates estimation errors 3. Positive feedback loop may occur [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 24
Soft Actor Critic 2 [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 25
Soft Actor Critic 2 Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 26
Soft Actor Critic 2 Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 27
Soft Actor Critic 2 Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 28
Soft Actor Critic 2 Overestimation alleviation (Clipped Double Estimate 3 ): Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 29 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”
Soft Actor Critic 2 Overestimation alleviation (Clipped Double Estimate 3 ): Soft Policy Evaluation: 1. [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 30 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”
Soft Actor Critic 2 Overestimation alleviation (Clipped Double Estimate 3 ): Soft Policy Evaluation: 1. 2. [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 31 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”
Overestimation alleviation (Clipped Double Estimate 3 ): Limitations: 1. ● Coarse bias control ● Wasteful aggregation 2. Solution: Truncated Quantile Critics [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 32 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”
TQC step 1: Prediction of N distributions 33
TQC step 2: Pooling 34
TQC step 3: Truncation 35
TQC step 4: Discounting and Shifting 36
Training For each Z-network: 37
Training For each Z-network: 4 [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 38
Training For each Z-network: 4 [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 39
Training For each Z-network: 4 Policy: Maximizes nontruncated average of all atoms of the mixture [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 40
Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control A novel direction: interplay between overestimation and stochasticity 41
Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control A novel direction: interplay between overestimation and stochasticity Variance Mean increases Overestimation increases Mean 42
Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control A novel direction: interplay between overestimation and stochasticity Variance Mean Mean increases Truncation Overestimation compensation Overestimation increases Mean Mean 43
Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control 2. Method provides adjustable and fine-grained overestimation bias control # of quantiles ... ... Resolution M = 10 increases Overestimation compensation M = 5 44 1 / 5 3 / 10 2 / 5 Fraction of dropped quantiles
Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control 2. Method provides adjustable and fine-grained overestimation bias control 3. Decouples overestimation control and number of approximators # of networks ... N = 3 Better performance More computations N = 2 ... ... ... N = 1 Fraction of dropped quantiles 45
Recommend
More recommend