controlling overestimation bias
play

Controlling Overestimation Bias with Truncated Mixture of Continuous - PowerPoint PPT Presentation

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics Arsenii Kuznetsov 1 Pavel Shvechikov 1,2 Alexander Grishin 1,3 Dmitry Vetrov 1,3 1 Samsung AI center, Moscow 2 Higher School of Economics,


  1. Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics Arsenii Kuznetsov 1 Pavel Shvechikov 1,2 Alexander Grishin 1,3 Dmitry Vetrov 1,3 1 Samsung AI center, Moscow 2 Higher School of Economics, Moscow 1 3 Samsung HSE Laboratory

  2. Overestimation bias in off-policy learning 1. Value estimates are imprecise 2. Agent pursues erroneous estimates ... 3. Errors propagate through time 4. Performance degrades 2

  3. Overestimation bias in off-policy learning 1. Value estimates are imprecise 2. Agent pursues erroneous estimates ... 3. Errors propagate through time 4. Performance degrades We propose a novel method: Truncated Quantile Critics (TQC) 3

  4. Key elements of TQC 1. Distributional critics Impressive empirical performance ● Captures info about return variance ● 4

  5. Key elements of TQC 1. Distributional critics Impressive empirical performance ● Captures info about return variance ● 2. Ensembling of the critics Increases performance and stability ● 5

  6. Key elements of TQC 1. Distributional critics Impressive empirical performance ● Captures info about return variance ● 2. Ensembling of the critics Increases performance and stability ● 3. Truncating the mixture of distributions Alleviates overestimation ● 6

  7. TQC’s novelties 1. Incorporates stochasticity of returns into the overestimation control 7

  8. TQC’s novelties 1. Incorporates stochasticity of returns into the overestimation control 2. Provides fine-grained and adjustable level of the overestimation control 8

  9. TQC’s novelties 1. Incorporates stochasticity of returns into the overestimation control 2. Provides fine-grained and adjustable level of the overestimation control 3. Decouples the overestimation control and the number of critics 9

  10. TQC is a new SOTA on MuJoCo 10

  11. TQC is a new SOTA on MuJoCo 11

  12. Stochastic Continuous Control in MDP Stochastic Agent Continuous Reward State Action Environment 12

  13. Stochastic Continuous Control in MDP Stochastic Agent Continuous Reward State Action Environment 13

  14. Stochastic Continuous Control in MDP Stochastic Agent Continuous Reward State Action Environment 14

  15. Overestimation: intuition Value, Q( a ) Noisy samples True, Q Actions, a 15

  16. Overestimation: intuition Value, Q( a ) Noisy samples True, Q Approximation, Q Actions, a 16

  17. Overestimation: intuition Value, Q( a ) Noisy samples True, Q Approximation, Q Sources of distortion, U: 1. Insufficient data 2. Limited model capacity 3. SGD noise 4. Env’s stochasticity 5. Ongoing policy changes Actions, a 17

  18. Overestimation: intuition Value, Q( a ) Noisy samples True, Q Q ( a APPROX ) Approximation, Q Error Sources of distortion, U: Q ( a APPROX ) 1. Insufficient data 2. Limited model capacity 3. SGD noise 4. Env’s stochasticity 5. Ongoing policy changes a APPROX Actions, a 18

  19. Overestimation: mathematical model 1 Predicted maximum: [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 19

  20. Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 20

  21. Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: Jensen inequality [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 21

  22. Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 22

  23. Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: Predicted ≥ True [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 23

  24. Overestimation: mathematical model 1 Predicted maximum averaged over zero mean distortion: Predicted ≥ True 1. Policy exploits critic’s erroneous estimates 2. TD-learning propagates estimation errors 3. Positive feedback loop may occur [1]: Thrun, Sebastian, and Anton Schwartz. "Issues in using function approximation for reinforcement learning." 24

  25. Soft Actor Critic 2 [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 25

  26. Soft Actor Critic 2 Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 26

  27. Soft Actor Critic 2 Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 27

  28. Soft Actor Critic 2 Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 28

  29. Soft Actor Critic 2 Overestimation alleviation (Clipped Double Estimate 3 ): Soft Policy Evaluation: [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 29 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

  30. Soft Actor Critic 2 Overestimation alleviation (Clipped Double Estimate 3 ): Soft Policy Evaluation: 1. [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 30 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

  31. Soft Actor Critic 2 Overestimation alleviation (Clipped Double Estimate 3 ): Soft Policy Evaluation: 1. 2. [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 31 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

  32. Overestimation alleviation (Clipped Double Estimate 3 ): Limitations: 1. ● Coarse bias control ● Wasteful aggregation 2. Solution: Truncated Quantile Critics [2]: Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor." 32 [3]: Scott Fujimoto, Herke van Hoof, David Meger “Addressing Function Approximation Error in Actor-Critic Methods”

  33. TQC step 1: Prediction of N distributions 33

  34. TQC step 2: Pooling 34

  35. TQC step 3: Truncation 35

  36. TQC step 4: Discounting and Shifting 36

  37. Training For each Z-network: 37

  38. Training For each Z-network: 4 [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 38

  39. Training For each Z-network: 4 [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 39

  40. Training For each Z-network: 4 Policy: Maximizes nontruncated average of all atoms of the mixture [4]: Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. 40

  41. Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control A novel direction: interplay between overestimation and stochasticity 41

  42. Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control A novel direction: interplay between overestimation and stochasticity Variance Mean increases Overestimation increases Mean 42

  43. Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control A novel direction: interplay between overestimation and stochasticity Variance Mean Mean increases Truncation Overestimation compensation Overestimation increases Mean Mean 43

  44. Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control 2. Method provides adjustable and fine-grained overestimation bias control # of quantiles ... ... Resolution M = 10 increases Overestimation compensation M = 5 44 1 / 5 3 / 10 2 / 5 Fraction of dropped quantiles

  45. Our contribution: Truncated Quantile Critics 1. Uses return stochasticity for overestimation control 2. Method provides adjustable and fine-grained overestimation bias control 3. Decouples overestimation control and number of approximators # of networks ... N = 3 Better performance More computations N = 2 ... ... ... N = 1 Fraction of dropped quantiles 45

Recommend


More recommend