csc2547 presentation curiosity driven exploration
play

CSC2547 Presentation: Curiosity-driven exploration Count-based VS - PowerPoint PPT Presentation

CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master students) 1. PLAN (2011) 1. VIME (NeurIPS2016) 1. CTS (NeurIPS2016) Outline Motivation, Related Works and


  1. CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master students)

  2. 1. PLAN (2011) 1. VIME (NeurIPS2016) 1. CTS (NeurIPS2016)

  3. Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion

  4. Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion

  5. Background Intrinsic reward /exploration RL+Curiosity bonus History: action Next Extrinsic state reward

  6. What is exploration? Intrinsic motivation: - Reducing the agent’s uncertainty over the environment’s dynamics. [Plan] [VIME] [CTS] Count-based - Use (pseudo) visitation counts to guide agents to unvisited states.

  7. Why exploration useful? DEMO Sparse Reward Problem Our original plot & demo Montezuma’s revenge DQN DQN + Exploration bonus Z-axis Intrinsic Reward Y-axis Intrinsic Reward function timestamp /Training Timestamp T s . … , 3 s , 2 s , 1 S s i x a - X

  8. Related work (Timeline) Pseudocount in 2016 The notion of Intrinsic L2 prediction error Pseudocount + Pixel CNN still achieves SOTA for Motivation using neural networks Montezuma’s revenge” 2010 Formal Theory 2015 Incentivizing 2019 On Bonus 2017 Count-Based of Creativity, Fun, Exploration In Based Exploration Exploration with and Intrinsic Reinforcement Methods In The Neural Density Motivation (1990- Learning With Deep Arcade Learning Models 2010) Predictive Models Environment 2018 Exploration by 2011 2016 Random Network PLAN VIME CTS Distillation Bayesian Approximate Pseudocount Distillation error as a Optimal “PLAN” exploration quantification of uncertainty Exploration

  9. Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion

  10. [PLAN] contribution Dynamics model Bayes update for posterior distribution of the dynamics model Optimal Bayesian Exploration based on: Expected cumulative Expected cumulative info Expected one-step info gain info gain fo tau steps gain for tau-1 steps if if performing this performing this next action action

  11. [PLAN] Quantify “surprise” with info gain p 𝜄

  12. [PLAN] 1-step expected information gain “1-step expected info gain” “expected immediate info gain” NOTE: VIME uses this as the Intrinsic reward! “Mutual info between next state distribution & model parameter”

  13. [PLAN] “Planning to be surprised” Cumulative steps info gain Curious Q-value Perform an action “Planning tau steps” Follow a policy because not actually observed yet

  14. [PLAN] Optimal Bayesian Exploration policy [Method1] Computing optimal curiosity-Q backwards for tau steps [Method2] Policy Iteration Policy evaluation Repeat applying Policy improvement

  15. [Plan] Non-triviality of curious Q-value Cumulative information gain fluctuates! Info gain additive in expectation! Cumulative != Sum

  16. [Plan] Results . . 50 states . Random Greedy w.r.t expected one-step info gain Q-learning using one-step info gain Policy iteration (Dynamic programming approximation to optimal bayesian exploration)

  17. [Plan] Results

  18. Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion

  19. [VIME] contribution Dynamic s model Variational inference for posterior distribution of dynamics model 1-step exploration bonus

  20. [VIME] Quantify the information gained Reminder: PLAN cumulative info gain

  21. [VIME] Variational Bayes What’s hard? Computing posterior for highly parameterized models (e.g. neural networks) Approximate posterior by minimizing Minimize negative ELBO

  22. [VIME] Optimization for variational bayes How to minimize negative ELBO? Take an efficient single second-order (Newton) update step to minimize negative ELBO:

  23. [VIME] Estimate 1-step expected info gain What’s hard? Computing the exact one-step expected info-gain. High- dimensional states → Monte-carlo estimation.

  24. [VIME] Results (Walker-2D) Average extrinsic return Dense reward RL algorithm: TRPO

  25. [VIME] Results (Swimmer-Gather) Average extrinsic return Sparse reward RL algorithm: TRPO

  26. Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion

  27. [CTS] contribution States Density model Pseudo-count 1-step exploration bonus

  28. [CTS] Count state visitation Empirical count Empirical distribution These two are different states! But we want to increment visitation counts for both when visiting either one. Pixel difference

  29. [CTS] Introduce state density model p p s s x=s1 s2 X =s1 s2

  30. How to update CTS density model? Check the “context tree switching” paper! https://arxiv.org/abs/1111.3182 This was the difficulty of reading this paper as it only shows a bayes rule Remark: For pixel-cnn density update for mixture of density models model in “Count-based (e.g. CTS). exploration with neural density model ”, just backprop .

  31. [CTS] Derive pseudo-count from density model Two constraints: Linear system Solve linear system Pseudo-count derived!

  32. [CTS] Results (Montezuma’s Revenge) State: 84x84x4 # Actions: 18 RL algorithm: Double DQN

  33. Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Summary, Comparisons and Discussion

  34. Deriving posterior dynami ct model/ density model VIME PLAN CTS Bayes rule Variational inference Bayes rule

  35. Derive exploratory policy Policy trained with the reward augmented by intrinsic reward. [VIME] 1-step Information gain [CTS] Pseudo-count [PLAN] Directly argmax(curiosity Q)

  36. Pseudo-count VS Intrinsic Motivation Mixture model “Unifying count-based exploration and intrinsic motivations”!

  37. Limitations & Future Directions → Intractable posterior & use dynamics model for expectation PLAN Difficult to be scaled outside Tabular RL. VIME → Currently maximize sum of 1-step info gain. CTS → which density model leads to better generalization over states? Learning rates of policy network VS Updating dynamic model/density model.

  38. Thank you! (Appendix)

  39. Our derivation for “Additive in expectation” h’’ contains h’

Recommend


More recommend