CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master students)
1. PLAN (2011) 1. VIME (NeurIPS2016) 1. CTS (NeurIPS2016)
Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion
Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion
Background Intrinsic reward /exploration RL+Curiosity bonus History: action Next Extrinsic state reward
What is exploration? Intrinsic motivation: - Reducing the agent’s uncertainty over the environment’s dynamics. [Plan] [VIME] [CTS] Count-based - Use (pseudo) visitation counts to guide agents to unvisited states.
Why exploration useful? DEMO Sparse Reward Problem Our original plot & demo Montezuma’s revenge DQN DQN + Exploration bonus Z-axis Intrinsic Reward Y-axis Intrinsic Reward function timestamp /Training Timestamp T s . … , 3 s , 2 s , 1 S s i x a - X
Related work (Timeline) Pseudocount in 2016 The notion of Intrinsic L2 prediction error Pseudocount + Pixel CNN still achieves SOTA for Motivation using neural networks Montezuma’s revenge” 2010 Formal Theory 2015 Incentivizing 2019 On Bonus 2017 Count-Based of Creativity, Fun, Exploration In Based Exploration Exploration with and Intrinsic Reinforcement Methods In The Neural Density Motivation (1990- Learning With Deep Arcade Learning Models 2010) Predictive Models Environment 2018 Exploration by 2011 2016 Random Network PLAN VIME CTS Distillation Bayesian Approximate Pseudocount Distillation error as a Optimal “PLAN” exploration quantification of uncertainty Exploration
Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion
[PLAN] contribution Dynamics model Bayes update for posterior distribution of the dynamics model Optimal Bayesian Exploration based on: Expected cumulative Expected cumulative info Expected one-step info gain info gain fo tau steps gain for tau-1 steps if if performing this performing this next action action
[PLAN] Quantify “surprise” with info gain p 𝜄
[PLAN] 1-step expected information gain “1-step expected info gain” “expected immediate info gain” NOTE: VIME uses this as the Intrinsic reward! “Mutual info between next state distribution & model parameter”
[PLAN] “Planning to be surprised” Cumulative steps info gain Curious Q-value Perform an action “Planning tau steps” Follow a policy because not actually observed yet
[PLAN] Optimal Bayesian Exploration policy [Method1] Computing optimal curiosity-Q backwards for tau steps [Method2] Policy Iteration Policy evaluation Repeat applying Policy improvement
[Plan] Non-triviality of curious Q-value Cumulative information gain fluctuates! Info gain additive in expectation! Cumulative != Sum
[Plan] Results . . 50 states . Random Greedy w.r.t expected one-step info gain Q-learning using one-step info gain Policy iteration (Dynamic programming approximation to optimal bayesian exploration)
[Plan] Results
Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion
[VIME] contribution Dynamic s model Variational inference for posterior distribution of dynamics model 1-step exploration bonus
[VIME] Quantify the information gained Reminder: PLAN cumulative info gain
[VIME] Variational Bayes What’s hard? Computing posterior for highly parameterized models (e.g. neural networks) Approximate posterior by minimizing Minimize negative ELBO
[VIME] Optimization for variational bayes How to minimize negative ELBO? Take an efficient single second-order (Newton) update step to minimize negative ELBO:
[VIME] Estimate 1-step expected info gain What’s hard? Computing the exact one-step expected info-gain. High- dimensional states → Monte-carlo estimation.
[VIME] Results (Walker-2D) Average extrinsic return Dense reward RL algorithm: TRPO
[VIME] Results (Swimmer-Gather) Average extrinsic return Sparse reward RL algorithm: TRPO
Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion
[CTS] contribution States Density model Pseudo-count 1-step exploration bonus
[CTS] Count state visitation Empirical count Empirical distribution These two are different states! But we want to increment visitation counts for both when visiting either one. Pixel difference
[CTS] Introduce state density model p p s s x=s1 s2 X =s1 s2
How to update CTS density model? Check the “context tree switching” paper! https://arxiv.org/abs/1111.3182 This was the difficulty of reading this paper as it only shows a bayes rule Remark: For pixel-cnn density update for mixture of density models model in “Count-based (e.g. CTS). exploration with neural density model ”, just backprop .
[CTS] Derive pseudo-count from density model Two constraints: Linear system Solve linear system Pseudo-count derived!
[CTS] Results (Montezuma’s Revenge) State: 84x84x4 # Actions: 18 RL algorithm: Double DQN
Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Summary, Comparisons and Discussion
Deriving posterior dynami ct model/ density model VIME PLAN CTS Bayes rule Variational inference Bayes rule
Derive exploratory policy Policy trained with the reward augmented by intrinsic reward. [VIME] 1-step Information gain [CTS] Pseudo-count [PLAN] Directly argmax(curiosity Q)
Pseudo-count VS Intrinsic Motivation Mixture model “Unifying count-based exploration and intrinsic motivations”!
Limitations & Future Directions → Intractable posterior & use dynamics model for expectation PLAN Difficult to be scaled outside Tabular RL. VIME → Currently maximize sum of 1-step info gain. CTS → which density model leads to better generalization over states? Learning rates of policy network VS Updating dynamic model/density model.
Thank you! (Appendix)
Our derivation for “Additive in expectation” h’’ contains h’
Recommend
More recommend