model based active exploration
play

Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, - PowerPoint PPT Presentation

Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, Faustino Gomez arxiv.org/abs/1810.12162 Presentation by Danijar Hafner Reinforcement Learning objective sensor input algorithm motor output unknown learning agent


  1. Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, Faustino Gomez arxiv.org/abs/1810.12162 Presentation by Danijar Hafner

  2. Reinforcement Learning objective sensor input algorithm motor output unknown learning agent environment

  3. Reinforcement Learning Intrinsic Motivation objective objective sensor sensor input input algorithm algorithm motor motor output output unknown unknown learning agent learning agent environment environment

  4. Many Intrinsic Objectives Information gain e.g. Lindley 1956, Sun 2011, Houthooft 2017 Prediction error e.g. Schmidhuber 1991, Bellemare 2016, Pathak 2017 Empowerment e.g. Klyubin 2005, Tishby 2011, Gregor 2016 Skill discovery e.g. Eysenbach 2018, Sharma 2020, Co-Reyes 2018 Surprise minimization e.g. Schrödinger 1944, Friston 2013, Berseth 2020 Bayes-adaptive RL e.g. Gittins 1979, Duff 2002, Ross 2007

  5. Information Gain Without rewards, the agent can only learn about the environment.

  6. Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction

  7. Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction Need to represent uncertainty about W to tell how much we have learned. p( W )

  8. Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction Need to represent uncertainty about W to tell how much we have learned. data collection p( W ) p( W | X )

  9. Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction Need to represent uncertainty about W to tell how much we have learned. data collection p( W ) p( W | X ) To gain the most information, we aim to maximize the mutual information between future sensory inputs X and model parameters W : Both W and X are max a I( X ; W | A=a ) random variables

  10. Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction Need to represent uncertainty about W to tell how much we have learned. data collection p( W ) p( W | X ) To gain the most information, we aim to maximize the mutual information between future sensory inputs X and model parameters W : Both W and X are max a I( X ; W | A=a ) = ? random variables

  11. Retrospective Infogain Expected Infogain e.g. VIME, ICM, RND e.g. MAX, PETS-ET, LD KL[p(W|X,A=a) p(W|A=a)] | | I( X ; W | A=a ) Collect episodes, train world model, Need to search for actions that will record improvement, reward the lead to high information gain without controller by this improvement additional environment interaction Infogain depends on agent's Learn a forward model of the knowledge that keeps changing, environment to search for actions by making it a non-stationary objective planning or learning in imagination The learned controller will lag behind Computing the expected information and go to states that were previously gain requires computing entropies of novel but are not anymore a model with uncertainty estimates

  12. Retrospective Novelty Episode 1 Everything unknown

  13. Retrospective Novelty Episode 1 Random behavior

  14. Retrospective Novelty Episode 1 High novelty

  15. Retrospective Novelty Episode 1 Reinforce behavior

  16. Retrospective Novelty Episode 2 Repeat behavior

  17. Retrospective Novelty Episode 2 Reach similar states

  18. Retrospective Novelty Episode 2 Not surprising anymore :(

  19. Retrospective Novelty Episode 2 Unlearn behavior

  20. Retrospective Novelty Episode 3 Repeat behavior

  21. Retrospective Novelty Episode 3 Repeat behavior

  22. Retrospective Novelty Episode 3 Still not novel

  23. Retrospective Novelty Episode 3 Unlearn behavior

  24. Retrospective Novelty The agent builds a map of where it was already and avoids those states. Episode 4 Back to random behavior

  25. Expected Novelty Episode 1 Everything unknown

  26. Expected Novelty Episode 1 Consider options

  27. Expected Novelty Episode 1 Execute plan

  28. Expected Novelty Episode 1 Observe new data

  29. Expected Novelty Episode 2 Consider options

  30. Expected Novelty Episode 2 Execute plan

  31. Expected Novelty Episode 2 Observe new data

  32. Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain

  33. Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors

  34. Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty

  35. Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty Information gain targets uncertain trajectories with low expected noise

  36. Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty Information gain targets uncertain trajectories with low expected noise Wide predictions mean high expected noise Overlapping modes means less total uncertainty

  37. Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty Information gain targets uncertain trajectories with low expected noise Wide predictions mean high expected noise Narrow predictions mean low expected noise Overlapping modes means less total uncertainty Distant modes means large total uncertainty

  38. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty

  39. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members:

  40. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members: p(X | A=a) = 1/K Σ p(X | W=w k , A=a) Aggregate prediction:

  41. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members: p(X | A=a) = 1/K Σ p(X | W=w k , A=a) Aggregate prediction: Aleatoric uncertainty:

  42. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members: p(X | A=a) = 1/K Σ p(X | W=w k , A=a) Aggregate prediction: ? Aleatoric uncertainty:

  43. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members: p(X | A=a) = 1/K Σ p(X | W=w k , A=a) Aggregate prediction: 1/K Σ H(p(X | W=w k , A=a)) Aleatoric uncertainty:

  44. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members: p(X | A=a) = 1/K Σ p(X | W=w k , A=a) Aggregate prediction: 1/K Σ H(p(X | W=w k , A=a)) Aleatoric uncertainty: Total uncertainty:

  45. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members: p(X | A=a) = 1/K Σ p(X | W=w k , A=a) Aggregate prediction: 1/K Σ H(p(X | W=w k , A=a)) Aleatoric uncertainty: ? Total uncertainty:

  46. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members: p(X | A=a) = 1/K Σ p(X | W=w k , A=a) Aggregate prediction: 1/K Σ H(p(X | W=w k , A=a)) Aleatoric uncertainty: H(1/K Σ p(X | W=w k , A=a)) Total uncertainty:

  47. Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members: p(X | A=a) = 1/K Σ p(X | W=w k , A=a) Aggregate prediction: 1/K Σ H(p(X | W=w k , A=a)) Aleatoric uncertainty: H(1/K Σ p(X | W=w k , A=a)) Total uncertainty: Gaussian entropy has a closed form, so we can compute the aleatoric uncertainty. GMM entropy does not, sample it or switch to Renyi entropy that has a closed form.

Recommend


More recommend