Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic - PowerPoint PPT Presentation

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments Yi Sun, Faustino Gomez, J¨ urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland August 2011 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18

Motivation Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

Motivation An intelligent agent is sent to explore an unknown environment Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Example: Learning the transition model of a Markovian environment using only 100 < s , a , s ′ > triples Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Let L = I − γ P , then v = L − r Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ ε is the expectation of the TD error Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) L1-regularized feature selection (Kolter and Ng, 2009) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

Bellman Error Basis Functions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic - PowerPoint PPT Presentation

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments Yi Sun, Faustino Gomez, J urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland August 2011 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18

A Bayesian framework for optimal motion planning with uncertainty Andrea Censi, Daniele Calisi,

Optimal Algorithms for Learning Bayesian Optimal Algorithms for Learning Bayesian Network

Approximate Q-Learning 3-25-16 Exploration policy vs. optimal policy Where do the exploration

Bayesian Nonparametric Models for Data Exploration Melanie F. Pradier Friday 15 th September,

Dear friends, do Dear friends, do no not be surprised a t be surprised at the pain t the

Safe Bayesian Optimization for Optimal Control Schillinger, M., Hartmann, B., Skalecki, P.,

Planning and Optimization C3. Delete Relaxation: Hardness of Optimal Planning & AND/OR Graphs

Optimal Planning of Digital Cor d less T ele c ommunic ation Systems 1 Optimal

What is an Optimal Bayesian Method? Chris. J. Oates Newcastle University & Lloyds

Restarted Bayesian Online Change-point Detector achieves Optimal Detection Delay Reda ALAMI

Compressed Sensing and Bayesian Experimental Design or Optimal Sensing and Reconstruction of N -

Neural Architecture Search with Bayesian Optimisation and Optimal Transport #0 ip, 64, (28891)

Unsupervised Deconvolution-Segmentation of Textured Image Bayesian approach: optimal strategy and

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit

Neural Architecture Search with Bayesian Optimisation and Optimal Transport #0 ip, 64, (28891)

Optimal Experimental Design for Large-Scale Bayesian Inverse Problems via Multi-PDE-Constrained

Bayesian D s -Optimal Designs for Generalized Linear Models with Varying Dispersion Parameter

Variational Bayesian Optimal Experimental Design Adam Foster Martin Jankowiak Eli Bingham

Optimal Two-Stage Bayesian Sequential Change Diagnosis Xiaochuan Ma 1 Lifeng Lai 1 Shuguang Cui 2

11/27/2006 Massachusetts Institute of Technology Context Optimal path planning for dynamic

Optimal Route Planning on Mobile Systems Adrian Batzill Structure Route Planning

Sequential Optimal Inference for Experiments with Bayesian Particle Filters Remi Daviet Wharton

Multilevel methods for fast Bayesian optimal experimental design Ra ul Tempone Alexander von

The Bayes Optimal Classifier Machine Learning 1 Most probable classification In Bayesian

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic - PowerPoint PPT Presentation

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments Yi Sun, Faustino Gomez, J urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland August 2011 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18

A Bayesian framework for optimal motion planning with uncertainty Andrea Censi, Daniele Calisi,

Optimal Algorithms for Learning Bayesian Optimal Algorithms for Learning Bayesian Network

Approximate Q-Learning 3-25-16 Exploration policy vs. optimal policy Where do the exploration

Bayesian Nonparametric Models for Data Exploration Melanie F. Pradier Friday 15 th September,

Dear friends, do Dear friends, do no not be surprised a t be surprised at the pain t the

Safe Bayesian Optimization for Optimal Control Schillinger, M., Hartmann, B., Skalecki, P.,

Planning and Optimization C3. Delete Relaxation: Hardness of Optimal Planning &amp; AND/OR Graphs

Optimal Planning of Digital Cor d less T ele c ommunic ation Systems 1 Optimal

What is an Optimal Bayesian Method? Chris. J. Oates Newcastle University &amp; Lloyds

Restarted Bayesian Online Change-point Detector achieves Optimal Detection Delay Reda ALAMI

Compressed Sensing and Bayesian Experimental Design or Optimal Sensing and Reconstruction of N -

Neural Architecture Search with Bayesian Optimisation and Optimal Transport #0 ip, 64, (28891)

Unsupervised Deconvolution-Segmentation of Textured Image Bayesian approach: optimal strategy and

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit

Neural Architecture Search with Bayesian Optimisation and Optimal Transport #0 ip, 64, (28891)

Optimal Experimental Design for Large-Scale Bayesian Inverse Problems via Multi-PDE-Constrained

Bayesian D s -Optimal Designs for Generalized Linear Models with Varying Dispersion Parameter

Variational Bayesian Optimal Experimental Design Adam Foster Martin Jankowiak Eli Bingham

Optimal Two-Stage Bayesian Sequential Change Diagnosis Xiaochuan Ma 1 Lifeng Lai 1 Shuguang Cui 2

11/27/2006 Massachusetts Institute of Technology Context Optimal path planning for dynamic

Optimal Route Planning on Mobile Systems Adrian Batzill Structure Route Planning

Sequential Optimal Inference for Experiments with Bayesian Particle Filters Remi Daviet Wharton

Multilevel methods for fast Bayesian optimal experimental design Ra ul Tempone Alexander von

The Bayes Optimal Classifier Machine Learning 1 Most probable classification In Bayesian

Planning and Optimization C3. Delete Relaxation: Hardness of Optimal Planning & AND/OR Graphs

What is an Optimal Bayesian Method? Chris. J. Oates Newcastle University & Lloyds