Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments Yi Sun, Faustino Gomez, J¨ urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland August 2011 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18
Motivation Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18
Motivation An intelligent agent is sent to explore an unknown environment Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18
Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18
Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18
Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18
Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Example: Learning the transition model of a Markovian environment using only 100 < s , a , s ′ > triples Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18
Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Let L = I − γ P , then v = L − r Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18
Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ ε is the expectation of the TD error Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18
Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) L1-regularized feature selection (Kolter and Ng, 2009) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18
Bellman Error Basis Functions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18
Recommend
More recommend