planning to be surprised optimal bayesian exploration in
play

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic - PowerPoint PPT Presentation

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments Yi Sun, Faustino Gomez, J urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland August 2011 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18


  1. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments Yi Sun, Faustino Gomez, J¨ urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland August 2011 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 1 / 18

  2. Motivation Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  3. Motivation An intelligent agent is sent to explore an unknown environment Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  4. Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  5. Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  6. Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  7. Motivation An intelligent agent is sent to explore an unknown environment Learning through sequential interactions Limited time / resources Question: How should the agent choose the actions, so that it learns the environment as effectively as possible? Example: Learning the transition model of a Markovian environment using only 100 < s , a , s ′ > triples Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 2 / 18

  8. Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  9. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  10. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  11. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  12. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  13. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  14. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  15. Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Let L = I − γ P , then v = L − r Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 3 / 18

  16. Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  17. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  18. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  19. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  20. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  21. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  22. Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ ε is the expectation of the TD error Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 4 / 18

  23. Preliminary Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  24. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  25. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  26. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  27. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  28. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  29. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  30. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  31. Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) L1-regularized feature selection (Kolter and Ng, 2009) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 5 / 18

  32. Bellman Error Basis Functions Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

  33. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

  34. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

  35. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

  36. Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Sun,Gomez,Schmidhuber (IDSIA) Bayesian Exploration 08/11 6 / 18

Recommend


More recommend