Incremental Basis Construction from Temporal Difference Error Yi Sun, Faustino Gomez, Mark Ring, J¨ urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland June 2011 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 1 / 17
Preliminary Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17
Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Let L = I − γ P , then v = L − r Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17
Preliminary Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17
Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ ε is the expectation of the TD error Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17
Preliminary Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17
Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) L1-regularized feature selection (Kolter and Ng, 2009) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17
Bellman Error Basis Functions Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Compute TD fixpoint θ ( k ) w.r.t the k current basis function Φ ( k ) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Compute TD fixpoint θ ( k ) w.r.t the k current basis function Φ ( k ) Get the Bellman error ε ( k ) = r − L Φ ( k ) θ ( k ) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17
Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Compute TD fixpoint θ ( k ) w.r.t the k current basis function Φ ( k ) Get the Bellman error ε ( k ) = r − L Φ ( k ) θ ( k ) Expand: Φ ( k + 1 ) = [ Φ ( k ) ⋮ ε ( k ) ] . Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17
Recommend
More recommend