Incremental Basis Construction from Temporal Difference Error Yi - PowerPoint PPT Presentation

Incremental Basis Construction from Temporal Difference Error Yi Sun, Faustino Gomez, Mark Ring, J¨ urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland June 2011 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 1 / 17

Preliminary Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

Preliminary A Markov Reward Process (MRP) is defined by the 4-tuple ⟨ S , P , r , γ ⟩ S = { 1, . . . , S } is the state space P is an S × S transition matrix with { P } i , j = Pr [ s t + 1 = j ∣ s t = i ] r ∈ R S is the reward function γ ∈ [ 0,1 ) is the discount factor The Value Function , v ∈ R S , is the solution of the Bellman equation v = r + γ Pv . Let L = I − γ P , then v = L − r Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 2 / 17

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

Preliminary v = Φ θ , where Linear function approximation (LFA): ˆ Φ = [ φ 1 , . . . , φ N ] are N ( N ≪ S ) basis functions θ = [ θ 1 , . . . , θ N ] ⊺ are the weights The Bellman Error ε ∈ R S is defined as ε = r + γ P ˆ v − ˆ v = r − L Φ θ . ε ≡ 0 ⇐ ⇒ v ≡ Φ θ ε is the expectation of the TD error Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 3 / 17

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

Preliminary v = Φ θ depends on both θ and Φ . The LFA ˆ To find θ : TD (Sutton, 1988), LSTD (Bradtke et al., 1996), etc. To construct Φ : Bellman error basis functions (BEBFs, Wu and Givan, 2005; Keller et al. 2006; Parr et al. 2007; Mahadevan and Liu 2010) Proto-value basis functions (Mahadevan et al., 2006) Reduced-rank predictive state representations (Boots and Gordon, 2010) L1-regularized feature selection (Kolter and Ng, 2009) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 4 / 17

Bellman Error Basis Functions Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Compute TD fixpoint θ ( k ) w.r.t the k current basis function Φ ( k ) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Compute TD fixpoint θ ( k ) w.r.t the k current basis function Φ ( k ) Get the Bellman error ε ( k ) = r − L Φ ( k ) θ ( k ) Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

Bellman Error Basis Functions Intuition: ”Bellman error, loosely speaking, point[s] towards the optimal value function”, (Parr et al., 2007) Construction: φ ( 1 ) = r At stage k > 1 Compute TD fixpoint θ ( k ) w.r.t the k current basis function Φ ( k ) Get the Bellman error ε ( k ) = r − L Φ ( k ) θ ( k ) Expand: Φ ( k + 1 ) = [ Φ ( k ) ⋮ ε ( k ) ] . Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 5 / 17

Incremental Basis Construction from Temporal Difference Error Yi - PowerPoint PPT Presentation

Incremental Basis Construction from Temporal Difference Error Yi Sun, Faustino Gomez, Mark Ring, J urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland June 2011 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 1 /

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Incremental Construction Cost Incremental Construction Cost Analysis for New Homes Robin Snyder,

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22

Temporal Difference Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Oct 12, 13,

Causal inference Part II: Difference In Difference and Instrumental Variables Difference in

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Singer difference sets and difference system of sets Akihiro Munemasa Graduate School of

Extent- -based Incremental Identification based Incremental Identification Extent of Reaction

ENTSOG: 5 th Stakeholder Joint Working Session for the Incremental Proposal 8 April 2014 5th SJWS

Incremental SAT Library Integration using Abstract Stobjs Sol Swords Centaur Technology, Inc.

Resilient and focused hydrocarbons Gordon Birrel Gordo Birrell EVP, production and operations

Towards Understanding Triangle Construction Problems Vesna Marinkovi c Predrag Jani ci

Method of summation of some slowly convergent series Pawe Wony Rafa Nowak e-mail:

Managing Construction and Professional Services Contracts 2019 CDBG-DR Problem Solving Clinic

Parallel BVH Construction using k -means Clustering Daniel Meister and Ji r Bittner

Work Zone C Constr tructi tion & & M Maintenance N Needs Smart Work Zones

CONSTRUCTION IN THE LOS ANGELES BASIN AN INDUSTRY WITH GROWING MIDDLE - SKILL WORKFORCE

Interactive HMM construction based on interesting sequences Szymon Jaroszewicz National

Incremental Basis Construction from Temporal Difference Error Yi - PowerPoint PPT Presentation

Incremental Basis Construction from Temporal Difference Error Yi Sun, Faustino Gomez, Mark Ring, J urgen Schmidhuber IDSIA, USI & SUPSI, Switzerland June 2011 Sun,Gomez,Ring,Schmidhuber (IDSIA) Incremental Basis Construction 06/11 1 /

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Temporal Difference Learning Robert Platt Northeastern University If one had to identify one

Incremental Construction Cost Incremental Construction Cost Analysis for New Homes Robin Snyder,

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22

Temporal Difference Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Oct 12, 13,

Causal inference Part II: Difference In Difference and Instrumental Variables Difference in

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Singer difference sets and difference system of sets Akihiro Munemasa Graduate School of

Extent- -based Incremental Identification based Incremental Identification Extent of Reaction

ENTSOG: 5 th Stakeholder Joint Working Session for the Incremental Proposal 8 April 2014 5th SJWS

Incremental SAT Library Integration using Abstract Stobjs Sol Swords Centaur Technology, Inc.

Resilient and focused hydrocarbons Gordon Birrel Gordo Birrell EVP, production and operations

Towards Understanding Triangle Construction Problems Vesna Marinkovi c Predrag Jani ci

Method of summation of some slowly convergent series Pawe Wony Rafa Nowak e-mail:

Managing Construction and Professional Services Contracts 2019 CDBG-DR Problem Solving Clinic

Parallel BVH Construction using k -means Clustering Daniel Meister and Ji r Bittner

Work Zone C Constr tructi tion &amp; &amp; M Maintenance N Needs Smart Work Zones

CONSTRUCTION IN THE LOS ANGELES BASIN AN INDUSTRY WITH GROWING MIDDLE - SKILL WORKFORCE

Interactive HMM construction based on interesting sequences Szymon Jaroszewicz National

Work Zone C Constr tructi tion & & M Maintenance N Needs Smart Work Zones