Preditive Timing Models Pierre-Luc Bacon, Borja Balle, Doina Precup - PowerPoint PPT Presentation

Preditive Timing Models Pierre-Luc Bacon, Borja Balle, Doina Precup Reasoning and Learning Lab McGill University From bad models to good policies (NIPS 2014)

Motivation ◮ Learning good models can be challenging (think of the Atari domain for example)

Motivation ◮ Learning good models can be challenging (think of the Atari domain for example) ◮ We consider a simpler kind of model: a subjective (agent-oriented) predictive timing model.

Motivation ◮ Learning good models can be challenging (think of the Atari domain for example) ◮ We consider a simpler kind of model: a subjective (agent-oriented) predictive timing model. ◮ We define a notion of predictive state over the durations of possible courses of actions.

Motivation ◮ Learning good models can be challenging (think of the Atari domain for example) ◮ We consider a simpler kind of model: a subjective (agent-oriented) predictive timing model. ◮ We define a notion of predictive state over the durations of possible courses of actions. ◮ Timing models are known to be important in animal learning (eg. Machado et al, 2009)

Hypothetical timing model for a localization task

Today’s presentation will mostly be about the learning problem. Planning results are coming up.

Options framework An option is a triple: �I ⊆ S , π : S × A → [0 , 1] , β : S → [0 , 1] � ◮ initiation set I ◮ policy π (stochastic or deterministic) ◮ termination condition β Example Robot navigation: if there is no obstacle in front ( I ), go forward ( π ) until you get too close to another object ( β .)

Usual option models 1. Expected reward r ω : for every state, it gives the expected return during ω s execution 2. Transition model p ω : conditional distribution over next states (reflecting the discount factor γ and the option duration) Models give predictions about the future, conditioned on the option being executed, i.e. generalized value functions

Options Duration Model (ODM) Instead of predicting a full model at the end of an option (probability distribution over observations or states), predict when the option will terminate , i.e. the expected option duration or the distribution over durations

Model We have a dynamical system with observations from Ω × { ♯, ⊥} , where: ◮ ♯ ( sharp ) denotes continuation ◮ ⊥ ( bottom ) denotes termination We obtain a coarser representation of the original MDP: ( s 1 , π ω 1 ( s 1 )) , . . . , ( s d − 1 , π ω 1 ( s d 1 − 1 )) , ( s d 1 , π ω 2 ( s d 1 )) , ... → ( ω 1 , ♯, . . . , ω 1 , ♯, ω 1 , ⊥ , ω 2 , ♯, . . . , ω 2 , ♯, ω 2 , ⊥ , . . . ) = ( ω 1 , ♯ ) d 1 − 1 ( ω 1 , ⊥ )( ω 2 , ♯ ) d 2 − 1 ( ω 2 , ⊥ ) . . .

Predictive State Representation A predictive state representation is a model of a dynamical system where the current state is represented as a set of predictions about the future behavior of the system. A PSR with observations in Σ (finite) is a tuple A = � α λ , α ∞ , { A σ } σ ∈ Σ � where: ◮ α λ , α ∞ ∈ R n are the initial and final weights ◮ A σ ∈ R n × n are the transition weights

Predicting with PSR A PSR A computes a function f A : Σ ⋆ → R that assigns a number to each string x = x 1 x 2 · · · x t ∈ Σ ⋆ as follows: f A ( x ) = α ⊤ λ A x 1 A x 2 · · · A x t α ∞ = α ⊤ λ A x α ∞ . The conditional probability of observing a sequence of observations v ∈ Σ ⋆ after u is: f A ( u ) = α ⊤ = α ⊤ f A , u ( v ) = f A ( uv ) λ A u A v α ∞ u A v α ∞ . α ⊤ α ⊤ λ A u α ∞ u α ∞ The PSR semantics of u is that of a history , and v of a test .

Embedding Let δ ( s 0 , ω ) be a random variable representing the duration of option ω when started from s 0 P [ δ ( s 0 , ω ) = d ] = e ⊤ s 0 A d − 1 ω,♯ A ω, ⊥ 1 , e s 0 ∈ R S is an indicator vector with e s 0 ( s ) = I [ s = s 0 ] A ω,♯ ( s , s ′ ) = � a ∈ A π ( s , a ) P ( s , a , s ′ ) (1 − β ( s ′ )) � �� not stopping A ω, ⊥ ( s , s ′ ) = � a ∈ A π ( s , a ) P ( s , a , s ′ ) β ( s ′ ) , � �� stopping 1 ∈ R S

Theorem Let M be an MDP with n states, Ω a set of options, and Σ = Ω × { ♯, ⊥} . For every distribution α over the states of M, there exists a PSR A = � α , 1 , { A σ }� with at most n states that computes the distributions over durations of options executed from a state sampled according to α . The probability of a sequence of options ¯ ω = ω 1 · · · ω t and their durations ¯ d = d 1 · · · d t , d i > 0. is then given by: P [ ¯ ω ] = α ⊤ A d 1 − 1 ω 1 ,♯ A ω 1 , ⊥ A d 2 − 1 ω 2 ,♯ A ω 2 , ⊥ · · · A d t − 1 d | α, ¯ ω t ,♯ A ω t , ⊥ 1 .

Learning A Hankel matrix a bi-infinite matrix, H f ∈ R Σ ⋆ × Σ ⋆ with rows and columns indexed by strings in Σ ⋆ , which contains the joint probabilities of prefixes and suffixes. ǫ ( ω 0 , ⊥ ) ( ω 0 , ♯ ) , ( ω 0 , ⊥ ) ( ω 0 , ♯ ) , ( ω 0 , ♯ ) , ( ω 0 , ⊥ ) , . . . .   . ǫ . ( ω 0 , ♯ )       ( ω 0 , ♯ ) , ( ω 0 , ♯ ) . . . P [( ω 0 , ♯ )( ω 0 , ♯ )( ω 0 , ♯ )( ω 0 , ⊥ )] . . .     ( ω 0 , ♯ ) , ( ω 0 , ♯ ) , ( ω 0 , ⊥ )   . . . . . . Node: closely related to the so-called system dynamics matrix

Hankel Low-rank matrix Factorization and Data WFA estimation matrix linear algebra Key Idea: The Hankel Trick

We can recover (up to a change of basis) the underlying PSR through a rank-factorization of the Hankel matrix. Given the SVD UΛV ⊤ of H , 3 lines of code suffice: α ⊤ λ = h ⊤ λ, S V α ∞ = ( HV ) + h P ,λ A σ = ( HV ) + H σ V Note: The use of SVD makes the algorithm robust to noisy estimation of H .

Synthetic experiment q 2 q 5 q 8 q 1 q 4 q 7 q 0 q 3 q 6 Four options: go N, E, W, or S until the agent hits a wall. A primitive action succeeds with probability 0 . 9. We report the | µ A − d ω | relative errors: max { µ A , d ω }

1.0 PSR Naive 0.8 True model 0.6 rel. error 0.4 0.2 0.0 0 10000 20000 30000 40000 50000 N The ”naive” method consists in predicting the empirical mean durations, regardless of history. The PSR state updates clearly help.

0.15 d=5 0.14 d=9 d=13 0.13 0.12 rel. error 0.11 0.10 0.09 0.08 0.07 0.06 0 10000 20000 30000 40000 50000 N Relative error as a function of the number of samples for different grid sizes

Continuous domain | Ω | ( K r , K s ) h = 1 h = 2 h = 3 h = 4 h = 5 h = 6 h = 7 h = 8 ( 2 , 1 ) 0.19 (199) 0.25 (199) 0.26 (196) 0.30 (198) 0.31 (172) 0.33 (163) 0.31 (173) 0.30 (172) 4 ( 1 , 1 ) 0.15 (133) 0.28 (126) 0.31 (134) 0.35 (131) 0.36 (131) 0.36 (131) 0.36 (132) 0.36 (133) ( 2 , 1 ) 0.40 (176) 0.47 (163) 0.49 (163) 0.51 (176) 0.52 (162) 0.51 (164) 0.50 (163) 0.52 (167) 8 ( 1 , 1 ) 0.38 (166) 0.48 (162) 0.46 (195) 0.51 (164) 0.52 (162) 0.51 (162) 0.51 (165) 0.54 (169) Simulated robot with continuous state and nonlinear dynamics. We use the Box2D physics engine to simulate a circular differential wheeled robot (Roomba-like)

Future work Planning: We have been able to show that given a policy over options: and some ODM state then the value function is a linear function the PSR state. This suggests that the ODM state might be sufficient for planning Also on the agenda: ◮ Try to gain a better theoretical understanding of the environment vs PSR-rank relationship. ◮ Conduct planning experiments on the learnt models.

Thank you

The off-policy case The exploration policy will be reflected in the empirical Hankel matrix. We can compensate by forming an auxiliary PSR. For a uniform policy, we would have: α π λ = e 0 α π ∞ = 1 A π ω i ,♯ (0 , ω i ) = | Ω | A π ω i ,♯ ( ω i , ω i ) = 1 A π ω i ,♯ (0 , 0) = | Ω | A π ω i ,♯ ( ω i , 0) = 1 and take compute the corrected Hankel by taking the Hadamard product: H = ˆ H ⊙ H π

Preditive Timing Models Pierre-Luc Bacon, Borja Balle, Doina Precup - PowerPoint PPT Presentation

Preditive Timing Models Pierre-Luc Bacon, Borja Balle, Doina Precup Reasoning and Learning Lab McGill University From bad models to good policies (NIPS 2014) Motivation Learning good models can be challenging (think of the Atari domain for

Timing and Coordination Essential Knowledge 2.E.2 and 2.E.3 Timing and Coordination Timing

Liberty Timing File (LIB) Advanced VLSI Design CMPE 641 Liberty Timing File The .lib file is an

Timing Library Format (TLF) Advanced VLSI Design CMPE 414 Timing Library Format (TLF) TLF is an

Timing Analysis Timing Path Groups and Types Timing paths are grouped into path groups

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Top 5 Timing Closure Techniques Greg Daughtry Correct Timing Constraints Analyze Before

FORMAL MODELING AND VERIFICATION FOR TIMING PREDICTABILITY Mathieu Jan, Mihail Asavoae, Belgacem

Precision timing and scintillation of binary radio pulsars Daniel Reardon (Swinburne/OzGrav)

A Highly Compressed Timing Macro-modeling Algorithm for Hierarchical and Incremental Timing

Precision timing with PbWO 4 crystals and prospects for a precision timing upgrade of the CMS

FPGA%Timing%Models Many%FPGA%and%CPLD%vendors%provide%a% timing model

Accommodations For Students With An IEP or 504 Plan & How To Help Your Child February 1 ,

City of Somerville Davis Square Signal Timing Changes May 2018 City of Somerville Davis Square

VOLA Guide for Using Vola Event Management and Timing Software for Season 18-19 Race Results

Transforming the Timing Industry March 2016 Rich Timing Content in All Electronics Only SiTime

Agenda Compare 4 alternative timing mechanisms (long and short the S&P 500)

Universal Language in the 17th Century Our examination of the history of machine translation

Graphs Graphs Definitions Implementation/Representation of graphs Search

Disaster Recovery Compliance Disaster Recovery Compliance Davis- -Bacon and CDBG Bacon and CDBG

9.1 Remeshing Hao Li http://cs599.hao-li.com 1 Outline What is remeshing? Why

The Business of Dry Curing June 25, 2014

The Option-Critic Architecture Author: Pierre-Luc Bacon, Jean Harb, Doina Precup Speaker: Zebin

Leveling the Playing Field in Asymmetric Litigation Ethical Duties of ABA Model Rule 1.1 Mary

Nothing Here Fast Quantum Algorithms or How we learned to put our pants on two legs at a time.

Preditive Timing Models Pierre-Luc Bacon, Borja Balle, Doina Precup - PowerPoint PPT Presentation

Preditive Timing Models Pierre-Luc Bacon, Borja Balle, Doina Precup Reasoning and Learning Lab McGill University From bad models to good policies (NIPS 2014) Motivation Learning good models can be challenging (think of the Atari domain for

Timing and Coordination Essential Knowledge 2.E.2 and 2.E.3 Timing and Coordination Timing

Liberty Timing File (LIB) Advanced VLSI Design CMPE 641 Liberty Timing File The .lib file is an

Timing Library Format (TLF) Advanced VLSI Design CMPE 414 Timing Library Format (TLF) TLF is an

Timing Analysis Timing Path Groups and Types Timing paths are grouped into path groups

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Top 5 Timing Closure Techniques Greg Daughtry Correct Timing Constraints Analyze Before

FORMAL MODELING AND VERIFICATION FOR TIMING PREDICTABILITY Mathieu Jan, Mihail Asavoae, Belgacem

Precision timing and scintillation of binary radio pulsars Daniel Reardon (Swinburne/OzGrav)

A Highly Compressed Timing Macro-modeling Algorithm for Hierarchical and Incremental Timing

Precision timing with PbWO 4 crystals and prospects for a precision timing upgrade of the CMS

FPGA%Timing%Models Many%FPGA%and%CPLD%vendors%provide%a% timing model

Accommodations For Students With An IEP or 504 Plan &amp; How To Help Your Child February 1 ,

City of Somerville Davis Square Signal Timing Changes May 2018 City of Somerville Davis Square

VOLA Guide for Using Vola Event Management and Timing Software for Season 18-19 Race Results

Transforming the Timing Industry March 2016 Rich Timing Content in All Electronics Only SiTime

Agenda Compare 4 alternative timing mechanisms (long and short the S&amp;P 500)

Universal Language in the 17th Century Our examination of the history of machine translation

Graphs Graphs Definitions Implementation/Representation of graphs Search

Disaster Recovery Compliance Disaster Recovery Compliance Davis- -Bacon and CDBG Bacon and CDBG

9.1 Remeshing Hao Li http://cs599.hao-li.com 1 Outline What is remeshing? Why

The Business of Dry Curing June 25, 2014

The Option-Critic Architecture Author: Pierre-Luc Bacon, Jean Harb, Doina Precup Speaker: Zebin

Leveling the Playing Field in Asymmetric Litigation Ethical Duties of ABA Model Rule 1.1 Mary

Nothing Here Fast Quantum Algorithms or How we learned to put our pants on two legs at a time.

Accommodations For Students With An IEP or 504 Plan & How To Help Your Child February 1 ,

Agenda Compare 4 alternative timing mechanisms (long and short the S&P 500)