Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement - PowerPoint PPT Presentation

Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Tuesday class summary: Planning: any computational process that uses a model to create or improve a policy Dyna framework: 2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Questions during class “Why use simulated experience? Can’t you directly compute solution based on model?” “Wouldn’t it be better to plan backwards from goal” 3 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

How to Achieve Efficient Planning? What type of backup is better? Sample vs. full backups Incremental vs. less incremental backups How to order the backups? 4 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

What is Efficient Planning? Planning algorithm A is more efficient than planning algorithm B if: it can compute the optimal policy (or value function) in less time. given the same amount of computation time, it improves the policy (or value function) more. 5 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

What backup type is best? 6 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Full vs. Sample Backups Sample backups Value Full backups estimated (one-step TD) (DP) s s ! ( s ) a a v π V r r s' s' policy evaluation TD(0) s max a v * V *( s ) r s' value iteration s,a s,a r r Q ! ( a , s ) q π s' s' a' a' Q-policy evaluation Sarsa s,a s,a r r s' s' Q * q * ( a , s ) max max a' a' Q-value iteration Q-learning 7 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Full vs. Sample Backups 1 full sample backups backups b = 2 (branching factor) RMS error in value b =10 estimate b =100 b =1000 b =10,000 0 0 1 b 2 b a 0 Q ( s 0 , a 0 ) max Number of computations b successor states, equally likely; initial error = 1; assume all next states’ values are correct 8 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Small Backups Small backups are single-successor backups based on the model Small backups have the same computational complexity as sample backups Small backups have no sampling error Small backups require storage for ‘old’ values 9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Main Idea behind Small Backups Consider estimate A that is constructed from a weighted sum estimates . X i X full backup: A w i X i i What can we do if we know that only a single successor, , y X j . changed value since the last backup? − x j Let be the old value of , used to construct the current + X j X value of A . The value A can then be updated for a single successor by adding the difference between the new and the old value: A A + w j ( X j � x j ) small backup: 10 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Small vs. Sample Backups 1 0.8 r left = +1 e z i s RMS − p e s t t 0.6 n a r right = -1 t s n o c r = +1 ) , sample backup: TD(0), decaying step − size 0 error ( D T p : u k c a b e l p m a 0.4 s (normalized) r 0.2 t f r e l r i g h small backup t 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 step − size / step − size decay = +1 = -1 1 r s a n n o normalized RMS error d i o t 0.8 i m n s t a r g i n a y α e c d 0 ) , D ( T p : k u a c b l e m p a s 0.6 r left = +1 0.4 r = +1 r right = +1 sample backup: TD(0), constant α 0.2 small backup 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 alpha / decay 11 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Small vs. Sample Backups B A C transition probability state values 1 10 8 0.667 6 4 0.333 2 0 0 state A state B state A state B 12 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Backup Ordering 13 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Backup Ordering Do Forever: 1) Select a state s 2 S according to some selection strategy H 2) Apply a full backup to s : h i r ( s, a ) + P s 0 p ( s 0 | s, a ) V ( s 0 ) V ( s ) max a ˆ Asynchronous Value Iteration For every selection strategy H that selects each state infinitely often the values V converge to the optimal value function V ⇤ The rate of convergence depends strongly on the selection strategy H 14 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

The Trade-Off For any effective ordering strategy the cost that is saved by having to perform less backups should out-weigh the cost of maintaining the ordering: cost to maintain cost savings ordering due to fewer backups 15 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Prioritized Sweeping Which states or state-action pairs should be generated during planning? Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size of the change When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue Moore & Atkeson 1993; Peng & Williams 1993 improved by McMahan & Gordon 2005; Van Seijen 2013 16 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Moore and Atekson’s Prioritized Sweeping Published in 1993. 17 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Prioritized Sweeping vs. Dyna-Q Both use n =5 backups per environmental interaction 18 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Bellman Error Ordering Bellman error is a measure for the difference between the current value and the value after a full backup: � i� h X p ( s 0 | s, a ) V ( s 0 ) BE ( s ) = � V ( s ) � max r ( s, a ) + ˆ � � � a s 0 19 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Bellman Error Ordering initialize V ( s ) arbitrarily for all s compute BE ( s ) for all s loop { until convergence } select state s 0 with worst Bellman error perform full backup of s 0 BE ( s 0 ) ← 0 s of s 0 do for all predecessor states ¯ recompute BE (¯ s ) end for end loop To get positive trade-off: comp. time Bellman error << comp time Full backup 20 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Prioritized Sweeping with Small Backups initialize V ( s ) arbitrarily for all s initialize U ( s ) = V ( s ) for all s initialize Q ( s, a ) = V ( s ) for all s, a initialize N sa , N s 0 sa to 0 for all s, a, s 0 loop { over episodes } initialize s repeat { for each step in the episode } select action a , based on Q ( s, · ) take action a , observe r and s 0 N s 0 sa ← N s 0 N sa ← N sa + 1; sa + 1 ⇥ ⇤ Q ( s, a ) ← Q ( s, a )( N sa − 1) + r + γ V ( s 0 ) /N sa V ( s ) ← max b Q ( s, b ) p ← | V ( s ) − U ( s ) | if s is on queue, set its priority to p ; otherwise, add it with priority p for a number of update cycles do s 0 from queue remove top state ¯ s 0 ) − V (¯ s 0 ) ∆ U ← U (¯ V (¯ s 0 ) ← V U ¯ s 0 ) s 0 a ) pairs with N ¯ for all (¯ s, ¯ a > 0 do s ¯ ¯ s 0 a ) + γ N ¯ Q (¯ s, ¯ a ) ← Q (¯ s, ¯ a /N ¯ a · ∆ U s ¯ s ¯ ¯ U (¯ s ) ← max b Q (¯ s, b ) p ← | V (¯ s ) − U (¯ s ) | if s is on queue, set its priority to p ; otherwise, add it with priority p end for end for s ← s 0 until s is terminal end loop 21 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Empirical Comparison 0.55 PS, Moore & Atkeson initial error 0.5 0.45 RMS 0.4 error 0.35 PS, Wiering & Schmidhuber (avg. over P S , P e n first 10 5 obs) 0.3 g & W i l l i a m s 0.25 PS, small backups 0.2 0.15 value iteration 0.1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 − 6 x 10 comp. time per observation [s] 22 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ation tas

Trajectory Sampling Trajectory sampling: perform backups along simulated trajectories This samples from the on-policy distribution Advantages when function approximation is used (Chapter 8) Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored: Initial states Irrelevant states Reachable under optimal control 23 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Trajectory Sampling Experiment one-step full tabular backups uniform: cycled through all state- action pairs on-policy: backed up along simulated trajectories 200 randomly generated undiscounted episodic tasks 2 actions for each state, each with b equally likely next states 0.1 prob of transition to terminal state expected reward on each transition selected from mean 0 variance 1 Gaussian 24 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Heuristic Search Used for action selection, not for changing a value function (=heuristic evaluation function) Backed-up values are computed, but typically discarded Extension of the idea of a greedy policy — only deeper Also suggests ways to select states to backup: smart focusing: 25 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement - PowerPoint PPT Presentation

Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction Tuesday class summary: Planning: any computational process that uses a model to create or improve a policy Dyna framework: 2 R. S. Sutton and A. G.

Efficient Decoupling Capacitor Planning Efficient Decoupling Capacitor Planning via Convex

Classical Planning Systems ICS 271 Fall 2014 Outline: Planning Planning environments

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Planning 2.0 BLMs Final Planning Rule http://www.blm.gov/plan2 1 Planning 2.0 Outline

Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2016 Outline: Planning Planning

Family Planning Only Programs Current Family Planning Only Programs Family Planning Only

Set 9: Planning Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2018 Outline:

Planning I: Planning I: The Planning Process The Planning Process AU INSY 560, Singapore 1997,

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

An Introduction to Empirical Support of Efficient Market Hypothesis Behavioral Finance

Efficient Graph Rewriting York Semigroup Graham Campbell May 2019 Graham Campbell Efficient

Horn Formulas 1 Efficient satisfiability checks In the following: A very efficient

Appeals system under the Planning Act - proportionate, efficient and customer focused HBF

Division of Planning: Keepers of the Stuff http://transportation.ky.gov/planning/ Planning

STRATEGIC PLANNING STRATEGIC PLANNING STRATEGIC PLANNING STRATEGIC PLANNING AIKEN COUNTY PUBLIC

Planning Act Implementation: Transforming Planning in Practice Planning Act Implementation o

Redemption: Real-Time Protection Against Ransomware at End-Hosts WRITTEN BY: PRESENTED BY: AMIN

Nf=2+1+1 renormalisation of four-quark operators Julien Frison University of Edinburgh For the

Primary/Backup CS 452 Single-node key/value store Client Put key1 value1 Client

NREN Backup Services TF-Storage 3, Dublin, Jan Meijer <uninett.no> business case? cheaper

Synology 2020 Sales Manager Devin Santamaria Hybrid Cloud & Beyond The private vs. public

EXAM REVIEW, PROJECT WORK CSSE 120 Rose Hulman Institute of Technology Announcements

iOS Forensics with Open-Source Tools Andrey Belenko AGENDA Basics iOS Security iOS

Gmunu : Toward multigrid based Einstein field equations CHEONG, Chi-Kit field equations solver

Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement - PowerPoint PPT Presentation

Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction Tuesday class summary: Planning: any computational process that uses a model to create or improve a policy Dyna framework: 2 R. S. Sutton and A. G.

Efficient Decoupling Capacitor Planning Efficient Decoupling Capacitor Planning via Convex

Classical Planning Systems ICS 271 Fall 2014 Outline: Planning Planning environments

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Planning 2.0 BLMs Final Planning Rule http://www.blm.gov/plan2 1 Planning 2.0 Outline

Classical Planning Systems Chapter 10 R&amp;N ICS 271 Fall 2016 Outline: Planning Planning

Family Planning Only Programs Current Family Planning Only Programs Family Planning Only

Set 9: Planning Classical Planning Systems Chapter 10 R&amp;N ICS 271 Fall 2018 Outline:

Planning I: Planning I: The Planning Process The Planning Process AU INSY 560, Singapore 1997,

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

An Introduction to Empirical Support of Efficient Market Hypothesis Behavioral Finance

Efficient Graph Rewriting York Semigroup Graham Campbell May 2019 Graham Campbell Efficient

Horn Formulas 1 Efficient satisfiability checks In the following: A very efficient

Appeals system under the Planning Act - proportionate, efficient and customer focused HBF

Division of Planning: Keepers of the Stuff http://transportation.ky.gov/planning/ Planning

STRATEGIC PLANNING STRATEGIC PLANNING STRATEGIC PLANNING STRATEGIC PLANNING AIKEN COUNTY PUBLIC

Planning Act Implementation: Transforming Planning in Practice Planning Act Implementation o

Redemption: Real-Time Protection Against Ransomware at End-Hosts WRITTEN BY: PRESENTED BY: AMIN

Nf=2+1+1 renormalisation of four-quark operators Julien Frison University of Edinburgh For the

Primary/Backup CS 452 Single-node key/value store Client Put key1 value1 Client

NREN Backup Services TF-Storage 3, Dublin, Jan Meijer &lt;uninett.no&gt; business case? cheaper

Synology 2020 Sales Manager Devin Santamaria Hybrid Cloud &amp; Beyond The private vs. public

EXAM REVIEW, PROJECT WORK CSSE 120 Rose Hulman Institute of Technology Announcements

iOS Forensics with Open-Source Tools Andrey Belenko AGENDA Basics iOS Security iOS

Gmunu : Toward multigrid based Einstein field equations CHEONG, Chi-Kit field equations solver

Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2016 Outline: Planning Planning

Set 9: Planning Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2018 Outline:

NREN Backup Services TF-Storage 3, Dublin, Jan Meijer <uninett.no> business case? cheaper

Synology 2020 Sales Manager Devin Santamaria Hybrid Cloud & Beyond The private vs. public