Function Approximation via Tile Coding: Automating Parameter Choice Alexander Sherstov and Peter Stone Department of Computer Sciences The University of Texas at Austin
About the Authors Alex Sherstov Peter Stone Thanks to Nick Jong for presenting!
Overview • TD reinforcement learning – Leading abstraction for decision making – Uses function approximation to store value function Agent action Environment value function Q(s,a) reward function r reward, new state transition function t
Overview • TD reinforcement learning – Leading abstraction for decision making – Uses function approximation to store value function Agent action Environment value function Q(s,a) reward function r reward, new state transition function t • Existing methods – Discretization, neural nets, radial basis, case-based, ... [Santamaria et al., 1997] – Trade-offs: representational power, time/space req’s, ease of use
Overview, cont. • “Happy medium": tile coding – Widely used in RL [Stone and Sutton, 2001, Santamaria et al., 1997, Sutton, 1996] . – Use in robot soccer: Full Action Sparse, coarse, Linear soccer values tile coding map state Few continuous state variables (13) Huge binary feature vector, F a (about 400 1’s and 40,000 0’s)
Our Results • We show that: – Tile-coding is parameter-sensitive – Optimal parameterization depends on the problem and elapsed training time
Our Results • We show that: – Tile-coding is parameter-sensitive – Optimal parameterization depends on the problem and elapsed training time • We contribute: – An automated parameter-adjustment scheme – Empirical validation
Background: Reinforcement Learning • RL problem given by �S , A , t, r � : – S , set of states ; – A , set of actions ; – t : S × A → Pr( S ) , transition function; – r : S × A → R , reward function.
Background: Reinforcement Learning • RL problem given by �S , A , t, r � : – S , set of states ; – A , set of actions ; – t : S × A → Pr( S ) , transition function; – r : S × A → R , reward function. • Solution: – policy π ∗ : S → A that maximizes return � ∞ i =0 γ i r i find π ∗ by approximating optimal value – Q -learning: function Q ∗ : S × A → R
Background: Reinforcement Learning • RL problem given by �S , A , t, r � : – S , set of states ; – A , set of actions ; – t : S × A → Pr( S ) , transition function; – r : S × A → R , reward function. • Solution: – policy π ∗ : S → A that maximizes return � ∞ i =0 γ i r i find π ∗ by approximating optimal value – Q -learning: function Q ∗ : S × A → R • Need FA to generalize Q ∗ to unseen situations
Background: Tile Coding Tiling #2 Tiling #1 2 2 # # e e l l b b a a i i r r a a V V e e t t a a t t S S State Variable #1 State Variable #1 • Maintaining arbitrary f : D → R (often D = S × A ): – D partitioned into tiles , each with a weight – Each partition is a tiling ; several used – Given x ∈ D , sum weights of participating tiles = ⇒ get f ( x )
Background: Tile Coding Parameters • We study canonical univariate tile coding: – w , tile width (same for all tiles) – t , # of tilings (“generalization breadth") – r = w/t , resolution – tilings uniformly offset
Background: Tile Coding Parameters • We study canonical univariate tile coding: – w , tile width (same for all tiles) – t , # of tilings (“generalization breadth") – r = w/t , resolution – tilings uniformly offset • Empirical model: – Fix resolution r , vary generalization breadth t – Same resolution = ⇒ same rep power, asymptotic perf – But: t affects intermediate performance – How to set t ?
Testbed Domain: Grid World • Domain and optimal policy: .8 .8 .8 .8 .8 .8 .8 .8 .8 wall .7 .7 .7 .7 .7 .7 .7 .7 .7 .6 .6 .7 .7 .7 .7 .6 .6 .6 .5 abyss .5 start goal • Episodic task (cliff, goal cells terminal) • Actions: ( d, p ) ∈ {↑ , ↓ , → , ←} × [0 , 1]
Testbed Domain, cont. • Move succeeds w/ prob. F ( p ) , random o/w; F varies from cell to cell: 1 0.9 0.8 0.7 0.6 0.5 0 0.2 0.4 0.6 0.8 1 p
Testbed Domain, cont. • Move succeeds w/ prob. F ( p ) , random o/w; F varies from cell to cell: 1 0.9 0.8 0.7 0.6 0.5 0 0.2 0.4 0.6 0.8 1 p • 2 reward functions: − 100 cliff, +100 goal, − 1 o/w (“informative"); +100 goal, 0 o/w (“uninformative")
Testbed Domain, cont. • Move succeeds w/ prob. F ( p ) , random o/w; F varies from cell to cell: 1 0.9 0.8 0.7 0.6 0.5 0 0.2 0.4 0.6 0.8 1 p • 2 reward functions: − 100 cliff, +100 goal, − 1 o/w (“informative"); +100 goal, 0 o/w (“uninformative") • Use of tile coding: generalize over actions ( p )
Generalization Helps Initially 100 100 1 TILING 3 TILINGS 80 80 6 TILINGS % OPTIMAL % OPTIMAL 60 60 40 40 1 TILING 20 20 3 TILINGS 6 TILINGS 0 0 0 250 500 750 1000 0 1000 2000 3000 4000 EPISODES COMPLETED EPISODES COMPLETED informative reward uninformative reward Generalization improves cliff avoidance.
Generalization Helps Initially, cont. 100 100 100 80 80 80 % OPTIMAL % OPTIMAL % OPTIMAL 60 60 60 40 40 40 1 TILING 1 TILING 1 TILING 20 20 20 3 TILINGS 3 TILINGS 3 TILINGS 6 TILINGS 6 TILINGS 6 TILINGS 0 0 0 0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 EPISODES COMPLETED EPISODES COMPLETED EPISODES COMPLETED α = 0 . 5 α = 0 . 1 α = 0 . 05 Generalization improves discovery of better actions.
Generalization Hurts Eventually 99 99 98 98 % OPTIMAL % OPTIMAL 97 97 96 96 1 TILING 1 TILING 3 TILINGS 3 TILINGS 6 TILINGS 6 TILINGS 95 95 40000 60000 80000 100000 40000 60000 80000 100000 EPISODES COMPLETED EPISODES COMPLETED informative reward uninformative reward Generalization slows convergence.
Adaptive Generalization • Best to adjust generalization over time
Adaptive Generalization • Best to adjust generalization over time • Solution: reliability index ρ ( s, a ) ∈ [0 , 1] – ρ ( s, a ) ≈ 1 = ⇒ Q ( s, a ) reliable (and vice versa) – large backup error on ( s, a ) decreases ρ ( s, a ) (and vice versa)
Adaptive Generalization • Best to adjust generalization over time • Solution: reliability index ρ ( s, a ) ∈ [0 , 1] – ρ ( s, a ) ≈ 1 = ⇒ Q ( s, a ) reliable (and vice versa) – large backup error on ( s, a ) decreases ρ ( s, a ) (and vice versa) • Use of ρ ( s, a ) : – An update to Q ( s, a ) is generalized to largest nearby region R that is unreliable on average : 1 ( s,a ) ∈ R ρ ( s, a ) ≤ 1 � | R | 2
Effects of Adaptive Generalization • Time-variant generalization – Encourages generalization when Q ( s, a ) changing – Suppresses generalization near convergence
Effects of Adaptive Generalization • Time-variant generalization – Encourages generalization when Q ( s, a ) changing – Suppresses generalization near convergence • Space-variant generalization – Rarely-visited states benefit from generalization for a longer time
Adaptive Generalization at Work 100 100 80 98 % OPTIMAL % OPTIMAL 60 96 40 94 ADAPTIVE ADAPTIVE 1 TILING 1 TILING 20 92 3 TILINGS 3 TILINGS 6 TILINGS 6 TILINGS 0 90 0 200 400 600 800 1000 20000 40000 60000 80000 100000 EPISODES COMPLETED EPISODES COMPLETED episodes 0–1000 episodes 1000–1000000 Adaptive generalization better than any fixed setting.
Conclusions • Precise empirical study of parameter choice in tile coding
Conclusions • Precise empirical study of parameter choice in tile coding • No single setting ideal for all problems, or even throughout learning curve on the same problem
Conclusions • Precise empirical study of parameter choice in tile coding • No single setting ideal for all problems, or even throughout learning curve on the same problem • Contributed algorithm for adjusting parameters as needed in different regions of S × A ( space-variant gen.) and at different learning stages ( time-variant gen.)
Conclusions • Precise empirical study of parameter choice in tile coding • No single setting ideal for all problems, or even throughout learning curve on the same problem • Contributed algorithm for adjusting parameters as needed in different regions of S × A ( space-variant gen.) and at different learning stages ( time-variant gen.) • Showed superiority of this adaptive technique to any fixed setting
References [Santamaria et al., 1997] Santamaria, J. C., Sutton, R. S., and Ram, A. (1997). Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior , 6(2):163–217. [Stone and Sutton, 2001] Stone, P . and Sutton, R. S. (2001). Scaling reinforcement learning toward RoboCup soccer. In Proc. 18th International Conference on Machine Learning (ICML-01) , pages 537–544. Morgan Kaufmann, San Francisco, CA. [Sutton, 1996] Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems 8 , pages 1038–1044, Cambridge, MA. MIT Press.
Recommend
More recommend