function approximation via tile coding automating
play

Function Approximation via Tile Coding: Automating Parameter Choice - PowerPoint PPT Presentation

Function Approximation via Tile Coding: Automating Parameter Choice Alexander Sherstov and Peter Stone Department of Computer Sciences The University of Texas at Austin About the Authors Alex Sherstov Peter Stone Thanks to Nick Jong for


  1. Function Approximation via Tile Coding: Automating Parameter Choice Alexander Sherstov and Peter Stone Department of Computer Sciences The University of Texas at Austin

  2. About the Authors Alex Sherstov Peter Stone Thanks to Nick Jong for presenting!

  3. Overview • TD reinforcement learning – Leading abstraction for decision making – Uses function approximation to store value function Agent action Environment value function Q(s,a) reward function r reward, new state transition function t

  4. Overview • TD reinforcement learning – Leading abstraction for decision making – Uses function approximation to store value function Agent action Environment value function Q(s,a) reward function r reward, new state transition function t • Existing methods – Discretization, neural nets, radial basis, case-based, ... [Santamaria et al., 1997] – Trade-offs: representational power, time/space req’s, ease of use

  5. Overview, cont. • “Happy medium": tile coding – Widely used in RL [Stone and Sutton, 2001, Santamaria et al., 1997, Sutton, 1996] . – Use in robot soccer: Full Action Sparse, coarse, Linear soccer values tile coding map state Few continuous state variables (13) Huge binary feature vector, F a (about 400 1’s and 40,000 0’s)

  6. Our Results • We show that: – Tile-coding is parameter-sensitive – Optimal parameterization depends on the problem and elapsed training time

  7. Our Results • We show that: – Tile-coding is parameter-sensitive – Optimal parameterization depends on the problem and elapsed training time • We contribute: – An automated parameter-adjustment scheme – Empirical validation

  8. Background: Reinforcement Learning • RL problem given by �S , A , t, r � : – S , set of states ; – A , set of actions ; – t : S × A → Pr( S ) , transition function; – r : S × A → R , reward function.

  9. Background: Reinforcement Learning • RL problem given by �S , A , t, r � : – S , set of states ; – A , set of actions ; – t : S × A → Pr( S ) , transition function; – r : S × A → R , reward function. • Solution: – policy π ∗ : S → A that maximizes return � ∞ i =0 γ i r i find π ∗ by approximating optimal value – Q -learning: function Q ∗ : S × A → R

  10. Background: Reinforcement Learning • RL problem given by �S , A , t, r � : – S , set of states ; – A , set of actions ; – t : S × A → Pr( S ) , transition function; – r : S × A → R , reward function. • Solution: – policy π ∗ : S → A that maximizes return � ∞ i =0 γ i r i find π ∗ by approximating optimal value – Q -learning: function Q ∗ : S × A → R • Need FA to generalize Q ∗ to unseen situations

  11. Background: Tile Coding Tiling #2 Tiling #1 2 2 # # e e l l b b a a i i r r a a V V e e t t a a t t S S State Variable #1 State Variable #1 • Maintaining arbitrary f : D → R (often D = S × A ): – D partitioned into tiles , each with a weight – Each partition is a tiling ; several used – Given x ∈ D , sum weights of participating tiles = ⇒ get f ( x )

  12. Background: Tile Coding Parameters • We study canonical univariate tile coding: – w , tile width (same for all tiles) – t , # of tilings (“generalization breadth") – r = w/t , resolution – tilings uniformly offset

  13. Background: Tile Coding Parameters • We study canonical univariate tile coding: – w , tile width (same for all tiles) – t , # of tilings (“generalization breadth") – r = w/t , resolution – tilings uniformly offset • Empirical model: – Fix resolution r , vary generalization breadth t – Same resolution = ⇒ same rep power, asymptotic perf – But: t affects intermediate performance – How to set t ?

  14. Testbed Domain: Grid World • Domain and optimal policy: .8 .8 .8 .8 .8 .8 .8 .8 .8 wall .7 .7 .7 .7 .7 .7 .7 .7 .7 .6 .6 .7 .7 .7 .7 .6 .6 .6 .5 abyss .5 start goal • Episodic task (cliff, goal cells terminal) • Actions: ( d, p ) ∈ {↑ , ↓ , → , ←} × [0 , 1]

  15. Testbed Domain, cont. • Move succeeds w/ prob. F ( p ) , random o/w; F varies from cell to cell: 1 0.9 0.8 0.7 0.6 0.5 0 0.2 0.4 0.6 0.8 1 p

  16. Testbed Domain, cont. • Move succeeds w/ prob. F ( p ) , random o/w; F varies from cell to cell: 1 0.9 0.8 0.7 0.6 0.5 0 0.2 0.4 0.6 0.8 1 p • 2 reward functions: − 100 cliff, +100 goal, − 1 o/w (“informative"); +100 goal, 0 o/w (“uninformative")

  17. Testbed Domain, cont. • Move succeeds w/ prob. F ( p ) , random o/w; F varies from cell to cell: 1 0.9 0.8 0.7 0.6 0.5 0 0.2 0.4 0.6 0.8 1 p • 2 reward functions: − 100 cliff, +100 goal, − 1 o/w (“informative"); +100 goal, 0 o/w (“uninformative") • Use of tile coding: generalize over actions ( p )

  18. Generalization Helps Initially 100 100 1 TILING 3 TILINGS 80 80 6 TILINGS % OPTIMAL % OPTIMAL 60 60 40 40 1 TILING 20 20 3 TILINGS 6 TILINGS 0 0 0 250 500 750 1000 0 1000 2000 3000 4000 EPISODES COMPLETED EPISODES COMPLETED informative reward uninformative reward Generalization improves cliff avoidance.

  19. Generalization Helps Initially, cont. 100 100 100 80 80 80 % OPTIMAL % OPTIMAL % OPTIMAL 60 60 60 40 40 40 1 TILING 1 TILING 1 TILING 20 20 20 3 TILINGS 3 TILINGS 3 TILINGS 6 TILINGS 6 TILINGS 6 TILINGS 0 0 0 0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 EPISODES COMPLETED EPISODES COMPLETED EPISODES COMPLETED α = 0 . 5 α = 0 . 1 α = 0 . 05 Generalization improves discovery of better actions.

  20. Generalization Hurts Eventually 99 99 98 98 % OPTIMAL % OPTIMAL 97 97 96 96 1 TILING 1 TILING 3 TILINGS 3 TILINGS 6 TILINGS 6 TILINGS 95 95 40000 60000 80000 100000 40000 60000 80000 100000 EPISODES COMPLETED EPISODES COMPLETED informative reward uninformative reward Generalization slows convergence.

  21. Adaptive Generalization • Best to adjust generalization over time

  22. Adaptive Generalization • Best to adjust generalization over time • Solution: reliability index ρ ( s, a ) ∈ [0 , 1] – ρ ( s, a ) ≈ 1 = ⇒ Q ( s, a ) reliable (and vice versa) – large backup error on ( s, a ) decreases ρ ( s, a ) (and vice versa)

  23. Adaptive Generalization • Best to adjust generalization over time • Solution: reliability index ρ ( s, a ) ∈ [0 , 1] – ρ ( s, a ) ≈ 1 = ⇒ Q ( s, a ) reliable (and vice versa) – large backup error on ( s, a ) decreases ρ ( s, a ) (and vice versa) • Use of ρ ( s, a ) : – An update to Q ( s, a ) is generalized to largest nearby region R that is unreliable on average : 1 ( s,a ) ∈ R ρ ( s, a ) ≤ 1 � | R | 2

  24. Effects of Adaptive Generalization • Time-variant generalization – Encourages generalization when Q ( s, a ) changing – Suppresses generalization near convergence

  25. Effects of Adaptive Generalization • Time-variant generalization – Encourages generalization when Q ( s, a ) changing – Suppresses generalization near convergence • Space-variant generalization – Rarely-visited states benefit from generalization for a longer time

  26. Adaptive Generalization at Work 100 100 80 98 % OPTIMAL % OPTIMAL 60 96 40 94 ADAPTIVE ADAPTIVE 1 TILING 1 TILING 20 92 3 TILINGS 3 TILINGS 6 TILINGS 6 TILINGS 0 90 0 200 400 600 800 1000 20000 40000 60000 80000 100000 EPISODES COMPLETED EPISODES COMPLETED episodes 0–1000 episodes 1000–1000000 Adaptive generalization better than any fixed setting.

  27. Conclusions • Precise empirical study of parameter choice in tile coding

  28. Conclusions • Precise empirical study of parameter choice in tile coding • No single setting ideal for all problems, or even throughout learning curve on the same problem

  29. Conclusions • Precise empirical study of parameter choice in tile coding • No single setting ideal for all problems, or even throughout learning curve on the same problem • Contributed algorithm for adjusting parameters as needed in different regions of S × A ( space-variant gen.) and at different learning stages ( time-variant gen.)

  30. Conclusions • Precise empirical study of parameter choice in tile coding • No single setting ideal for all problems, or even throughout learning curve on the same problem • Contributed algorithm for adjusting parameters as needed in different regions of S × A ( space-variant gen.) and at different learning stages ( time-variant gen.) • Showed superiority of this adaptive technique to any fixed setting

  31. References [Santamaria et al., 1997] Santamaria, J. C., Sutton, R. S., and Ram, A. (1997). Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior , 6(2):163–217. [Stone and Sutton, 2001] Stone, P . and Sutton, R. S. (2001). Scaling reinforcement learning toward RoboCup soccer. In Proc. 18th International Conference on Machine Learning (ICML-01) , pages 537–544. Morgan Kaufmann, San Francisco, CA. [Sutton, 1996] Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing Systems 8 , pages 1038–1044, Cambridge, MA. MIT Press.

Recommend


More recommend