Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - - PowerPoint PPT Presentation

โ–ถ
module 9
SMART_READER_LITE
LIVE PREVIEW

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - - PowerPoint PPT Presentation

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Large State Space Value Iteration, Policy Iteration and Linear Programming Complexity at least quadratic in || Problem: ||


slide-1
SLIDE 1

Module 9 LAO*

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

slide-2
SLIDE 2

CS886 (c) 2013 Pascal Poupart

2

Large State Space

  • Value Iteration, Policy Iteration and Linear

Programming

โ€“ Complexity at least quadratic in |๐‘‡|

  • Problem: |๐‘‡| may be very large

โ€“ Queuing problems: infinite state space โ€“ Factored problems: exponentially many states

slide-3
SLIDE 3

CS886 (c) 2013 Pascal Poupart

3

Mitigate Size of State Space

  • Two ideas:
  • Exploit initial state

โ€“ Not all states are reachable

  • Exploit heuristic โ„Ž

โ€“ approximation of optimal value function โ€“ usually an upper bound โ„Ž ๐‘ก โ‰ฅ ๐‘Šโˆ— ๐‘ก โˆ€๐‘ก

slide-4
SLIDE 4

CS886 (c) 2013 Pascal Poupart

4

State Space

State space |๐‘‡|

๐‘ก0

Reachable states States reachable by ๐œŒโˆ—

slide-5
SLIDE 5

CS886 (c) 2013 Pascal Poupart

5

State Space

State space |๐‘‡|

๐‘ก0

Reachable states States reachable by ๐œŒโˆ—

slide-6
SLIDE 6

CS886 (c) 2013 Pascal Poupart

6

LAO* Algorithm

  • Related to

โ€“ A*: path heuristic search โ€“ AO*: tree heuristic search โ€“ LAO*: cyclic graph heuristic search

  • LAO* alternates between

โ€“ State space expansion โ€“ Policy optimization

  • value iteration, policy iteration, linear programming
slide-7
SLIDE 7

CS886 (c) 2013 Pascal Poupart

7

Terminology

  • ๐‘‡: state space
  • ๐‘‡๐น โІ ๐‘‡: envelope

โ€“ Growing set of states

  • ๐‘‡๐‘ˆ โІ ๐‘‡๐น: terminal states

โ€“ States whose children are not in the envelope

  • ๐‘‡๐‘ก0

๐œŒ โІ ๐‘‡๐น: states reachable from ๐‘ก0 by following ๐œŒ

  • โ„Ž(๐‘ก): heuristic such that โ„Ž ๐‘ก โ‰ฅ ๐‘Šโˆ— ๐‘ก โˆ€๐‘ก

โ€“ E.g., โ„Ž ๐‘ก = max

๐‘ก,๐‘ ๐‘†(๐‘ก, ๐‘)/(1 โˆ’ ๐›ฟ)

slide-8
SLIDE 8

CS886 (c) 2013 Pascal Poupart

8

LAO* Algorithm

LAO*(MDP, heuristic โ„Ž)

๐‘‡๐น โ† {๐‘ก0}, ๐‘‡๐‘ˆ โ† {๐‘ก0} Repeat Let ๐‘†๐น ๐‘ก, ๐‘ = โ„Ž(๐‘ก) ๐‘ก โˆˆ ๐‘‡๐‘ˆ ๐‘†(๐‘ก, ๐‘)

  • therwise

Let ๐‘ˆ๐น(๐‘กโ€ฒ|๐‘ก, ๐‘) = ๐‘ก โˆˆ ๐‘‡๐‘ˆ Pr (๐‘กโ€ฒ|๐‘ก, ๐‘)

  • therwise

Find optimal policy ๐œŒ for ๐‘‡๐น, ๐‘†๐น, ๐‘ˆ๐น Find reachable states ๐‘‡๐‘ก0

๐œŒ

Select reachable terminal states s1, โ€ฆ , sk โІ ๐‘‡๐‘ก0

๐œŒ โˆฉ ๐‘‡๐‘ˆ

๐‘‡๐‘ˆ โ† (๐‘‡๐‘ˆ โˆ– ๐‘ก1, โ€ฆ , ๐‘ก๐‘™ ) โˆช (๐‘‘โ„Ž๐‘—๐‘š๐‘’๐‘ ๐‘“๐‘œ ๐‘ก1, โ€ฆ , ๐‘ก๐‘™ โˆ– ๐‘‡๐น) ๐‘‡๐น โ† ๐‘‡๐น โˆช ๐‘‘โ„Ž๐‘—๐‘š๐‘’๐‘ ๐‘“๐‘œ( ๐‘ก1, โ€ฆ , ๐‘ก๐‘™ ) Until ๐‘‡๐‘ก0

๐œŒ โˆฉ ๐‘‡๐‘ˆ is empty

slide-9
SLIDE 9

CS886 (c) 2013 Pascal Poupart

9

Efficiency

Efficiency influenced by

  • 1. Choice of terminal states to add to envelope
  • 2. Algorithm to find optimal policy

โ€“ Can use value iteration, policy iteration, modified policy iteration, linear programming โ€“ Key: reuse previous computation

  • E.g., start with previous policy or value function at each

iteration

slide-10
SLIDE 10

CS886 (c) 2013 Pascal Poupart

10

Convergence

  • Theorem: LAO* converges to the optimal policy
  • Proof:

โ€“ Fact: At each iteration, the value function ๐‘Š is an upper bound on ๐‘Šโˆ— due to the heuristic function โ„Ž โ€“ Proof by contradiction: suppose the algorithm stops, but ๐œŒ is not optimal.

  • Since the algorithm stopped, all states reachable by ๐œŒ are in

๐‘‡๐น โˆ– ๐‘‡๐‘ˆ

  • Hence, the value function ๐‘Š is the value of ๐œŒ and since ๐œŒ is

suboptimal then ๐‘Š < ๐‘Šโˆ—, which contradicts the fact that ๐‘Š is an upper bound on ๐‘Šโˆ—

slide-11
SLIDE 11

CS886 (c) 2013 Pascal Poupart

11

Summary

  • LAO*

โ€“ Extension of basic solution algorithms (value iteration, policy iteration, linear programming) โ€“ Exploit initial state and heuristic function โ€“ Gradually grow an envelope of states โ€“ Complexity depends on # of reachable states instead

  • f size of state space