Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - - PowerPoint PPT Presentation
Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - - PowerPoint PPT Presentation
Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Large State Space Value Iteration, Policy Iteration and Linear Programming Complexity at least quadratic in || Problem: ||
CS886 (c) 2013 Pascal Poupart
2
Large State Space
- Value Iteration, Policy Iteration and Linear
Programming
โ Complexity at least quadratic in |๐|
- Problem: |๐| may be very large
โ Queuing problems: infinite state space โ Factored problems: exponentially many states
CS886 (c) 2013 Pascal Poupart
3
Mitigate Size of State Space
- Two ideas:
- Exploit initial state
โ Not all states are reachable
- Exploit heuristic โ
โ approximation of optimal value function โ usually an upper bound โ ๐ก โฅ ๐โ ๐ก โ๐ก
CS886 (c) 2013 Pascal Poupart
4
State Space
State space |๐|
๐ก0
Reachable states States reachable by ๐โ
CS886 (c) 2013 Pascal Poupart
5
State Space
State space |๐|
๐ก0
Reachable states States reachable by ๐โ
CS886 (c) 2013 Pascal Poupart
6
LAO* Algorithm
- Related to
โ A*: path heuristic search โ AO*: tree heuristic search โ LAO*: cyclic graph heuristic search
- LAO* alternates between
โ State space expansion โ Policy optimization
- value iteration, policy iteration, linear programming
CS886 (c) 2013 Pascal Poupart
7
Terminology
- ๐: state space
- ๐๐น โ ๐: envelope
โ Growing set of states
- ๐๐ โ ๐๐น: terminal states
โ States whose children are not in the envelope
- ๐๐ก0
๐ โ ๐๐น: states reachable from ๐ก0 by following ๐
- โ(๐ก): heuristic such that โ ๐ก โฅ ๐โ ๐ก โ๐ก
โ E.g., โ ๐ก = max
๐ก,๐ ๐(๐ก, ๐)/(1 โ ๐ฟ)
CS886 (c) 2013 Pascal Poupart
8
LAO* Algorithm
LAO*(MDP, heuristic โ)
๐๐น โ {๐ก0}, ๐๐ โ {๐ก0} Repeat Let ๐๐น ๐ก, ๐ = โ(๐ก) ๐ก โ ๐๐ ๐(๐ก, ๐)
- therwise
Let ๐๐น(๐กโฒ|๐ก, ๐) = ๐ก โ ๐๐ Pr (๐กโฒ|๐ก, ๐)
- therwise
Find optimal policy ๐ for ๐๐น, ๐๐น, ๐๐น Find reachable states ๐๐ก0
๐
Select reachable terminal states s1, โฆ , sk โ ๐๐ก0
๐ โฉ ๐๐
๐๐ โ (๐๐ โ ๐ก1, โฆ , ๐ก๐ ) โช (๐โ๐๐๐๐ ๐๐ ๐ก1, โฆ , ๐ก๐ โ ๐๐น) ๐๐น โ ๐๐น โช ๐โ๐๐๐๐ ๐๐( ๐ก1, โฆ , ๐ก๐ ) Until ๐๐ก0
๐ โฉ ๐๐ is empty
CS886 (c) 2013 Pascal Poupart
9
Efficiency
Efficiency influenced by
- 1. Choice of terminal states to add to envelope
- 2. Algorithm to find optimal policy
โ Can use value iteration, policy iteration, modified policy iteration, linear programming โ Key: reuse previous computation
- E.g., start with previous policy or value function at each
iteration
CS886 (c) 2013 Pascal Poupart
10
Convergence
- Theorem: LAO* converges to the optimal policy
- Proof:
โ Fact: At each iteration, the value function ๐ is an upper bound on ๐โ due to the heuristic function โ โ Proof by contradiction: suppose the algorithm stops, but ๐ is not optimal.
- Since the algorithm stopped, all states reachable by ๐ are in
๐๐น โ ๐๐
- Hence, the value function ๐ is the value of ๐ and since ๐ is
suboptimal then ๐ < ๐โ, which contradicts the fact that ๐ is an upper bound on ๐โ
CS886 (c) 2013 Pascal Poupart
11
Summary
- LAO*
โ Extension of basic solution algorithms (value iteration, policy iteration, linear programming) โ Exploit initial state and heuristic function โ Gradually grow an envelope of states โ Complexity depends on # of reachable states instead
- f size of state space