module 9
play

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement - PowerPoint PPT Presentation

Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Large State Space Value Iteration, Policy Iteration and Linear Programming Complexity at least quadratic in || Problem: ||


  1. Module 9 LAO* CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Large State Space • Value Iteration, Policy Iteration and Linear Programming – Complexity at least quadratic in |𝑇| • Problem: |𝑇| may be very large – Queuing problems: infinite state space – Factored problems: exponentially many states 2 CS886 (c) 2013 Pascal Poupart

  3. Mitigate Size of State Space • Two ideas: • Exploit initial state – Not all states are reachable • Exploit heuristic ℎ – approximation of optimal value function – usually an upper bound ℎ 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 3 CS886 (c) 2013 Pascal Poupart

  4. State Space State space |𝑇| Reachable states 𝑡 0 States reachable by 𝜌 ∗ 4 CS886 (c) 2013 Pascal Poupart

  5. State Space State space |𝑇| Reachable states 𝑡 0 States reachable by 𝜌 ∗ 5 CS886 (c) 2013 Pascal Poupart

  6. LAO* Algorithm • Related to – A*: path heuristic search – AO*: tree heuristic search – LAO*: cyclic graph heuristic search • LAO* alternates between – State space expansion – Policy optimization • value iteration, policy iteration, linear programming 6 CS886 (c) 2013 Pascal Poupart

  7. Terminology • 𝑇 : state space • 𝑇 𝐹 ⊆ 𝑇 : envelope – Growing set of states • 𝑇 𝑈 ⊆ 𝑇 𝐹 : terminal states – States whose children are not in the envelope 𝜌 ⊆ 𝑇 𝐹 : states reachable from 𝑡 0 by following 𝜌 • 𝑇 𝑡 0 • ℎ(𝑡) : heuristic such that ℎ 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 – E.g., ℎ 𝑡 = max 𝑡,𝑏 𝑆(𝑡, 𝑏)/(1 − 𝛿) 7 CS886 (c) 2013 Pascal Poupart

  8. LAO* Algorithm LAO*(MDP, heuristic ℎ ) 𝑇 𝐹 ← {𝑡 0 } , 𝑇 𝑈 ← {𝑡 0 } Repeat Let 𝑆 𝐹 𝑡, 𝑏 = ℎ(𝑡) 𝑡 ∈ 𝑇 𝑈 otherwise 𝑆(𝑡, 𝑏) 0 𝑡 ∈ 𝑇 𝑈 Let 𝑈 𝐹 (𝑡 ′ |𝑡, 𝑏) = otherwise (𝑡 ′ |𝑡, 𝑏) Pr Find optimal policy 𝜌 for 𝑇 𝐹 , 𝑆 𝐹 , 𝑈 𝐹 𝜌 Find reachable states 𝑇 𝑡 0 𝜌 ∩ 𝑇 𝑈 Select reachable terminal states s 1 , … , s k ⊆ 𝑇 𝑡 0 𝑇 𝑈 ← (𝑇 𝑈 ∖ 𝑡 1 , … , 𝑡 𝑙 ) ∪ (𝑑ℎ𝑗𝑚𝑒𝑠𝑓𝑜 𝑡 1 , … , 𝑡 𝑙 ∖ 𝑇 𝐹 ) 𝑇 𝐹 ← 𝑇 𝐹 ∪ 𝑑ℎ𝑗𝑚𝑒𝑠𝑓𝑜( 𝑡 1 , … , 𝑡 𝑙 ) 𝜌 ∩ 𝑇 𝑈 is empty Until 𝑇 𝑡 0 8 CS886 (c) 2013 Pascal Poupart

  9. Efficiency Efficiency influenced by 1. Choice of terminal states to add to envelope 2. Algorithm to find optimal policy – Can use value iteration, policy iteration, modified policy iteration, linear programming – Key: reuse previous computation • E.g., start with previous policy or value function at each iteration 9 CS886 (c) 2013 Pascal Poupart

  10. Convergence • Theorem: LAO* converges to the optimal policy • Proof: – Fact: At each iteration, the value function 𝑊 is an upper bound on 𝑊 ∗ due to the heuristic function ℎ – Proof by contradiction: suppose the algorithm stops, but 𝜌 is not optimal. • Since the algorithm stopped, all states reachable by 𝜌 are in 𝑇 𝐹 ∖ 𝑇 𝑈 • Hence, the value function 𝑊 is the value of 𝜌 and since 𝜌 is suboptimal then 𝑊 < 𝑊 ∗ , which contradicts the fact that 𝑊 is an upper bound on 𝑊 ∗ 10 CS886 (c) 2013 Pascal Poupart

  11. Summary • LAO* – Extension of basic solution algorithms (value iteration, policy iteration, linear programming) – Exploit initial state and heuristic function – Gradually grow an envelope of states – Complexity depends on # of reachable states instead of size of state space 11 CS886 (c) 2013 Pascal Poupart

Recommend


More recommend