module 6
play

Module 6 Value Iteration CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Process Definition Set of states: Set of actions (i.e., decisions): Transition model: Pr


  1. Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Markov Decision Process β€’ Definition – Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐡 – Transition model: Pr⁑ (𝑑 𝑒 |𝑑 π‘’βˆ’1 , 𝑏 π‘’βˆ’1 ) – Reward model (i.e., utility): 𝑆(𝑑 𝑒 , 𝑏 𝑒 ) – Discount factor: 0 ≀ 𝛿 ≀ 1 – Horizon (i.e., # of time steps): β„Ž β€’ Goal: find optimal policy 𝜌 2 CS886 (c) 2013 Pascal Poupart

  3. Finite Horizon β€’ Policy evaluation 𝜌 𝑑 = β„Ž 𝛿 𝑒 Pr⁑ π‘Š (𝑇 𝑒 = 𝑑′|𝑇 0 = 𝑑, 𝜌)𝑆(𝑑′, 𝜌 𝑒 (𝑑′)) β„Ž 𝑒=0 β€’ Recursive form (dynamic programming) 𝜌 𝑑 = 𝑆(𝑑, 𝜌 0 𝑑 ) π‘Š 0 𝜌 (𝑑 β€² ) 𝜌 𝑑 = 𝑆 𝑑, 𝜌 𝑒 𝑑 Pr 𝑑 β€² 𝑑, 𝜌 𝑒 𝑑 + 𝛿 π‘Š π‘Š 𝑑 β€² 𝑒 π‘’βˆ’1 3 CS886 (c) 2013 Pascal Poupart

  4. Finite Horizon β€’ Optimal Policy 𝜌 βˆ— 𝜌 βˆ— 𝑑 β‰₯ π‘Š 𝜌 𝑑 β‘β‘βˆ€πœŒ, 𝑑 π‘Š β„Ž β„Ž β€’ Optimal value function π‘Š βˆ— (shorthand for π‘Š 𝜌 βˆ— ) βˆ— 𝑑 = max 𝑆(𝑑, 𝑏) π‘Š 0 𝑏 βˆ— 𝑑 = max Pr 𝑑 β€² 𝑑, 𝑏 π‘Š βˆ— (𝑑 β€² ) 𝑆 𝑑, 𝑏 + 𝛿 π‘Š 𝑑 β€² 𝑒 π‘’βˆ’1 𝑏 Bellman’s equation 4 CS886 (c) 2013 Pascal Poupart

  5. Value Iteration Algorithm valueIteration(MDP) βˆ— 𝑑 ← max π‘Š 𝑆(𝑑, 𝑏)β‘βˆ€π‘‘ 0 𝑏 For 𝑒 = 1 to β„Ž do βˆ— (𝑑 β€² ) βˆ— 𝑑 ← max Pr 𝑑 β€² 𝑑, 𝑏 π‘Š 𝑆 𝑑, 𝑏 + 𝛿 π‘Š β‘βˆ€π‘‘ 𝑑 β€² 𝑒 π‘’βˆ’1 𝑏 Return π‘Š βˆ— Optimal policy 𝜌 βˆ— βˆ— 𝑑 ← argmax 𝑒 = 0:⁑𝜌 0 𝑆 𝑑, 𝑏 β‘βˆ€π‘‘ 𝑏 βˆ— 𝑑 ← argmax βˆ— (𝑑 β€² ) Pr 𝑑 β€² 𝑑, 𝑏 π‘Š 𝑒 > 0 : 𝜌 𝑒 𝑆 𝑑, 𝑏 + 𝛿 β‘βˆ€π‘‘ 𝑑 β€² π‘’βˆ’1 𝑏 NB: 𝜌 βˆ— is non stationary (i.e., time dependent) 5 CS886 (c) 2013 Pascal Poupart

  6. Value Iteration β€’ Matrix form: 𝑆 𝑏 : 𝑇 Γ— 1 column vector of rewards for 𝑏 βˆ— : 𝑇 Γ— 1 column vector of state values π‘Š 𝑒 π‘ˆ 𝑏 : 𝑇 Γ— 𝑇 matrix of transition prob. for 𝑏 valueIteration(MDP) βˆ— ← max 𝑆 𝑏 ⁑ π‘Š 0 𝑏 For 𝑒 = 1 to β„Ž do βˆ— ← max 𝑆 𝑏 + π›Ώπ‘ˆ 𝑏 π‘Š βˆ— ⁑ π‘Š 𝑒 π‘’βˆ’1 𝑏 Return π‘Š βˆ— 6 CS886 (c) 2013 Pascal Poupart

  7. Infinite Horizon β€’ Let β„Ž β†’ ∞ 𝜌 β†’ π‘Š 𝜌 𝜌 and π‘Š 𝜌 β€’ Then π‘Š β†’ π‘Š ∞ ∞ β„Ž β„Žβˆ’1 β€’ Policy evaluation: 𝜌 𝑑 = 𝑆 𝑑, 𝜌 ∞ 𝑑 Pr 𝑑 β€² 𝑑, 𝜌 ∞ 𝑑 𝜌 (𝑑 β€² ) + 𝛿 π‘Š π‘Š β‘βˆ€π‘‘ 𝑑 β€² ∞ ∞ β€’ Bellman’s equation: βˆ— 𝑑 = max Pr 𝑑 β€² 𝑑, 𝑏 π‘Š βˆ— (𝑑 β€² ) 𝑆 𝑑, 𝑏 + 𝛿 π‘Š 𝑑 β€² ∞ ∞ 𝑏 7 CS886 (c) 2013 Pascal Poupart

  8. Policy evaluation β€’ Linear system of equations 𝜌 𝑑 = 𝑆 𝑑, 𝜌 ∞ 𝑑 Pr 𝑑 β€² 𝑑, 𝜌 ∞ 𝑑 𝜌 (𝑑 β€² ) π‘Š + 𝛿 π‘Š β‘βˆ€π‘‘ 𝑑 β€² ∞ ∞ β€’ Matrix form: 𝑆 : 𝑇 Γ— 1 column vector of sate rewards for 𝜌 π‘Š : 𝑇 Γ— 1 column vector of state values for 𝜌 π‘ˆ : 𝑇 Γ— 𝑇 matrix of transition prob for 𝜌 π‘Š = 𝑆 + π›Ώπ‘ˆπ‘Š 8 CS886 (c) 2013 Pascal Poupart

  9. Solving linear equations β€’ Linear system: π‘Š = 𝑆 + π›Ώπ‘ˆπ‘Š β€’ Gaussian elimination: 𝐽 βˆ’ π›Ώπ‘ˆ π‘Š = 𝑆 β€’ Compute inverse: π‘Š = 𝐽 βˆ’ π›Ώπ‘ˆ βˆ’1 𝑆 β€’ Iterative methods β€’ Value iteration (a.k.a. Richardson iteration) β€’ Repeat π‘Š ← 𝑆 + π›Ώπ‘ˆπ‘Š 9 CS886 (c) 2013 Pascal Poupart

  10. Contraction β€’ Let 𝐼(π‘Š) ≝ 𝑆 + π›Ώπ‘ˆπ‘Š be the policy eval operator β€’ Lemma 1: 𝐼 is a contraction mapping. βˆ’ 𝐼 π‘Š βˆ’ π‘Š 𝐼 π‘Š ∞ ≀ 𝛿 π‘Š ∞ β€’ Proof 𝐼 π‘Š βˆ’ 𝐼 π‘Š ∞ βˆ’ 𝑆 βˆ’ π›Ώπ‘ˆπ‘Š = 𝑆 + π›Ώπ‘ˆπ‘Š ∞ (by definition) βˆ’ π‘Š = π›Ώπ‘ˆ π‘Š ∞ (simplification) βˆ’ π‘Š ≀ 𝛿 π‘ˆ π‘Š ∞ (since 𝐡𝐢 ≀ 𝐡 𝐢 ) ∞ βˆ’ π‘Š π‘ˆ(𝑑, 𝑑 β€² ) = 𝛿 π‘Š ∞ (since max = 1 ) 𝑑′ 𝑑 10 CS886 (c) 2013 Pascal Poupart

  11. Convergence β€’ Theorem 2: Policy evaluation converges to π‘Š 𝜌 for any initial estimate π‘Š π‘œβ†’βˆž 𝐼 (π‘œ) π‘Š = π‘Š 𝜌 β‘β‘βˆ€π‘Š lim β€’ Proof β€’ By definition V 𝜌 = 𝐼 ∞ 0 , but policy evaluation computes 𝐼 ∞ π‘Š for any initial π‘Š β€’ By lemma 1, 𝐼 (π‘œ) π‘Š βˆ’ 𝐼 π‘œ ∞ ≀ 𝛿 π‘œ π‘Š π‘Š βˆ’ π‘Š ∞ β€’ Hence, when π‘œ β†’ ∞ , then 𝐼 (π‘œ) π‘Š βˆ’ 𝐼 π‘œ 0 ∞ β†’ 0 and 𝐼 ∞ π‘Š = π‘Š 𝜌 β‘β‘βˆ€π‘Š 11 CS886 (c) 2013 Pascal Poupart

  12. Approximate Policy Evaluation β€’ In practice, we can’t perform an infinite number of iterations. β€’ Suppose that we perform value iteration for 𝑙 steps and 𝐼 𝑙 π‘Š βˆ’ 𝐼 π‘™βˆ’1 π‘Š ∞ = πœ— , how far is 𝐼 𝑙 π‘Š from π‘Š 𝜌 ? 12 CS886 (c) 2013 Pascal Poupart

  13. Approximate Policy Evaluation β€’ Theorem 3: If 𝐼 𝑙 π‘Š βˆ’ 𝐼 π‘™βˆ’1 π‘Š ∞ ≀ πœ— then πœ— π‘Š 𝜌 βˆ’ 𝐼 𝑙 π‘Š ∞ ≀ 1 βˆ’ 𝛿 β€’ Proof π‘Š 𝜌 βˆ’ 𝐼 𝑙 π‘Š ∞ 𝐼 ∞ (π‘Š) βˆ’ 𝐼 𝑙 π‘Š = ∞ (by Theorem 2) 𝐼 𝑒+𝑙 π‘Š βˆ’ 𝐼 𝑒+π‘™βˆ’1 π‘Š ∞ = ∞ 𝑒=1 𝐼 𝑒+𝑙 (π‘Š) βˆ’ 𝐼 𝑒+π‘™βˆ’1 π‘Š ∞ ≀ ( 𝐡 + 𝐢 ≀ 𝐡 + | 𝐢 | ) 𝑒=1 ∞ πœ— ∞ 𝛿 𝑒 πœ— = = 1βˆ’π›Ώ (by Lemma 1) 𝑒=1 13 CS886 (c) 2013 Pascal Poupart

  14. Optimal Value Function β€’ Non-linear system of equations βˆ— 𝑑 = max Pr 𝑑 β€² 𝑑, 𝑏⁑ π‘Š βˆ— (𝑑 β€² ) β‘βˆ€π‘‘ π‘Š 𝑆 𝑑, 𝑏⁑ + 𝛿 𝑑 β€² ∞ ∞ 𝑏 β€’ Matrix form: 𝑆 𝑏 : 𝑇 Γ— 1 column vector of rewards for 𝑏 π‘Š βˆ— : 𝑇 Γ— 1 column vector of optimal values π‘ˆ a : 𝑇 Γ— 𝑇 matrix of transition prob for 𝑏 π‘Š βˆ— = max 𝑆 𝑏 + π›Ώπ‘ˆ 𝑏 π‘Š βˆ— 𝑏 14 CS886 (c) 2013 Pascal Poupart

  15. Contraction 𝑆 𝑏 + π›Ώπ‘ˆ 𝑏 π‘Š be the operator in β€’ Let 𝐼 βˆ— (π‘Š) ≝ max 𝑏 value iteration β€’ Lemma 3: 𝐼 βˆ— is a contraction mapping. 𝐼 βˆ— π‘Š βˆ’ 𝐼 βˆ— π‘Š βˆ’ π‘Š ∞ ≀ 𝛿 π‘Š ∞ β€’ Proof: without loss of generality, let 𝐼 βˆ— π‘Š 𝑑 β‰₯ 𝐼 βˆ— (π‘Š)(𝑑) and βˆ— = argmax Pr 𝑑 β€² 𝑑, 𝑏 π‘Š(𝑑′) 𝑆 𝑑, 𝑏 + 𝛿 let 𝑏 𝑑 𝑑 β€² 𝑏 15 CS886 (c) 2013 Pascal Poupart

  16. Contraction β€’ Proof continued: β€’ Then 0 ≀ 𝐼 βˆ— π‘Š 𝑑 βˆ’ 𝐼 βˆ— π‘Š 𝑑 (by assumption) βˆ— + 𝛿 Pr 𝑑 β€² 𝑑, 𝑏 𝑑 βˆ— 𝑑 β€² (by definition) ≀ 𝑆 𝑑, 𝑏 𝑑 π‘Š 𝑑 β€² βˆ— βˆ’ 𝛿 Pr 𝑑 β€² 𝑑, 𝑏 𝑑 βˆ— π‘Š 𝑑 β€² βˆ’π‘† 𝑑, 𝑏 𝑑 𝑑 β€² Pr 𝑑 β€² 𝑑, 𝑏 𝑑 𝑑 β€² βˆ’ π‘Š 𝑑 β€² βˆ— = 𝛿 π‘Š 𝑑 β€² Pr 𝑑 β€² 𝑑, 𝑏 𝑑 βˆ’ π‘Š βˆ— ≀ 𝛿 π‘Š (maxnorm upper bound) 𝑑 β€² ∞ Pr 𝑑 β€² 𝑑, 𝑏 𝑑 βˆ’ π‘Š βˆ— ∞ (since = 𝛿 π‘Š = 1 ) 𝑑 β€² β€’ Repeat the same argument for 𝐼 βˆ— π‘Š )(𝑑) and 𝑑 β‰₯ 𝐼 βˆ— (π‘Š for each 𝑑 16 CS886 (c) 2013 Pascal Poupart

  17. Convergence β€’ Theorem 4: Value iteration converges to π‘Š βˆ— for any initial estimate π‘Š π‘œβ†’βˆž 𝐼 βˆ—(π‘œ) π‘Š = π‘Š βˆ— β‘β‘βˆ€π‘Š lim β€’ Proof β€’ By definition V βˆ— = 𝐼 βˆ— ∞ 0 , but value iteration computes 𝐼 βˆ— ∞ π‘Š for some initial π‘Š β€’ By lemma 3, 𝐼 βˆ—(π‘œ) π‘Š βˆ’ 𝐼 βˆ— π‘œ ≀ 𝛿 π‘œ π‘Š π‘Š βˆ’ π‘Š ∞ ∞ β€’ Hence, when π‘œ β†’ ∞ , then 𝐼 βˆ—(π‘œ) π‘Š βˆ’ 𝐼 βˆ— π‘œ 0 β†’ ∞ 0 and 𝐼 βˆ— ∞ π‘Š = π‘Š βˆ— β‘β‘βˆ€π‘Š 17 CS886 (c) 2013 Pascal Poupart

  18. Value Iteration β€’ Even when horizon is infinite, perform finitely many iterations β€’ Stop when π‘Š π‘œ βˆ’ π‘Š ≀ πœ— π‘œβˆ’1 valueIteration(MDP) βˆ— ← max 𝑆 𝑏 ; β‘β‘β‘β‘β‘β‘β‘β‘π‘œ ← 0 π‘Š 0 𝑏 Repeat π‘œ ← π‘œ + 1 𝑆 𝑏 + π›Ώπ‘ˆ 𝑏 π‘Š π‘Š π‘œ ← max π‘œβˆ’1 𝑏 Until π‘Š π‘œ βˆ’ π‘Š ∞ ≀ πœ— π‘œβˆ’1 Return π‘Š π‘œ 18 CS886 (c) 2013 Pascal Poupart

  19. Induced Policy β€’ Since π‘Š π‘œ βˆ’ π‘Š ∞ ≀ πœ— , by Theorem 4: we know π‘œβˆ’1 πœ— π‘œ βˆ’ π‘Š βˆ— that π‘Š ∞ ≀ 1βˆ’π›Ώ β€’ But, how good is the stationary policy 𝜌 π‘œ 𝑑 extracted based on π‘Š π‘œ ? 𝑆 𝑑, 𝑏 + 𝛿 Pr 𝑑 β€² 𝑑, 𝑏 π‘Š π‘œ (𝑑 β€² ) 𝜌 π‘œ 𝑑 = argmax 𝑏 𝑑 β€² β€’ How far is π‘Š 𝜌 π‘œ from π‘Š βˆ— ? 19 CS886 (c) 2013 Pascal Poupart

  20. Induced Policy 2πœ— β€’ Theorem 5: π‘Š 𝜌 π‘œ βˆ’ π‘Š βˆ— ∞ ≀ 1βˆ’π›Ώ β€’ Proof π‘Š 𝜌 π‘œ βˆ’ π‘Š βˆ— π‘Š 𝜌 π‘œ βˆ’ π‘Š π‘œ βˆ’ π‘Š βˆ— ∞ = π‘œ + π‘Š ∞ π‘Š 𝜌 π‘œ βˆ’ π‘Š π‘œ βˆ’ π‘Š βˆ— ≀ ∞ + π‘Š ∞ ( 𝐡 + 𝐢 ≀ 𝐡 + | 𝐢 | ) π‘œ 𝐼 𝜌 π‘œ ∞ (π‘Š π‘œ βˆ’ 𝐼 βˆ— ∞ π‘Š = π‘œ ) βˆ’ π‘Š + π‘Š π‘œ π‘œ ∞ ∞ πœ— πœ— ≀ 1βˆ’π›Ώ + 1βˆ’π›Ώ (by Theorems 2 and 4) 2πœ— = 1βˆ’π›Ώ 20 CS886 (c) 2013 Pascal Poupart

  21. Summary β€’ Value iteration – Simple dynamic programming algorithm – Complexity: 𝑃(π‘œ 𝐡 𝑇 2 ) β€’ Here π‘œ is the number of iterations β€’ Can we optimize the policy directly instead of optimizing the value function and then inducing a policy? – Yes: by policy iteration 21 CS886 (c) 2013 Pascal Poupart

Recommend


More recommend