Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Markov Decision Process β’ Definition β Set of states: π β Set of actions (i.e., decisions): π΅ β Transition model: Prβ‘ (π‘ π’ |π‘ π’β1 , π π’β1 ) β Reward model (i.e., utility): π(π‘ π’ , π π’ ) β Discount factor: 0 β€ πΏ β€ 1 β Horizon (i.e., # of time steps): β β’ Goal: find optimal policy π 2 CS886 (c) 2013 Pascal Poupart
Finite Horizon β’ Policy evaluation π π‘ = β πΏ π’ Prβ‘ π (π π’ = π‘β²|π 0 = π‘, π)π(π‘β², π π’ (π‘β²)) β π’=0 β’ Recursive form (dynamic programming) π π‘ = π(π‘, π 0 π‘ ) π 0 π (π‘ β² ) π π‘ = π π‘, π π’ π‘ Pr π‘ β² π‘, π π’ π‘ + πΏ π π π‘ β² π’ π’β1 3 CS886 (c) 2013 Pascal Poupart
Finite Horizon β’ Optimal Policy π β π β π‘ β₯ π π π‘ β‘β‘βπ, π‘ π β β β’ Optimal value function π β (shorthand for π π β ) β π‘ = max π(π‘, π) π 0 π β π‘ = max Pr π‘ β² π‘, π π β (π‘ β² ) π π‘, π + πΏ π π‘ β² π’ π’β1 π Bellmanβs equation 4 CS886 (c) 2013 Pascal Poupart
Value Iteration Algorithm valueIteration(MDP) β π‘ β max π π(π‘, π)β‘βπ‘ 0 π For π’ = 1 to β do β (π‘ β² ) β π‘ β max Pr π‘ β² π‘, π π π π‘, π + πΏ π β‘βπ‘ π‘ β² π’ π’β1 π Return π β Optimal policy π β β π‘ β argmax π’ = 0:β‘π 0 π π‘, π β‘βπ‘ π β π‘ β argmax β (π‘ β² ) Pr π‘ β² π‘, π π π’ > 0 : π π’ π π‘, π + πΏ β‘βπ‘ π‘ β² π’β1 π NB: π β is non stationary (i.e., time dependent) 5 CS886 (c) 2013 Pascal Poupart
Value Iteration β’ Matrix form: π π : π Γ 1 column vector of rewards for π β : π Γ 1 column vector of state values π π’ π π : π Γ π matrix of transition prob. for π valueIteration(MDP) β β max π π β‘ π 0 π For π’ = 1 to β do β β max π π + πΏπ π π β β‘ π π’ π’β1 π Return π β 6 CS886 (c) 2013 Pascal Poupart
Infinite Horizon β’ Let β β β π β π π π and π π β’ Then π β π β β β ββ1 β’ Policy evaluation: π π‘ = π π‘, π β π‘ Pr π‘ β² π‘, π β π‘ π (π‘ β² ) + πΏ π π β‘βπ‘ π‘ β² β β β’ Bellmanβs equation: β π‘ = max Pr π‘ β² π‘, π π β (π‘ β² ) π π‘, π + πΏ π π‘ β² β β π 7 CS886 (c) 2013 Pascal Poupart
Policy evaluation β’ Linear system of equations π π‘ = π π‘, π β π‘ Pr π‘ β² π‘, π β π‘ π (π‘ β² ) π + πΏ π β‘βπ‘ π‘ β² β β β’ Matrix form: π : π Γ 1 column vector of sate rewards for π π : π Γ 1 column vector of state values for π π : π Γ π matrix of transition prob for π π = π + πΏππ 8 CS886 (c) 2013 Pascal Poupart
Solving linear equations β’ Linear system: π = π + πΏππ β’ Gaussian elimination: π½ β πΏπ π = π β’ Compute inverse: π = π½ β πΏπ β1 π β’ Iterative methods β’ Value iteration (a.k.a. Richardson iteration) β’ Repeat π β π + πΏππ 9 CS886 (c) 2013 Pascal Poupart
Contraction β’ Let πΌ(π) β π + πΏππ be the policy eval operator β’ Lemma 1: πΌ is a contraction mapping. β πΌ π β π πΌ π β β€ πΏ π β β’ Proof πΌ π β πΌ π β β π β πΏππ = π + πΏππ β (by definition) β π = πΏπ π β (simplification) β π β€ πΏ π π β (since π΅πΆ β€ π΅ πΆ ) β β π π(π‘, π‘ β² ) = πΏ π β (since max = 1 ) π‘β² π‘ 10 CS886 (c) 2013 Pascal Poupart
Convergence β’ Theorem 2: Policy evaluation converges to π π for any initial estimate π πββ πΌ (π) π = π π β‘β‘βπ lim β’ Proof β’ By definition V π = πΌ β 0 , but policy evaluation computes πΌ β π for any initial π β’ By lemma 1, πΌ (π) π β πΌ π β β€ πΏ π π π β π β β’ Hence, when π β β , then πΌ (π) π β πΌ π 0 β β 0 and πΌ β π = π π β‘β‘βπ 11 CS886 (c) 2013 Pascal Poupart
Approximate Policy Evaluation β’ In practice, we canβt perform an infinite number of iterations. β’ Suppose that we perform value iteration for π steps and πΌ π π β πΌ πβ1 π β = π , how far is πΌ π π from π π ? 12 CS886 (c) 2013 Pascal Poupart
Approximate Policy Evaluation β’ Theorem 3: If πΌ π π β πΌ πβ1 π β β€ π then π π π β πΌ π π β β€ 1 β πΏ β’ Proof π π β πΌ π π β πΌ β (π) β πΌ π π = β (by Theorem 2) πΌ π’+π π β πΌ π’+πβ1 π β = β π’=1 πΌ π’+π (π) β πΌ π’+πβ1 π β β€ ( π΅ + πΆ β€ π΅ + | πΆ | ) π’=1 β π β πΏ π’ π = = 1βπΏ (by Lemma 1) π’=1 13 CS886 (c) 2013 Pascal Poupart
Optimal Value Function β’ Non-linear system of equations β π‘ = max Pr π‘ β² π‘, πβ‘ π β (π‘ β² ) β‘βπ‘ π π π‘, πβ‘ + πΏ π‘ β² β β π β’ Matrix form: π π : π Γ 1 column vector of rewards for π π β : π Γ 1 column vector of optimal values π a : π Γ π matrix of transition prob for π π β = max π π + πΏπ π π β π 14 CS886 (c) 2013 Pascal Poupart
Contraction π π + πΏπ π π be the operator in β’ Let πΌ β (π) β max π value iteration β’ Lemma 3: πΌ β is a contraction mapping. πΌ β π β πΌ β π β π β β€ πΏ π β β’ Proof: without loss of generality, let πΌ β π π‘ β₯ πΌ β (π)(π‘) and β = argmax Pr π‘ β² π‘, π π(π‘β²) π π‘, π + πΏ let π π‘ π‘ β² π 15 CS886 (c) 2013 Pascal Poupart
Contraction β’ Proof continued: β’ Then 0 β€ πΌ β π π‘ β πΌ β π π‘ (by assumption) β + πΏ Pr π‘ β² π‘, π π‘ β π‘ β² (by definition) β€ π π‘, π π‘ π π‘ β² β β πΏ Pr π‘ β² π‘, π π‘ β π π‘ β² βπ π‘, π π‘ π‘ β² Pr π‘ β² π‘, π π‘ π‘ β² β π π‘ β² β = πΏ π π‘ β² Pr π‘ β² π‘, π π‘ β π β β€ πΏ π (maxnorm upper bound) π‘ β² β Pr π‘ β² π‘, π π‘ β π β β (since = πΏ π = 1 ) π‘ β² β’ Repeat the same argument for πΌ β π )(π‘) and π‘ β₯ πΌ β (π for each π‘ 16 CS886 (c) 2013 Pascal Poupart
Convergence β’ Theorem 4: Value iteration converges to π β for any initial estimate π πββ πΌ β(π) π = π β β‘β‘βπ lim β’ Proof β’ By definition V β = πΌ β β 0 , but value iteration computes πΌ β β π for some initial π β’ By lemma 3, πΌ β(π) π β πΌ β π β€ πΏ π π π β π β β β’ Hence, when π β β , then πΌ β(π) π β πΌ β π 0 β β 0 and πΌ β β π = π β β‘β‘βπ 17 CS886 (c) 2013 Pascal Poupart
Value Iteration β’ Even when horizon is infinite, perform finitely many iterations β’ Stop when π π β π β€ π πβ1 valueIteration(MDP) β β max π π ; β‘β‘β‘β‘β‘β‘β‘β‘π β 0 π 0 π Repeat π β π + 1 π π + πΏπ π π π π β max πβ1 π Until π π β π β β€ π πβ1 Return π π 18 CS886 (c) 2013 Pascal Poupart
Induced Policy β’ Since π π β π β β€ π , by Theorem 4: we know πβ1 π π β π β that π β β€ 1βπΏ β’ But, how good is the stationary policy π π π‘ extracted based on π π ? π π‘, π + πΏ Pr π‘ β² π‘, π π π (π‘ β² ) π π π‘ = argmax π π‘ β² β’ How far is π π π from π β ? 19 CS886 (c) 2013 Pascal Poupart
Induced Policy 2π β’ Theorem 5: π π π β π β β β€ 1βπΏ β’ Proof π π π β π β π π π β π π β π β β = π + π β π π π β π π β π β β€ β + π β ( π΅ + πΆ β€ π΅ + | πΆ | ) π πΌ π π β (π π β πΌ β β π = π ) β π + π π π β β π π β€ 1βπΏ + 1βπΏ (by Theorems 2 and 4) 2π = 1βπΏ 20 CS886 (c) 2013 Pascal Poupart
Summary β’ Value iteration β Simple dynamic programming algorithm β Complexity: π(π π΅ π 2 ) β’ Here π is the number of iterations β’ Can we optimize the policy directly instead of optimizing the value function and then inducing a policy? β Yes: by policy iteration 21 CS886 (c) 2013 Pascal Poupart
Recommend
More recommend