Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Policy Optimization • Value and policy iteration – Iterative algorithms that implicitly solve an optimization problem • Can we explicitly write down this optimization problem? – Yes, it can be formulated as a linear program 2 CS886 (c) 2013 Pascal Poupart
Primal Linear Program primalLP(MDP) 𝑊 𝑥(𝑡)𝑊(𝑡) min 𝑡 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ ∀𝑡, 𝑏 subject to 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ return 𝑊 • Variables: 𝑊 𝑡 ∀𝑡 • Objective: min 𝑥(𝑡)𝑊(𝑡) 𝑡 where 𝑥(𝑡) is a weight assigned to state 𝑡 • Constraints: Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ ∀𝑡, 𝑏 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ 3 CS886 (c) 2013 Pascal Poupart
Objective • Why do we minimize a weighted combination of the values? Shouldn’t we maximize value? • Value functions 𝑊 that satisfy the constraints are upper bounds on the optimal value function 𝑊 ∗ 𝑊 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 • Minimizing value ensures that we choose the lowest upper bound V 𝑊(𝑡) = 𝑊 ∗ 𝑡 ∀𝑡 min 4 CS886 (c) 2013 Pascal Poupart
Upper bound • Theorem: Value functions 𝑊 that satisfy Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 ∀𝑡, 𝑏 are 𝑡 ′ upper bounds on the optimal value function 𝑊 ∗ 𝑊 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 • Proof: Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ – Since 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 ∀𝑡, 𝑏 𝑡 ′ Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ – Then 𝑊 𝑡 ≥ max 𝑆 𝑡, 𝑏 + 𝛿 ∀𝑡 𝑡 ′ 𝑏 = 𝐼 ∗ (𝑊)(𝑡) ∀𝑡 – Furthermore 𝑊 ≥ 𝐼 ∗ 𝑊 ≥ 𝐼 ∗ (𝐼 ∗ ≥ ⋯ ≥ 𝐼 ∗ ∞ 𝑊 = 𝑊 ∗ 𝑊 5 CS886 (c) 2013 Pascal Poupart
Weight function (initial state) • How do we choose the weight function? • If the policy always starts in the same initial state 𝑡 0 , then set 𝑥 𝑡 = 1 𝑡 = 𝑡 0 otherwise 0 • This ensures that 𝑥 𝑡 𝑊 𝑡 = 𝑊 ∗ (𝑡 0 ) 𝑡 6 CS886 (c) 2013 Pascal Poupart
Weight function (any state) • If the policy may start in any state, then assign a positive weight to each state, i.e. 𝑥 𝑡 > 0 ∀𝑡 • This ensures that 𝑊 is minimized at each 𝑡 and therefore 𝑊 𝑡 = 𝑊 ∗ 𝑡 ∀𝑡 • The magnitude of the weight doesn’t matter when the LP is solved exactly. We will revisit the choice of 𝑥(𝑡) when we discuss approximate linear programming. 7 CS886 (c) 2013 Pascal Poupart
Optimal Policy • Linear program finds 𝑊 ∗ • We can extract 𝜌 ∗ from 𝑊 ∗ as usual: 𝜌 ∗ 𝑡 ← 𝑏𝑠𝑛𝑏𝑦 𝑏 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 ∗ (𝑡 ′ ) 𝑡 ′ • Or check the active constraints – For each 𝑡 , check which 𝑏 ∗ leads to equality 𝑊 𝑡 = 𝑆 𝑡, 𝑏 ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 ∗ 𝑊(𝑡 ′ ) 𝑡 ′ Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ ∀𝑏 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ – Set 𝜌 ∗ 𝑡 ← 𝑏 ∗ 8 CS886 (c) 2013 Pascal Poupart
Direct Policy Optimization • The optimal solution to the primal linear program is 𝑊 ∗ , but we still have to extract 𝜌 ∗ • Could we directly optimize 𝜌 ? – Yes, by considering the dual linear program 9 CS886 (c) 2013 Pascal Poupart
Dual Linear Program dualLP(MDP) max 𝑧 𝑡, 𝑏 𝑆(𝑡, 𝑏) 𝑡,𝑏 𝑧 (𝑡 ′ |𝑡, 𝑏)𝑧 𝑡, 𝑏 subject to 𝑧 𝑡′, 𝑏′ = 𝑐 𝑡′ + 𝛿 Pr ∀𝑡 𝑏′ 𝑡,𝑏 𝑧 𝑡, 𝑏 ≥ 0 ∀𝑡, 𝑏 Let 𝜌 𝑏|𝑡 = Pr 𝑏 𝑡 = 𝑧(𝑡, 𝑏)/ 𝑧(𝑡, 𝑏) 𝑏 return 𝜌 • Variables: y 𝑡, 𝑏 ∀𝑡, 𝑏 – frequency of each 𝑡, 𝑏 -pair (proportional to 𝜌 ) • Objective: max 𝑧 𝑡, 𝑏 𝑆(𝑡, 𝑏) 𝑡,𝑏 𝑧 • Constraints: (𝑡 ′ |𝑡, 𝑏)𝑧 𝑡, 𝑏 𝑧 𝑡′, 𝑏′ = 𝑐 𝑡′ + 𝛿 Pr 𝑏′ 𝑡,𝑏 10 CS886 (c) 2013 Pascal Poupart
Duality • For every primal linear program in the form min Interpretation: 𝑦 𝑑 𝑈 𝑦 𝑑 = 𝑥 s. t. 𝐵𝑦 ≥ 𝑐 𝑦 = 𝑊 𝑧 ∝ 𝜌 • There is an equivalent dual 𝐵 = 𝐽 − 𝛿𝑈 𝑏 ∀𝑏 linear program in the form 𝑐 = [𝑆 𝑏 ]∀𝑏 max 𝑐 𝑈 𝑧 𝑧 s. t. 𝐵 𝑈 𝑧 = 𝑑 and 𝑧 ≥ 0 𝑑 𝑈 𝑦 = max 𝑐 𝑈 𝑧 • Where min 𝑦 𝑧 11 CS886 (c) 2013 Pascal Poupart
State Frequency • Let 𝑔(𝑡) be the frequency of 𝑡 under policy 𝜌 . 0 step: 𝑔 0 𝑡 = 𝑥(𝑡) 1 𝑡′ = 𝑥 𝑡′ + 𝛿 Pr 1 step: 𝑔 (𝑡′|𝑡, 𝜌 𝑡 )𝑥 𝑡 𝑡 2 𝑡′′ = 𝑥 𝑡′′ + 𝛿 Pr 2 steps: 𝑔 (𝑡′′|𝑡′, 𝜌 𝑡′ )𝑥 𝑡′ 𝑡′ +𝛿 2 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 Pr 𝑡′′ 𝑡 ′ , 𝜌 𝑡 ′ 𝑥(𝑡) 𝑡,𝑡 ′ … n steps: + 𝛿 Pr 𝑡 𝑜 𝑡 𝑜−1 , 𝜌 𝑡 𝑜−1 𝑜−1 (𝑡 𝑜−1 ) 𝑜 𝑡 𝑜 = 𝑥 𝑡 𝑜 𝑔 𝑔 𝑡 𝑜−1 ∞ steps: 𝑔 𝑡′ = 𝑥 𝑡′ + 𝛿 Pr 𝑡′ 𝑡, 𝜌(𝑡) 𝑔(𝑡) 𝑡 12 CS886 (c) 2013 Pascal Poupart
State-Action Frequency • Let 𝑧 𝑡, 𝑏 be the state-action frequency 𝑧 𝑡, 𝑏 = 𝜌 𝑏|𝑡 𝑔 𝑡 where 𝜌 𝑏 𝑡 = Pr 𝑏 𝑡 is a stochastic policy • Then the following equations are equivalent 𝑔 𝑡′ = 𝑥 𝑡′ + 𝛿 Pr 𝑡′ 𝑡, 𝜌(𝑡) 𝑔(𝑡) 𝑡 𝑔 𝜌 𝑡′ = 𝑥 𝑡′ + Pr 𝑡 ′ 𝑡, 𝑏 𝜌 𝑏|𝑡 𝑔 𝜌 (𝑡) ⇔ 𝜌(𝑏 ′ |𝑡 ′ ) 𝑏 ′ 𝑡 = 𝑥 𝑡′ + Pr 𝑡 ′ 𝑡, 𝑏 𝑧(𝑡, 𝑏) ⇔ 𝑧(𝑡 ′ , 𝑏 ′ ) 𝑏 ′ 𝑡 Constraint of dual LP 13 CS886 (c) 2013 Pascal Poupart
Policy • We can recover 𝜌 from 𝑧 . 𝑧 𝑡, 𝑏 = 𝜌 𝑏 𝑡 𝑔 𝑡 (by definition) 𝑧 𝑡,𝑏 𝜌 𝑏 𝑡 = 𝑔 𝑡 (isolate 𝜌 ) 𝑧 𝑡,𝑏 𝜌 𝑏 𝑡 = (by definition) 𝑧 𝑡,𝑏 𝑏 • 𝜌 may be stochastic • Actions with non-zero probability are necessarily optimal 14 CS886 (c) 2013 Pascal Poupart
Objective • Duality theory guarantees that the objectives of the primal and dual LPs are equal max 𝑧 𝑡, 𝑏 𝑆 𝑡, 𝑏 = min 𝑊 𝑥(𝑡) 𝑊(𝑡) y 𝑡,𝑏 𝑡 • This means that 𝑧 𝑡, 𝑏 𝑆 𝑡, 𝑏 implicitly 𝑡,𝑏 measures the value of the optimal policy. 15 CS886 (c) 2013 Pascal Poupart
Solution Algorithms • Two broad classes of algorithms: – Simplex (corner search) – Interior point methods (interior iterative methods) • Polynomial complexity (MDP is in P, not NP) • Many packages for linear programming – CPLEX (robust, efficient and free for academia) 16 CS886 (c) 2013 Pascal Poupart
Recommend
More recommend