the simplex method is strongly polynomial for
play

The simplex method is strongly polynomial for deterministic Markov - PowerPoint PPT Presentation

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post Yinyu Ye Fields Institute November 29, 2013 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 1 / 18 Markov Decision Processes A Markov decision


  1. The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post Yinyu Ye Fields Institute November 29, 2013 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 1 / 18

  2. Markov Decision Processes A Markov decision process is a method of modeling repeated decision making over time in stochastic, changing environments. p 1 p 2 r 1 s p 3 r 2 It consists of states s and actions a with rewards r a and probability distributions P a over states When action a is used it receive the reward r a and transitions to a new state according to the distribution P a Post, Ye Simplex on MDPs Fields, Nov 29, 2013 2 / 18

  3. Markov Decision Processes We are also given a discount factor γ < 1 as part of the input Goal: pick actions so as to maximize ∞ � γ t E A [ r ( t )] t =0 where r ( t ) is the reward at time t r 2 r 1 r 4 r 5 r 3 Reward: Post, Ye Simplex on MDPs Fields, Nov 29, 2013 3 / 18

  4. Markov Decision Processes We are also given a discount factor γ < 1 as part of the input Goal: pick actions so as to maximize ∞ � γ t E A [ r ( t )] t =0 where r ( t ) is the reward at time t r 2 r 1 r 4 r 5 r 3 Reward: r 1 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 3 / 18

  5. Markov Decision Processes We are also given a discount factor γ < 1 as part of the input Goal: pick actions so as to maximize ∞ � γ t E A [ r ( t )] t =0 where r ( t ) is the reward at time t r 2 r 1 r 5 r 4 r 3 Reward: r 1 + γ r 5 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 3 / 18

  6. Markov Decision Processes We are also given a discount factor γ < 1 as part of the input Goal: pick actions so as to maximize ∞ � γ t E A [ r ( t )] t =0 where r ( t ) is the reward at time t r 2 r 1 r 4 r 5 r 3 r 1 + γ r 5 + γ 2 r 4 Reward: Post, Ye Simplex on MDPs Fields, Nov 29, 2013 3 / 18

  7. Motivation MDPs are widely used in machine learning, operations research, economics, robotics and control, etc. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 4 / 18

  8. Motivation MDPs are widely used in machine learning, operations research, economics, robotics and control, etc. MDPs are also an interesting problem theoretically in that they are essentially where our knowledge of how to solve LPs in strongly polynomial time stops ◮ Close to being strongly polynomial [Ye05] and possess a lot of structure that allows for powerful algorithms like policy iteration [How60]... ◮ ...but also appear hard for powerful algorithms [Fea10] [FHZ11] Post, Ye Simplex on MDPs Fields, Nov 29, 2013 4 / 18

  9. Motivation MDPs are widely used in machine learning, operations research, economics, robotics and control, etc. MDPs are also an interesting problem theoretically in that they are essentially where our knowledge of how to solve LPs in strongly polynomial time stops ◮ Close to being strongly polynomial [Ye05] and possess a lot of structure that allows for powerful algorithms like policy iteration [How60]... ◮ ...but also appear hard for powerful algorithms [Fea10] [FHZ11] Performance of basis exchange algorithms like policy iteration and simplex remains poorly understood ◮ A number of open questions including their performance on special cases like deterministic MDPs [HZ10] ◮ Important for developing new algorithms with better performance Post, Ye Simplex on MDPs Fields, Nov 29, 2013 4 / 18

  10. Previous Work Policy iteration [How60] ◮ Long conjectured to be strongly polynomial but only exponential bounds known [MS99] ◮ Recently shown to be exponential [Fea10] Simplex lower bounds using MDPs [FHZ11] [Fri11] [MC94] 1 Discounted MDPs (bounds depend on 1 − γ ) ◮ ǫ -approximation to the optimum [Bel57] ◮ True optimum [Ye11] [HMZ11] Specialized algorithms for deterministic MDPs and other special cases [PT87] [HN94] [MTZ10] [Mad02] Post, Ye Simplex on MDPs Fields, Nov 29, 2013 5 / 18

  11. Results Theorem The simplex method with Dantzig’s most-negative reduced cost pivoting rule converges in O ( n 3 m 2 log 2 n ) iterations for deterministic MDPs regardless of the discount factor. Theorem If each action can have a distinct discount, then the simplex method converges in O ( n 5 m 3 log 2 n ) iterations. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 6 / 18

  12. Results Theorem The simplex method with Dantzig’s most-negative reduced cost pivoting rule converges in O ( n 3 m 2 log 2 n ) iterations for deterministic MDPs regardless of the discount factor. Theorem If each action can have a distinct discount, then the simplex method converges in O ( n 5 m 3 log 2 n ) iterations. Subsequent work [HKZ13] has improved these bounds by a factor of n Post, Ye Simplex on MDPs Fields, Nov 29, 2013 6 / 18

  13. Value vector Let π be a policy (a choice of action for each state) ◮ This defines a Markov chain The value (dual variable) v π s of a state s is the expected reward for starting in the state and following π v π s = r a + γ ( P π a ) T v π v 1 p 1 r 1 v s p 2 v 2 ◮ Key property: increasing the value of one state only increases values of others Post, Ye Simplex on MDPs Fields, Nov 29, 2013 7 / 18

  14. Flux vector The flux (primal variable) x π a through an action a is the discounted number of times an action is used when starting in all the states x π = � ( γ P π ) i 1 = ( I − γ P π ) − 1 1 , i ≥ 0 1 ◮ Flux through an action in π is always between 1 and n 1 − γ = n � ∞ i =0 γ i Post, Ye Simplex on MDPs Fields, Nov 29, 2013 8 / 18

  15. Flux vector The flux (primal variable) x π a through an action a is the discounted number of times an action is used when starting in all the states x π = � ( γ P π ) i 1 = ( I − γ P π ) − 1 1 , i ≥ 0 1 γ γ 2 γ 3 1 − γ ◮ Flux through an action in π is always between 1 and n 1 − γ = n � ∞ i =0 γ i Post, Ye Simplex on MDPs Fields, Nov 29, 2013 8 / 18

  16. Linear Program MDPs can be solved with the following primal/dual pair of LPs Primal: maximize � a r a x a � = 1 + γ � subject to ∀ s ∈ S , a ∈ A s x a a P a , s x a ≥ 0 x Dual: minimize � s v s v s ≥ r a + γ � subject to ∀ s ∈ S , a ∈ A s , s ′ P a , s ′ v s ′ Post, Ye Simplex on MDPs Fields, Nov 29, 2013 9 / 18

  17. Gain The gain (reduced cost) r π a of an action is improvement for switching to that action for one step r π a = ( r a + γ P T a v π ) − v π s We will pivot on the action with the highest gain v 1 p 1 r 1 p 2 v 2 r π v s 1 = ( r 1 + γ ( p 1 v 1 + p 2 v 2 )) − v s r 2 p 3 v 3 v 4 p 4 Post, Ye Simplex on MDPs Fields, Nov 29, 2013 10 / 18

  18. Discounted MDPs Basic idea: all variables lie in an interval of polynomial size. As a result the gap to the optimum shrinks by a polynomial factor each iteration. 1 Suppose 1 − γ is polynomial. Let π be the current policy and ∆ = max r π and a = argmax r π r T x ∗ − r T x π = ( r π ) T x ∗ ≤ ∆ n 1 − γ Using action a will increase objective by at least ∆, so distance to optimum shrinks by factor of 1 − 1 − γ n Post, Ye Simplex on MDPs Fields, Nov 29, 2013 11 / 18

  19. Discounted MDPs Now consider optimal gains r ∗ Suppose ∆ = min a ′ ∈ π r ∗ and a = argmin a ′ ∈ π r ∗ ∆ > r T x π − r T x ∗ > ∆ n 1 − γ if a ∈ π . Therefore if r T x π − r T x ∗ shrinks by factor of n 1 − γ , a can never again appear in a policy, and this happens after � � 1 − γ n n log 1 − (1 − γ ) / n = O 1 − γ log n 1 − γ rounds [Ye10] Post, Ye Simplex on MDPs Fields, Nov 29, 2013 12 / 18

  20. Deterministic MDPs An action is either on a path or a cycle If a is on a path then x a ∈ [1 , n ] � � 1 n If a is on a cycle then x a ∈ 1 − γ , 1 − γ So if x a � = 0, it must lie in one of two layers of polynomial size n n 0 1 1 1 − γ 1 − γ Post, Ye Simplex on MDPs Fields, Nov 29, 2013 13 / 18

  21. Uniform discount Lemma If the algorithm updates a path action it reduces the gap to the last policy before creating a new cycle by a factor of 1 − 1 / n 2 . Post, Ye Simplex on MDPs Fields, Nov 29, 2013 14 / 18

  22. Uniform discount Lemma If the algorithm updates a path action it reduces the gap to the last policy before creating a new cycle by a factor of 1 − 1 / n 2 . Lemma After O ( n 2 log n ) iterations, either the algorithm finishes, creates a new cycle, breaks a cycle, or some action never again appears in a policy before a new cycle is created. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 14 / 18

  23. Uniform discount Lemma If the algorithm updates a path action it reduces the gap to the last policy before creating a new cycle by a factor of 1 − 1 / n 2 . Lemma After O ( n 2 log n ) iterations, either the algorithm finishes, creates a new cycle, breaks a cycle, or some action never again appears in a policy before a new cycle is created. Lemma After O ( n 2 m log n ) iterations, either the algorithm finishes or creates a new cycle. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 14 / 18

  24. Uniform discount Lemma If the algorithm creates a new cycle it reduces the gap to the optimum by a factor of 1 − 1 / n. Lemma After O ( n log n ) cycles are created either the algorithm finishes, some action is eliminated from cycles for the remainder of the algorithm or entirely eliminated from future policies, or the algorithm converges. Theorem The simplex method converges in O ( n 3 m 2 log 2 n ) iterations on deterministic MDPs. Post, Ye Simplex on MDPs Fields, Nov 29, 2013 15 / 18

Recommend


More recommend