rich behavioral models illustration on journey planning
play

Rich Behavioral Models: Illustration on Journey Planning Mickael - PowerPoint PPT Presentation

Rich Behavioral Models: Illustration on Journey Planning Mickael Randour F.R.S.-FNRS & UMONS Universit e de Mons, Belgium March 14, 2019 Workshop Theory and Algorithms in Graph and Stochastic Games Context SSP-E/SSP-P SSP-WE


  1. Rich Behavioral Models: Illustration on Journey Planning Mickael Randour F.R.S.-FNRS & UMONS – Universit´ e de Mons, Belgium March 14, 2019 Workshop – Theory and Algorithms in Graph and Stochastic Games

  2. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion The talk in one slide Strategy synthesis for Markov Decision Processes (MDPs) Finding good controllers for systems interacting with a stochastic environment. Good? Performance evaluated through payoff functions . Usual problem is to optimize the expected performance or the probability of achieving a given performance level . Not sufficient for many practical applications. � Several extensions, more expressive but also more complex. . . Aim of this survey talk Give a flavor of classical questions and extensions ( rich behavioral models ), illustrated on the stochastic shortest path (SSP). Rich Behavioral Models Mickael Randour 1 / 41

  3. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion 1 Context, MDPs, strategies 2 Classical stochastic shortest path problems 3 Good expectation under acceptable worst-case 4 Percentile queries in multi-dimensional MDPs 5 Conclusion Rich Behavioral Models Mickael Randour 2 / 41

  4. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion 1 Context, MDPs, strategies 2 Classical stochastic shortest path problems 3 Good expectation under acceptable worst-case 4 Percentile queries in multi-dimensional MDPs 5 Conclusion Rich Behavioral Models Mickael Randour 3 / 41

  5. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion Multi-criteria quantitative synthesis Verification and synthesis: � a reactive system to control , � an interacting environment , � a specification to enforce . Model of the (discrete) interaction? � Antagonistic environment: 2-player game on graph. � Stochastic environment: MDP . Quantitative specifications. Examples: � Reach a state s before x time units � shortest path. � Minimize the average response-time � mean-payoff. Focus on multi-criteria quantitative models � to reason about trade-offs and interplays . Rich Behavioral Models Mickael Randour 4 / 41

  6. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion Strategy (policy) synthesis for MDPs system environment informal description description specification model as a model as 1 How complex is it to decide if Markov Decision a winning Process (MDP) objective a winning strategy exists? 2 How complex such a strategy synthesis needs to be? Simpler is better . is there a 3 Can we synthesize one winning strategy ? efficiently? yes no empower system capabilities strategy or weaken = specification controller requirements Rich Behavioral Models Mickael Randour 5 / 41

  7. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion Markov decision processes MDP D = ( S , s init , A , δ, w ). � Finite sets of states S and actions A , 0 . 7 � probabilistic transition δ : S × A → D ( S ), 0 . 3 s 1 s 2 � weight function w : A → Z . a 1 , 2 Run (or play): ρ = s 1 a 1 . . . a n − 1 s n . . . 0 . 9 a 2 , − 1 b 3 , 3 such that δ ( s i , a i , s i +1 ) > 0 for all i ≥ 1. 0 . 1 � Set of runs R ( D ). � Set of histories (finite runs) H ( D ). s 3 Strategy σ : H ( D ) → D ( A ). � ∀ h ending in s , Supp( σ ( h )) ∈ A ( s ). a 4 , 1 a 3 , 0 s 4 Rich Behavioral Models Mickael Randour 6 / 41

  8. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion Markov decision processes Sample pure memoryless strategy σ . 0 . 7 Sample run ρ = s 1 a 1 s 2 a 2 s 1 a 1 s 2 a 2 ( s 3 a 3 s 4 a 4 ) ω . 0 . 3 s 1 s 2 Other possible run ρ ′ = s 1 a 1 s 2 a 2 ( s 3 a 3 s 4 a 4 ) ω . a 1 , 2 Strategies may use 0 . 9 a 2 , − 1 b 3 , 3 � finite or infinite memory , 0 . 1 � randomness . s 3 Payoff functions map runs to numerical values: a 4 , 1 a 3 , 0 � truncated sum up to T = { s 3 } : TS T ( ρ ) = 2, TS T ( ρ ′ ) = 1, � mean-payoff: MP( ρ ) = MP( ρ ′ ) = 1 / 2, s 4 � many more. Rich Behavioral Models Mickael Randour 6 / 41

  9. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion Markov chains 0 . 7 0 . 3 Once strategy σ fixed, fully stochastic process: s 1 s 2 a 1 , 2 � Markov chain ( MC ) M . 0 . 9 State space = product of the MDP and the a 2 , − 1 memory of σ . 0 . 1 s 3 Event E ⊆ R ( M ) � probability P M ( E ) Measurable f : R ( M ) → R ∪ {∞} , a 4 , 1 a 3 , 0 � expected value E M ( f ) s 4 Rich Behavioral Models Mickael Randour 7 / 41

  10. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion Aim of this survey Compare different types of quantitative specifications for MDPs � w.r.t. the complexity of the decision problem, � w.r.t. the complexity of winning strategies. Recent extensions share a common philosophy: framework for the synthesis of strategies with richer performance guarantees . � Our work deals with many different payoff functions. Focus on the shortest path problem in this talk. � Not the most involved technically, natural applications. � Useful to understand the practical interest of each variant. Joint work with R. Berthon, V. Bruy` ere, E. Filiot, J.-F. Raskin, O. Sankur [BFRR17, RRS17, RRS15, BCH + 16, Ran16, BRR17]. Rich Behavioral Models Mickael Randour 8 / 41

  11. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion 1 Context, MDPs, strategies 2 Classical stochastic shortest path problems 3 Good expectation under acceptable worst-case 4 Percentile queries in multi-dimensional MDPs 5 Conclusion Rich Behavioral Models Mickael Randour 9 / 41

  12. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion Stochastic shortest path Shortest path problem for weighted graphs Given state s ∈ S and target set T ⊆ S , find a path from s to a state t ∈ T that minimizes the sum of weights along edges. � PTIME algorithms (Dijkstra, Bellman-Ford, etc) [CGR96]. We focus on MDPs with strictly positive weights for the SSP. � Truncated sum payoff function for ρ = s 1 a 1 s 2 a 2 . . . and target set T : �� n − 1 j =1 w ( a j ) if s n first visit of T , TS T ( ρ ) = ∞ if T is never reached. Rich Behavioral Models Mickael Randour 10 / 41

  13. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion Planning a journey in an uncertain environment home go back, 2 railway, 2 car, 1 0 . 1 0 . 9 0 . 2 0 . 1 0 . 7 waiting light medium heavy train bike, 45 room traffic traffic traffic 0 . 9 0 . 1 relax, 35 drive, 20 drive, 30 drive, 70 wait, 3 work Each action takes time, target = work. � What kind of strategies are we looking for when the environment is stochastic? Rich Behavioral Models Mickael Randour 11 / 41

  14. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion SSP-E: minimizing the expected length to target SSP-E problem Given MDP D = ( S , s init , A , δ, w ), target set T and threshold ℓ ∈ Q , decide if there exists σ such that E σ D (TS T ) ≤ ℓ . Theorem [BT91] The SSP-E problem can be decided in polynomial time. Optimal pure memoryless strategies always exist and can be constructed in polynomial time. Rich Behavioral Models Mickael Randour 12 / 41

  15. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion SSP-E: illustration home go back, 2 railway, 2 car, 1 0 . 1 0 . 9 0 . 2 0 . 1 0 . 7 waiting light medium heavy train bike, 45 room traffic traffic traffic 0 . 9 0 . 1 relax, 35 drive, 20 drive, 30 drive, 70 wait, 3 work � Pure memoryless strategies suffice. � Taking the car is optimal: E σ D (TS T ) = 33. Rich Behavioral Models Mickael Randour 13 / 41

  16. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion SSP-E: PTIME algorithm 1 Graph analysis (linear time): � s not connected to T ⇒ ∞ and remove, � s ∈ T ⇒ 0. 2 Linear programming ( LP , polynomial time). For each s ∈ S \ T , one variable x s , � max x s s ∈ S \ T under the constraints � δ ( s , a , s ′ ) · x s ′ x s ≤ w ( a )+ for all s ∈ S \ T , for all a ∈ A ( s ). s ′ ∈ S \ T Rich Behavioral Models Mickael Randour 14 / 41

  17. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion SSP-E: PTIME algorithm 1 Graph analysis (linear time): � s not connected to T ⇒ ∞ and remove, � s ∈ T ⇒ 0. 2 Linear programming ( LP , polynomial time). Optimal solution v : � v s = expectation from s to T under an optimal strategy. Optimal pure memoryless strategy σ v :   σ v ( s ) = arg min � δ ( s , a , s ′ ) · v s ′  .  w ( a ) + a ∈ A ( s ) s ′ ∈ S \ T � Playing optimally = locally optimizing present + future. Rich Behavioral Models Mickael Randour 14 / 41

  18. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion SSP-E: PTIME algorithm 1 Graph analysis (linear time): � s not connected to T ⇒ ∞ and remove, � s ∈ T ⇒ 0. 2 Linear programming ( LP , polynomial time). In practice, value and strategy iteration algorithms often used: � best performance in most cases but exponential in the worst-case, � fixed point algorithms, successive solution improvements [BT91, dA99, HM14]. Rich Behavioral Models Mickael Randour 14 / 41

  19. Context SSP-E/SSP-P SSP-WE SSP-PQ Conclusion Traveling without taking too many risks home go back, 2 railway, 2 car, 1 0 . 1 0 . 9 0 . 2 0 . 1 0 . 7 waiting light medium heavy train bike, 45 room traffic traffic traffic 0 . 9 0 . 1 relax, 35 drive, 20 drive, 30 drive, 70 wait, 3 work Minimizing the expected time to destination makes sense if we travel often and it is not a problem to be late. With car, in 10% of the cases, the journey takes 71 minutes. Rich Behavioral Models Mickael Randour 15 / 41

Recommend


More recommend