Multiple-Environment Markov Decision Processes: Efficient Analysis - PowerPoint PPT Presentation

Multiple-Environment Markov Decision Processes: Efficient Analysis and Applications ICAPS 2020 K. Chatterjee, M. Chmel´ ık, D. Karkhanis, P. Novotn´ y, A. Royer October 27-30th, 2020

Introducing MEMDPS 0.8 0.4 0.9 0.5 0.5 0.25 s 1 s 3 0.75 s 0 s 2 0.15 Figure: A MEMDP augments the standard MDP framework with the notion of environments or contexts [1] Multiple-Environment Markov Decision Processes, J.F. Raskin and O. Sancur, 2014 1

Introducing MEMDPS Definition [1] Formally, a MEMDP is a tuple ( I , S , A , δ, r , s 0 , λ ), where: • S , is a finite set of control states; • A , is a finite alphabet of actions; • I , is a finite set of environments; • { δ i } i ∈I , is a collection of probabilistic transition functions, one for every environment • { r i } i ∈I , is a set of reward functions • s 0 ∈ S , is the initial state; • λ ∈ D ( I ), is the initial distribution over the environments [1] Multiple-Environment Markov Decision Processes, J.F. Raskin and O. Sancur, 2014 1

Introducing MEMDPS 0.8 0.4 0.9 0.5 0.5 0.25 s 1 s 3 0.75 s 0 s 2 0.15 In summary, MEMDPs augment MDPs with multiple environment hypotheses , aiming to design a controller that perform well for all. Previous work [1] study the existence of winning and almost winning strategies in MEMDPs. [1] Multiple-Environment Markov Decision Processes, J.F. Raskin and O. Sancur, 2014 1

Applications Orthogonal to this, in this work, we explore the practicality of MEMDPS across different settings and applications. 2

Applications Orthogonal to this, in this work, we explore the practicality of MEMDPS across different settings and applications. Example: Recommendation systems as MEMDPs A MEMDP can be used to build a MDP-based recommender which is tailored to different user profiles (environments), with potentially different transition functions. 0.9 (Fantasy, Fantasy) book 0.5 0.01 0.4 (Fantasy, Sci-fi) book 0.7 Fantasy book 0.2 0.25 0.75 ... s 0 (Fantasy, History) book 0.09 History book 0.3 0.6 0.7 (History, Fantasy) book 0.4 0.6 (History, History) book 2

A subcase of POMDPS MEMDPs are POMDPs Every MEMDP can be formulated as a partially-observable MDP, by considering the cross-product of states and environments 0.8 0.5 0.25 s 1 s 3 0.75 s 0 s 2 0.8 0.4 0.9 0.5 0.4 0.5 0.9 0.25 s 1 s 3 0.5 s 1 s 3 0.75 s 0 s 0 0.15 s 2 0.15 s 2 Figure: Converting a MEMDP (left) to a POMDP (right) 3

A subcase of POMDPS MEMDPs are POMDPs Every MEMDP can be formulated as a partially-observable MDP, by considering the cross-product of states and environments Consequently, POMDP solvers can be readily applied to the MEMDP framework. However, we show that developping MEMDP-specific solvers can significantly improve performance. 3

Solving MEMDPS: A summary Sparse transition function The partially-observable (PO) feature (the environment I ) is sampled only once, at initialization, and then kept constant. Thus there is no transitions across environments, and we can store the transition function more efficiently. 4

Solving MEMDPS: A summary Sparse transition function ⇒ Memory usage: O ( |S| 2 |I||A| ) (instead of O ( |S| 2 |I| 2 |A| ))) = 4

Solving MEMDPS: A summary Sparse transition function ⇒ Memory usage: O ( |S| 2 |I||A| ) (instead of O ( |S| 2 |I| 2 |A| ))) = Faster belief updates In a MEMDP, the uncertainty lies on the environment, rather than on states. Furthermore, as noted before, the PO features are static, once sampled. 4

Solving MEMDPS: A summary Sparse transition function ⇒ Memory usage: O ( |S| 2 |I||A| ) (instead of O ( |S| 2 |I| 2 |A| ))) = Faster belief updates = ⇒ Belief update can be done linearly in O ( |I| ) (rather than quadratic in terms of states O ( |S| 2 |I| 2 )) 4

Solving MEMDPS: A summary Sparse transition function ⇒ Memory usage: O ( |S| 2 |I||A| ) (instead of O ( |S| 2 |I| 2 |A| ))) = Faster belief updates = ⇒ Belief update can be done linearly in O ( |I| ) (rather than quadratic in terms of states O ( |S| 2 |I| 2 )) Monotonic expected belief entropy In a MEMDP, the entropy of the current belief captures uncertainty on the environments, and is a (non-strictly) decreasing function in expectation. 4

Solving MEMDPS: A summary Sparse transition function ⇒ Memory usage: O ( |S| 2 |I||A| ) (instead of O ( |S| 2 |I| 2 |A| ))) = Faster belief updates = ⇒ Belief update can be done linearly in O ( |I| ) (rather than quadratic in terms of states O ( |S| 2 |I| 2 )) Monotonic expected belief entropy ⇒ Monotonocity guarantee when using this quantity as a heuristics [7] = [7] Exact and approximate algorithms for partially observable Markov decision processes, Cassandra, 1998 4

Optimized Solvers We use these properties to optimize two classic POMDP solvers for MEMDPs applications: • SPBVI: Based on PBVI [3] , with faster and memory-efficient belief expansion sets. [3] Point-based value iteration: An anytime algorithm for POMDPs, Pineau et al, IJCAI 2003 [4] Monte-Carlo Planning in Large POMDPs, Silver and Veness, NeurIPS 2010 5

Optimized Solvers We use these properties to optimize two classic POMDP solvers for MEMDPs applications: • SPBVI: Based on PBVI [3] , with faster and memory-efficient belief expansion sets. • POMCP [4] : On top of faster belief update, we propose two further variants: • POMCP-ex: Exact belief update (rather than approximation) can be performed efficiently in MEMDPS • PAMCP: Caching mechanism to retain past histories in future executions, to better handle a stream of input queries [3] Point-based value iteration: An anytime algorithm for POMDPs, Pineau et al, IJCAI 2003 [4] Monte-Carlo Planning in Large POMDPs, Silver and Veness, NeurIPS 2010 5

Experiment: Recommender systems In prior work, MDPs have been used for capturing long-term interactions in recommender systems [5] , assuming a fixed environment for each user. We instead propose to learn a controller that handles different user profiles by modeling this task using a MEMDP. [5] An MDP-based Recommender System, Shani et al, JMLR 2005 6

Experiment: Recommender systems In prior work, MDPs have been used for capturing long-term interactions in recommender systems [5] , assuming a fixed environment for each user. We instead propose to learn a controller that handles different user profiles by modeling this task using a MEMDP. (synthetic) MDP SPBVI POMCP POMCP-ex PAMCP PAMCP-ex Accuracy 0.12 ± 0.03 - 0.64 ± 0.27 0.77 ± 0.07 0.68 ± 0.24 0.75 ± 0.08 Env. prediction - - 0.79 ± 0.33 0.96 ± 0.04 0.85 ± 0.30 0.94 ± 0.06 Runtime 5h30mn OOM 9mn36s 14s 14s 36s Table 1 : Synthetic dataset experiments (using 8 environments, 8 products, sequence of length 5) [5] An MDP-based Recommender System, Shani et al, JMLR 2005 6

Experiment: Recommender systems In prior work, MDPs have been used for capturing long-term interactions in recommender systems [5] , assuming a fixed environment for each user. We instead propose to learn a controller that handles different user profiles by modeling this task using a MEMDP. (Foodmart) MDP SPBVI POMCP POMCP-ex Accuracy 0.61 ± 0.14 0.62 ± 0.14 0.62 ± 0.14 0.62 ± 0.14 Precision 0.74 ± 0.09 - 0.78 ± 0.07 0.78 ± 0.08 Env. prediction - 0.60 ± 0.31 0.54 ± 0.35 0.53 ± 0.36 Runtime 11mn57s 12mn 38s 46s 23s Table 2 : Foodmart dataset experiments (using 8 environments * , 3 products, sequence of length 8) * : Environments are generated in a greedy manner, using perplexity as a metric [5] An MDP-based Recommender System, Shani et al, JMLR 2005 6

Experiment: Maze solving with failure rate The parametric Hallway maze problem consists in solving a maze where the agent has a certain (unknown) probability of “skidding”, i.e., failure, which we capture as different environments in a MEMDP. 7

Conclusions • MEMDPs are a straightforward tool for introducing context in MDPs • Standard POMDPs solvers can be significantly optimized by considering specificities of MEMDPs • Sparse transition function • Faster belief update • Monotonicity of the average belief entropy • We additionally verify the practicality of MEMDP-specific solvers through several experiments on recommender systems and a parametric version of the standard maze solving problem 8

Multiple-Environment Markov Decision Processes: Efficient Analysis - PowerPoint PPT Presentation

Multiple-Environment Markov Decision Processes: Efficient Analysis and Applications ICAPS 2020 K. Chatterjee, M. Chmel k, D. Karkhanis, P. Novotn y, A. Royer October 27-30th, 2020 Introducing MEMDPS 0.8 0.4 0.9 0.5 0.5 0.25 s 1 s

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? CPTs?

Markov Decision Processes Philipp Koehn 7 April 2020 Philipp Koehn Artificial Intelligence:

Mechanical Sympathy for Elephants Reducing I/O and memory stalls Thomas Munro, PGCon 2020

Orthogonal Bases Are the Towards Formulating . . . Best: A Theorem Justifying How to Describe .

Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab February 5, 2013 1 / 13 1

Characterization and Validation of Cloud-Cleared Radiances E.F. Fishbein H.H. Aumann S-Y Lee

Anne Bracy CS 3410 Computer Science Cornell University The slides are the product of many

Optimal control of the cylinder wake flow using Proper Orthogonal Decomposition (POD) Michel

w s o

Principal Component Ananalysis 4-8-2016 PCA: the setting Unsupervised learning Unlabeled