Online Planning for Decentralized Sto ci astic Control with Partial History Sharing Kaiqing Zhang, Erik Miehling, and Tamer Ba ş ar Coordinated Science Lab — UIUC American Control Conference — Philadelphia, PA July 11, 2019
Decentralized Sto ci astic Control Control of a dynamic system by multiple agents each possessing • di ff erent information Also termed Dec-POMDPs in the learning/CS community • Smart Grid Robotics Unmanned Aerial Vehicles MOBA Video Games Asymmetric information → no single agent has knowledge of all • previous events Dynamic programming techniques quickly become computationally • intractable � 2
Decentralized Sto ci astic Control with Partial History Sharing In practice, agents may have some information (history) in common • Agents may observe each other’s actions, e.g. , fl eet control [Gerla et • al. , ’14] Share some common observations, e.g. , cooperative robot navigation • [Lowe et al. , ’18; Zhang et al. , ’18] Ti is common information can be used to reduce the policy search space • Decentralized POMDP → centralized POMDP [Nayyar et al. , ’13; • Mahajan & Mannan, ’13] Su ffi cient information: belief over the system state and local • information of each agent � 3
Related Work Common information approach + dynamic programming decomposition • (requires model to be known) [Nayyar et al. , ’13] Common information-based reformulation is generalization of • occupancy-state MDPs [Dibangoye et al. , ’16, ’18] for Dec-POMDPs Model-free/sampling-based planning heuristics for Dec-POMDPs: • Dec-POMDP → non-observable MDP , solved by heuristic tree search • [Oliehoek et al. , ’14] Monte-Carlo sampling + policy iteration/expectation-maximization • [Wu et al. , ’10, ’13] Monte-Carlo tree search for special Dec-POMDPs [Amato et al. , ’13; • Best et al. , ’18] Require a centralized coordinator [Amato et al. , ’13; Oliehoek et al. , • ’14; Dibangoye et al. , ’18] or communication [ Oliehoek et al. , ’12] � 4
Our Contribution Development of a tractable online + decentralized planning algorithm • for decentralized stochastic control with partial history sharing Does not require an explicit model representation, only a generative • model (black-box simulator) Does not require explicit communication among agents • Possesses provable convergence guarantees • Ti e proposed algorithm uni fi es some recently developed Dec-POMDP • solvers � 5
Decentralized Sto ci astic Control with Partial History Sharing — Model Consider a dynamical system consisting of agents where • Local memory/information, • Local action, • Local observations, • Common information • Let and , the common information • is a subset of Dynamics • State: • Information: ← common info. increment • ← updated local memory ← updated common info. � 6
Examples of Partial History Sharing Control sharing: • → → Delayed sharing ( d -step): • → → Some additional examples are periodic sharing, delayed state sharing, and • others (see [Nayyar et al. , ’13] ) � 7
Decentralized Sto ci astic Control with Partial History Sharing — Model (cont’d) Ti e goal is to fi nd a joint control policy, , consisting of • local control policies , such that the total expected discounted reward, , is maximized • Note that all agents possess the same goal, i.e. , they have a common reward function � 8
Common Information Approa ci Consider a coordinator that has access to the common information • Ti e coordinator solves for prescriptions that map each • agent’s local information to a local action Ti e coordinator’s problem is a POMDP with modi fi ed state, action, and • observation processes [Nayyar et al. , ’13] state: • action: • observation: • De fi ne virtual history as — the information • state of the coordinator’s POMDP is the common information based belief � 9
Common Information Approa ci Given a virtual history , the coordinator determines a joint prescription • using a coordination strategy , , where Ti e coordinator’s objective is to fi nd a coordination strategy pro fi le • to maximize where • A dynamic programming decomposition to solve for the optimal prescriptions exists [Nayyar et al. , ’13] Note: since the coordinator’s information is in common between the • agents, each agent can (in principle) perform this computation � 10
Challenges with the Common Information Approa ci Ti e decision variables of the coordinator are functions • Under fi nite action and observation spaces, the space of these functions • is also fi nite, but very large Examples: 1-step delayed sharing: • → Control sharing: • → � 11
Decentralized Online Planning Inspired by the single-agent Partially Observable Monte-Carlo Planning • (POMCP) algorithm of [Silver & Veness, ’10] Sampling-based approach helps to alleviate the computational challenges • associated with the large state space Key Ideas of the Algorithm: Solve the coordinator’s POMDP via online tree search • Nodes of each search tree are virtual histories with • edges consisting of joint prescriptions and new common information Agents possess a common random seed to avoid the • need to communicate (used before [Bernstein et al. , ’09; Oliehoek et al. , ’09; Arabneydi & Mahajan, ’15] ) � 12
Decentralized Online Planning Each agent constructs an identical set of search trees across all agents • Search trees are constructed iteratively by running simulations from the • current history for each agent (tree) � 13
Sear ci Stage Each simulation begins by sampling from the common information based • belief at the current history node Ti e search tree is expanded by either rollout or selection via UCB1 • [Auer et al. , ’02] as follows where, : number of previous simulation visits to the virtual history h : estimated value of choosing prescription in virtual history h Successive simulations build out the search trees under a stopping • condition ( e.g. , timeout) is met � 14
Belief Update A joint prescription is selected, actions are realized, and new common • information is revealed to the agents Ti e belief at a given history node h is approximated by a set of K • particles , denoted by the set B(h) Ti e belief is updated by the following procedure: • Draw a particle uniformly from B(h) • Generate from the selected prescription, repeat K times • Call the generative model to • construct a sample of new common information and updated local memories If the sampled common information matches the true common • information, add particle to updated belief set � 15
Convergence Due to the common source of randomness, the planning procedure is • identical and decoupled for all agents Convergence of the decentralized online planning algorithm can be • characterized by the (single-agent) POMCP alg. [Silver & Veness, ’10] Applied to a novel security se tu ing • (collaborative intrusion response) 2 Agents choose defense actions in • 1.5 response to security alert 1 information 0.5 Actions and alerts are copied to • 0 400 600 800 1000 1200 1400 1600 centralized database with delay (1-step delayed sharing) � 16
Existing Algorithms: MAA* Heuristic tree-search algorithm from [ Szer et al. , ’05] • Dec-POMDP → non-observable MDP [Oliehoek et al. , ’14 ] • Can be viewed as the designer’s approa ci from [Nayyar et al. , ’13] • MAA* Our algorithm Centralized problem NOMDP POMDP Common information Empty General Local memory Local observation history General System state + joint local observation System state + joint local State history memory Joint prescriptions + common History Joint policies information Belief over system state + joint local Belief over system state and joint Su ffi cient statistic observation history local memory � 17
Existing Algorithms: Occupancy-state MDPs Dec-POMDP → occupancy-state MDP [ Dibangoye et al. , ’16 ] • Ti e occupancy state (belief over state + joint local histories) is a su ffi cient • statistic for optimal planning MAA* Our algorithm Centralized problem NOMDP POMDP Common information Empty General Local memory Local histories (observations + actions) General System state + joint local State System state + joint local histories memory Joint prescriptions + common History Joint policies information Belief over system state + joint local Belief over system state and joint Su ffi cient statistic histories local memory � 18
Recommend
More recommend