online planning for decentralized sto ci astic control

Online Planning for Decentralized Sto ci astic Control with Partial - PowerPoint PPT Presentation

Online Planning for Decentralized Sto ci astic Control with Partial History Sharing Kaiqing Zhang, Erik Miehling, and Tamer Ba ar Coordinated Science Lab UIUC American Control Conference Philadelphia, PA July 11, 2019 Decentralized

  1. Online Planning for Decentralized Sto ci astic Control with Partial History Sharing Kaiqing Zhang, Erik Miehling, and Tamer Ba ş ar Coordinated Science Lab — UIUC American Control Conference — Philadelphia, PA July 11, 2019

  2. Decentralized Sto ci astic Control Control of a dynamic system by multiple agents each possessing • di ff erent information Also termed Dec-POMDPs in the learning/CS community • Smart Grid Robotics Unmanned Aerial Vehicles MOBA Video Games Asymmetric information → no single agent has knowledge of all • previous events Dynamic programming techniques quickly become computationally • intractable � 2

  3. Decentralized Sto ci astic Control with Partial History Sharing In practice, agents may have some information (history) in common • Agents may observe each other’s actions, e.g. , fl eet control [Gerla et • al. , ’14] Share some common observations, e.g. , cooperative robot navigation • [Lowe et al. , ’18; Zhang et al. , ’18] Ti is common information can be used to reduce the policy search space • Decentralized POMDP → centralized POMDP [Nayyar et al. , ’13; • Mahajan & Mannan, ’13] Su ffi cient information: belief over the system state and local • information of each agent � 3

  4. Related Work Common information approach + dynamic programming decomposition • (requires model to be known) [Nayyar et al. , ’13] Common information-based reformulation is generalization of • occupancy-state MDPs [Dibangoye et al. , ’16, ’18] for Dec-POMDPs Model-free/sampling-based planning heuristics for Dec-POMDPs: • Dec-POMDP → non-observable MDP , solved by heuristic tree search • [Oliehoek et al. , ’14] Monte-Carlo sampling + policy iteration/expectation-maximization • [Wu et al. , ’10, ’13] Monte-Carlo tree search for special Dec-POMDPs [Amato et al. , ’13; • Best et al. , ’18] Require a centralized coordinator [Amato et al. , ’13; Oliehoek et al. , • ’14; Dibangoye et al. , ’18] or communication [ Oliehoek et al. , ’12] � 4

  5. Our Contribution Development of a tractable online + decentralized planning algorithm • for decentralized stochastic control with partial history sharing Does not require an explicit model representation, only a generative • model (black-box simulator) Does not require explicit communication among agents • Possesses provable convergence guarantees • Ti e proposed algorithm uni fi es some recently developed Dec-POMDP • solvers � 5

  6. Decentralized Sto ci astic Control with Partial History Sharing — Model Consider a dynamical system consisting of agents where • Local memory/information, • Local action, • Local observations, • Common information • Let and , the common information • is a subset of Dynamics • State: • Information: ← common info. increment • ← updated local memory ← updated common info. � 6

  7. Examples of Partial History Sharing Control sharing: • → → Delayed sharing ( d -step): • → → Some additional examples are periodic sharing, delayed state sharing, and • others (see [Nayyar et al. , ’13] ) � 7

  8. Decentralized Sto ci astic Control with Partial History Sharing — Model (cont’d) Ti e goal is to fi nd a joint control policy, , consisting of • local control policies , such that the total expected discounted reward, , is maximized • Note that all agents possess the same goal, i.e. , they have a common reward function � 8

  9. Common Information Approa ci Consider a coordinator that has access to the common information • Ti e coordinator solves for prescriptions that map each • agent’s local information to a local action Ti e coordinator’s problem is a POMDP with modi fi ed state, action, and • observation processes [Nayyar et al. , ’13] state: • action: • observation: • De fi ne virtual history as — the information • state of the coordinator’s POMDP is the common information based belief � 9

  10. Common Information Approa ci Given a virtual history , the coordinator determines a joint prescription • using a coordination strategy , , where Ti e coordinator’s objective is to fi nd a coordination strategy pro fi le • to maximize where • A dynamic programming decomposition to solve for the optimal prescriptions exists [Nayyar et al. , ’13] Note: since the coordinator’s information is in common between the • agents, each agent can (in principle) perform this computation � 10

  11. Challenges with the Common Information Approa ci Ti e decision variables of the coordinator are functions • Under fi nite action and observation spaces, the space of these functions • is also fi nite, but very large Examples: 1-step delayed sharing: • → Control sharing: • → � 11

  12. Decentralized Online Planning Inspired by the single-agent Partially Observable Monte-Carlo Planning • (POMCP) algorithm of [Silver & Veness, ’10] Sampling-based approach helps to alleviate the computational challenges • associated with the large state space Key Ideas of the Algorithm: Solve the coordinator’s POMDP via online tree search • Nodes of each search tree are virtual histories with • edges consisting of joint prescriptions and new common information Agents possess a common random seed to avoid the • need to communicate (used before [Bernstein et al. , ’09; Oliehoek et al. , ’09; Arabneydi & Mahajan, ’15] ) � 12

  13. Decentralized Online Planning Each agent constructs an identical set of search trees across all agents • Search trees are constructed iteratively by running simulations from the • current history for each agent (tree) � 13

  14. Sear ci Stage Each simulation begins by sampling from the common information based • belief at the current history node Ti e search tree is expanded by either rollout or selection via UCB1 • [Auer et al. , ’02] as follows where, : number of previous simulation visits to the virtual history h : estimated value of choosing prescription in virtual history h Successive simulations build out the search trees under a stopping • condition ( e.g. , timeout) is met � 14

  15. Belief Update A joint prescription is selected, actions are realized, and new common • information is revealed to the agents Ti e belief at a given history node h is approximated by a set of K • particles , denoted by the set B(h) Ti e belief is updated by the following procedure: • Draw a particle uniformly from B(h) • Generate from the selected prescription, repeat K times • Call the generative model to • construct a sample of new common information and updated local memories If the sampled common information matches the true common • information, add particle to updated belief set � 15

  16. Convergence Due to the common source of randomness, the planning procedure is • identical and decoupled for all agents Convergence of the decentralized online planning algorithm can be • characterized by the (single-agent) POMCP alg. [Silver & Veness, ’10] Applied to a novel security se tu ing • (collaborative intrusion response) 2 Agents choose defense actions in • 1.5 response to security alert 1 information 0.5 Actions and alerts are copied to • 0 400 600 800 1000 1200 1400 1600 centralized database with delay (1-step delayed sharing) � 16

  17. Existing Algorithms: MAA* Heuristic tree-search algorithm from [ Szer et al. , ’05] • Dec-POMDP → non-observable MDP [Oliehoek et al. , ’14 ] • Can be viewed as the designer’s approa ci from [Nayyar et al. , ’13] • MAA* Our algorithm Centralized problem NOMDP POMDP Common information Empty General Local memory Local observation history General System state + joint local observation System state + joint local State history memory Joint prescriptions + common History Joint policies information Belief over system state + joint local Belief over system state and joint Su ffi cient statistic observation history local memory � 17

  18. Existing Algorithms: Occupancy-state MDPs Dec-POMDP → occupancy-state MDP [ Dibangoye et al. , ’16 ] • Ti e occupancy state (belief over state + joint local histories) is a su ffi cient • statistic for optimal planning MAA* Our algorithm Centralized problem NOMDP POMDP Common information Empty General Local memory Local histories (observations + actions) General System state + joint local State System state + joint local histories memory Joint prescriptions + common History Joint policies information Belief over system state + joint local Belief over system state and joint Su ffi cient statistic histories local memory � 18


More recommend