Multiagent Multiagent Planning Planning Planning under Uncertainty for Cooperative Agents Shlomo Zilberstein Shlomo Zilberstein School of Computer Science School of Computer Science ! Challenge : How to achieve intelligent coordination of a University of Massachusetts Amherst University of Massachusetts Amherst group of decision makers in spite of stochasticity and partial observability? ! Objective : Develop effective decision-theoretic planning methods to address the uncertainty about the domain, the outcome of actions, and the knowledge, beliefs and intentions of the other agents. IFAAMAS Summer school on Autonomous Agents and Multi-Agent Systems ! Beijing, China ! August 2013 2 Problem Characteristics Problem Characteristics Sample Applications Sample Applications ! Space exploration rovers ! A group of decision makers or agents interact in [ Zilberstein et al. 02 ] a stochastic environment ! Each “episode” involves a sequence of decisions ! Multi-access broadcast channels over finite or infinite horizon [ Ooi and Wornell 96 ] ! The change in the environment is determined ! Decentralized detection of stochastically by the current state and the hazardous weather events set of actions taken by the agents [ Kumar and Zilberstein 09 ] ! Each decision maker obtains different partial ! Mobile robot navigation observations of the overall situation [ Emery-Montemerlo et al. 05; Spaan and Melo 08 ] ! Decision makers have the same objective 3 4 Outline Outline Planning under Uncertainty Planning under Uncertainty ! Planning with Markov decision processes a World 1 ! Decentralized partially-observable MDPs s ! Complexity results ! An agent interacts with the environment over ! Solving finite-horizon decentralized POMDPs some extended period of time ! Solving infinite-horizon decentralized POMDPs ! Utility function depends on the sequence of ! Scalability with respect to the number of agents decisions and their outcomes ! A rational agent should choose an action that ! Conclusion maximizes its expected utility 5 6
A Simple Grid Environment A Simple Grid Environment " Markov Decision Process Markov Decision Process Russell & Norvig, 2003 ! A Markov decision process is a tuple 〈 S , A , P , R 〉 , where ! S is a finite set of domain states, with initial state s 0 ! A is a finite set of actions ! P ( s # | s, a ) is a state transition function ! R(s), R ( s , a ), or R ( s , a , s # ) is a reward function ! The Markov assumption: P ( s t | s t- 1 , s t- 2 , …, s 1 , a ) = P ( s t | s t- 1 , a ) ! 7 8 Example: An Optimal Policy Example: An Optimal Policy Policies for Different Policies for Different R ( s ) +1 +1 .812 " .868 " .912 " � 1 � 1 .762 " .660 " .705 " .655 " .611 " .388 " Actions succeed with probability 0.8 and move sideways " with probability 0.1 (remain in the same position when " there is a wall). Actions incur a small cost. " 9 10 Policies and Utilities of States Policies and Utilities of States The Bellman Equation The Bellman Equation ! Optimal policy defined by: ! A policy π is a mapping from states to actions. ! An optimal policy π * maximizes the expected ∑ π *( s ) = argmax P ( s '| s , a ) U ( s ') reward: a & ∞ ) s ' argmax ∑ γ t R ( s t ) | π π * = ( + ∑ U ( s ) = R ( s ) + γ max P ( s '| s , a ) U ( s ') ' * π t = 0 a s ' ! The utility of a state ! Can be solved using dynamic programming [Bellman, 1957] & ∞ ) U π ( s ) = E ∑ γ t R ( s t ) | π , s 0 = s ( + ' * t = 0 11 12
Value Iteration Value Iteration " Convergence of VI Convergence of VI Bellman, 1957 repeat An initial error bound is: || U − U * || ≤ 2 R max /(1 − γ ) U ← U ' ' || ≤ γ || U i − U i ' || Based on || BU i − BU i for each state s do How many iterations are needed to reach a max norm error of ε ? γ N ⋅ 2 R max /(1 − γ ) ≤ ε ∑ U '[ s ] ← R [ s ] + γ max P ( s '| s , a ) U ( s ') a N = log(2 R max /( ε (1 − γ ))/log(1/ γ ) ' ( s ' A less conservative termination condition end if || U ' − U || < ε (1 − γ )/2 γ then || U ' − U * || < ε until CloseEnough ( U , U ') 13 14 Policy Iteration Policy Iteration " Policy Loss Policy Loss Howard, 1960 repeat The error bound on the utility of each state may π ← π ' not be the most important factor. U ← ValueDetermination ( π ) What the agent cares about is how well it does for each state s do based on a given policy / utility function. if || U i − U * || < ε then || U π i − U * || < 2 εγ /(1 − γ ) ∑ π '[ s ] ← argmax P ( s '| s , a ) U ( s ') a s ' Note that the policy loss can approach zero long end before the utility estimates converge. until π = π ' 15 16 Stochastic Shortest-Path Problems Stochastic Shortest-Path Problems Value Determination Value Determination ! Given a start state, the objective is to minimize Can be implemented using: the expected cost of reaching a goal state. Value Iteration: ! S : a finite set of states ∑ U '( s ) = R ( s ) + γ P ( s '| s , π ( s )) U ( s ') ! A ( i ), i ∈ S : a finite set of actions available in s ' state i or ! P ij ( a ) : probability of state j after action a in state i By solving a set of n linear equations: ! C i ( a ) : expected cost of taking action a in state i ∑ U ( s ) = R ( s ) + P ( s '| s , π ( s )) U ( s ') s ' 17 18
MDPs and State-Space Search MDPs and State-Space Search " Advantages of Search Advantages of Search Hansen & Zilberstein, AAAI 1998, AIJ 2001 ! MDPs present a state-space search problem ! Can find optimal solutions without evaluating in which transitions are stochastic. all problem states. ! Because state transitions are stochastic, it is ! Can take advantage of domain knowledge to impossible to bound the number of actions reduce search effort. needed to reach the goal (indefinite-horizon). ! Can benefit from a large body of existing work ! Search algorithms like A* can handle the in AI on how to search in real-time and how to deterministic version of this problem. trade-off solution quality for search effort. ! But neither A* nor AO* can solve indefinite horizon problems. 19 20 Solving MDPs Using Search Solving MDPs Using Search Possible Solution Structures Possible Solution Structures Solution graph: " Explicit graph: " Start ! Implicit graph : " all states reachable " states evaluated " state ! all states " by optimal solution " during search ! Solution is a " simple path " Solution is an " acyclic graph " Solution is a " cyclic graph " Given a start state, heuristic search can find an optimal " solution without evaluating all states. " 21 22 AO* " AO* DP and Heuristic Search DP and Heuristic Search Nilsson 1971; Martelli & Montanari 1973 Solution ! Initialize partial solution graph to start state. Sequence Branches Loops Structure ! Repeat until a complete solution is found: Dynamic Forward DP Backwards Policy (value) Programming induction iteration ! Expand some nonterminal state on the fringe of the Heuristic best partial solution graph. Search A* AO* LAO* ! Use backwards induction to update the costs of all ancestor states of the expanded state and possibly Heuristic search = " starting state + " change the selected action. " forward expansion of solution + " " admissible heuristic + " " " " DP dynamic programming " 23 24
LAO* LAO* " Heuristic Evaluation Function Heuristic Evaluation Function Hansen and Zilberstein, AAAI 1998 ! h ( i ) is an heuristic estimate of the minimal-cost solution ! Like AO*, performs dynamic programming on for every non-terminal tip state. the set of states that includes the expanded ! h ( i ) is admissible if h ( i ) < f* ( i ). state and all of its ancestors. ! An admissible heuristic estimate f ( i ) for any state in the explicit graph is defined as follows: ! But LAO* must use either policy iteration or value iteration instead of backward induction. ) + 0 if i is a goal state ! Convergence to exact state costs of value + f ( i ) = * h ( i ) if i is a non- terminal tip state iteration is asymptotic, but it is generally more + # & efficient than policy iteration for large problems. else min a ∈ A ( i ) c i ( a ) + ∑ p ij ( a ) f ( j ) + % ( $ ' , j ∈ S 25 26 Theoretical Properties Theoretical Properties Imperfect Observations Imperfect Observations a ! Theorem 1 : Using an admissible heuristic, World Reward r o LAO* converges to an optimal solution without (necessarily) expanding/evaluating all Partially observable MDP adds the following: states. ! O – a finite set of observations ! Theorem 2 : If h 2 ( i ) is a more informative ! P ( o | s' , a ) – observation function: the probability that heuristic than h 1 ( i ) (i.e., h 1 ( i ) ≤ h 2 ( i ) ≤ f* ( i ) ), o is observed after taking action a resulting in a LAO* using h 2 ( i ) expands a subset of the transition to state s' worst case set of states expanded using h 1 ( i ) . ! A discrete probability distribution over starting states (the initial belief state): b 0 27 28 Example: Hallway Example: Hallway Example: Tiger Game Example: Tiger Game States: tiger left, tiger right States: grid cells with Minimize number of Actions: listen, open left, open right orientation steps to the starred Transitions: listening only provides info; opening a Actions: turn , , , square for a given start door resets the problem move forward, stay state distribution Observations: noisy indications of tiger’s location Transitions: noisy Goal: get the treasure Observations: red lines Goal: red star location 30 29
Recommend
More recommend