outline algorithms for multiagent learning
play

Outline Algorithms for Multiagent Learning A. Introduction - PowerPoint PPT Presentation

Outline Algorithms for Multiagent Learning A. Introduction Equilibrium Learners B. Single Agent Learning Regret Minimizing Algorithms C. Game Theory Best Response Learners Q-Learning


  1. � � � � � � � � � � � Outline Algorithms for Multiagent Learning A. Introduction Equilibrium Learners B. Single Agent Learning Regret Minimizing Algorithms C. Game Theory Best Response Learners – Q-Learning D. Multiagent Learning – Opponent Modeling Q-Learning E. Future Issues and Open Problems – Gradient Ascent – WoLF Learning to Coordinate SA3 – D9 SA3 – D10 What’s the Goal? Q-Learning Learn a best response, if one exists. ... or any MDP learning algorithm. Make some other guarantees. For example, The most commonly used approach to learning in multiagent systems. And, not without success. – Convergence of payoffs or policies. If it is the only learning agent. . . – Low regret or at least minimax optimal. – Recall, if the other agents are using a stationary If best response learners converge against each strategy, it becomes an MDP . other, then it must be to a Nash equilibrium. – Q-learning will converge to a best-response. Otherwise, requires on-policy learning. SA3 – D11 SA3 – D12

  2. ✢ � ✗ ✢ ✆ ✓ ✑ ✆ ✗ � ✓ ✑ ✚ ✖ ✓ ✑ ✖ ✓ ✘ ✗ ✆ ✆ ✪ ✩ ✄ ✑ ✓ � ✂ ✫ ✌ ✣ ✩ ★ ✣ ✆ ✓ ✩ � ✤ ✑ ✗ ✁ � ✢ � ✏ ✣ � � � � ✆ ✑ ✏ ✖ ✆ ✣ ✯ � ✏ � ✁ ✆ ✏ ✮ ✏ ✆ ✖ ✏ ✛ ✂ ☞ ✖ ✑ ✌ ☞ ✯ ✟ ✞ ✝ ✆ ✆ Q-Learning Opponent Modeling Q-Learning (Uther, 1997) and others. Dominance solvable games. Fictitious play in stochastic games using approximation. It has also been successfully applied to. . . Choose action that maximizes, – Team games. ✄✒✑ ✓✕✔ (Sen et al. 1994; Claus & Boutilier, 1998) ✂☎✄ ✂☎✄ ✆✜✛ ✠☛✡ ✓✙✔ ✂☎✄ ✡✎✍ – Games with pure strategy equilibria. (Tan, 1993; Crites & Sandholm, 1995, Bowling, 2000) Update opponent model and Q-values, – Adversarial games. ✥✧✦ ✂☎✄ ✂☎✄✒✑ ✂☎✄✒✑ ✂☎✄ ✆✭✬ ✂☎✄ (Tesauro, 1995; Uther, 1997) ✄✒✑ TD-Gammon remains one of the most convincing ✂☎✄ ✂☎✄ ✂☎✄ ✓✕✔ ✓✕✔ successes of reinforcement learning. SA3 – D13 SA3 – D14 Opponent Modeling Q-Learning Gradient Ascent Superficially less naive than Q-learning. Compute gradient of value with respect to the player’s strategy. – Recognizes the existence of other agents. Adjust policy to increase value. – . . . but assumes they use a stationary policy. Single-agent learning (parameterized policies). Similar results to Q-learning, but faster approximation. (Williams, 1993; Sutton et al., 2000, Baxter & Bartlett, 2000) (Uther, 1997) — Hexcer Multiagent Learning. First 50000 Games Second 50000 Games (Singh, Kearns, & Mansour, 2000; Bowling & Veloso, 2002, 2003; MMQ Q OMQ MMQ Q OMQ Zinkevich, 2003) MMQ — 27% 32% MMQ — 45% 43% Q 73% — 40% Q 55& — 41% OMQ 68% 60% — OMQ 57% 59% — SA3 – D15 SA3 – D16

  3. ✒ ☛ ✆ ✒ ✍ ✚ ✛ ✆ ✁ ✍ ✚ ✓ ✜ ✗ ✡ ✮ ✠ ✚ ✏ ✑ ✚ ✒ ✗ ✍ ✚ ✑ ✚ ✛ ✆ ✁ ✝ ✕ ✚ ✬ ✁ ✑ ✖ ✕ ✌ ✄ ✝ ✝ ✕ ✄ ✆✝ ✒ ✗ ✮ ✝ ✏ ✑ ✬ ✗ ✑ ✁ ✑ ✖ ✙ ✕ ✌ ✠ ✝ ✑ ✓ ✗ ✍ ★ ✥ ✧ ✩ ✁ ✂ ✣ ✖ ✖ ✙ ✣ ✞ ✂ ✑ ✧ ✞ ✓ ✂ ✌ ✄ ✆✝ ✕ ✄ ✝ ✝ ✒ ✌ ✠ ✥ ✥ ✜ ✫ ✗ ✡ ☛ ✭ ✚ ✏ ✑ ✚ ✒ ✗ ✑ ✚ ✬ ✫ ✦ ✫ ✪ ✜ ✢ ✣ ✞ ✒ ✝ ✝ ✠ ✕ ✤ ✥ ✍ ✒ ✆ ✒ ✒ ✁ ✍ ✑ ✄ ✆ ✆ ✓ ✍ ✌ ✔ ✕ ✑ ✄ ✏ ✆✝ ✓ ✌ ✔ ✕ ✍ ✒ ✑ ✄ ✝ ✆ ✓ ✌ ✑ ✬ ✕ ✞ � ✁ ✮ ✆ ✆ ✄ ✆✝ ✄ ✝ ✆ ✄ ✝ ✝ ✟ ✯ ✁ ✲ ✆ ✆ ✠ ✆✝ ✠ ✝ ✆ ✠ ✝ ✝ ✞ ✔ ✍ ✑ ✆✝ ✒ ✓ ✄ ✝ ✝ ✬ ✖ ✁ ✄ ✆ ✆ ✕ ✄ ✕ ✝ ✄ ✝ ✆ ✓ ✄ ✝ ✝ ✮ ✱ ✗ ✯ ✬ ✏ ✝ ✄ ✒ ✍ ✌ ✔ ✕ ✑ ✒ ✄ ✝ ✝ ✁ ✖ ✍ ✑ ✓ ✌ ✕ ✄ ✆✝ ✕ ✄ ✝ ✝ ✒ ✓ ✑ ✌ ✄ ✝ ✆ ✝ Infinitesimal Gradient Ascent IGA (Singh, Kearns, & Mansour, 2000) ✡☞☛ ✌✎✍ ✂☎✄ ✂☎✠ ✌✎✍ ✡☞✘ ✌✎✍ ✡☞☛ ✌✎✍ ✌✎✍ where, SA3 – D17 SA3 – D18 IGA — Theorem IGA — Proof (Singh et al., 2000) Theorem. If both players follow Infinitesimal Gradient Ascent (IGA), where , then their strategies will converge to a Nash equilibrium OR the average payoffs over time will converge in the limit to the expected payoffs of a Nash equilibrium. A A D D B B C C is not invertible has real eigenvalues has imaginary eigenvalues ✯✰✭ or SA3 – D19 SA3 – D20

  4. ✕ ✖ ✍ � � � ✙✚✛✜ � ✟ ✠ ✔ ✌ ✁ ✛ ✍ ✎✏ ✑ ✔ ✔ ✂ ✁ ✕ ✒ ✘ ✂ ✚ ✆ ✄ ✕ ✁ ✂ ✆ ✛ ✚ ✁ ✔ ✒ ✝ ✝ ✄ ✆✝ ✁ ✄ ✓ ✖ ✑ ✌ ✜ ✓ ✆ ✛ ✚ ✍ ☎ ✏ ✞ ✂ ✂ ✔ ✔ ✁ ✂ ✆ ✛ ✚ ✁ ✚ � ✓ ✁ � ✜ � ✂ ✏ ✄ ☎ ✂ � ✘ ✒ ✓ ✚ ✒ ✑ � ✜ ✓ ✂ ✚ ✁ ✔ ✔ ✓ ✖ ✒ ✎✏ ✛ ✍ ✖ ✌ ✖ ✠ ✟ ✞ ✝ ✁ ✂ ✆ ✝ IGA — Summary GIGA (Zinkevich, 2003) One of the first convergence proofs for a payoff maximizing multiagent learning algorithm. Generalized Infinitesimal Gradient Ascent (GIGA). Expected payoffs do not necessarily converge. – At time , select actions according to . – After observing others select , ✄✆☎ Reward ✌ ✕✗✖ ✡☞☛ Average i.e., step the probability distribution toward immediate reward, then project into a valid Time probability space. SA3 – D21 SA3 – D22 GIGA GIGA — Intuition GIGA is identical to IGA for two-player, two-action Assumption: Policy gradient is bounded. games, while approximating the gradient. IGA ✌ ✕✗✖ ✡☞☛ GIGA GIGA is universally consistent! SA3 – D23 SA3 – D24

  5. ✚ ✍ ✜ � ✚ ✘ ✗ ✡ ☛ ✚ ✒ ✏ ✑ ✚ ✒ ✗ ✑ ✑ ☞ ✓ ✑ � ✏ � ✚ ☛ ✗ ✆ � ✚ ✑ ✁ ✚ ✒ ✗ ✍ ✑ ✚ ✛ ✆ ✏ ✚ ✓ � ✒ � ✆ ✟ ✠ ✑ ✏ ✚ ✑ ✘ ✁ ✌ � ✆ ✝ ✞ ✚ ✚ ✏ ☛ ✡ ✁ ✌ � ✆ ✝ ✞ ✚ ✘ ☛ ✍ ✚ ✏ ✑ ✚ ✒ ☛ ✒ ✚ ✜ ✚ ✡ ✑ ✚ ☛ ✗ � ✣ ✚ ✏ ✚ ✜ ✒ ✗ ✍ ✑ ✚ ✛ ✆ ✁ � ✓ ✚ ☛ � ✞ � ✝ ✆ � � ✠ ✚ ✟ ✆ ✍ ✚ ✛ ✆ ✁ ✍ ✑ ✓ ✍ ✡ ✝ ✞ ✏ � ✆ ✟ ✠ ☛ � ✣ ✠ ✟ ✍ ✚ ✛ ✆ ✁ ✆ ☎ ✜ ✏ � ✚ ✘ ✗ ✡ ☛ ✢ ✚ ✑ ✄ ✚ ✒ ✗ ✑ � ✚ ✜ ✘ ✘ WoLF WoLF-IGA (Bowling & Veloso, 2002, 2003) ✌✎✍ ✡☞☛ Modify gradient ascent learning to converge. ✌✎✍ Vary the speed of learning: Win or Learn Fast. – If winning, learn cautiously. – If losing, learn quickly. ☛ ✂✁ Algorithms: WoLF-IGA, WoLF-PHC, GraWoLF . SA3 – D25 SA3 – D26 WoLF-IGA WoLF-IGA — Theorem Theorem. If both players follow WoLF-IGA, where , ✡☞☛ ✌✎✍ and , then their strategies will converge to a Nash equilibrium. ✌✎✍ WoLF Win or Learn Fast! if ✌✎✍ ✌✎✍ ✡☞☛ ✡☞☛ WINNING otherwise LOSING if ✌✎✍ ✌✎✍ WINNING otherwise LOSING SA3 – D27 SA3 – D28

Recommend


More recommend