feedback control for learning in games
play

Feedback Control for Learning in Games Gurdal ARSLAN & Jeff - PowerPoint PPT Presentation

Feedback Control for Learning in Games Gurdal ARSLAN & Jeff SHAMMA Mechanical and Aerospace Engineering UCLA Setup: Repeated Games Time k = 1,2,3, Player i : Strategy: p i ( k ) Action: a i ( k ) = rand[ p i ( k


  1. Feedback Control for Learning in Games Gurdal ARSLAN & Jeff SHAMMA Mechanical and Aerospace Engineering UCLA

  2. Setup: Repeated Games • Time k = 1,2,3,… • Player i : – Strategy: p i ( k ) ∈ ∆ – Action: a i ( k ) = rand[ p i ( k )] – Payoff: U i ( a i , a -i ) ' a i T M i a -i – Play: p i ( k ) = F (information up to time k ) • Assume players do not share utilities! How can simple rules lead players to mixed strategy Nash equilibrium? • Separate issues: Will they? should they? compute NE? 2

  3. Prior Work & Convergence • (Stochastic) Fictitious Play • No Regret • New approaches: Multirate, Joint weak calibration, Regret testing, … • Convergence results: – Special cases: NE – Correlated equilibria – Convex hull of NE – “Dwell” near NE 3

  4. Non-convergence Results • Shapley game vs Fictitious Play • Crawford (1985): wide class of learning mechanisms must fail to converge mixed strategies • Jordan anticoordination game: 3 players, each with 2 moves. P1 P2 P3 • Hart & Mas-Colell (2003): Consider larger class & show Uncoupled + Jordan anticoordination = non-convergence 4

  5. Preview • Introduce new uncoupled dynamics based on “feedback control”. • Demonstrate how convergence to mixed strategy NE can be enabled ( including Shapley & Jordan games ). • Best/Better response variants. • Action/Payoff based versions. • Two/Multi-player cases. 5

  6. Feedback Control disturbance controller process error actual desired + K P behavior behavior – feedback • K = controller = sequential decision maker • P = process with approximate model P model • Think of “standing upright” 6

  7. What’s the Connection? • FB → GT: – New initiatives in “cooperative control” (combat systems, networks, self- assembly, automata teams…) require general sum formulation. • GT → FB: DM2 DM2 DM1 DM1 DM4 DM4 DM3 DM3 DM5 DM5 DM i is in feedback with DM -i 7

  8. Typical Controller: PID • Proportional + Integral + Derivative – K P ⇒ current error – K I ⇒ error history – K D ⇒ error change • “Workhorse” of traditional control design. • Model of human motion control, homeostasis, … 8

  9. Derivative Action e t+ τ t (now) • React to predicted error • Example: “Balancing”: 9

  10. Repeated Games in Continuous Time • Empirical frequencies: • ODE method of stochastic approximation: Deterministic continuous time analysis ⇓ Probabilistic discrete time conclusions 10

  11. Derivative Action FP (DAFP) • Define smoothed best response: • FP: • Derivative action FP: • “First order” model of adversary: Moving target. 11

  12. Ideal vs Approximate • Ideal ⇒ Implicit Equations • Approximate: • Use of ideal differentiators can always lead to NE (a misleading conclusion). 12

  13. Approximate Differentiator • Define: • Asymptotically • Two-player implementation: 13

  14. Local Convergence of DAFP • Theorem : Consider a two-player game with a NE . 1) stable at stable at 2) unstable at , with stable at where are the eigenvalues of linearized 14

  15. Jordan Anticoordination Revisited • Unique mixed NE is unstable under • , hence stabilizable by 15

  16. Extensions to “Gradient Play” • “Better Response” = GP • DAGP : • Theorem : Similar … using eigenvalues of • Shapley & Jordan games convergent. 16

  17. Crawford & Conlisk • Crawford (1985): Nonconvergence of a class of algorithms. • Conlisk (1993): “Adaptation in games: Two solutions to the Crawford puzzle”, J. of Economic Behavior and Organization. – Two-player zero-sum games – Play in “rounds” ( …, R-1, R, R+1, … ) – On R+1 use adjust mixed strategy with “forecast” payoff based on intervals R & R-1 17

  18. Discrete Time • Theorem : Local attractor in continuous time ⇒ Positive probability of convergence to NE in discrete-time. • …as opposed to Zero probability. 18

  19. Payoff Based Rules • Use “stimulus response” • Theorem : Positive probability of convergence to NE. 19

  20. Jordan Anticoordination: Payoff Based DAGP γ = 1, λ = 50, ε = 0.1 20

  21. Multiplayer Games • Immediate extensions in case of “pair-wise utility” structure: • Otherwise, must inspect “joint-action” version of FP. 21

  22. Concluding Remarks • Feedback control motivates the use of auxiliary dynamics to enable NE convergence. • Other “controller” structures possible (all mixed strategy equilibria “stabilizable”) • DAFP & DAGP respect “graph” structures. • Key concerns: – Natural? – Strategic? 22

Recommend


More recommend