learning and efficiency in games
play

Learning and Efficiency in Games (with Dynamic Population) va - PowerPoint PPT Presentation

Learning and Efficiency in Games (with Dynamic Population) va Tardos Cornell Joint work with Thodoris Lykouris and Vasilis Syrgkanis Large population games: traffic routing Traffic subject to congestion delays cars and packets follow


  1. Learning and Efficiency in Games (with Dynamic Population) Éva Tardos Cornell Joint work with Thodoris Lykouris and Vasilis Syrgkanis

  2. Large population games: traffic routing • Traffic subject to congestion delays • cars and packets follow shortest path • Congestion game =cost (delay) depends only on congestion on edges

  3. Example 2: advertising auctions $ $ $ advertising auctions • Advertisers leave and join the system • Changes in system setup • Advertiser values change 3

  4. Questions + Motivation • Repeated game: How do players behave? • Nash equilibrium? • Today: Machine Learning • With players (or player objectives) changing over time • Efficiency loss due to selfish behavior of players (Price of Anarchy)

  5. Traffic Pattern (optimal) delay C 1 hour x/100 0 min A B Time: 1.5 hours 1 hour D y/100

  6. Not Nash equilibrium! C 1 hour x/100 0 min A B Time: 1.5 hours 1 hour D y/100 Nash: Stable solution: no incentive to deviate

  7. Nash equilibrium C 1 hour x/100 0min 100 A B Time: 2 hours 1 hour D y/100 Nash: Stable solution: no incentive to deviate But how did the players find it?

  8. Congestion game in Social Science Kleinberg- Oren STOC’11 Which project should I try? • Each project j has reward 𝑑 projects 𝑘 • Each player has a probability 𝑞 𝑗𝑘 for solving ??? • Fair credit: equally shared by discoverers Uniform players and fair sharing= congestion game Unfair sharing and/or different abilities: Vetta utility game

  9. Nash as Selfish Outcome ? • Can the players find Nash? • Which Nash? Daskalakis-Goldberg- Papadimitrou’06 Nash exists, but …. Finding Nash is • PPAD hard in many games • Coordination problem (multiple Nash)

  10. Repeated games a 1 1 a 1 2 a 1 3 a 1 t a 2 1 a 2 2 a 2 3 a 2 t … … … … a n 1 a n 2 a n 3 a n t time Outcome for Outcome for ( a 1 1 , a 2 1 , …, a n 1 ) ( a 1 t , a 2 t , …, a n t ) • Assume same game each period • Player’s value/cost additive over periods

  11. Learning outcome a 1 1 a 1 2 a 1 3 a 1 t a 2 1 a 2 2 a 2 3 a 2 t … … … … a n 1 a n 2 a n 3 a n t time Maybe here they don’t By here they have a know how to play, who are better idea… the other players, …

  12. Nash equilibrium a 1 1 a 1 2 a 1 3 a 1 a 1 a 1 a 1 a 1 a 1 a 1 a 1 a 1 time a 2 1 a 2 2 a 2 3 a 2 a 2 a 2 a 2 a 2 a 2 a 2 a 2 a 2 … … … … … … … … … … … … a n 1 a n 2 a n 3 a n a n a n a n a n a n a n a n a n Nash equilibrium: Stable actions a with no regret for any alternate strategy 𝑦 : 𝑑𝑝𝑡𝑢 𝑗 𝑦, 𝑏 −𝑗 ≥ 𝑑𝑝𝑡𝑢 𝑗 (𝑏) No regret

  13. No-regret without stability: learning a 1 1 a 1 2 a 1 3 a 1 t a 2 1 a 2 2 a 2 3 a 2 t … … … … a n 1 a n 2 a n 3 a n t time For any fixed action 𝑦 (with d options) : 𝑢 ) 𝑑𝑝𝑡𝑢 𝑗 𝑏 𝑢 ≤ 𝑑𝑝𝑡𝑢 𝑗 (𝑦, 𝑏 −𝑗 𝑢 𝑢 No-regret 𝑢 ) Regret: R i (x,T)= 𝑑𝑝𝑡𝑢 𝑗 𝑏 𝑢 − 𝑑𝑝𝑡𝑢 𝑗 (𝑦, 𝑏 −𝑗 ≤ 𝑝(𝑈) 𝑢 𝑢 Many simple rules ensure R i (x,T) approx. ~ 𝑈𝑚𝑝𝑕 𝑒 for all x MWU (Hedge), Regret Matching, etc.

  14. No-regret without stability: learning a 1 1 a 1 2 a 1 3 a 1 t a 2 1 a 2 2 a 2 3 a 2 t … … … … a n 1 a n 2 a n 3 a n t time For any fixed action 𝑦 (with d options) : Approx. 𝑢 ) 𝑑𝑝𝑡𝑢 𝑗 𝑏 𝑢 ≤ 𝑑𝑝𝑡𝑢 𝑗 (𝑦, 𝑏 −𝑗 𝑢 𝑢 no-regret 𝑢 ) Regret: R i (x,T)= 𝑑𝑝𝑡𝑢 𝑗 𝑏 𝑢 − (1 + 𝜗) 𝑑𝑝𝑡𝑢 𝑗 (𝑦, 𝑏 −𝑗 ≤ 𝑝(𝑈) 𝑢 𝑢 Many simple rules ensure R i (x,T) approx. ~ 𝑃(log 𝑒/𝜗) for all x MWU (Hedge), Regret Matching, etc. Foster, Li, Lykouris, Sridharan, T’16

  15. Dynamics of rock-paper-scissor 1 1 1 3 3 Nash: 3 Scissor R P S - 9 1 -1 R -1 1 -9 Learning -1 -9 1 P dynamic 1 -9 -1 1 -1 -9 S -1 1 -9 Rock Paper Payoffs/utility • Doesn’t converge • correlates on shared history

  16. Main Question • Efficiency loss due to selfish behavior of players (Price of Anarchy) • In repeated game settings • With players (or player objectives) changing over time Examples $ $ $ internet routing advertising auctions • Advertisers leave and join the system • Traffic changes over time • Advertiser values change 16

  17. Result: routing, limit for very small users Theorem (Roughgarden- T’02): In any network with continuous, non-decreasing cost functions and small users cost of opt with cost of Nash with  rates 2r i for all i rates r i for all i Nash equilibrium: stable solution where no player had incentive to deviate. cost of worst Nash equilibrium Price of Anarchy= “socially optimum” cost

  18. Quality of Learning outcomes: Price of Total Anarchy Bounds average welfare assuming no-regret learners 𝑈 1 𝑈 𝑑𝑝𝑡𝑢(𝑏 𝑢 ) 𝑢=1 Price of Total Anarchy= lim 𝑈→∞ “socially optimum” cost [Blum, Hajiaghayi, Ligett, Roth, 2008] 18

  19. Result 2: routing with learning players Theorem (Blum, Even- Dar, Ligett’06; Roughgarden’09): Price of anarchy bounds developed for Nash equilibria extend to no- regret learning outcomes a 1 1 a 1 2 a 1 3 a 1 t a 2 1 a 2 2 a 2 3 a 2 t … … … … a n 1 a n 2 a n 3 a n t time Assumes a stable set of participants

  20. Today: Dynamic Population Classical model: • Game is repeated identically and nothing changes Dynamic population model: At each step t each player i is replaced with an arbitrary new player with probability p In a population of N players, each step, Np players replaced in expectation 20

  21. Learning players can adapt…. Goal: Bound average welfare assuming adaptive no-regret learners 𝑈 𝑑𝑝𝑡𝑢(𝑏 𝑢 , 𝑤 𝑢 ) 𝑢=1 𝑄𝑝𝐵 = lim 𝑈 𝑃𝑞𝑢(𝑤 𝑢 ) 𝑈→∞ 𝑢=1 where 𝑤 𝑢 is the vector of player types at time t even when the rate of change is high, i.e. a large fraction can turn over at every step. 21

  22. Need for adaptive learning a 1 1 a 1 2 a 1 3 a 1 t a 2 1 a 2 2 a 2 3 a 2 t … … … … a n 1 a n 2 a n 3 a n t time Example routing • Strategy = path • Best “fixed” strategy in hindsight very weak in changing environment • Learners can adapt to the changing environment 22

  23. Need for adaptive learning a 1 1 a 1 2 a 1 3 a 1 t a 2 1 a 2 2 a 2 3 a 2 t … … … … a n 1 a n 2 a n 3 a n t projects time Example 2: matching (project selection) • Strategy = choose a project • Best “fixed” strategy in hindsight very weak in changing environment • Learners can adapt to the changing environment 23

  24. Adaptive Learning a 1 1 a 1 2 a 1 3 a 1 t a 2 1 a 2 2 a 2 3 a 2 t … … … … 𝜐 2 𝜐 1 a n 1 a n 2 a n 3 a n t time • Adaptive regret [Hazan- Seshadiri’07, Luo - Schapire’15, Blum - Mansour’07, Lehrer’03] for all player i, strategy x and interval [𝜐 1 , 𝜐 2 ] 𝜐 2 𝑢 ; 𝑤 𝑢 𝑆 𝑗 𝑦, 𝜐 1 , 𝜐 2 = 𝑑𝑝𝑡𝑢 𝑗 𝑏 𝑢 ; 𝑤 𝑢 − 𝑑𝑝𝑡𝑢 𝑗 𝑦, 𝑏 −𝑗 ≤ 𝑝 𝜐 2 − 𝜐 1 𝑢=𝜐 1 rates of ~ 𝜐 2 − 𝜐 1  Regret with respect to a strategy that changes k times ≤ ~ 𝑙𝑈 24

  25. Adaptive Learning a 1 1 a 1 2 a 1 3 a 1 t a 2 1 a 2 2 a 2 3 a 2 t … … … … 𝜐 2 𝜐 1 a n 1 a n 2 a n 3 a n t time • Adaptive regret [Foster,Li,Lykouris,Sridharan,T’16] for all player i, strategy x and interval [𝜐 1 , 𝜐 2 ] 𝑢 ; 𝑤 𝑢 𝑑𝑝𝑡𝑢 𝑗 𝑏 𝑢 ; 𝑤 𝑢 − 1 + 𝜗 𝑑𝑝𝑡𝑢 𝑗 𝑦, 𝑏 −𝑗 𝜐 2 𝑆 𝑗 𝑦, 𝜐 1 , 𝜐 2 = ≤ 𝑃(k log 𝑒/𝜗) 𝑢=𝜐 1 Regret with respect to a strategy that changes k times Using any of MWU (Hedge), Regret Matching, etc. mixed with a bit of “forgetting” 25

  26. Result (Lykouris, Syrgkanis, T’16) : Bound average welfare close to Price of Anarchy for Nash 𝟐 even when the rate of change is high, 𝒒 ≈ 𝐦𝐩𝐡 𝒐 with n players assuming adaptive no-regret learners - Worst case change of player type  need for adapting to changing environment - Sudden large change is unlikely 26

  27. No-regret and Price of Anarchy Low regret: 𝑈 𝑢 ; 𝑤 𝑢 𝑆 𝑗 𝑦 = 𝑑𝑝𝑡𝑢 𝑗 𝑏 𝑢 ; 𝑤 𝑢 − 𝑑𝑝𝑡𝑢 𝑗 𝑦, 𝑏 −𝑗 ≤ 𝑝 𝑈 projects 𝑢=1 Best action varies with choices of others… Consider Optimal Solution ∗ be the choice in OPT Let x= 𝑏 𝑗 No regret for all players i: 𝑑𝑝𝑡𝑢 𝑗 𝑏 𝑢 ≤ 𝑑𝑝𝑡𝑢 𝑗 (𝒃 𝒋 ∗ , 𝑏 −𝑗 ) 𝑢 𝑢 ∗ Players don’t have to know 𝒃 𝒋 27

  28. Proof Technique: Smoothness (Roughgarden’09 ) ∗ in optimum Consider optimal solution: player i does action 𝑏 𝑗 𝑢 ) No regret: 𝑑𝑝𝑡𝑢 𝑗 𝑏 𝑢 ≤ 𝑑𝑝𝑡𝑢 𝑗 (𝑏 𝑗 ∗ , 𝑏 −𝑗 ∗ ) (doesn’t need to know 𝑏 𝑗 𝑢 𝑢 A game is ( λ,μ )-smooth ( λ > 0; μ< 1 ): if for all strategy vectors a ∗ , 𝑏 −𝑗 𝑑𝑝𝑡𝑢 𝑗 𝑏 ≤ 𝑑𝑝𝑡𝑢 𝑗 (𝑏 𝑗 ) ≤ 𝜇 𝑃𝑄𝑈 + 𝜈 𝑑𝑝𝑡𝑢(𝑏) 𝑗 𝑗 𝜇 A Nash equilibrium a has cost(a) ≤ 1−𝜈 Opt

Recommend


More recommend