Empirical-evidence Equilibria in Stochastic Games Nicolas Dudebout
Outline 2 • Stochastic games • Empirical-evidence equilibria (EEEs) • Open questions in EEEs
Stochastic Games 3 • Game theory • Markov decision processes
Game Theory Decision making 1 , 𝑏 2 ) 𝑣 2 (𝑏 ∗ 𝑏 2 ∈ 2 𝑏 ∗ 2 ) 𝑣 1 (𝑏 1 , 𝑏 ∗ 𝑏 1 ∈ 1 𝑏 ∗ ⎩ ⎪ ⎨ ⎪ ⎧ Nash Equilibrium Game theory 𝑣(𝑏) 𝑏∈ 4 𝑣∶ → ℝ ⟹ 𝑏 ∗ ∈ arg max 𝑣 1 ∶ 1 × 2 → ℝ 𝑣 2 ∶ 1 × 2 → ℝ 1 ∈ arg max 2 ∈ arg max
Example: Battle of the Sexes 0, 0 3 𝑃) F 1, 3 Nash equilibria O 0, 1 2, 2 F O 5 • (𝐺, 𝐺) • (𝑃, 𝑃) 4 𝐺 1 / 4 𝑃, 1 / 3 𝐺 2 / • ( 3 /
Markov Decision Process (MDP) Stage cost 𝑣(𝑦, 𝑏) Strategy 𝜏 ∶ ℋ → ∞ ∑ 𝑢=0 𝜀 𝑢 𝑣(𝑦 𝑢 , 𝑏 𝑢 )] Bellman’s equation Dynamic programming use knowledge of 𝑔 Reinforcement learning learn 𝑔 from repeated interaction 6 Dynamic 𝑦 + ∼ 𝑔(𝑦, 𝑏) ⟺ 𝑦 𝑢+1 ∼ 𝑔(𝑦 𝑢 , 𝑏 𝑢 ) History ℎ 𝑢 = (𝑦 0 , 𝑦 1 , … , 𝑦 𝑢 , 𝑏 0 , 𝑏 1 , … , 𝑏 𝑢 ) Utility 𝑉(𝜏) = 𝔽 𝑔,𝜏 [ 𝑉 ∗ (𝑦) = max 𝑏∈ {𝑣(𝑦, 𝑏) + 𝜀𝔽 𝑔 [𝑉 ∗ (𝑦 + ) | 𝑦, 𝑏]}
Markov Decision Process (MDP) Stage cost 𝑣(𝑦, 𝑏) Strategy 𝜏 ∶ ℋ → ∞ ∑ 𝑢=0 𝜀 𝑢 𝑣(𝑦 𝑢 , 𝑏 𝑢 )] Bellman’s equation Dynamic programming use knowledge of 𝑔 Reinforcement learning learn 𝑔 from repeated interaction 6 Dynamic 𝑦 + ∼ 𝑔(𝑦, 𝑏) ⟺ 𝑦 𝑢+1 ∼ 𝑔(𝑦 𝑢 , 𝑏 𝑢 ) History ℎ 𝑢 = (𝑦 0 , 𝑦 1 , … , 𝑦 𝑢 , 𝑏 0 , 𝑏 1 , … , 𝑏 𝑢 ) Utility 𝑉(𝜏) = 𝔽 𝑔,𝜏 [ 𝑉 ∗ (𝑦) = max 𝑏∈ {𝑣(𝑦, 𝑏) + 𝜀𝔽 𝑔 [𝑉 ∗ (𝑦 + ) | 𝑦, 𝑏]}
Markov Decision Process (MDP) Stage cost 𝑣(𝑦, 𝑏) Strategy 𝜏 ∶ 𝒴 → ∞ ∑ 𝑢=0 𝜀 𝑢 𝑣(𝑦 𝑢 , 𝑏 𝑢 )] Bellman’s equation Dynamic programming use knowledge of 𝑔 Reinforcement learning learn 𝑔 from repeated interaction 6 Dynamic 𝑦 + ∼ 𝑔(𝑦, 𝑏) ⟺ 𝑦 𝑢+1 ∼ 𝑔(𝑦 𝑢 , 𝑏 𝑢 ) History ℎ 𝑢 = (𝑦 0 , 𝑦 1 , … , 𝑦 𝑢 , 𝑏 0 , 𝑏 1 , … , 𝑏 𝑢 ) Utility 𝑉(𝜏) = 𝔽 𝑔,𝜏 [ 𝑉 ∗ (𝑦) = max 𝑏∈ {𝑣(𝑦, 𝑏) + 𝜀𝔽 𝑔 [𝑉 ∗ (𝑦 + ) | 𝑦, 𝑏]}
Imperfect Information (POMDP) Signal 𝑡 ∼ 𝜉(𝑥) Strategy 𝜏 ∶ ℋ → 7 Dynamic 𝑥 + ∼ 𝑜(𝑥, 𝑏) History ℎ 𝑢 = (𝑡 0 , 𝑡 1 , … , 𝑡 𝑢 , 𝑏 0 , 𝑏 1 , … , 𝑏 𝑢 ) Belief ℙ 𝑜,𝜉,𝜏 [𝑥 | ℎ]
Imperfect Information (POMDP) Signal 𝑡 ∼ 𝜉(𝑥) Strategy 𝜏 ∶ ℋ → 7 Dynamic 𝑥 + ∼ 𝑜(𝑥, 𝑏) History ℎ 𝑢 = (𝑡 0 , 𝑡 1 , … , 𝑡 𝑢 , 𝑏 0 , 𝑏 1 , … , 𝑏 𝑢 ) Belief ℙ 𝑜,𝜉,𝜏 [𝑥 | ℎ]
Imperfect Information (POMDP) Signal 𝑡 ∼ 𝜉(𝑥) 7 Dynamic 𝑥 + ∼ 𝑜(𝑥, 𝑏) History ℎ 𝑢 = (𝑡 0 , 𝑡 1 , … , 𝑡 𝑢 , 𝑏 0 , 𝑏 1 , … , 𝑏 𝑢 ) Strategy 𝜏 ∶ Δ(𝒳 ) → Belief ℙ 𝑜,𝜉,𝜏 [𝑥 | ℎ]
Stochastic Games 0 , 𝑏 1 2 ) 1 , … , 𝑏 𝑢 0 , 𝑏 2 2 , 𝑏 2 1 , … , 𝑡 𝑢 0 , 𝑡 2 1 ) 1 , … , 𝑏 𝑢 ℎ 𝑢 1 , 𝑏 1 ℎ 𝑢 1 , … , 𝑡 𝑢 8 0 , 𝑡 1 Dynamic 𝑥 + ∼ 𝑜(𝑥, 𝑏 1 , 𝑏 2 ) 𝑡 1 ∼ 𝜉 1 (𝑥) Signals { 𝑡 2 ∼ 𝜉 2 (𝑥) 1 = (𝑡 1 Histories { 2 = (𝑡 2 𝜏 1 ∶ ℋ 1 → 1 Strategies { 𝜏 2 ∶ ℋ 2 → 2 ℙ 𝑜,𝜉 1 ,𝜏 1 ,𝜉 2 ,𝜏 2 [𝑥, ℎ 2 | ℎ 1 ] Beliefs { ℙ 𝑜,𝜉 1 ,𝜏 1 ,𝜉 2 ,𝜏 2 [𝑥, ℎ 1 | ℎ 2 ]
Stochastic Games 0 , 𝑏 1 2 ) 1 , … , 𝑏 𝑢 0 , 𝑏 2 2 , 𝑏 2 1 , … , 𝑡 𝑢 0 , 𝑡 2 1 ) 1 , … , 𝑏 𝑢 ℎ 𝑢 1 , 𝑏 1 ℎ 𝑢 1 , … , 𝑡 𝑢 8 0 , 𝑡 1 Dynamic 𝑥 + ∼ 𝑜(𝑥, 𝑏 1 , 𝑏 2 ) 𝑡 1 ∼ 𝜉 1 (𝑥) Signals { 𝑡 2 ∼ 𝜉 2 (𝑥) 1 = (𝑡 1 Histories { 2 = (𝑡 2 𝜏 1 ∶ ℋ 1 → 1 Strategies { 𝜏 2 ∶ ℋ 2 → 2 ℙ 𝑜,𝜉 1 ,𝜏 1 ,𝜉 2 ,𝜏 2 [𝑥, ℎ 2 | ℎ 1 ] Beliefs { ℙ 𝑜,𝜉 1 ,𝜏 1 ,𝜉 2 ,𝜏 2 [𝑥, ℎ 1 | ℎ 2 ]
Existing Approaches 9 • (Weakly) belief-free equilibrium • Mean-field equilibrium • Incomplete theories
Empirical-evidence Equilibria 10
Motivation Agent 1 Nature Agent 2 0. Pick arbitrary strategies 1. Formulate simple but consistent models 2. Design strategies optimal w.r.t. models, then, back to 1. Empirical-evidence equilibrium is a fixed point: 11 • Strategies optimal w.r.t. models • Models consistent with strategies
Example: Asset Management Trading one asset on the stock market Model based on Model very different for each agent 12 • information published by the company • observed trading activity
Multiple to Single Agent Agent 1 Nature Agent 2 13
Multiple to Single Agent Agent 1 Nature Agent 2 Nature 1 13
Single Agent Setup Agent Nature 14
Single Agent Setup Nature 14 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡)
Single Agent Setup Nature 𝑡 14 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡)
Single Agent Setup 𝑡 ∼ 𝜉(𝑥) 𝑡 14 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏) 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡)
Example: Asset Management 𝑡 ∼ 𝜉(𝑥) 𝑡 Stage cost 𝑞 ⋅ 𝑏 Nature 𝑥 represents market sentiment, political climate, other traders 15 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏) 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡) State holding 𝑦 ∈ {0 .. 𝑁} Action sell one, hold, or buy one 𝑏 ∈ {−1, 0, 1} Signal price 𝑞 ∈ { Low , High }
Single Agent Setup 𝑏 ∼ 𝜏(ℎ) 𝑡 ∼ 𝜉(𝑥) 𝑡 Model ̂ 𝑡 ̂ 𝑡 consistent with 𝜏 𝜏 optimal w.r.t. ̂ 𝑡 16 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡) 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏)
Single Agent Setup 𝑏 ∼ 𝜏(ℎ) 𝑡 ∼ 𝜉(𝑥) 𝑡 Model ̂ 𝑡 ̂ 𝑡 consistent with 𝜏 𝜏 optimal w.r.t. ̂ 𝑡 16 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡) 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏)
Single Agent Setup 𝑏 ∼ 𝜏(ℎ) 𝑡 ∼ 𝜉(𝑥) 𝑡 Model ̂ 𝑡 ̂ 𝑡 consistent with 𝜏 𝜏 optimal w.r.t. ̂ 𝑡 16 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡) 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏)
Single Agent Setup 𝑏 ∼ 𝜏(ℎ) 𝑡 ∼ 𝜉(𝑥) 𝑡 Model ̂ 𝑡 ̂ 𝑡 consistent with 𝑡 𝜏 optimal w.r.t. ̂ 𝑡 16 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡) 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏)
Single Agent Setup 𝑏 ∼ 𝜏(ℎ) 𝑡 ∼ 𝜉(𝑥) 𝑡 Model ̂ 𝑡 ̂ 𝑡 consistent with 𝜏 𝜏 optimal w.r.t. ̂ 𝑡 16 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡) 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏)
Single Agent Setup 𝑡) 𝑏 ∼ 𝜏(ℎ) 𝑡 ∼ 𝜉(𝑥) 𝑡 Model ̂ 𝑡 ̂ 𝑡 consistent with 𝜏 𝜏 optimal w.r.t. ̂ 𝑡 16 𝑦 + ∼ 𝑔(𝑦, 𝑏, ̂ 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏)
Single Agent Setup 𝑡) 𝑏 ∼ 𝜏(ℎ) 𝑡 ∼ 𝜉(𝑥) 𝑡 Model ̂ 𝑡 ̂ 𝑡 consistent with 𝜏 𝜏 optimal w.r.t. ̂ 𝑡 16 𝑦 + ∼ 𝑔(𝑦, 𝑏, ̂ 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏)
• 0 characteristic: ℙ[𝑡 = 0], ℙ[𝑡 = 1] • 1 characteristic: ℙ[𝑡𝑡 + = 00], ℙ[𝑡𝑡 + = 10], ℙ[𝑡𝑡 + = 01], ℙ[𝑡𝑡 + = 11] • ... • 𝑙 characteristic: probability of strings of length 𝑙 + 1 Definition Two processes 𝑡 and 𝑡 ′ are depth- 𝑙 consistent if Depth- 𝑙 Consistency Consider a binary stochastic process 𝑡 0100010001001010010110111010000111010101... they have the same 𝑙 characteristic 17
Definition Two processes 𝑡 and 𝑡 ′ are depth- 𝑙 consistent if Depth- 𝑙 Consistency Consider a binary stochastic process 𝑡 0100010001001010010110111010000111010101... they have the same 𝑙 characteristic 17 • 0 characteristic: ℙ[𝑡 = 0], ℙ[𝑡 = 1] • 1 characteristic: ℙ[𝑡𝑡 + = 00], ℙ[𝑡𝑡 + = 10], ℙ[𝑡𝑡 + = 01], ℙ[𝑡𝑡 + = 11] • ... • 𝑙 characteristic: probability of strings of length 𝑙 + 1
Depth- 𝑙 Consistency Consider a binary stochastic process 𝑡 0100010001001010010110111010000111010101... they have the same 𝑙 characteristic 17 • 0 characteristic: ℙ[𝑡 = 0], ℙ[𝑡 = 1] • 1 characteristic: ℙ[𝑡𝑡 + = 00], ℙ[𝑡𝑡 + = 10], ℙ[𝑡𝑡 + = 01], ℙ[𝑡𝑡 + = 11] • ... • 𝑙 characteristic: probability of strings of length 𝑙 + 1 Definition Two processes 𝑡 and 𝑡 ′ are depth- 𝑙 consistent if
Depth- 𝑙 Consistency: Example 1 0.7 0.3 0.7 0.3 1 1 0 0 𝑨 1 𝑨 0 0.5 0.5 𝑨 ∅ 1 18
𝜈(𝑨 = (𝑡 1 , 𝑡 2 , … , 𝑡 𝑙 ))[𝑡 𝑙+1 ] = ℙ 𝜏 [𝑡 𝑢+1 = 𝑡 𝑙+1 | 𝑡 𝑢 = 𝑡 𝑙 , … , 𝑡 𝑢−𝑙+1 = 𝑡 1 ] Complete picture Fix a depth 𝑙 ∈ ℕ 𝑏 ∼ 𝜏(ℎ) 𝑡 ∼ 𝜉(𝑥) Model 𝑡 ̂ 𝑡 𝜏 ↦ 𝜈 consistent with 𝜏 𝜈 ↦ 𝜏 optimal w.r.t. 𝜈 𝑨 contains the last 𝑙 observed signals 19 𝑦 + ∼ 𝑔(𝑦, 𝑏, 𝑡) 𝑥 + ∼ 𝑜(𝑥, 𝑦, 𝑏)
Recommend
More recommend