Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games H.L. Prasad † , Prashanth L.A. ♯ and Shalabh Bhatnagar ♯ † Streamoid Technologies, Inc ♯ Indian Institute of Science H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 1 / 21
Multi-agent RL setting Environment � r 1 , r 2 , . . . , r N � Reward r = , a 1 , a 2 , . . . , a N � Action a = � next state y . . . 1 2 N Agents H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 2 / 21
Problem area Markov Chains ( S , p ) Markov Normal-form Decision Games Processes ( N, A , r ) , ( S , A , p, r, β ) , N -agents single agent Stochastic Games ( N, S , A , p, r, β ) , N -agents H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 3 / 21
Problem area (revisited) � Zero-sum � Zero-sum Normal-form Stochastic Games Games General- � sum Design Objective: ! General-sum Online algorithm, Convergence to Nash equilibrium 1 1If NE is a useful objective for learning in games, then we have a strong contribution! H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 4 / 21
A General Optimization Problem H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 5 / 21
Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21
Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21
Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21
Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21
Value function � � � β t � v π ( s ) = E r ( s t , a ) π ( s t , a ) | s 0 = s t a ∈A ( x ) Value function Reward Policy A stationary Markov strategy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ � � is said to be Nash if v i π ∗ ( s ) ≥ v i � π i ,π − i ∗ � ( s ) , ∀ π i , ∀ i, ∀ s ∈ S H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 6 / 21
Dynamic Programming Idea � � v i E π i ( x ) Q i π − i ∗ ( x, a i ) π ∗ ( x ) = max , π i ( x ) ∈ ∆( A i ( x )) Marginal Value after fixing a i ∼ π i Optimal (Nash) Value where Q-value is given by Q i π − i ( x, a i ) = E π − i ( x ) r i ( x, a ) + β � p ( y | x, a ) v i ( y ) y ∈ U ( x ) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21
Dynamic Programming Idea � � v i E π i ( x ) Q i π − i ∗ ( x, a i ) π ∗ ( x ) = max , π i ( x ) ∈ ∆( A i ( x )) Marginal Value after fixing a i ∼ π i Optimal (Nash) Value where Q-value is given by Q i π − i ( x, a i ) = E π − i ( x ) r i ( x, a ) + β � p ( y | x, a ) v i ( y ) y ∈ U ( x ) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21
Dynamic Programming Idea � � v i E π i ( x ) Q i π − i ∗ ( x, a i ) π ∗ ( x ) = max , π i ( x ) ∈ ∆( A i ( x )) Marginal Value after fixing a i ∼ π i Optimal (Nash) Value where Q-value is given by Q i π − i ( x, a i ) = E π − i ( x ) r i ( x, a ) + β � p ( y | x, a ) v i ( y ) y ∈ U ( x ) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21
Dynamic Programming Idea � � v i E π i ( x ) Q i π − i ∗ ( x, a i ) π ∗ ( x ) = max , π i ( x ) ∈ ∆( A i ( x )) Marginal Value after fixing a i ∼ π i Optimal (Nash) Value where Q-value is given by Q i π − i ( x, a i ) = E π − i ( x ) r i ( x, a ) + β � p ( y | x, a ) v i ( y ) y ∈ U ( x ) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 7 / 21
Optimization problem - informal terms Need to solve: v i E π i ( x ) Q i π − i ∗ ( x, a i ) � � π ∗ ( x ) = max (1) π i ( x ) ∈ ∆( A i ( x )) Formulation: Objective. minimize the Bellman error v i ( x ) − E π i Q i π − i ( x, a i ) in every state, for every agent Constraint 1. ensure policy π is a distribution Constraint 2. Q i π − i ( x, a i ) ≤ v i π ( x ) ← − a proxy for the max in (1) H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 8 / 21
Optimization problem in formal terms N � � v i ( x ) − E π i Q i π − i ( x, a i ) � � min v,π f ( v, π ) = i =1 x ∈S subject to π i ( x, a i ) ≥ 0 , ∀ a i ∈ A i ( x ) , x ∈ S , i = 1 , 2 , . . . , N, N � π i ( x, a i ) = 1 , ∀ x ∈ S , i = 1 , 2 , . . . , N. i =1 π − i ( x, a i ) ≤ v i ( x ) , ∀ a i ∈ A i ( x ) , x ∈ S , i = 1 , 2 , . . . , N. Q i H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 9 / 21
Solution approach Usual approach: Apply KKT conditions to solve the general optimization problem Caveat: Imposes a tricky linear independence requirement Alternative: Use a simpler set of SG-SP conditions H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 10 / 21
A sufficient condition SG-SP Point A point ( v ∗ , π ∗ ) is said to be an SG-SP point if it is feasible and for all x ∈ X and i ∈ { 1 , 2 , . . . , N } ∀ a i ∈ A i ( x ) π i ∗ ( x, a i ) g i x,a i ( v i ∗ , π − i ∗ ( x )) = 0 , where g i x,a i ( v i , π − i ( x )) := Q i π − i ( x, a i ) − v i ( x ) . Nash ⇔ SG-SP: A strategy π ∗ is Nash if and only if ( v ∗ , π ∗ ) is an SG-SP point H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 11 / 21
An Online Algorithm: ON-SGSP H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 12 / 21
ON-SGSP’s decentralized online learning model Environment r , y r , y a 1 r , y a 2 a N . . . 1 2 N ON-SGSP H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 13 / 21
ON-SGSP - operational flow Policy Evaluation Policy π i Value v π i Policy Improvement Policy evaluation: estimate the value function using temporal difference (TD) learning Policy improvement: perform gradient descent for the policy using a descent direction Descent direction ensures convergence to a global minimum of the optimization problem H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 14 / 21
ON-SGSP - operational flow Policy Evaluation Policy π i Value v π i Policy Improvement Policy evaluation: estimate the value function using temporal difference (TD) learning Policy improvement: perform gradient descent for the policy using a descent direction Descent direction ensures convergence to a global minimum of the optimization problem H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 14 / 21
More on the descent direction Descend along � � � TD-learning for � g i x,a i ( v i , π − i ) − π i ( x, a i ) � � � � ∂f ( v, π ) policy evaluation � × sgn ∂π i From Lagrange multiplier and slack variable theory Solution tracks an ODE with limit as an SG-SP point 1 sgn is a continuous version of sgn H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 15 / 21
Experiments H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 16 / 21
A single state non-generic 2-player game Payoff Matrix Player 2 → a 1 a 2 a 3 Player 1 ↓ a 1 1 , 0 0 , 1 1 , 0 0 , 1 1 , 0 1 , 0 a 2 a 3 0 , 1 0 , 1 1 , 1 H.L. Prasad, Prashanth L A, Shalabh Bhatnagar RL Algorithms for NE in General-Sum Games 17 / 21
Recommend
More recommend