Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards Umer Siddique , Paul Weng, and Matthieu Zimmer University of Michigan-Shanghai Jiao Tong University Joint Institute ICML 2020 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 1 / 11
Overview Motivation and Problem 1 Theoretical Discussions & Algorithms 2 Experimental Results 3 Conclusion 4 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 2 / 11
Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11
Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). Fairness consideration to users is crucial U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11
Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). Fairness consideration to users is crucial Existing approaches to tackle this issue includes: Utilitarian approach Egalitarian approach U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11
Fairness Fairness includes: Efficiency Impartiality Equity U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11
Fairness Fairness includes: Efficiency Impartiality Equity Fairness encoded in a Social Welfare Function (SWF) U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11
Fairness Fairness includes: Efficiency Impartiality Equity Fairness encoded in a Social Welfare Function (SWF) We focus on generalized Gini social welfare function (GGF) U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11
Problem Statement GGF can be defined as: D � w i v ↑ GGF w ( v ) = i i =1 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11
Problem Statement GGF can be defined as: v ↑ 1 v ↑ D 2 � w i v ↑ GGF w ( v ) = i = [ w 1 w 2 . . . w D ] i =1 . . . v ↑ D U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11
Problem Statement GGF can be defined as: v ↑ 1 ≤ v ↑ D 2 � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ] ≤ i =1 . . . ≤ v ↑ D U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11
Problem Statement GGF can be defined as: v ↑ 1 ≤ v ↑ D 2 � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ] ≤ i =1 . . . ≤ v ↑ D Fair optimization problem in RL: arg max GGF w ( J ( π )) (1) π U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11
Problem Statement GGF can be defined as: v ↑ 1 ≤ v ↑ D 2 � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ] ≤ i =1 . . . ≤ v ↑ D Fair optimization problem in RL: arg max GGF w ( J ( π )) (1) π � ∞ � � � h 1 � γ t − 1 R t � where J ( π ) = E P π or J ( π ) = lim . h E P π R t h →∞ t =1 t =1 γ -discounted rewards average rewards U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11
Theoretical Discussion Assumption: MDPs are weakly-communicating U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11
Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11
Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11
Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. Contribution on Approximation Error Approximate average-optimal policy ( π ∗ 1 ) with γ -optimal policy ( π ∗ γ ). U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11
Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. Contribution on Approximation Error Approximate average-optimal policy ( π ∗ 1 ) with γ -optimal policy ( π ∗ γ ). Theorem: � � GGF w ( µ ( π ∗ γ )) ≥ GGF w ( µ ( π ∗ 1 )) − R (1 − γ ) ρ ( γ, σ ( H P π ∗ 1 )) + ρ ( γ, σ ( H P π ∗ γ )) σ where R = max π � R π � 1 and ρ ( γ, σ ) = γ − (1 − γ ) σ . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11
Value Based and Policy Gradient Algorithms DQN: Q network takes values in R |A|× D , instead of R |A| , trained with target: Q θ ( s , a ) = r + γ ˆ ˆ Q θ ′ ( s ′ , a ∗ ) , where a ∗ = argmax a ′ ∈A GGF w r + γ ˆ Q θ ′ ( s ′ , a ′ ) � � . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 7 / 11
Value Based and Policy Gradient Algorithms DQN: Q network takes values in R |A|× D , instead of R |A| , trained with target: Q θ ( s , a ) = r + γ ˆ ˆ Q θ ′ ( s ′ , a ∗ ) , where a ∗ = argmax a ′ ∈A GGF w r + γ ˆ Q θ ′ ( s ′ , a ′ ) � � . To optimize the GGF with policy gradient: ∇ θ GGF w ( J ( π θ )) = ∇ J ( π θ ) GGF w ( J ( π θ )) · ∇ θ J ( π θ ) = w ⊺ σ · ∇ θ J ( π θ ) . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 7 / 11
Experimental Results What is the impact of optimizing GGF instead of the average of the objectives? Species Conservation 0 . 9 GGF Score 0 . 8 0 . 7 A2C GGF-A2C PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 8 / 11
Experimental Results What is the impact of optimizing GGF instead of the average of the objectives? Species Conservation 0 . 9 GGF Score 0 . 8 0 . 7 A2C GGF-A2C PPO GGF-PPO Sea-otters 0 . 8 Abalones Average density 0 . 6 0 . 4 0 . 2 0 . 0 A2C GGF-A2C PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 8 / 11
Experimental Results What is the price of fairness? How those algorithms performs in continuous domains? Species Conservation Average accumulated density 1 . 25 1 . 00 0 . 75 PPO GGF-PPO 0 . 50 10000 20000 30000 40000 50000 Number of Steps U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 9 / 11
Experimental Results What is the price of fairness? How those algorithms performs in continuous domains? Species Conservation Average accumulated density 1 . 25 1 . 00 0 . 75 PPO GGF-PPO 0 . 50 10000 20000 30000 40000 50000 Number of Steps Average accumulated bandwidth Network Congestion Control 14 12 PPO GGF-PPO 0 5 10 15 20 25 Number of Episodes U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 9 / 11
Experimental Results (Traffic Light Control) What is the effect of γ with respect to GGF-average optimality? Traffic Light Control × 10 7 − 1 . 6 − 1 . 8 GGF Score − 2 . 0 − 2 . 2 − 2 . 4 PPO- γ -0.99 GGF-PPO- γ -0.99 PPO- 1 − GGF-PPO- 1 − North East West South 400 Average waiting time 350 300 250 200 PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 10 / 11
Conclusion Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11
Conclusion Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains Future Works: Extend to distributed control Consider other fair social welfare functions Directly solve average reward problems U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11
Ruffy, F., Przystupa, M., and Beschastnikh, I. (2019). Iroko: A framework to prototype reinforcement learning for data center traffic control. In Workshop on ML for Systems at NeurIPS . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11
Recommend
More recommend