learning fair policies in multiobjective deep
play

Learning Fair Policies in Multiobjective (Deep) Reinforcement - PowerPoint PPT Presentation

Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards Umer Siddique , Paul Weng, and Matthieu Zimmer University of Michigan-Shanghai Jiao Tong University Joint Institute ICML 2020 U. Siddique,


  1. Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards Umer Siddique , Paul Weng, and Matthieu Zimmer University of Michigan-Shanghai Jiao Tong University Joint Institute ICML 2020 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 1 / 11

  2. Overview Motivation and Problem 1 Theoretical Discussions & Algorithms 2 Experimental Results 3 Conclusion 4 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 2 / 11

  3. Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11

  4. Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). Fairness consideration to users is crucial U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11

  5. Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). Fairness consideration to users is crucial Existing approaches to tackle this issue includes: Utilitarian approach Egalitarian approach U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11

  6. Fairness Fairness includes: Efficiency Impartiality Equity U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11

  7. Fairness Fairness includes: Efficiency Impartiality Equity Fairness encoded in a Social Welfare Function (SWF) U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11

  8. Fairness Fairness includes: Efficiency Impartiality Equity Fairness encoded in a Social Welfare Function (SWF) We focus on generalized Gini social welfare function (GGF) U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11

  9. Problem Statement GGF can be defined as: D � w i v ↑ GGF w ( v ) = i i =1 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  10. Problem Statement GGF can be defined as:  v ↑  1     v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 w 2 . . . w D ]       i =1 . . .       v ↑ D U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  11. Problem Statement GGF can be defined as:  v ↑  1  ≤    v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ]   ≤     i =1 . . .     ≤   v ↑ D U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  12. Problem Statement GGF can be defined as:  v ↑  1  ≤    v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ]   ≤     i =1 . . .     ≤   v ↑ D Fair optimization problem in RL: arg max GGF w ( J ( π )) (1) π U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  13. Problem Statement GGF can be defined as:  v ↑  1  ≤    v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ]   ≤     i =1 . . .     ≤   v ↑ D Fair optimization problem in RL: arg max GGF w ( J ( π )) (1) π � ∞ � � � h 1 � γ t − 1 R t � where J ( π ) = E P π or J ( π ) = lim . h E P π R t h →∞ t =1 t =1 γ -discounted rewards average rewards U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

  14. Theoretical Discussion Assumption: MDPs are weakly-communicating U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  15. Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  16. Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  17. Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. Contribution on Approximation Error Approximate average-optimal policy ( π ∗ 1 ) with γ -optimal policy ( π ∗ γ ). U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  18. Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. Contribution on Approximation Error Approximate average-optimal policy ( π ∗ 1 ) with γ -optimal policy ( π ∗ γ ). Theorem: � � GGF w ( µ ( π ∗ γ )) ≥ GGF w ( µ ( π ∗ 1 )) − R (1 − γ ) ρ ( γ, σ ( H P π ∗ 1 )) + ρ ( γ, σ ( H P π ∗ γ )) σ where R = max π � R π � 1 and ρ ( γ, σ ) = γ − (1 − γ ) σ . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

  19. Value Based and Policy Gradient Algorithms DQN: Q network takes values in R |A|× D , instead of R |A| , trained with target: Q θ ( s , a ) = r + γ ˆ ˆ Q θ ′ ( s ′ , a ∗ ) , where a ∗ = argmax a ′ ∈A GGF w r + γ ˆ Q θ ′ ( s ′ , a ′ ) � � . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 7 / 11

  20. Value Based and Policy Gradient Algorithms DQN: Q network takes values in R |A|× D , instead of R |A| , trained with target: Q θ ( s , a ) = r + γ ˆ ˆ Q θ ′ ( s ′ , a ∗ ) , where a ∗ = argmax a ′ ∈A GGF w r + γ ˆ Q θ ′ ( s ′ , a ′ ) � � . To optimize the GGF with policy gradient: ∇ θ GGF w ( J ( π θ )) = ∇ J ( π θ ) GGF w ( J ( π θ )) · ∇ θ J ( π θ ) = w ⊺ σ · ∇ θ J ( π θ ) . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 7 / 11

  21. Experimental Results What is the impact of optimizing GGF instead of the average of the objectives? Species Conservation 0 . 9 GGF Score 0 . 8 0 . 7 A2C GGF-A2C PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 8 / 11

  22. Experimental Results What is the impact of optimizing GGF instead of the average of the objectives? Species Conservation 0 . 9 GGF Score 0 . 8 0 . 7 A2C GGF-A2C PPO GGF-PPO Sea-otters 0 . 8 Abalones Average density 0 . 6 0 . 4 0 . 2 0 . 0 A2C GGF-A2C PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 8 / 11

  23. Experimental Results What is the price of fairness? How those algorithms performs in continuous domains? Species Conservation Average accumulated density 1 . 25 1 . 00 0 . 75 PPO GGF-PPO 0 . 50 10000 20000 30000 40000 50000 Number of Steps U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 9 / 11

  24. Experimental Results What is the price of fairness? How those algorithms performs in continuous domains? Species Conservation Average accumulated density 1 . 25 1 . 00 0 . 75 PPO GGF-PPO 0 . 50 10000 20000 30000 40000 50000 Number of Steps Average accumulated bandwidth Network Congestion Control 14 12 PPO GGF-PPO 0 5 10 15 20 25 Number of Episodes U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 9 / 11

  25. Experimental Results (Traffic Light Control) What is the effect of γ with respect to GGF-average optimality? Traffic Light Control × 10 7 − 1 . 6 − 1 . 8 GGF Score − 2 . 0 − 2 . 2 − 2 . 4 PPO- γ -0.99 GGF-PPO- γ -0.99 PPO- 1 − GGF-PPO- 1 − North East West South 400 Average waiting time 350 300 250 200 PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 10 / 11

  26. Conclusion Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11

  27. Conclusion Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains Future Works: Extend to distributed control Consider other fair social welfare functions Directly solve average reward problems U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11

  28. Ruffy, F., Przystupa, M., and Beschastnikh, I. (2019). Iroko: A framework to prototype reinforcement learning for data center traffic control. In Workshop on ML for Systems at NeurIPS . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11

Recommend


More recommend