Learning Fair Policies in Multiobjective (Deep) Reinforcement - PowerPoint PPT Presentation

Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards Umer Siddique , Paul Weng, and Matthieu Zimmer University of Michigan-Shanghai Jiao Tong University Joint Institute ICML 2020 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 1 / 11

Overview Motivation and Problem 1 Theoretical Discussions & Algorithms 2 Experimental Results 3 Conclusion 4 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 2 / 11

Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11

Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). Fairness consideration to users is crucial U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11

Motivation: Why should we care about fair systems? Figure: Network with a fat-tree topology from Ruffy et al. (2019). Fairness consideration to users is crucial Existing approaches to tackle this issue includes: Utilitarian approach Egalitarian approach U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 3 / 11

Fairness Fairness includes: Efficiency Impartiality Equity U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11

Fairness Fairness includes: Efficiency Impartiality Equity Fairness encoded in a Social Welfare Function (SWF) U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11

Fairness Fairness includes: Efficiency Impartiality Equity Fairness encoded in a Social Welfare Function (SWF) We focus on generalized Gini social welfare function (GGF) U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 4 / 11

Problem Statement GGF can be defined as: D � w i v ↑ GGF w ( v ) = i i =1 U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

Problem Statement GGF can be defined as:  v ↑  1     v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 w 2 . . . w D ]       i =1 . . .       v ↑ D U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

Problem Statement GGF can be defined as:  v ↑  1  ≤    v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ]   ≤     i =1 . . .     ≤   v ↑ D U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

Problem Statement GGF can be defined as:  v ↑  1  ≤    v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ]   ≤     i =1 . . .     ≤   v ↑ D Fair optimization problem in RL: arg max GGF w ( J ( π )) (1) π U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

Problem Statement GGF can be defined as:  v ↑  1  ≤    v ↑ D   2   � w i v ↑ GGF w ( v ) = i = [ w 1 > w 2 > . . . > w D ]   ≤     i =1 . . .     ≤   v ↑ D Fair optimization problem in RL: arg max GGF w ( J ( π )) (1) π � ∞ � � � h 1 � γ t − 1 R t � where J ( π ) = E P π or J ( π ) = lim . h E P π R t h →∞ t =1 t =1 γ -discounted rewards average rewards U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 5 / 11

Theoretical Discussion Assumption: MDPs are weakly-communicating U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. Contribution on Approximation Error Approximate average-optimal policy ( π ∗ 1 ) with γ -optimal policy ( π ∗ γ ). U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

Theoretical Discussion Assumption: MDPs are weakly-communicating Sufficiency of Stationary Markov Policies Existence of stationary Markov fair optimal policy. Possibly State-Dependent Optimality With average reward, fair optimality stays state-independent. Contribution on Approximation Error Approximate average-optimal policy ( π ∗ 1 ) with γ -optimal policy ( π ∗ γ ). Theorem: � � GGF w ( µ ( π ∗ γ )) ≥ GGF w ( µ ( π ∗ 1 )) − R (1 − γ ) ρ ( γ, σ ( H P π ∗ 1 )) + ρ ( γ, σ ( H P π ∗ γ )) σ where R = max π � R π � 1 and ρ ( γ, σ ) = γ − (1 − γ ) σ . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 6 / 11

Value Based and Policy Gradient Algorithms DQN: Q network takes values in R |A|× D , instead of R |A| , trained with target: Q θ ( s , a ) = r + γ ˆ ˆ Q θ ′ ( s ′ , a ∗ ) , where a ∗ = argmax a ′ ∈A GGF w r + γ ˆ Q θ ′ ( s ′ , a ′ ) � � . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 7 / 11

Value Based and Policy Gradient Algorithms DQN: Q network takes values in R |A|× D , instead of R |A| , trained with target: Q θ ( s , a ) = r + γ ˆ ˆ Q θ ′ ( s ′ , a ∗ ) , where a ∗ = argmax a ′ ∈A GGF w r + γ ˆ Q θ ′ ( s ′ , a ′ ) � � . To optimize the GGF with policy gradient: ∇ θ GGF w ( J ( π θ )) = ∇ J ( π θ ) GGF w ( J ( π θ )) · ∇ θ J ( π θ ) = w ⊺ σ · ∇ θ J ( π θ ) . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 7 / 11

Experimental Results What is the impact of optimizing GGF instead of the average of the objectives? Species Conservation 0 . 9 GGF Score 0 . 8 0 . 7 A2C GGF-A2C PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 8 / 11

Experimental Results What is the impact of optimizing GGF instead of the average of the objectives? Species Conservation 0 . 9 GGF Score 0 . 8 0 . 7 A2C GGF-A2C PPO GGF-PPO Sea-otters 0 . 8 Abalones Average density 0 . 6 0 . 4 0 . 2 0 . 0 A2C GGF-A2C PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 8 / 11

Experimental Results What is the price of fairness? How those algorithms performs in continuous domains? Species Conservation Average accumulated density 1 . 25 1 . 00 0 . 75 PPO GGF-PPO 0 . 50 10000 20000 30000 40000 50000 Number of Steps U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 9 / 11

Experimental Results What is the price of fairness? How those algorithms performs in continuous domains? Species Conservation Average accumulated density 1 . 25 1 . 00 0 . 75 PPO GGF-PPO 0 . 50 10000 20000 30000 40000 50000 Number of Steps Average accumulated bandwidth Network Congestion Control 14 12 PPO GGF-PPO 0 5 10 15 20 25 Number of Episodes U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 9 / 11

Experimental Results (Traffic Light Control) What is the effect of γ with respect to GGF-average optimality? Traffic Light Control × 10 7 − 1 . 6 − 1 . 8 GGF Score − 2 . 0 − 2 . 2 − 2 . 4 PPO- γ -0.99 GGF-PPO- γ -0.99 PPO- 1 − GGF-PPO- 1 − North East West South 400 Average waiting time 350 300 250 200 PPO GGF-PPO U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 10 / 11

Conclusion Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11

Conclusion Fair optimization in RL setting Theoretical discussion with a new bound Adaptations of DQN, A2C and PPO to solve this problem. Experimental validation in 3 domains Future Works: Extend to distributed control Consider other fair social welfare functions Directly solve average reward problems U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11

Ruffy, F., Przystupa, M., and Beschastnikh, I. (2019). Iroko: A framework to prototype reinforcement learning for data center traffic control. In Workshop on ML for Systems at NeurIPS . U. Siddique, P. Weng, and M. Zimmer Fair Policies in RL 11 / 11

Learning Fair Policies in Multiobjective (Deep) Reinforcement - PowerPoint PPT Presentation

Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning with Average and Discounted Rewards Umer Siddique , Paul Weng, and Matthieu Zimmer University of Michigan-Shanghai Jiao Tong University Joint Institute ICML 2020 U. Siddique,

Multiobjective Multiobjective Genetic Algorithms for Genetic Algorithms for Multiscaling

Fair Testing - O Wings We are learning to carry out a fair test. What is a fair test? Fair

THE COLLEGE FAIR What is a college fair? When should I attend a fair? Why should I go

SC SCIENCE FAIR IENCE FAIR Calallen Independent School District SCI SCIENCE ENCE FAIR FAIR

Metaheuristics 4.3 Main design issues of multiobjective metaheuristics 4.4 Fitness assignment

Getting Lost or Getting Trapped: On the Effect of Moves to Incomparable Points in Multiobjective

Heuristic Algorithms for Multiobjective Combinatorial Optimization Adapted from a tutorial by

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

CHART | ART FAIR 29. 31. AUGUST 2014 CHART | ART FAIR IS AN INNOVATIVE ART FAIR WITH A HIGH

SALES Jim McCarthy, President/CEO Miami Valley Fair Housing Center Miami Valley Fair Housing Center

Pharma goes FAIR Herman van Vlijmen Janssen Pharmaceu9ca Beerse, Belgium What is FAIR?

Status of the FAIR Project Status of the FAIR Project I. Augustin FAIR Coordination Group GSI

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Decision Rule-based Algorithm for Ordinal Classification based on Rank Loss Minimization

der Informatik Moritz Mhlhausen Prof. Marcus Magnor

MySQL Online Schema Changes at Uber and Tango Ben Black and David Turner Who are we? Ben

Stanley, N.W. Tasmania Settled 1826 by the English Van Diemens Land Company Population

INTEGRAL PRIVACY COMPLIANT STATISTICS COMPUTATION NAVODA SENAVIRATHNE UNIVERSITY OF SKVDE,

Safety Assurance in in Cyber-Physical Systems buil ilt wit ith Le Learning-Enabled Components

Frequentist Statistics DS GA 1002 Probability and Statistics for Data Science

How to Take into Account the Discrete Parameters in the BIC Criterion? V. Vandewalle University