Federated Optimization in Heterogeneous Networks Tian Li (CMU) , - PowerPoint PPT Presentation

Federated Optimization in Heterogeneous Networks Tian Li (CMU) , Anit Kumar Sahu (BCAI), Manzil Zaheer (Google Research), Maziar Sanjabi (Facebook AI), Ameet Talwalkar (CMU & Determined AI), Virginia Smith (CMU) tianli@cmu.edu

Federated Learning Privacy-preserving training in heterogeneous, (potentially) massive networks Networks of remote devices Networks of isolated organizations e.g., cell phones e.g., hospitals healthcare next-word prediction 2

Example Applications Voice recognition on mobile phones Adapting to pedestrian behavior on autonomous vehicles Personalized healthcare on wearable devices Predictive maintenance for industrial machines 3

Workflow & Challenges N Objective: ∑ min w f ( w ) = p k F k ( w ) Systems heterogeneity variable hardware, network connectivity, k =1 power, etc loss on device k A standard setup: W t + 1 Statistical heterogeneity server highly non-identically distributed data devices Expensive communication W ′ W ′ ′ potentially massive network; wireless communication W t W t Privacy concerns privacy leakage through parameters local training local training 4

A Popular Method: Federated Averaging (FedAvg) [1] At each communication round: What can go wrong? Server randomly selects a subset of devices & sends the current global model w t Works well in many settings ! Each selected device updates for epochs w t k E (especially non-convex) of SGD to optimize & sends the new local F k model back Server aggregates local models to form a new global model w t +1 [1] McMahan, H. Brendan, et al. "Communication-efficient learning of deep networks from decentralized data." AISTATS, 2017. 5

What are the issues? systems heterogeneity statistical heterogeneity stragglers highly non-identically distributed data FedAvg heuristic method simply drop slow devices [2] simple average updates 90% stragglers not guaranteed to converge 0% stragglers 0% stragglers [2] Bonawitz, Keith, et al. "Towards Federated Learning at Scale: System Design." MLSys, 2019. 6

Outline Motivation FedProx Method Theoretical Analysis Experiments Future Work 7

FedProx — High Level systems heterogeneity statistical heterogeneity simply drop stragglers average simple SGD updates allow for variable amounts of work encourage more FedProx & safely incorporate them well-behaved updates account for stragglers rate as a function of statistical heterogeneity theory s n o i t u b i r t n o C 1. convergence guarantees more robust empirical performance for federated learning in heterogeneous networks 2. 8

FedProx: A Framework For Federated Optimization Objective: At each communication round, local objective: N ∑ min w f ( w ) = p k F k ( w ) min F k ( w k ) k =1 w k Idea 1: Allow for variable amounts of work to be performed on local devices to handle stragglers Idea 2: Modified Local Subproblem: 2 F k ( w k ) + μ w k − w t min 2 w k a proximal term 9

FedProx: A Framework For Federated Optimization 2 F k ( w k ) + μ w k − w t Modified Local Subproblem: min 2 w k The proximal term (1) safely incorporate noisy updates; (2) explicitly limits the impact of local updates Generalization of FedAvg Can use any local solver More robust and stable empirical performance Strong theoretical guarantees (with some assumptions) 10

Convergence Analysis Challenges : device subsampling, non-iid data, local updates High-level: converges despite these challenges Introduces notion of B-dissimilarity in to characterize statistical heterogeneity: IID data: B = 1 𝔽 [ ∥∇ F k ( w ) ∥ 2 ] ≤ ∥∇ f ( w ) ∥ 2 B 2 non-IID data: B > 1 * used in other contexts, e.g., gradient diversity [3] to quantify the benefits of scaling distributed SGD [3] Yin, Dong, et al. "Gradient Diversity: a Key Ingredient for Scalable Distributed Learning.” AISTATS, 2018. 12

Convergence Analysis Assumption 1: Dissimilarity is bounded Assumption 2: Modified local subproblem is convex & smooth Proximal term makes the method more amenable to theoretical analysis! Assumption 3: Each local subproblem is solved to some accuracy Flexible communication/computation tradeoff Account for partial work in the rates 13

Convergence Analysis [Theorem] Obtain suboptimality , after T rounds, with: ε T = O ( ) f ( w 0 ) − f * ρε some constant, a function of ( B , μ , …) Rate is general: Covers both convex, and non-convex loss functions Independent of the local solver; agnostic of the sampling method The same asymptotic convergence guarantee as SGD Can converge much faster than distributed SGD in practice 14

Experiments Zero Systems heterogeneity + Fixed Statistical heterogeneity Benchmark: LEAF (leaf.cmu.edu) FedAvg FedProx, μ > 0 FedProx with leads to more stable convergence under statistical heterogeneity μ > 0 16

FedProx, μ > 0 FedAvg Similar benefits for all datasets 17

Experiments High Systems heterogeneity + Fixed Statistical heterogeneity Allowing for variable amounts of work to be performed helps convergence in the presence of systems heterogeneity FedAvg FedProx with leads μ > 0 FedProx, μ = 0 to more stable convergence under statistical & systems FedProx, μ > 0 heterogeneity 18

In terms of test accuracy: on average, 22% absolute accuracy improvement compared with FedAvg in highly heterogeneous settings FedProx, μ = 0 FedProx, μ > 0 FedAvg Similar benefits for all datasets 19

Experiments Impact of Statistical Heterogeneity Increasing heterogeneity leads to worse convergence Setting μ > 0 can help to combat this In addition, B-dissimilarity captures statistical heterogeneity (see paper) 20

Future Work Hyper-parameter tuning Privacy & security Set μ automatically Better privacy metrics & mechanisms Diagnostics Personalization Determining Automatic fine-tuning heterogeneity a priori Leveraging the Productionizing heterogeneity for Cold start problems improved performance White paper: Federated Learning: Challenges, Methods, and Future Directions, IEEE Signal Processing Magazine, 2020. (also on ArXiv) 22

Thanks! Poster: # 3, this room On-device Intelligence Workshop, Wednesday, this room Benchmark: leaf.cmu.edu Paper & code: cs.cmu.edu/~litian/ 23

Backup 1 • Relations with previous works • proximal term • Elastic SGD: employs a more complex moving average to update parameters; limited to SGD as a local solver; only been analyzed for quadratic problems • DANE and inexact DANE: adds an additional gradient correction term, assume full device participation (unrealistic); discouraging empirical performance • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes such as speeding up SGD on a single machine; different analysis assumptions (IID, solving subproblems exactly) • B-dissimilarity term • For other purposes, such as quantifying the benefit in scaling SGD for IID data 24

Backup 2 • Data statistics • Systems heterogeneity simulation • Fix a global number of epochs E, and force some devices to perform fewer updates than epochs. In particular, for varying heterogeneous setting, assign E x (chosen uniformly random between ) number of epochs to 0%, 50, and 90% [1, E ] of selected devices. 25

Backup 3 • The original FedAvg algorithm

Backup 4 • Complete theorem Assume the functions are non-convex, L-Lipschitz smooth, and there exists , such that F k L _ > 0 , with . Suppose that is not a stationary solution and the local ∇ 2 F k ⪰ − L _ I w t μ = μ − L _ > 0 ¯ functions are -dissimilar, i.e., If and are chosen such that B ( w t ) ≤ B . γ t F k B μ , K , k ρ t = ( 2 K + 2 ) ) > 0, μ − B (1 + γ t ) μ − γ t B − LB (1 + γ t ) − L (1 + γ t ) 2 B 2 − LB 2 (1 + γ t ) 2 2 1 ( 2 μ 2 μ 2 K ¯ 2 ¯ ¯ μμ ¯ K μ then at the iteration of FedProx, we have the following expected decrease in the global t objective: 𝔽 S t [ f ( w t +1 )] ≤ f ( w t ) − ρ t ∥∇ f ( w t ) ∥ 2 , γ t = max where is the set of devices chosen at iteration and γ t S t K t k . k ∈ S t

Federated Optimization in Heterogeneous Networks Tian Li (CMU) , - PowerPoint PPT Presentation

Federated Optimization in Heterogeneous Networks Tian Li (CMU) , Anit Kumar Sahu (BCAI), Manzil Zaheer (Google Research), Maziar Sanjabi (Facebook AI), Ameet Talwalkar (CMU & Determined AI), Virginia Smith (CMU) tianli@cmu.edu Federated

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Analyzing Federated Learning through an Adversarial Lens Arjun Nitin Bhagoji 1 , Supriyo

Fair Resource Allocation in Federated Learning Tian Li (CMU) , Maziar Sanjabi (Facebook AI), Ahmad

Federated Machine Learning via Over-the-Air Computation Yuanming Shi ShanghaiTech University 1

Docker in the EGI Docker in the EGI Federated Cloud Federated Cloud Carlos Gimeno

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Talash: Friend Finding in Federated Social Networks Ruturaj Dhekane And Brion Vibber Ruturaj

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

#7 Thinking in possibilities for federated log out. Marcel den Reijer & Fouad Makioui

Anomaly Detection in Smart Buildings using Federated Learning Tuhin Sharma | Binaize Labs

Lets get a federated identity Do you have access to your email? Youll need a valid email

2nd Meeting of ICAC Federated Identity Status & Plans Mine Altunay October 15, 2019 Current

LB-MAP: LOAD-BALANCED MIDDLEBOX ASSIGNMENT IN POLICY-DRIVEN DATA CENTERS MANAR ALQARNI

PAM: When Overloaded, Push Your Neighbor Aside! Zili Meng Jun Bi Chen Sun Shuhe Wang Minhu Wang

Niagara: Efficient Traffic Splitting on Commodity Switches Nanxi Kang , Monia Ghobadi, John

Community of Constituents Initiative Southern California Regional Coalition Meeting #1 Agenda

lecture 10 MIPS assembly language 3 - arrays - strings - MIPS assembler directives and

Monroe Street Reconstruction Green Infrastructure Focus Group February 16, 2017 Monroe Street

Lecture: Visual Bag of Words Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning

Sally y O'Don onnell, , LCSW sally@sallyodonnell.com www.sallyodonnell.com/oasis - Resources

Federated Optimization in Heterogeneous Networks Tian Li (CMU) , - PowerPoint PPT Presentation

Federated Optimization in Heterogeneous Networks Tian Li (CMU) , Anit Kumar Sahu (BCAI), Manzil Zaheer (Google Research), Maziar Sanjabi (Facebook AI), Ameet Talwalkar (CMU & Determined AI), Virginia Smith (CMU) tianli@cmu.edu Federated

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Analyzing Federated Learning through an Adversarial Lens Arjun Nitin Bhagoji 1 , Supriyo

Fair Resource Allocation in Federated Learning Tian Li (CMU) , Maziar Sanjabi (Facebook AI), Ahmad

Federated Machine Learning via Over-the-Air Computation Yuanming Shi ShanghaiTech University 1

Docker in the EGI Docker in the EGI Federated Cloud Federated Cloud Carlos Gimeno

Mining Heterogeneous Mining Heterogeneous Information Networks Information Networks Xifeng Yan

Talash: Friend Finding in Federated Social Networks Ruturaj Dhekane And Brion Vibber Ruturaj

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

#7 Thinking in possibilities for federated log out. Marcel den Reijer &amp; Fouad Makioui

Anomaly Detection in Smart Buildings using Federated Learning Tuhin Sharma | Binaize Labs

Lets get a federated identity Do you have access to your email? Youll need a valid email

2nd Meeting of ICAC Federated Identity Status &amp; Plans Mine Altunay October 15, 2019 Current

LB-MAP: LOAD-BALANCED MIDDLEBOX ASSIGNMENT IN POLICY-DRIVEN DATA CENTERS MANAR ALQARNI

PAM: When Overloaded, Push Your Neighbor Aside! Zili Meng Jun Bi Chen Sun Shuhe Wang Minhu Wang

Niagara: Efficient Traffic Splitting on Commodity Switches Nanxi Kang , Monia Ghobadi, John

Community of Constituents Initiative Southern California Regional Coalition Meeting #1 Agenda

lecture 10 MIPS assembly language 3 - arrays - strings - MIPS assembler directives and

Monroe Street Reconstruction Green Infrastructure Focus Group February 16, 2017 Monroe Street

Lecture: Visual Bag of Words Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning

Sally y O'Don onnell, , LCSW sally@sallyodonnell.com www.sallyodonnell.com/oasis - Resources

#7 Thinking in possibilities for federated log out. Marcel den Reijer & Fouad Makioui

2nd Meeting of ICAC Federated Identity Status & Plans Mine Altunay October 15, 2019 Current