Federated Optimization in Heterogeneous Networks Tian Li (CMU) , Anit Kumar Sahu (BCAI), Manzil Zaheer (Google Research), Maziar Sanjabi (Facebook AI), Ameet Talwalkar (CMU & Determined AI), Virginia Smith (CMU) tianli@cmu.edu
Federated Learning Privacy-preserving training in heterogeneous, (potentially) massive networks Networks of remote devices Networks of isolated organizations e.g., cell phones e.g., hospitals healthcare next-word prediction 2
Example Applications Voice recognition on mobile phones Adapting to pedestrian behavior on autonomous vehicles Personalized healthcare on wearable devices Predictive maintenance for industrial machines 3
Workflow & Challenges N Objective: ∑ min w f ( w ) = p k F k ( w ) Systems heterogeneity variable hardware, network connectivity, k =1 power, etc loss on device k A standard setup: W t + 1 Statistical heterogeneity server highly non-identically distributed data devices Expensive communication W ′ W ′ ′ potentially massive network; wireless communication W t W t Privacy concerns privacy leakage through parameters local training local training 4
A Popular Method: Federated Averaging (FedAvg) [1] At each communication round: What can go wrong? Server randomly selects a subset of devices & sends the current global model w t Works well in many settings ! Each selected device updates for epochs w t k E (especially non-convex) of SGD to optimize & sends the new local F k model back Server aggregates local models to form a new global model w t +1 [1] McMahan, H. Brendan, et al. "Communication-efficient learning of deep networks from decentralized data." AISTATS, 2017. 5
What are the issues? systems heterogeneity statistical heterogeneity stragglers highly non-identically distributed data FedAvg heuristic method simply drop slow devices [2] simple average updates 90% stragglers not guaranteed to converge 0% stragglers 0% stragglers [2] Bonawitz, Keith, et al. "Towards Federated Learning at Scale: System Design." MLSys, 2019. 6
Outline Motivation FedProx Method Theoretical Analysis Experiments Future Work 7
FedProx — High Level systems heterogeneity statistical heterogeneity simply drop stragglers average simple SGD updates allow for variable amounts of work encourage more FedProx & safely incorporate them well-behaved updates account for stragglers rate as a function of statistical heterogeneity theory s n o i t u b i r t n o C 1. convergence guarantees more robust empirical performance for federated learning in heterogeneous networks 2. 8
FedProx: A Framework For Federated Optimization Objective: At each communication round, local objective: N ∑ min w f ( w ) = p k F k ( w ) min F k ( w k ) k =1 w k Idea 1: Allow for variable amounts of work to be performed on local devices to handle stragglers Idea 2: Modified Local Subproblem: 2 F k ( w k ) + μ w k − w t min 2 w k a proximal term 9
FedProx: A Framework For Federated Optimization 2 F k ( w k ) + μ w k − w t Modified Local Subproblem: min 2 w k The proximal term (1) safely incorporate noisy updates; (2) explicitly limits the impact of local updates Generalization of FedAvg Can use any local solver More robust and stable empirical performance Strong theoretical guarantees (with some assumptions) 10
Outline Motivation FedProx Method Theoretical Analysis Experiments Future Work 11
Convergence Analysis Challenges : device subsampling, non-iid data, local updates High-level: converges despite these challenges Introduces notion of B-dissimilarity in to characterize statistical heterogeneity: IID data: B = 1 𝔽 [ ∥∇ F k ( w ) ∥ 2 ] ≤ ∥∇ f ( w ) ∥ 2 B 2 non-IID data: B > 1 * used in other contexts, e.g., gradient diversity [3] to quantify the benefits of scaling distributed SGD [3] Yin, Dong, et al. "Gradient Diversity: a Key Ingredient for Scalable Distributed Learning.” AISTATS, 2018. 12
Convergence Analysis Assumption 1: Dissimilarity is bounded Assumption 2: Modified local subproblem is convex & smooth Proximal term makes the method more amenable to theoretical analysis! Assumption 3: Each local subproblem is solved to some accuracy Flexible communication/computation tradeoff Account for partial work in the rates 13
Convergence Analysis [Theorem] Obtain suboptimality , after T rounds, with: ε T = O ( ) f ( w 0 ) − f * ρε some constant, a function of ( B , μ , …) Rate is general: Covers both convex, and non-convex loss functions Independent of the local solver; agnostic of the sampling method The same asymptotic convergence guarantee as SGD Can converge much faster than distributed SGD in practice 14
Outline Motivation FedProx Method Theoretical Analysis Experiments Future Work 15
Experiments Zero Systems heterogeneity + Fixed Statistical heterogeneity Benchmark: LEAF (leaf.cmu.edu) FedAvg FedProx, μ > 0 FedProx with leads to more stable convergence under statistical heterogeneity μ > 0 16
FedProx, μ > 0 FedAvg Similar benefits for all datasets 17
Experiments High Systems heterogeneity + Fixed Statistical heterogeneity Allowing for variable amounts of work to be performed helps convergence in the presence of systems heterogeneity FedAvg FedProx with leads μ > 0 FedProx, μ = 0 to more stable convergence under statistical & systems FedProx, μ > 0 heterogeneity 18
In terms of test accuracy: on average, 22% absolute accuracy improvement compared with FedAvg in highly heterogeneous settings FedProx, μ = 0 FedProx, μ > 0 FedAvg Similar benefits for all datasets 19
Experiments Impact of Statistical Heterogeneity Increasing heterogeneity leads to worse convergence Setting μ > 0 can help to combat this In addition, B-dissimilarity captures statistical heterogeneity (see paper) 20
Outline Motivation FedProx Method Theoretical Analysis Experiments Future Work 21
Future Work Hyper-parameter tuning Privacy & security Set μ automatically Better privacy metrics & mechanisms Diagnostics Personalization Determining Automatic fine-tuning heterogeneity a priori Leveraging the Productionizing heterogeneity for Cold start problems improved performance White paper: Federated Learning: Challenges, Methods, and Future Directions, IEEE Signal Processing Magazine, 2020. (also on ArXiv) 22
Thanks! Poster: # 3, this room On-device Intelligence Workshop, Wednesday, this room Benchmark: leaf.cmu.edu Paper & code: cs.cmu.edu/~litian/ 23
Backup 1 • Relations with previous works • proximal term • Elastic SGD: employs a more complex moving average to update parameters; limited to SGD as a local solver; only been analyzed for quadratic problems • DANE and inexact DANE: adds an additional gradient correction term, assume full device participation (unrealistic); discouraging empirical performance • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes such as speeding up SGD on a single machine; different analysis assumptions (IID, solving subproblems exactly) • B-dissimilarity term • For other purposes, such as quantifying the benefit in scaling SGD for IID data 24
Backup 2 • Data statistics • Systems heterogeneity simulation • Fix a global number of epochs E, and force some devices to perform fewer updates than epochs. In particular, for varying heterogeneous setting, assign E x (chosen uniformly random between ) number of epochs to 0%, 50, and 90% [1, E ] of selected devices. 25
Backup 3 • The original FedAvg algorithm
Backup 4 • Complete theorem Assume the functions are non-convex, L-Lipschitz smooth, and there exists , such that F k L _ > 0 , with . Suppose that is not a stationary solution and the local ∇ 2 F k ⪰ − L _ I w t μ = μ − L _ > 0 ¯ functions are -dissimilar, i.e., If and are chosen such that B ( w t ) ≤ B . γ t F k B μ , K , k ρ t = ( 2 K + 2 ) ) > 0, μ − B (1 + γ t ) μ − γ t B − LB (1 + γ t ) − L (1 + γ t ) 2 B 2 − LB 2 (1 + γ t ) 2 2 1 ( 2 μ 2 μ 2 K ¯ 2 ¯ ¯ μμ ¯ K μ then at the iteration of FedProx, we have the following expected decrease in the global t objective: 𝔽 S t [ f ( w t +1 )] ≤ f ( w t ) − ρ t ∥∇ f ( w t ) ∥ 2 , γ t = max where is the set of devices chosen at iteration and γ t S t K t k . k ∈ S t
Recommend
More recommend