Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization Zhize Li King Abdullah University of Science and Technology (KAUST) https://zhizeli.github.io Joint work with Dmitry Kovalev (KAUST), Xun Qian (KAUST) and Peter Richt´ arik (KAUST) ICML 2020 Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 1 / 19
Overview Problem 1 Related Work 2 Our Contributions 3 Single Device Setting Distributed Setting Experiments 4 Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 2 / 19
Problem Training distributed/federated learning models is typically performed by solving an optimization problem n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) , n x ∈ R d i =1 f i ( x ): loss function associated with data stored on node/device i ψ ( x ): regularization term (e.g., ℓ 1 regularizer � x � 1 , ℓ 2 regularizer � x � 2 2 or indicator function I C ( x ) for some set C ) Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 3 / 19
Examples n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) n x ∈ R d i =1 Each node/device i stores m data samples { ( a i , j , b i , j ) ∈ R d +1 } m j =1 � m ◮ Lasso regression : f i ( x ) = 1 i , j x − b i , j ) 2 , ψ ( x ) = λ � x � 1 j =1 ( a T m � m ◮ Logistic regression : f i ( x ) = 1 � 1 + exp( − b i , j a T � j =1 log i , j x ) m ◮ SVM : f i ( x ) = 1 � m � 0 , 1 − b i , j a T � , ψ ( x ) = λ 2 � x � 2 j =1 max i , j x 2 m Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 4 / 19
Goal n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) n x ∈ R d i =1 x ) − P ( x ∗ ) ≤ ǫ or Goal: find an ǫ -solution (parameters) ˆ x , e.g., P (ˆ 2 ≤ ǫ , where x ∗ := arg min x ∈ R d P ( x ). x − x ∗ � 2 � ˆ Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 5 / 19
Goal n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) n x ∈ R d i =1 x ) − P ( x ∗ ) ≤ ǫ or Goal: find an ǫ -solution (parameters) ˆ x , e.g., P (ˆ 2 ≤ ǫ , where x ∗ := arg min x ∈ R d P ( x ). x − x ∗ � 2 � ˆ For optimization methods: Bottleneck: communication cost Common strategy: Compress the communicated messages (lower communication cost in each iteration/communication round) and hope that this will not increase the total number of iterations/comm. rounds. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 5 / 19
Related Work • Several recent work show that the total communication complexity can be improved via compression. See e.g., QSGD [Alistarh et al., 2017], DIANA [Mishchenko et al., 2019], Natural compression [Horv´ ath et al., 2019]. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 6 / 19
Related Work • Several recent work show that the total communication complexity can be improved via compression. See e.g., QSGD [Alistarh et al., 2017], DIANA [Mishchenko et al., 2019], Natural compression [Horv´ ath et al., 2019]. • However previous work usually lead to this kind of improvement: Communication cost per iteration ( - - ) Iterations ( + ) ⇒ Total ( - ) ‘ - ’ denotes decrease , ‘ + ’ denotes increase Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 6 / 19
Related Work • Several recent work show that the total communication complexity can be improved via compression. See e.g., QSGD [Alistarh et al., 2017], DIANA [Mishchenko et al., 2019], Natural compression [Horv´ ath et al., 2019]. • However previous work usually lead to this kind of improvement: Communication cost per iteration ( - - ) Iterations ( + ) ⇒ Total ( - ) ‘ - ’ denotes decrease , ‘ + ’ denotes increase • In this work, we provide the first optimization methods provably combining the benefits of gradient compression and acceleration : Communication cost per iteration ( - - ) Iterations ( - - ) ⇒ Total ( - - - - ) Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 6 / 19
Single Device Setting • First, consider the simple single device (i.e. n = 1)) case: x ∈ R d f ( x ) , min where f : R d → R is L -smooth, and convex or µ -strongly convex. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 7 / 19
Single Device Setting • First, consider the simple single device (i.e. n = 1)) case: x ∈ R d f ( x ) , min where f : R d → R is L -smooth, and convex or µ -strongly convex. • f is L -smooth or has L -Lipschitz continuous gradient (for L > 0) if �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � , (1) and µ -strongly convex (for µ ≥ 0) if f ( x ) − f ( y ) − �∇ f ( y ) , x − y � ≥ µ 2 � x − y � 2 (2) for all x , y ∈ R d . The µ = 0 case reduces to the standard convexity. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 7 / 19
Compressed Gradient Descent (CGD) • Problem: min x ∈ R d f ( x ) 1) Given initial point x 0 , step-size η 2) CGD update: x k +1 = x k − η C ( ∇ f ( x k )) , for k ≥ 0 Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 8 / 19
Compressed Gradient Descent (CGD) • Problem: min x ∈ R d f ( x ) 1) Given initial point x 0 , step-size η 2) CGD update: x k +1 = x k − η C ( ∇ f ( x k )) , for k ≥ 0 Definition (Compression operator) A randomized map C : R d �→ R d is an ω -compression operator if E [ �C ( x ) − x � 2 ] ≤ ω � x � 2 , ∀ x ∈ R d . E [ C ( x )] = x , (3) In particular, no compression ( C ( x ) ≡ x ) implies ω = 0. Note that Condition (3) is satisfied by many practical compressions, e.g., random- k sparsification, ( p , s )-quantization. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 8 / 19
Accelerated Compressed Gradient Descent (ACGD) Inspired by Nesterov’s accelerated gradient descent (AGD) [Nesterov, 2004] and FISTA [Beck and Teboulle, 2009], here we propose the first accelerated compressed gradient descent (ACGD) method. Our ACGD update: 1) x k = α k y k + (1 − α k ) z k 2) y k +1 = x k − η k C ( ∇ f ( x k )) 3) z k +1 = β k θ k z k + (1 − θ k ) x k � γ k y k +1 + (1 − γ k ) y k � � � + (1 − β k ) Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 9 / 19
Convergence Results in Single Device Setting Table: Convergence results (Iterations) for the single device ( n = 1) case min x ∈ R d f ( x ) Algorithm µ -strongly convex f convex f Compressed Gradient Descent � � (1 + ω ) L µ log 1 (1 + ω ) L � � O O (CGD [Khirirat et al., 2018]) ǫ ǫ � � � � � � µ log 1 L L ACGD (this paper) O (1 + ω ) O (1 + ω ) ǫ ǫ • If no compression (i.e., ω = 0): CGD recovers the results of vanilla (uncompressed) GD, i.e., O ( L µ log 1 ǫ ) and O ( L ǫ ). Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 10 / 19
Convergence Results in Single Device Setting Table: Convergence results (Iterations) for the single device ( n = 1) case min x ∈ R d f ( x ) Algorithm µ -strongly convex f convex f Compressed Gradient Descent � � (1 + ω ) L µ log 1 (1 + ω ) L � � O O (CGD [Khirirat et al., 2018]) ǫ ǫ � � � � � � µ log 1 L L ACGD (this paper) O (1 + ω ) O (1 + ω ) ǫ ǫ • If no compression (i.e., ω = 0): CGD recovers the results of vanilla (uncompressed) GD, i.e., O ( L µ log 1 ǫ ) and O ( L ǫ ). �� �� L � L � • If compression parameter ω ≤ O or O : µ ǫ Our ACGD enjoys the benefits of compression and acceleration , i.e., both the communication cost per iteration (compression) and the total number of iterations (acceleration) are smaller than that of GD. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 10 / 19
Recall the Discussion in Related Work • Previous work usually lead to this kind of improvement: Communication cost per iteration ( - - ) Iterations ( + ) ⇒ Total ( - ) ‘ - ’ denotes decrease , ‘ + ’ denotes increase • In this work, we provide the first optimization methods provably combining the benefits of gradient compression and acceleration : Communication cost per iteration ( - - ) Iterations ( - - ) ⇒ Total ( - - - - ) Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 11 / 19
Distributed Setting Now, we consider the general distributed setting with n devices/nodes: n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) . n x ∈ R d i =1 The presence of multiple nodes ( n > 1) and of the regularizer ψ poses additional challenges. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 12 / 19
Distributed Setting Now, we consider the general distributed setting with n devices/nodes: n P ( x ) := 1 � � � min f i ( x ) + ψ ( x ) . n x ∈ R d i =1 The presence of multiple nodes ( n > 1) and of the regularizer ψ poses additional challenges. We propose a distributed variant of ACGD (called ADIANA) which can be seen as an accelerated version of DIANA [Mishchenko et al., 2019]. Zhize Li (KAUST) Acceleration for Compressed Gradient Descent ICML 2020 12 / 19
Recommend
More recommend