Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization Xianghui Mao ⋄ Kun Yuan ∗ Yubin Hu ⋄ Yuantao Gu ⋄ Ali H. Sayed † Wotao Yin ‡ ⋄ Tsinghua EE ∗ UCLA ECE † EPFL Engineering ‡ UCLA Math 1 / 36
Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 2 / 36
Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 3 / 36
Decentralized optimization • Consider a decentralized optimization problem over a network ( V, E ) n r ( x ) + 1 � min f i ( x ) , (1) n x ∈ R p i =1 where n is the number of nodes • Node i has access to f i ( x ) . All nodes can access r ( x ) . • Both f i ( x ) and r ( x ) can be non-convex 4 / 36
Gossip-based approaches Figure: Gossip-based communication • Agent communicates with all, or a random subset, of direct neighbors • Prior methods: DGD[1], diffusion[2], D-ADMM[3, 4], EXTRA[5], PG-EXTRA[6], DIGing[7], Exact diffusion[8], NIDS[9] ... • Convergence rates are comparable to standard centralized optimization. • Every agent communicates ⇒ per-iteration comm. cost at O ( n ) – O ( n 2 ) . 5 / 36
Random-walk approaches Figure: A random walk (1 , 6 , 9 , 1 , 2 , 6 , 5 , ... ) • A walker moves x through the network and updates it. x k is the k th value. • Agent i receiving x will update it with local (sub)gradient ∇ f i . • O (1) communication per iteration. • Prior works [10–13] require decaying step-sizes; slow. 6 / 36
Proposed method: Walkman • Walkman is a random-walk (RW) algorithm • Exact convergence with fixed step-size; much faster than existing RWs • Convergence guarantee established for non-convex and convex scenarios • More communication efficient than state-of-the-art methods • Can escape from saddle points on tested non-convex problems 7 / 36
Walkman communication efficiency • Comm. complexity for various algorithms for decentralized least squares Algorithm Communication Complexity � � ln � 1 � n ln 3 ( n ) Walkman (prosposed) O · (1 − λ 2 ( P )) 2 ǫ � �� ln � 1 � · � m D-ADMM [14] O ǫ (1 − λ 2 ( P )) 1 / 2 � �� ln � 1 � · � m EXTRA [5] O ǫ 1 − λ 2 ( P ) � �� ln � 1 � · � m Exact diffusion [8] O ǫ 1 − λ 2 ( P ) • Walkman is most communication efficient when 1 λ 2 ( P ) ≤ 1 − m 2 / 3 λ 2 ( P ) is a measure of network connectivity, and m is the number of edges. 8 / 36
Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 9 / 36
Problem reformulation • Recall the problem n r ( x ) + 1 � minimize f i ( x ) , n x ∈ R p i =1 • Create local variables y i and make then all equal to x . • Defining Y = col { y 1 , y 2 , · · · , y n } ∈ R np and F ( Y ) = � n i =1 f i ( y i ) , we have r ( x ) + 1 minimize nF ( Y ) , x, Y subject to 1 ⊗ x − Y = 0 , (2) where 1 = [1 1 . . . 1] T ∈ R n and ⊗ is the Kronecker product • The above two problems are equivalent. 10 / 36
Standard ADMM • The augmented Lagrangian function of problem (2) is � 2 � 1 ⊗ x − Y � 2 � L β ( x, Y ; Z ) = r ( x ) + 1 F ( Y ) + � Z , 1 ⊗ x − Y � + β , n where Z ∈ R np is the dual variable (Lagrange multiplier) • The standard ADMM to solve (2) is n i − z k x k +1 = 1 � ( y k i ¯ β ) , n i =1 x k +1 = prox 1 x k +1 ) , β r (¯ � x k +1 + z k � y k +1 i ∀ i ∈ V = prox 1 , i β f i β i + β ( x k +1 − y k +1 z k +1 = z k ∀ i ∈ V ) , i i • Step 1 uses a reduce operation, implementable in a distributed 1 -to- N setting but not in our decentralized setting 11 / 36
Derive Walkman • To update x with only one y i at a time. x k +1 , we propose • To decentralize the computation of ¯ n i − z k x k +1 = 1 � ( y k i ¯ β ) , n i =1 x k +1 = prox 1 x k +1 ) , β r (¯ (4) � z k β f i ( x k +1 + prox 1 β ) , i i = i k , y k +1 = (5) i y k i , otherwise , � i + β ( x k +1 − y k +1 z k ) , i = i k z k +1 i = (6) i z k i , otherwise . • A walker will carry ¯ x while visiting a sequence of nodes 12 / 36
� � y 1 − z 1 β , · · · , y n − z n , only y i k − z i k • Recall: among is updated. β β x k +2 is equivalent to • The computation of ¯ � � � � z k +1 i k − z k + 1 − 1 x k +2 = i k i k x k +1 y k +1 y k ¯ ¯ − (7) i k n β n β � �� � from neighbor � �� � local information Such computation can be conducted locally. (7), (4), (5), (6), (9) 13 / 36
• It is expensive to solve subproblem 2 � y − ( x k +1 + z k f i ( y ) + β β ) � 2 i minimize (8) y • When (8) is not easy to solve, we can linearize (8) and update y i cheaply � x k +1 + 1 β z k i − 1 β ∇ f i ( y k i ) , i = i k y k +1 = (9) i y k i , otherwise. 14 / 36
Walkman [15] x k around the network • A walker carries ¯ • Each local variable y k i is expected to converge to x ⋆ • The node activation is Markovian: node i k +1 must be the neighbor of i k . 15 / 36
Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 16 / 36
Assumptions Assumption (A1: Random walk) Random walk ( i k ) k ≥ 0 , i k ∈ V forms an irreducible, aperiodic Markov chain with transition probability matrix P and stationary distribution π . This guarantees each agent to be visited for infinitely many times Assumption (A2: Coerciveness) � n The objective function r ( x ) + 1 i =1 f i ( x ) , is bounded from below over R p n � n and is coercive over R p , that is, r ( x k ) + 1 i =1 f i ( x k ) → ∞ for any sequence n x k ∈ R p and � x k � → ∞ . There exists a bounded minimal function value. The boundedness of x k implies the boundedness of the function value. 17 / 36
Assumptions Assumption (A3: f i smoothness) Each f i ( x ) is L -Lipschitz differentiable Assumption (A4: r is semiconvex) 2 � · � 2 is convex. Function r ( x ) is γ -semiconvex, that is, r ( · ) + γ 18 / 36
Convergence property Theorem Under A1-A4, for β> max { γ, 2 L + 2 } (resp. β> max { γ, 2 L 2 + L + 2 } ), it holds that any limit point ( x ∗ , Y∗ , Z∗ ) of the sequence ( x k , Y k , Z k ) generated by Walkman with prox f i (resp. ∇ f i ) satisfies: x ∗ = x ∗ i = y ∗ i , i = 1 , . . . , n , where x ∗ is a stationary point of (1) , with probability 1 , that is, n � � 0 ∈ ∂r ( x ∗ ) + 1 � ∇ f i ( x ∗ ) Pr = 1 . n i =1 If the objective of (1) is convex, then x ∗ is a minimizer. Implication: Walkman almost surely converges to a stationary point. 19 / 36
Convergence rate • We examine the convergence rate for decentralized least squares n 1 � � A i x − b i � 2 minimize 2 n x i =1 This is a special case for problem (1) where r = 0 . • Node i possesses A i and b i • We need the mixing time to characterize the convergence rate 20 / 36
Mixing time • For δ > 0 , mixing time [16, Chapter 11] is defined as the smallest integer τ ( δ ) such that � � � [ P τ ( δ ) ] ij − π j � ≤ δπ ∗ , ∀ i, j ∈ V. (10) where π ∗ := min i ∈ V π i • After τ ( δ ) , each agent j will be visited with probability ≈ π j . • Inequality (10) is guaranteed when √ � � 1 2 τ ( δ ) := 1 − λ 2 ( P ) ln (11) δπ ∗ where λ 2 ( P ) := sup � � f T P � / � f � : f T 1 = 0 , f ∈ R n � . 21 / 36
Convergence rate Theorem Under A1, for β > 2 σ ∗ max + 2 with σ ∗ max := max i σ max ( A T i A i ) , we have linear convergence: � − � � t � 1 + n (1 − δ ) π ∗ ν τ ( δ ) E � Y t − Y ⋆ � 2 ≤ C 0 , ∀ t ≥ 0 , 4 β 2 τ ( δ ) ( n − 1)( β − σ ∗ max ) where ν = , and C 0 is a constant only dependent on n 2 A 1 , · · · , A n , b 1 , · · · , b n , and β . Quantity τ ( δ ) behaves as the iteration numbers in an epoch. Walkman solves the least squares problem at a linear convergence rate. 22 / 36
Outline 1. Decentralized optimization 2. The Walkman method 3. Convergence 4. Communication analysis 5. Simulation results 23 / 36
Communication complexity • For simplicity, assume P is a symmetric real matrix, modeling a gossip matrix of undirected graph. • Communication complexity of Walkman: � � � �� ln � n 1 + (1 − δ ) π ∗ � / ln � O · τ ( δ ) ǫ τ ( δ ) ���� � �� � comm. per epoch epoch numbers • Substitute τ ( δ ) with (11) � � � �� �� � � 1 + 1 − λ 2 ( P ) ln � n ln( n ) ln � · O 1 − λ 2 ( P ) ǫ 2 n ln(2 n ) � �� � � �� � epoch numbers comm. per epoch The number of edge m is not explicitly involved. 24 / 36
Recommend
More recommend