decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Ming Yan, Michigan State University Decentralized-10
decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . Ming Yan, Michigan State University Decentralized-10
decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . • Rewrite it as x k +1 = x k − (( I − W ) x k + λ ∇ f ( x k )) . DGD is gradient descent with stepsize one of √ 1 I − Wx � 2 minimize 2 � 2 + λf ( x ) . x Ming Yan, Michigan State University Decentralized-10
decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . • Rewrite it as x k +1 = x k − (( I − W ) x k + λ ∇ f ( x k )) . DGD is gradient descent with stepsize one of √ 1 I − Wx � 2 minimize 2 � 2 + λf ( x ) . x • The solution is generally non consensus, i.e., Wx ∗ = x ∗ + λ ∇ f ( x ∗ ) � = x ∗ . Ming Yan, Michigan State University Decentralized-10
decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . • Rewrite it as x k +1 = x k − (( I − W ) x k + λ ∇ f ( x k )) . DGD is gradient descent with stepsize one of √ 1 I − Wx � 2 minimize 2 � 2 + λf ( x ) . x • The solution is generally non consensus, i.e., Wx ∗ = x ∗ + λ ∇ f ( x ∗ ) � = x ∗ . • Diminishing stepsize, i.e., decreasing λ during the iteration. Ming Yan, Michigan State University Decentralized-10
constant stepsize? Ming Yan, Michigan State University Decentralized-11
constant stepsize? • alternating direction method of multipliers (ADMM) (Shi et al. ’14, Chang-Hong-Wang ’15, Hong-Chang ’17) Ming Yan, Michigan State University Decentralized-11
constant stepsize? • alternating direction method of multipliers (ADMM) (Shi et al. ’14, Chang-Hong-Wang ’15, Hong-Chang ’17) • multi-consensus inner loops (Chen-Ozdaglar ’12, Jakovetic-Xavier-Moura ’14) Ming Yan, Michigan State University Decentralized-11
constant stepsize? • alternating direction method of multipliers (ADMM) (Shi et al. ’14, Chang-Hong-Wang ’15, Hong-Chang ’17) • multi-consensus inner loops (Chen-Ozdaglar ’12, Jakovetic-Xavier-Moura ’14) • EXTRA/PG-EXTRA (Shi et al. ’15) Ming Yan, Michigan State University Decentralized-11
decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x Ming Yan, Michigan State University Decentralized-12
decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. Ming Yan, Michigan State University Decentralized-12
decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − Ming Yan, Michigan State University Decentralized-12
decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − • It is the same as √ � ∇ f ( x ∗ ) � � � � x ∗ � I − W 0 − = √ . s ∗ − I − W 0 0 Ming Yan, Michigan State University Decentralized-12
forward-backward • The KKT system � ∇ f ( x ∗ ) � − 0 √ � � � x ∗ � I − W 0 √ = . s ∗ − I − W 0 Ming Yan, Michigan State University Decentralized-13
forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 Ming Yan, Michigan State University Decentralized-13
forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 • It reduces to √ � � � x k � � ∇ f ( x k ) � � � � x k +1 � α I − I − W α I 0 √ − = √ . s k s k +1 − I − W β I − 2 I − W β I 0 Ming Yan, Michigan State University Decentralized-13
forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 • It reduces to √ � � � x k � � ∇ f ( x k ) � � � � x k +1 � α I − I − W α I 0 √ − = √ . s k s k +1 − I − W β I − 2 I − W β I 0 • It is equivalent to √ α x k − I − Ws k − ∇ f ( x k ) = α x k +1 , √ √ I − Wx k + β s k = − 2 I − Wx k +1 + β s k +1 . − Ming Yan, Michigan State University Decentralized-13
forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 • It reduces to √ � � � x k � � ∇ f ( x k ) � � � � x k +1 � α I − I − W α I 0 √ − = √ . s k s k +1 − I − W β I − 2 I − W β I 0 • It is equivalent to √ α x k − I − Ws k − ∇ f ( x k ) = α x k +1 , √ √ I − Wx k + β s k = − 2 I − Wx k +1 + β s k +1 . − √ • For simplicity, let t = I − Ws , and we have α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . Ming Yan, Michigan State University Decentralized-13
EXact firsT-ordeR Algorithm (EXTRA) • From the previous slide α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . Ming Yan, Michigan State University Decentralized-14
EXact firsT-ordeR Algorithm (EXTRA) • From the previous slide α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . • We have α x k +1 = α x k − t k − ∇ f ( x k ) = α x k − I − W (2 x k − x k − 1 ) − t k − 1 − ∇ f ( x k ) β = α x k − I − W (2 x k − x k − 1 )+ α x k + ∇ f ( x k − 1 ) − α x k − 1 − ∇ f ( x k ) β α I − I − W (2 x k − x k − 1 ) + ∇ f ( x k − 1 ) − ∇ f ( x k ) . = � � β Ming Yan, Michigan State University Decentralized-14
EXact firsT-ordeR Algorithm (EXTRA) • From the previous slide α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . • We have α x k +1 = α x k − t k − ∇ f ( x k ) = α x k − I − W (2 x k − x k − 1 ) − t k − 1 − ∇ f ( x k ) β = α x k − I − W (2 x k − x k − 1 )+ α x k + ∇ f ( x k − 1 ) − α x k − 1 − ∇ f ( x k ) β α I − I − W (2 x k − x k − 1 ) + ∇ f ( x k − 1 ) − ∇ f ( x k ) . = � � β • Let αβ = 2 and we have EXTRA (Shi et al. ’15) x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Ming Yan, Michigan State University Decentralized-14
convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-15
convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 Ming Yan, Michigan State University Decentralized-15
convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � − I + W � � � � − Σ � � U ⊤ � I + W Σ 0 U 2 2 = . U ⊤ I 0 U I 0 Ming Yan, Michigan State University Decentralized-15
convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � − I + W � � � � − Σ � � U ⊤ � I + W Σ 0 U 2 2 = . U ⊤ I 0 U I 0 • The iteration becomes � U ⊤ x k +1 � � �� U ⊤ x k � − Σ Σ 2 = . U ⊤ x k U ⊤ x k − 1 I 0 Ming Yan, Michigan State University Decentralized-15
convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � − I + W � � � � − Σ � � U ⊤ � I + W Σ 0 U 2 2 = . U ⊤ I 0 U I 0 • The iteration becomes � U ⊤ x k +1 � � �� U ⊤ x k � − Σ Σ 2 = . U ⊤ x k U ⊤ x k − 1 I 0 • The condition for W is − 2 / 3 < λ (Σ) = λ ( W + I ) ≤ 2 , which is − 5 / 3 < λ ( W ) ≤ 1 . Ming Yan, Michigan State University Decentralized-15
convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-16
convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � I + W − 1 − I + W + 1 α I α I 2 = . x k x k − 1 I 0 Ming Yan, Michigan State University Decentralized-16
convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � I + W − 1 − I + W + 1 α I α I 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � I + W − 1 − I + W + 1 � � � � Σ − 1 − Σ 2 + 1 � � U ⊤ � α I α I U α I α I 0 2 = . U ⊤ I 0 U I 0 Ming Yan, Michigan State University Decentralized-16
convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � I + W − 1 − I + W + 1 α I α I 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � I + W − 1 − I + W + 1 � � � � Σ − 1 − Σ 2 + 1 � � U ⊤ � α I α I U α I α I 0 2 = . U ⊤ I 0 U I 0 • The condition for W is 4 / (3 α ) − 2 / 3 < λ (Σ) = λ ( W + I ) ≤ 2 , which is 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 . In addition, we have stepsize 1 /α < 2 . Ming Yan, Michigan State University Decentralized-16
conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Ming Yan, Michigan State University Decentralized-17
conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Ming Yan, Michigan State University Decentralized-17
conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Ming Yan, Michigan State University Decentralized-17
conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Linear convergence condition: • f ( x ) is strongly convex. (Li-Yan ’17) Ming Yan, Michigan State University Decentralized-17
conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Linear convergence condition: • f ( x ) is strongly convex. (Li-Yan ’17) • weaker condition on f ( x ) but more restrict condition for both parameters. (Shi et al. ’15) Ming Yan, Michigan State University Decentralized-17
large stepsize as centralized ones? Ming Yan, Michigan State University Decentralized-18
decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x Ming Yan, Michigan State University Decentralized-19
decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. Ming Yan, Michigan State University Decentralized-19
decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − Ming Yan, Michigan State University Decentralized-19
decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − • It is the same as √ � ∇ f ( x ∗ ) � � � � x ∗ � I − W 0 − = √ . s ∗ − I − W 0 0 Ming Yan, Michigan State University Decentralized-19
forward-backward • The KKT system � ∇ f ( x ∗ ) � − 0 √ � � � x ∗ � I − W 0 √ = . s ∗ − I − W 0 Ming Yan, Michigan State University Decentralized-20
forward-backward • Using forward-backward in the KKT form � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � � � � x k +1 � α I I − W 0 0 √ = + . β I − 1 s k +1 s k +1 0 α ( I − W ) − I − W 0 Ming Yan, Michigan State University Decentralized-20
forward-backward • Combine the right hand side: � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � α I I − W √ = . s k +1 β I − 1 − I − W α ( I − W ) Ming Yan, Michigan State University Decentralized-20
forward-backward • Combine the right hand side: � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � α I I − W √ = . s k +1 β I − 1 − I − W α ( I − W ) • Apply Gaussian elimination: � � � x k � � ∇ f ( x k ) � α I 0 √ − √ β I − 1 s k 1 I − W ∇ f ( x k ) I − W α ( I − W ) α √ � � � x k +1 � α I I − W = . s k +1 β I 0 Ming Yan, Michigan State University Decentralized-20
forward-backward • Combine the right hand side: � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � α I I − W √ = . s k +1 β I − 1 − I − W α ( I − W ) • Apply Gaussian elimination: � � � x k � � ∇ f ( x k ) � α I 0 √ − √ β I − 1 s k 1 I − W ∇ f ( x k ) I − W α ( I − W ) α √ � � � x k +1 � α I I − W = . s k +1 β I 0 • It is equivalent to √ α x k − ∇ f ( x k ) − I − Ws k +1 = α x k +1 , √ √ I − 1 s k − 1 I − Wx k + β � I − W ∇ f ( x k ) = β s k +1 . αβ ( I − W ) � α Ming Yan, Michigan State University Decentralized-20
NIDS (Li-Shi-Yan ’17) From the previous slide: √ α x k − ∇ f ( x k ) − I − Ws k +1 = α x k +1 , √ √ I − 1 s k − 1 I − Wx k + β � αβ ( I − W ) � I − W ∇ f ( x k ) = β s k +1 . α Ming Yan, Michigan State University Decentralized-21
NIDS (Li-Shi-Yan ’17) From the previous slide: √ α x k − ∇ f ( x k ) − I − Ws k +1 = α x k +1 , √ √ I − 1 s k − 1 I − Wx k + β � αβ ( I − W ) � I − W ∇ f ( x k ) = β s k +1 . α √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . Ming Yan, Michigan State University Decentralized-21
NIDS (Li-Shi-Yan ’17) √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . Ming Yan, Michigan State University Decentralized-21
NIDS (Li-Shi-Yan ’17) √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . We have α x k +1 = α x k − ∇ f ( x k ) − t k +1 I − 1 t k − 1 β ( I − W ) x k + 1 = α x k − ∇ f ( x k ) − � αβ ( I − W ) � αβ ( I − W ) ∇ f ( x k ) I − 1 ( α x k − t k − ∇ f ( x k )) = � αβ ( I − W ) � I − 1 ( α x k + α x k − α x k − 1 + ∇ f ( x k − 1 ) − ∇ f ( x k )) . = � αβ ( I − W ) � Ming Yan, Michigan State University Decentralized-21
NIDS (Li-Shi-Yan ’17) √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . We have α x k +1 = α x k − ∇ f ( x k ) − t k +1 I − 1 t k − 1 β ( I − W ) x k + 1 = α x k − ∇ f ( x k ) − � αβ ( I − W ) � αβ ( I − W ) ∇ f ( x k ) I − 1 ( α x k − t k − ∇ f ( x k )) = � αβ ( I − W ) � I − 1 ( α x k + α x k − α x k − 1 + ∇ f ( x k − 1 ) − ∇ f ( x k )) . = � αβ ( I − W ) � Thus I − 1 (2 x k − x k − 1 − 1 x k +1 = � α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . αβ ( I − W ) � Ming Yan, Michigan State University Decentralized-21
convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-22
convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . Ming Yan, Michigan State University Decentralized-22
convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � (2 − 1 α ) I + W − (1 − 1 α ) I + W 2 2 = x k x k − 1 I 0 Ming Yan, Michigan State University Decentralized-22
convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � (2 − 1 α ) I + W − (1 − 1 α ) I + W 2 2 = x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � U ⊤ x k +1 � � (2 − 1 α ) Σ − (1 − 1 α ) Σ � � U ⊤ x k � 2 2 = U ⊤ x k U ⊤ x k − 1 I 0 Ming Yan, Michigan State University Decentralized-22
convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � (2 − 1 α ) I + W − (1 − 1 α ) I + W 2 2 = x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � U ⊤ x k +1 � � (2 − 1 α ) Σ − (1 − 1 α ) Σ � � U ⊤ x k � 2 2 = U ⊤ x k U ⊤ x k − 1 I 0 • Therefore, one sufficient condition is − 5 / 3 < λ ( W ) ≤ 1 and 1 /α < 2 . Ming Yan, Michigan State University Decentralized-22
conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-23
conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Ming Yan, Michigan State University Decentralized-23
conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Ming Yan, Michigan State University Decentralized-23
conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Linear convergence condition: • f ( x ) is strongly convex and − 1 < λ ( W ) ≤ 1 (Li-Shi-Yan ’17): � � �� 1 − µ L, 1 − 1 − λ 2 ( W ) O max . 1 − λ n ( W ) Ming Yan, Michigan State University Decentralized-23
NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-24
NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The difference is in the data to be communicated. Ming Yan, Michigan State University Decentralized-24
NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The difference is in the data to be communicated. • But NIDS has a larger range for parameters than EXTRA. Ming Yan, Michigan State University Decentralized-24
NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The difference is in the data to be communicated. • But NIDS has a larger range for parameters than EXTRA. • NIDS is faster than EXTRA. Ming Yan, Michigan State University Decentralized-24
advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-25
advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The stepsize is large and does not depend on the network topology. α < 2 1 L. Ming Yan, Michigan State University Decentralized-25
advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The stepsize is large and does not depend on the network topology. α < 2 1 L. • Individual stepsizes can be included. α i < 2 1 L i . Ming Yan, Michigan State University Decentralized-25
advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The stepsize is large and does not depend on the network topology. α < 2 1 L. • Individual stepsizes can be included. α i < 2 1 L i . • The linear convergence rate from the functions and the network are separated. � � �� 1 − µ L, 1 − 1 − λ 2 ( W ) O max . 1 − λ n ( W ) It matches the results for gradient descent and decentralized averaging without acceleration. Ming Yan, Michigan State University Decentralized-25
D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-26
D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . Ming Yan, Michigan State University Decentralized-26
D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . • E ξ ∼D ∇ f ( x ; ξ ) = ∇ f ( x ) , ∀ x . Ming Yan, Michigan State University Decentralized-26
D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . • E ξ ∼D ∇ f ( x ; ξ ) = ∇ f ( x ) , ∀ x . • E ξ ∼D �∇ f ( x ; ξ ) − ∇ f ( x ) � 2 � σ 2 , ∀ x . Ming Yan, Michigan State University Decentralized-26
D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . • E ξ ∼D ∇ f ( x ; ξ ) = ∇ f ( x ) , ∀ x . • E ξ ∼D �∇ f ( x ; ξ ) − ∇ f ( x ) � 2 � σ 2 , ∀ x . • Convergence result: if the stepsize is small enough (in the order of � T/n ) − 1 ), the convergence rate is ( c + � � σ + 1 O √ . T nT Ming Yan, Michigan State University Decentralized-26
numerical experiments Ming Yan, Michigan State University Decentralized-27
compared algorithms • NIDS • EXTRA/PG-EXTRA • DIGing-ATC (Nedic et al. ’16): x k +1 = W ( x k − α y k ) , y k +1 = W ( y k + ∇ f ( x k +1 ) − ∇ f ( x k )) . • accelerated distributed Nesterov gradient descent (Acc-DNGD-SC in (Qu-Li ’17) • dual friendly optimal algorithm (OA) for distributed optimization (Uribe et al. ’17). Ming Yan, Michigan State University Decentralized-28
strongly convex: same stepsize 10 -2 10 -4 10 -6 10 -8 10 -10 10 -12 10 -14 0 10 20 30 40 50 60 70 80 90 number of iterations Ming Yan, Michigan State University Decentralized-29
strongly convex: same stepsize 10 -2 10 -4 10 -6 10 -8 10 -10 10 -12 10 -14 0 10 20 30 40 50 60 70 80 90 number of iterations Ming Yan, Michigan State University Decentralized-29
strongly convex: adaptive stepsize 10 -2 10 -4 10 -6 10 -8 10 -10 10 -12 10 -14 0 20 40 60 80 100 120 140 number of iterations Ming Yan, Michigan State University Decentralized-30
linear convergence rate bottleneck 10 5 10 0 10 -5 10 -10 10 -15 10 -20 0 50 100 150 200 250 300 350 400 number of iterations Ming Yan, Michigan State University Decentralized-31
linear convergence rate bottleneck 10 5 10 0 10 -5 10 -10 10 -15 10 -20 0 50 100 150 200 250 300 350 400 450 number of iterations Ming Yan, Michigan State University Decentralized-31
nonsmooth functions 10 2 NIDS-1/L NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L 10 0 PGEXTRA-1.3/L PGEXTRA-1.4/L 10 -2 10 -4 10 -6 10 -8 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 number of iterations 10 4 Ming Yan, Michigan State University Decentralized-32
stochastic case: shuffled 2 2.5 Decentralized 2 1.5 1.5 Loss Loss D 2 1 1 Decentralized Centralized 0.5 D 2 0.5 Centralized 0 0 0 20 40 60 80 100 0 20 40 60 80 100 # Epochs # Epochs (a) T RANSFER L EARNING (b) L E N ET Ming Yan, Michigan State University Decentralized-33
Recommend
More recommend