Distributed Consensus Optimization Ming Yan Michigan State - PowerPoint PPT Presentation

decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Ming Yan, Michigan State University Decentralized-10

decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . Ming Yan, Michigan State University Decentralized-10

decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . • Rewrite it as x k +1 = x k − (( I − W ) x k + λ ∇ f ( x k )) . DGD is gradient descent with stepsize one of √ 1 I − Wx � 2 minimize 2 � 2 + λf ( x ) . x Ming Yan, Michigan State University Decentralized-10

decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . • Rewrite it as x k +1 = x k − (( I − W ) x k + λ ∇ f ( x k )) . DGD is gradient descent with stepsize one of √ 1 I − Wx � 2 minimize 2 � 2 + λf ( x ) . x • The solution is generally non consensus, i.e., Wx ∗ = x ∗ + λ ∇ f ( x ∗ ) � = x ∗ . Ming Yan, Michigan State University Decentralized-10

decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . • Rewrite it as x k +1 = x k − (( I − W ) x k + λ ∇ f ( x k )) . DGD is gradient descent with stepsize one of √ 1 I − Wx � 2 minimize 2 � 2 + λf ( x ) . x • The solution is generally non consensus, i.e., Wx ∗ = x ∗ + λ ∇ f ( x ∗ ) � = x ∗ . • Diminishing stepsize, i.e., decreasing λ during the iteration. Ming Yan, Michigan State University Decentralized-10

constant stepsize? Ming Yan, Michigan State University Decentralized-11

constant stepsize? • alternating direction method of multipliers (ADMM) (Shi et al. ’14, Chang-Hong-Wang ’15, Hong-Chang ’17) Ming Yan, Michigan State University Decentralized-11

constant stepsize? • alternating direction method of multipliers (ADMM) (Shi et al. ’14, Chang-Hong-Wang ’15, Hong-Chang ’17) • multi-consensus inner loops (Chen-Ozdaglar ’12, Jakovetic-Xavier-Moura ’14) Ming Yan, Michigan State University Decentralized-11

constant stepsize? • alternating direction method of multipliers (ADMM) (Shi et al. ’14, Chang-Hong-Wang ’15, Hong-Chang ’17) • multi-consensus inner loops (Chen-Ozdaglar ’12, Jakovetic-Xavier-Moura ’14) • EXTRA/PG-EXTRA (Shi et al. ’15) Ming Yan, Michigan State University Decentralized-11

decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x Ming Yan, Michigan State University Decentralized-12

decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. Ming Yan, Michigan State University Decentralized-12

decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − Ming Yan, Michigan State University Decentralized-12

decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − • It is the same as √ � ∇ f ( x ∗ ) � � � � x ∗ � I − W 0 − = √ . s ∗ − I − W 0 0 Ming Yan, Michigan State University Decentralized-12

forward-backward • The KKT system � ∇ f ( x ∗ ) � − 0 √ � � � x ∗ � I − W 0 √ = . s ∗ − I − W 0 Ming Yan, Michigan State University Decentralized-13

forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 Ming Yan, Michigan State University Decentralized-13

forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 • It reduces to √ � � � x k � � ∇ f ( x k ) � � � � x k +1 � α I − I − W α I 0 √ − = √ . s k s k +1 − I − W β I − 2 I − W β I 0 Ming Yan, Michigan State University Decentralized-13

forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 • It reduces to √ � � � x k � � ∇ f ( x k ) � � � � x k +1 � α I − I − W α I 0 √ − = √ . s k s k +1 − I − W β I − 2 I − W β I 0 • It is equivalent to √ α x k − I − Ws k − ∇ f ( x k ) = α x k +1 , √ √ I − Wx k + β s k = − 2 I − Wx k +1 + β s k +1 . − Ming Yan, Michigan State University Decentralized-13

forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 • It reduces to √ � � � x k � � ∇ f ( x k ) � � � � x k +1 � α I − I − W α I 0 √ − = √ . s k s k +1 − I − W β I − 2 I − W β I 0 • It is equivalent to √ α x k − I − Ws k − ∇ f ( x k ) = α x k +1 , √ √ I − Wx k + β s k = − 2 I − Wx k +1 + β s k +1 . − √ • For simplicity, let t = I − Ws , and we have α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . Ming Yan, Michigan State University Decentralized-13

EXact firsT-ordeR Algorithm (EXTRA) • From the previous slide α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . Ming Yan, Michigan State University Decentralized-14

EXact firsT-ordeR Algorithm (EXTRA) • From the previous slide α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . • We have α x k +1 = α x k − t k − ∇ f ( x k ) = α x k − I − W (2 x k − x k − 1 ) − t k − 1 − ∇ f ( x k ) β = α x k − I − W (2 x k − x k − 1 )+ α x k + ∇ f ( x k − 1 ) − α x k − 1 − ∇ f ( x k ) β α I − I − W (2 x k − x k − 1 ) + ∇ f ( x k − 1 ) − ∇ f ( x k ) . = � � β Ming Yan, Michigan State University Decentralized-14

EXact firsT-ordeR Algorithm (EXTRA) • From the previous slide α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . • We have α x k +1 = α x k − t k − ∇ f ( x k ) = α x k − I − W (2 x k − x k − 1 ) − t k − 1 − ∇ f ( x k ) β = α x k − I − W (2 x k − x k − 1 )+ α x k + ∇ f ( x k − 1 ) − α x k − 1 − ∇ f ( x k ) β α I − I − W (2 x k − x k − 1 ) + ∇ f ( x k − 1 ) − ∇ f ( x k ) . = � � β • Let αβ = 2 and we have EXTRA (Shi et al. ’15) x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Ming Yan, Michigan State University Decentralized-14

convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-15

convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 Ming Yan, Michigan State University Decentralized-15

convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � − I + W � � � � − Σ � � U ⊤ � I + W Σ 0 U 2 2 = . U ⊤ I 0 U I 0 Ming Yan, Michigan State University Decentralized-15

convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � − I + W � � � � − Σ � � U ⊤ � I + W Σ 0 U 2 2 = . U ⊤ I 0 U I 0 • The iteration becomes � U ⊤ x k +1 � � �� U ⊤ x k � − Σ Σ 2 = . U ⊤ x k U ⊤ x k − 1 I 0 Ming Yan, Michigan State University Decentralized-15

convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � − I + W � � � � − Σ � � U ⊤ � I + W Σ 0 U 2 2 = . U ⊤ I 0 U I 0 • The iteration becomes � U ⊤ x k +1 � � �� U ⊤ x k � − Σ Σ 2 = . U ⊤ x k U ⊤ x k − 1 I 0 • The condition for W is − 2 / 3 < λ (Σ) = λ ( W + I ) ≤ 2 , which is − 5 / 3 < λ ( W ) ≤ 1 . Ming Yan, Michigan State University Decentralized-15

convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-16

convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � I + W − 1 − I + W + 1 α I α I 2 = . x k x k − 1 I 0 Ming Yan, Michigan State University Decentralized-16

convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � I + W − 1 − I + W + 1 α I α I 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � I + W − 1 − I + W + 1 � � � � Σ − 1 − Σ 2 + 1 � � U ⊤ � α I α I U α I α I 0 2 = . U ⊤ I 0 U I 0 Ming Yan, Michigan State University Decentralized-16

convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � I + W − 1 − I + W + 1 α I α I 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � I + W − 1 − I + W + 1 � � � � Σ − 1 − Σ 2 + 1 � � U ⊤ � α I α I U α I α I 0 2 = . U ⊤ I 0 U I 0 • The condition for W is 4 / (3 α ) − 2 / 3 < λ (Σ) = λ ( W + I ) ≤ 2 , which is 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 . In addition, we have stepsize 1 /α < 2 . Ming Yan, Michigan State University Decentralized-16

conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Ming Yan, Michigan State University Decentralized-17

conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Ming Yan, Michigan State University Decentralized-17

conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Ming Yan, Michigan State University Decentralized-17

conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Linear convergence condition: • f ( x ) is strongly convex. (Li-Yan ’17) Ming Yan, Michigan State University Decentralized-17

conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Linear convergence condition: • f ( x ) is strongly convex. (Li-Yan ’17) • weaker condition on f ( x ) but more restrict condition for both parameters. (Shi et al. ’15) Ming Yan, Michigan State University Decentralized-17

large stepsize as centralized ones? Ming Yan, Michigan State University Decentralized-18

decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x Ming Yan, Michigan State University Decentralized-19

decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. Ming Yan, Michigan State University Decentralized-19

decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − Ming Yan, Michigan State University Decentralized-19

decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − • It is the same as √ � ∇ f ( x ∗ ) � � � � x ∗ � I − W 0 − = √ . s ∗ − I − W 0 0 Ming Yan, Michigan State University Decentralized-19

forward-backward • The KKT system � ∇ f ( x ∗ ) � − 0 √ � � � x ∗ � I − W 0 √ = . s ∗ − I − W 0 Ming Yan, Michigan State University Decentralized-20

forward-backward • Using forward-backward in the KKT form � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � � � � x k +1 � α I I − W 0 0 √ = + . β I − 1 s k +1 s k +1 0 α ( I − W ) − I − W 0 Ming Yan, Michigan State University Decentralized-20

forward-backward • Combine the right hand side: � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � α I I − W √ = . s k +1 β I − 1 − I − W α ( I − W ) Ming Yan, Michigan State University Decentralized-20

forward-backward • Combine the right hand side: � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � α I I − W √ = . s k +1 β I − 1 − I − W α ( I − W ) • Apply Gaussian elimination: � � � x k � � ∇ f ( x k ) � α I 0 √ − √ β I − 1 s k 1 I − W ∇ f ( x k ) I − W α ( I − W ) α √ � � � x k +1 � α I I − W = . s k +1 β I 0 Ming Yan, Michigan State University Decentralized-20

forward-backward • Combine the right hand side: � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � α I I − W √ = . s k +1 β I − 1 − I − W α ( I − W ) • Apply Gaussian elimination: � � � x k � � ∇ f ( x k ) � α I 0 √ − √ β I − 1 s k 1 I − W ∇ f ( x k ) I − W α ( I − W ) α √ � � � x k +1 � α I I − W = . s k +1 β I 0 • It is equivalent to √ α x k − ∇ f ( x k ) − I − Ws k +1 = α x k +1 , √ √ I − 1 s k − 1 I − Wx k + β � I − W ∇ f ( x k ) = β s k +1 . αβ ( I − W ) � α Ming Yan, Michigan State University Decentralized-20

NIDS (Li-Shi-Yan ’17) From the previous slide: √ α x k − ∇ f ( x k ) − I − Ws k +1 = α x k +1 , √ √ I − 1 s k − 1 I − Wx k + β � αβ ( I − W ) � I − W ∇ f ( x k ) = β s k +1 . α Ming Yan, Michigan State University Decentralized-21

NIDS (Li-Shi-Yan ’17) From the previous slide: √ α x k − ∇ f ( x k ) − I − Ws k +1 = α x k +1 , √ √ I − 1 s k − 1 I − Wx k + β � αβ ( I − W ) � I − W ∇ f ( x k ) = β s k +1 . α √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . Ming Yan, Michigan State University Decentralized-21

NIDS (Li-Shi-Yan ’17) √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . Ming Yan, Michigan State University Decentralized-21

NIDS (Li-Shi-Yan ’17) √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . We have α x k +1 = α x k − ∇ f ( x k ) − t k +1 I − 1 t k − 1 β ( I − W ) x k + 1 = α x k − ∇ f ( x k ) − � αβ ( I − W ) � αβ ( I − W ) ∇ f ( x k ) I − 1 ( α x k − t k − ∇ f ( x k )) = � αβ ( I − W ) � I − 1 ( α x k + α x k − α x k − 1 + ∇ f ( x k − 1 ) − ∇ f ( x k )) . = � αβ ( I − W ) � Ming Yan, Michigan State University Decentralized-21

NIDS (Li-Shi-Yan ’17) √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . We have α x k +1 = α x k − ∇ f ( x k ) − t k +1 I − 1 t k − 1 β ( I − W ) x k + 1 = α x k − ∇ f ( x k ) − � αβ ( I − W ) � αβ ( I − W ) ∇ f ( x k ) I − 1 ( α x k − t k − ∇ f ( x k )) = � αβ ( I − W ) � I − 1 ( α x k + α x k − α x k − 1 + ∇ f ( x k − 1 ) − ∇ f ( x k )) . = � αβ ( I − W ) � Thus I − 1 (2 x k − x k − 1 − 1 x k +1 = � α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . αβ ( I − W ) � Ming Yan, Michigan State University Decentralized-21

convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-22

convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . Ming Yan, Michigan State University Decentralized-22

convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � (2 − 1 α ) I + W − (1 − 1 α ) I + W 2 2 = x k x k − 1 I 0 Ming Yan, Michigan State University Decentralized-22

convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � (2 − 1 α ) I + W − (1 − 1 α ) I + W 2 2 = x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � U ⊤ x k +1 � � (2 − 1 α ) Σ − (1 − 1 α ) Σ � � U ⊤ x k � 2 2 = U ⊤ x k U ⊤ x k − 1 I 0 Ming Yan, Michigan State University Decentralized-22

convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � (2 − 1 α ) I + W − (1 − 1 α ) I + W 2 2 = x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � U ⊤ x k +1 � � (2 − 1 α ) Σ − (1 − 1 α ) Σ � � U ⊤ x k � 2 2 = U ⊤ x k U ⊤ x k − 1 I 0 • Therefore, one sufficient condition is − 5 / 3 < λ ( W ) ≤ 1 and 1 /α < 2 . Ming Yan, Michigan State University Decentralized-22

conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-23

conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Ming Yan, Michigan State University Decentralized-23

conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Ming Yan, Michigan State University Decentralized-23

conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Linear convergence condition: • f ( x ) is strongly convex and − 1 < λ ( W ) ≤ 1 (Li-Shi-Yan ’17): � � �� 1 − µ L, 1 − 1 − λ 2 ( W ) O max . 1 − λ n ( W ) Ming Yan, Michigan State University Decentralized-23

NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-24

NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The difference is in the data to be communicated. Ming Yan, Michigan State University Decentralized-24

NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The difference is in the data to be communicated. • But NIDS has a larger range for parameters than EXTRA. Ming Yan, Michigan State University Decentralized-24

NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The difference is in the data to be communicated. • But NIDS has a larger range for parameters than EXTRA. • NIDS is faster than EXTRA. Ming Yan, Michigan State University Decentralized-24

advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-25

advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The stepsize is large and does not depend on the network topology. α < 2 1 L. Ming Yan, Michigan State University Decentralized-25

advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The stepsize is large and does not depend on the network topology. α < 2 1 L. • Individual stepsizes can be included. α i < 2 1 L i . Ming Yan, Michigan State University Decentralized-25

advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The stepsize is large and does not depend on the network topology. α < 2 1 L. • Individual stepsizes can be included. α i < 2 1 L i . • The linear convergence rate from the functions and the network are separated. � � �� 1 − µ L, 1 − 1 − λ 2 ( W ) O max . 1 − λ n ( W ) It matches the results for gradient descent and decentralized averaging without acceleration. Ming Yan, Michigan State University Decentralized-25

D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-26

D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . Ming Yan, Michigan State University Decentralized-26

D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . • E ξ ∼D ∇ f ( x ; ξ ) = ∇ f ( x ) , ∀ x . Ming Yan, Michigan State University Decentralized-26

D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . • E ξ ∼D ∇ f ( x ; ξ ) = ∇ f ( x ) , ∀ x . • E ξ ∼D �∇ f ( x ; ξ ) − ∇ f ( x ) � 2 � σ 2 , ∀ x . Ming Yan, Michigan State University Decentralized-26

D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . • E ξ ∼D ∇ f ( x ; ξ ) = ∇ f ( x ) , ∀ x . • E ξ ∼D �∇ f ( x ; ξ ) − ∇ f ( x ) � 2 � σ 2 , ∀ x . • Convergence result: if the stepsize is small enough (in the order of � T/n ) − 1 ), the convergence rate is ( c + � � σ + 1 O √ . T nT Ming Yan, Michigan State University Decentralized-26

numerical experiments Ming Yan, Michigan State University Decentralized-27

compared algorithms • NIDS • EXTRA/PG-EXTRA • DIGing-ATC (Nedic et al. ’16): x k +1 = W ( x k − α y k ) , y k +1 = W ( y k + ∇ f ( x k +1 ) − ∇ f ( x k )) . • accelerated distributed Nesterov gradient descent (Acc-DNGD-SC in (Qu-Li ’17) • dual friendly optimal algorithm (OA) for distributed optimization (Uribe et al. ’17). Ming Yan, Michigan State University Decentralized-28

strongly convex: same stepsize 10 -2 10 -4 10 -6 10 -8 10 -10 10 -12 10 -14 0 10 20 30 40 50 60 70 80 90 number of iterations Ming Yan, Michigan State University Decentralized-29

strongly convex: adaptive stepsize 10 -2 10 -4 10 -6 10 -8 10 -10 10 -12 10 -14 0 20 40 60 80 100 120 140 number of iterations Ming Yan, Michigan State University Decentralized-30

linear convergence rate bottleneck 10 5 10 0 10 -5 10 -10 10 -15 10 -20 0 50 100 150 200 250 300 350 400 number of iterations Ming Yan, Michigan State University Decentralized-31

linear convergence rate bottleneck 10 5 10 0 10 -5 10 -10 10 -15 10 -20 0 50 100 150 200 250 300 350 400 450 number of iterations Ming Yan, Michigan State University Decentralized-31

nonsmooth functions 10 2 NIDS-1/L NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L 10 0 PGEXTRA-1.3/L PGEXTRA-1.4/L 10 -2 10 -4 10 -6 10 -8 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 number of iterations 10 4 Ming Yan, Michigan State University Decentralized-32

stochastic case: shuffled 2 2.5 Decentralized 2 1.5 1.5 Loss Loss D 2 1 1 Decentralized Centralized 0.5 D 2 0.5 Centralized 0 0 0 20 40 60 80 100 0 20 40 60 80 100 # Epochs # Epochs (a) T RANSFER L EARNING (b) L E N ET Ming Yan, Michigan State University Decentralized-33

Distributed Consensus Optimization Ming Yan Michigan State - PowerPoint PPT Presentation

Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Ming Yan, Michigan State University Decentralized-1 why we need decentralized optimization? Decentralized vehicles/aircrafts

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus Building Consensus is Consensus is finding an acceptable proposal that all members

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus and Dissent or: Meta - Consensus Consensus about what we have consensus

Distributed Systems CS425/ECE428 03/06/2020 Todays agenda Consensus Consensus in

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN Consensus The

CONSENSUS Fall 2012 Ken Birman Consensus a classic problem Consensus abstraction underlies

Membership of the consensus group Membership of the consensus group Members of the group were

When Aeron Met Raft Martin Thompson - @mjpt777 What does Consensus mean? consensus

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN Consensus (Recapitulation)

Consensus in Distributed Systems Jeff Chase Duke University Consensus P 1 P 1 v 1 d 1

Cryptocurrencies and (PoW) Distributed Consensus Ren Zhang & Bart Preneel

Implementing Distributed Consensus Dan Ldtke What? My hobby project of learning about

Consensus in Distributed Systems Course: Distributed Computing Faculty: Dr. Rajendra Prasath

Distributed Consensus with Process Failures Paulo S ergio Almeida Distributed Systems Group

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

PARMELA modeling and beam-based measurements in the JLab Upgrade FEL injector Carlos

Sentinel Landscapes Sentinel Landscapes are working or natural lands important to the Nations

Algorithms for Natural Language Processing Lecture 8: Parts of Speech My cat who lives

Engaging Generation Y & Z Using Social Media to Increase Breastfeeding Rates Through Cell

Get Ready for National Conference A meeting for Illinois attendees July 28, 2020 Topics: Meet

Our next Big Write will be written in the form of a persuasive letter, here we will try and

Wireless Networks L ecture 1: Course Organization, A Bit of History Peter Steenkiste CS and ECE,

WITH POPCORN! 2020 POPCORN SALE Guide Pg 1 1. Well be covering many common questions

Distributed Consensus Optimization Ming Yan Michigan State - PowerPoint PPT Presentation

Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Ming Yan, Michigan State University Decentralized-1 why we need decentralized optimization? Decentralized vehicles/aircrafts

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus Building Consensus is Consensus is finding an acceptable proposal that all members

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Consensus and Dissent or: Meta - Consensus Consensus about what we have consensus

Distributed Systems CS425/ECE428 03/06/2020 Todays agenda Consensus Consensus in

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN Consensus The

CONSENSUS Fall 2012 Ken Birman Consensus a classic problem Consensus abstraction underlies

Membership of the consensus group Membership of the consensus group Members of the group were

When Aeron Met Raft Martin Thompson - @mjpt777 What does Consensus mean? consensus

Distributed Algorithms (PhD course) Consensus SARDAR MUHAMMAD SULAMAN Consensus (Recapitulation)

Consensus in Distributed Systems Jeff Chase Duke University Consensus P 1 P 1 v 1 d 1

Cryptocurrencies and (PoW) Distributed Consensus Ren Zhang &amp; Bart Preneel

Implementing Distributed Consensus Dan Ldtke What? My hobby project of learning about

Consensus in Distributed Systems Course: Distributed Computing Faculty: Dr. Rajendra Prasath

Distributed Consensus with Process Failures Paulo S ergio Almeida Distributed Systems Group

Consensus Variants Usman Mazhar Mirza 6/17/2013 1 Consensus Variants In the variants we

PARMELA modeling and beam-based measurements in the JLab Upgrade FEL injector Carlos

Sentinel Landscapes Sentinel Landscapes are working or natural lands important to the Nations

Algorithms for Natural Language Processing Lecture 8: Parts of Speech My cat who lives

Engaging Generation Y &amp; Z Using Social Media to Increase Breastfeeding Rates Through Cell

Get Ready for National Conference A meeting for Illinois attendees July 28, 2020 Topics: Meet

Our next Big Write will be written in the form of a persuasive letter, here we will try and

Wireless Networks L ecture 1: Course Organization, A Bit of History Peter Steenkiste CS and ECE,

WITH POPCORN! 2020 POPCORN SALE Guide Pg 1 1. Well be covering many common questions

Cryptocurrencies and (PoW) Distributed Consensus Ren Zhang & Bart Preneel

Engaging Generation Y & Z Using Social Media to Increase Breastfeeding Rates Through Cell