Communication-Efficient Decentralized Learning Yuejie Chi EdgeComm Workshop, 2020
Acknowledgements Boyue Li Shicong Cen Yuxin Chen CMU CMU Princeton Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction, JMLR, 2020. 1
Distributed empirical risk minimization Distributed/Federated learning: due to privacy and scalability, data are distributed at multiple locations / workers / agents. Let M = ∪ i M i be a data partition with equal splitting: n f ( x ) := 1 1 � � f i ( x ) , where f i ( x ) := ℓ ( x ; z ) . n ( N/n ) i =1 z ∈M i f 1 ( x ) f 5 ( x ) N = number of total samples n = number of agents N/n = number of local samples f 2 ( x ) ���� f 4 ( x ) m f 3 ( x ) 2
Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 3
Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 • Local computation: agents update local estimate; ⇒ need to be scalable! 3
Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 • Local computation: agents update local estimate; ⇒ need to be scalable! • Global communications: agents exchange for consensus; ⇒ need to be communication-efficient! 3
Decentralized ERM - algorithmic framework n f ( x ) := 1 � minimize x f i ( x ) n i =1 ⇓ n 1 � minimize x i f i ( x i ) subject to x i = x j n i =1 • Local computation: agents update local estimate; ⇒ need to be scalable! • Global communications: agents exchange for consensus; ⇒ need to be communication-efficient! Guiding principle: more local computation leads to less communication 3
Two distributed schemes f 1 ( x ) f 5 ( x ) f 2 ( x ) f 4 ( x ) f 3 ( x ) Master/slave model PS coordinates global information sharing 4
Two distributed schemes f 1 ( x ) f 1 ( x ) f 5 ( x ) f 5 ( x ) f 2 ( x ) f 2 ( x ) f 4 ( x ) f 4 ( x ) f 3 ( x ) f 3 ( x ) Master/slave model Network model PS coordinates global information agents share local information over a sharing graph topology 4
Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus 5
Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus 5
Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus 5
Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus 5
Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus Distributed Approximate NEwton (DANE) (Shamir et. al., 2014) : + µ � � � � x − x t − 1 � � 2 x t ∇ f i ( x t − 1 ) − ∇ f ( x t − 1 ) , x i = argmin f i ( x ) − 2 2 x • Quasi-Newton method and less sensitive to ill-conditioning. 5
Distributed first-order methods in the master/slave setting local data x t = LocalUpdate ( f i , r f ( x t ) , x t ) i ( n x t = 1 X x t − 1 i n n r f ( x t ) = 1 i =1 X r f i ( x t ) parameter consensus n i =1 gradient consensus Distributed Stochastic Variance-Reduced Gradients (Cen et. al., 2020) : = x t,s − 1 − η v t,s − 1 x t,s ⇐ , s = 1 , 2 , . . . , i i i � �� � variance-reduced stochastic gradient • Better local computation efficiency. 5
Naive extension to the network setting f 1 ( x ) f 5 ( x ) { x t i , r f i ( x t i ) } f 2 ( x ) f 4 ( x ) f 3 ( x ) • Communicate: agent transmits { x t i , ∇ f i ( x t i ) } ; • Compute: x t i ⇐ LocalUpdate ( f i , Avg {∇ f j ( x t , Avg { x t j ) } j ∈N i j } j ∈N i ) � �� � � �� � surrogate of ∇ f ( x t ) surrogate of x t 6
Naive extension to the network setting f 1 ( x ) f 5 ( x ) 10 0 SVRG Naive Network-SVRG Optimality gap Doesn't converge to global optimum! 10 -5 { x t i , r f i ( x t i ) } f 2 ( x ) f 4 ( x ) 10 -10 0 20 40 60 80 100 #iters f 3 ( x ) • Communicate: agent transmits { x t i , ∇ f i ( x t i ) } ; • Compute: x t i ⇐ LocalUpdate ( f i , Avg {∇ f j ( x t , Avg { x t j ) } j ∈N i j } j ∈N i ) � �� � � �� � surrogate of ∇ f ( x t ) surrogate of x t 6
Naive extension to the network setting f 1 ( x ) f 5 ( x ) 10 0 SVRG Naive Network-SVRG Optimality gap Doesn't converge to global optimum! 10 -5 { x t i , r f i ( x t i ) } f 2 ( x ) f 4 ( x ) 10 -10 0 20 40 60 80 100 #iters f 3 ( x ) • Communicate: agent transmits { x t i , ∇ f i ( x t i ) } ; • Compute: x t i ⇐ LocalUpdate ( f i , Avg {∇ f j ( x t , Avg { x t j ) } j ∈N i j } j ∈N i ) � �� � � �� � surrogate of ∇ f ( x t ) surrogate of x t Consensus needs to be designed carefully in the network setting! 6
Average dynamic consensus Assume that each agent generates some time-varying quantity r t j . � n n r t in each of the How to track its the dynamic average 1 j =1 r t j = 1 n 1 ⊤ n agents, where r t = [ r t 1 , · · · , r t n ] ⊤ ? 7
Average dynamic consensus Assume that each agent generates some time-varying quantity r t j . � n n r t in each of the How to track its the dynamic average 1 j =1 r t j = 1 n 1 ⊤ n agents, where r t = [ r t 1 , · · · , r t n ] ⊤ ? • Dynamic average consensus (Zhu and Martinez, 2010): q t = W q t − 1 + r t − r t − 1 , � �� � � �� � correction mixing where q t = [ q t n ] ⊤ and W is the mixing matrix. 1 , · · · , q t 7
Average dynamic consensus Assume that each agent generates some time-varying quantity r t j . � n n r t in each of the How to track its the dynamic average 1 j =1 r t j = 1 n 1 ⊤ n agents, where r t = [ r t 1 , · · · , r t n ] ⊤ ? • Dynamic average consensus (Zhu and Martinez, 2010): q t = W q t − 1 + r t − r t − 1 , � �� � � �� � correction mixing where q t = [ q t n ] ⊤ and W is the mixing matrix. 1 , · · · , q t • Key property: the average of { q t i } dynamically tracks the average of { r t i } ; n q t = 1 ⊤ 1 ⊤ n r t , M. Zhu and S. Martnez. ”Discrete-time dynamic average consensus.” Automatica 2010. 7
Gradient tracking ✿ s t ✯ y t j x t ) ✘ ✟ x t ∇ f ( x t ) , ✟✟ = LocalUpdate ( f i , ✘✘✘ i ⇐ • Parameter averaging: � k ∈N j w jk x t − 1 y t j = , k • Gradient tracking: � k ∈N j w jk s t − 1 j ) − ∇ f j ( y t − 1 s t + ∇ f j ( y t j = ) . k j � �� � gradient tracking 8
Gradient tracking ✿ s t ✯ y t j x t ) ✘ ✟ x t ∇ f ( x t ) , ✟✟ = LocalUpdate ( f i , ✘✘✘ i ⇐ • Parameter averaging: � k ∈N j w jk x t − 1 y t j = , k • Gradient tracking: � k ∈N j w jk s t − 1 j ) − ∇ f j ( y t − 1 s t + ∇ f j ( y t j = ) . k j � �� � gradient tracking We can now apply the same DANE and SVRG-type local updates! 8
Linear Regression: Well-Conditioned f i ( x ) = � y i − A i x � 2 A i ∈ R 1000 × 40 2 , Well-conditioned Well-conditioned 10 0 10 − 3 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 10 0 10 1 10 2 10 3 10 4 #iters #grads/#samples DANE DGD Network-DANE Network-SVRG Network-SARAH Figure: The optimality gap w.r.t. iterations and gradients evaluation. The condition number κ = 10 . ER graph ( p = 0 . 3 ), 20 agents. 9
Linear Regression: Ill-Conditioned A i ∈ R 1000 × 40 f i ( x ) = � y i − A i x � 2 2 , Ill-conditioned Ill-conditioned 10 0 10 − 3 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 10 0 10 1 10 2 10 3 10 4 #iters #grads/#samples DANE DGD Network-DANE Network-SVRG Network-SARAH Figure: The optimality gap w.r.t. iterations and gradients evaluations. The condition number κ = 10 4 . ER graph ( p = 0 . 3 ), 20 agents. 10
Extra Mixing The mixing rate of the graph α 0 = 0 . 922 . A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG . Network-SVRG Network-SVRG 10 0 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 3 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 0 100 200 300 400 500 #iters K · #iters K = 1 11
Extra Mixing The mixing rate of the graph α 0 = 0 . 922 . A single round of mixing within each iteration cannot ensure the convergence of Network-SVRG . Network-SVRG Network-SVRG 10 0 x ( t ) ) − f ⋆ ) /f ⋆ 10 − 3 10 − 6 ( f (¯ 10 − 9 10 − 12 0 20 40 60 80 100 0 100 200 300 400 500 #iters K · #iters K = 1 K = 2 11
Recommend
More recommend