parallel and distributed training of neural networks via
play

Parallel and Distributed Training of Neural Networks via Successive - PowerPoint PPT Presentation

2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP16) Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors : Paolo Di Lorenzo and Simone Scardapane Contents


  1. 2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP’16) Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors : Paolo Di Lorenzo and Simone Scardapane

  2. Contents Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions

  3. Content at a glance Setting : Training of neural networks (NNs) where data is distributed across agents with sparse connectivity (e.g. as in wireless sensor networks). State-of-the-art : Very limited literature on distributed optimization of nonconvex objec- tive functions as required by NNs training. Objective : We propose a general framework with theoretical guarantees, that can be customized to multiple loss functions and regularizers. It allows for agents exploiting parallel multi-core processors.

  4. Visual representation Node Node 3 S 3 Dataset Node 1 S 1 Model Model Link Input/ Output Node 2 Node 4 S 2 S 4 Figure 1 : Example of distributed learning with four agents agreeing on a common (neural network) model.

  5. State-of-the-art Distributed learning with convex objective functions is well established: ◮ Kernel Ridge Regression [Predd, Kulkarni and Poor, IEEE SPM, 2006] ◮ Sparse Linear Regression [Mateos, Bazerque and Giannakis, IEEE TSP, 2010] ◮ Support Vector Machines [Forero, Cano and Giannakis, JMLR, 2010] ◮ Local convex solvers & communication [Jaggi et al., NIPS, 2014] This reflects the availability of general-purpose methods for dis- tributed optimization of convex losses, e.g. the ADMM.

  6. Our contribution 1. Distributed learning of neural networks has mostly been consid- ered with sub-optimal ensemble procedures (e.g., boosting), or using some form of centralized server [Jeffrey et al., NIPS, 2012]. 2. Similarly, literature on distributed nonconvex optimization is re- cent and smaller. 3. We customize a novel framework called in-NEtwork nonconveX opTimization (NEXT), combining a convexification-decomposition technique and a dynamic consensus procedure [Di Lorenzo and Scutari, IEEE TSIPN, 2016].

  7. Contents Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions

  8. Problem formulation Distributed training of a neural network f ( x ; w ) can be cast as the min- imization of a social cost function G plus a regularization term r : I � min U ( w ) = G ( w ) + r ( w ) = g i ( w ) + r ( w ) , (1) w i = 1 where g i ( · ) is the local cost function of agent i , defined as: � g i ( w ) = l ( d i , m , f ( w ; x i , m )) , (2) m ∈S i where l ( · , · ) is a (convex) loss function, and ( x i , m , d i , m ) is a training ex- ample. Problem (1) is typically nonconvex due to f ( x ; w ) .

  9. Network model ◮ The network is modeled as a digraph G [ n ] = ( V , E [ n ]) , where V = { 1 , . . . , I } is the set of agents, and E [ n ] is the set of (possibly) time- varying directed edges. ◮ Associated with each graph G [ n ] , we introduce (possibly) time- varying weights c ij [ n ] matching G [ n ] : � if j ∈ N in θ ij ∈ [ ϑ, 1 ] i [ n ] ; c ij [ n ] = (3) 0 otherwise, for some ϑ ∈ ( 0 , 1 ) , and define the matrix C [ n ] � ( c ij [ n ]) I i , j = 1 . ◮ The weights define the communication topology.

  10. Network assumptions 1. The sequence of graphs G [ n ] is B-strongly connected, i.e.: G [ k ] = ( V , E B [ k ]) with ( k + 1 ) B − 1 � E B [ k ] = E [ n ] n = kB is strongly connected, for all k ≥ 0 and some B > 0. 2. Every weight matrix C [ n ] in (3) is doubly stochastic, i.e. it satisfies 1 T C [ n ] = 1 T C [ n ] 1 = 1 and ∀ n . (4) 3. Each agent i knows only its own cost function g i (but not the entire G ), and the common function r .

  11. Step 1 - Local optimization At every step, a local estimate w i [ n ] is obtained by solving a strongly convex surrogate of the original cost function: g i ( w i ; w i [ n ]) + π i [ n ] T ( w i − w i [ n ]) + r ( w i ) , w i [ n ] = arg min � � (5) w i where � π i [ n ] � ∇ w g j ( w i [ n ]) (6) j � = i and � g i ( w i ; w i [ n ]) is a convex approximation of g i at the point w i [ n ] , preserving the first order properties of g i . π i [ n ] is not available to the agents and must be approximated.

  12. Step 2 - Computation of new estimate The new estimate is obtained as the convex combination: z i [ n ] = w i [ n ] + α [ n ] ( � w i [ n ] − w i [ n ]) , (7) where α [ n ] is a possibly time-varying step-size sequence.

  13. Step 3 - Consensus phase Each agent i updates w i [ n ] with a consensus procedure: � w i [ n + 1 ] = c ij [ n ] z i [ n ] , (8) j ∈N in i [ n ] Finally, we replace π i [ n ] with a local estimate � π i [ n ] , asymptotically con- verging to π i [ n ] . We can update the local estimate � π i [ n ] as: π i [ n ] � I · y i [ n ] − ∇ g i ( w i [ n ]) , � (9) where y i [ n ] is a local auxiliary variable to asymptotically track the av- erage of the gradients, updated as: � I y i [ n + 1 ] � c ij [ n ] y j [ n ] + ( ∇ g i ( w i [ n + 1 ]) − ∇ g i ( w i [ n ])) . (10) j = 1

  14. Convergence Theorem Let { w [ n ] } n � { ( w i [ n ]) I i = 1 } n be the sequence generated by the algorithm, and let { w [ n ] } n � { ( 1 / I ) � I i = 1 w i [ n ] } n be its average. Suppose that the step-size sequence { α [ n ] } n is chosen so that α [ n ] ∈ ( 0 , 1 ] , for all n, � ∞ � ∞ n = 0 α [ n ] 2 < ∞ . n = 0 α [ n ] = ∞ and (11) If the sequence { w [ n ] } n is bounded, then (a) all its limit points are stationary solutions of the original problem; (b) all the sequences { w i [ n ] } n asymptoti- cally agree, i.e., � w i [ n ] − w [ n ] � − n →∞ 0 , for all i. → Proof. See [Di Lorenzo and Scutari, IEEE TSIPN, 2016].

  15. Contents Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions

  16. Choice of surrogate function Strategy (a): Partial linearization (PL) We only linearize the NN mapping as: � f ( w i ; w i [ n ] , x i , m )) + τ i l ( d i , m , � 2 � w i − w i [ n ] � 2 , � g i ( w i ; w i [ n ]) = (12) m ∈S i where τ i ≥ 0, and � f ( w i ; w i [ n ] , x i , m ) = f ( w i [ n ] , x i , m ) + ∇ w f ( w i [ n ]; x i , m ) T ( w i − w i [ n ]) Strategy (b): Full linearization (FL) We linearize g i around w i [ n ] : g i ( w i ; w i [ n ]) = g i ( w i [ n ]) + ∇ g i ( w i [ n ]) T ( w i − w i [ n ]) + τ i 2 � w i − w i [ n ] � 2 . � (13)

  17. Parallel computing of the surrogate function ◮ Assume there are C cores available at each node i , and partition w i = ( w i , c ) C c = 1 in C nonoverlapping blocks. ◮ Choose � g i as additively separable in the blocks: C � � g i ( w i ; w i [ n ]) = � g i , c ( w i , c ; w i , − c [ n ]) c = 1 where each � g i , c ( • ; w i , − c [ n ]) satisfies the assumptions in the vari- able w i , c . ◮ The surrogate optimization problem decomposes in C separate strongly convex subproblems as: π i , c [ n ] T ( w i , c − w i , c [ n ]) + r ( w i , c ) , w i , c [ n ] = arg min � � g i ( w i , c ; w i , − c [ n ]) + � w i , c

  18. A practical example I We consider a squared loss l ( · , · ) = ( d i , m − f ( w ; x i , m )) 2 , and an ℓ 2 norm regularization r ( w ) = λ � w � 2 2 . Define: M � J T A i [ n ] = i , m [ n ] J i , m [ n ] + λ I , (14) m = 1 M � r T b i [ n ] = i , m [ n ] J i , m [ n ] . (15) m = 1 with [ J i , m [ n ]] kl = ∂ f k ( w i [ n ]; x i , m ) . (16) ∂ w l r i , m [ n ] = d i , m − f ( w i [ n ]; x i , m ) + J i , m [ n ] w i [ n ] . (17)

  19. A practical example II The cost function at agent i and core c for the PL formulation can be cast as: � π i , c [ n ]) = w T U i , c ( w i , c ; w i [ n ] , � i , c A i , c , c [ n ] w i , c π i , c [ n ]) T w i − 2 ( b i , c [ n ] + A i , c , − c [ n ] w i , − c [ n ] − 0 . 5 · � (18) where A i , c , c [ n ] is the block of A i [ n ] corresponding to the c -th partition, and similarly for A i , c , − c [ n ] . The solution is given in closed form as: w i , c [ n ] = A − 1 � i , c [ n ]( b i , c [ n ] + A i , c , − c [ n ] w i , − c [ n ] − 0 . 5 · � π i , c [ n ]) , (19)

  20. A practical example III In the FL case, the cost function at agent i and core c , can be cast as: � π i , c [ n ]) = ( 0 . 5 · τ + λ ) � w i , c � 2 U i , c ( w i , c ; w i [ n ] , � π i , c [ n ]) T w i , c , − ( τ i w i , c [ n ] − ∇ c g i ( w i [ n ]) − � (20) This leads to the closed form solution: � � 2 w i , c [ n ] = � ( τ i w i , c [ n ] − ∇ c g i ( w i [ n ]) − � π i , c [ n ]) (21) τ + 2 λ

Recommend


More recommend