2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP’16) Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors : Paolo Di Lorenzo and Simone Scardapane
Contents Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions
Content at a glance Setting : Training of neural networks (NNs) where data is distributed across agents with sparse connectivity (e.g. as in wireless sensor networks). State-of-the-art : Very limited literature on distributed optimization of nonconvex objec- tive functions as required by NNs training. Objective : We propose a general framework with theoretical guarantees, that can be customized to multiple loss functions and regularizers. It allows for agents exploiting parallel multi-core processors.
Visual representation Node Node 3 S 3 Dataset Node 1 S 1 Model Model Link Input/ Output Node 2 Node 4 S 2 S 4 Figure 1 : Example of distributed learning with four agents agreeing on a common (neural network) model.
State-of-the-art Distributed learning with convex objective functions is well established: ◮ Kernel Ridge Regression [Predd, Kulkarni and Poor, IEEE SPM, 2006] ◮ Sparse Linear Regression [Mateos, Bazerque and Giannakis, IEEE TSP, 2010] ◮ Support Vector Machines [Forero, Cano and Giannakis, JMLR, 2010] ◮ Local convex solvers & communication [Jaggi et al., NIPS, 2014] This reflects the availability of general-purpose methods for dis- tributed optimization of convex losses, e.g. the ADMM.
Our contribution 1. Distributed learning of neural networks has mostly been consid- ered with sub-optimal ensemble procedures (e.g., boosting), or using some form of centralized server [Jeffrey et al., NIPS, 2012]. 2. Similarly, literature on distributed nonconvex optimization is re- cent and smaller. 3. We customize a novel framework called in-NEtwork nonconveX opTimization (NEXT), combining a convexification-decomposition technique and a dynamic consensus procedure [Di Lorenzo and Scutari, IEEE TSIPN, 2016].
Contents Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions
Problem formulation Distributed training of a neural network f ( x ; w ) can be cast as the min- imization of a social cost function G plus a regularization term r : I � min U ( w ) = G ( w ) + r ( w ) = g i ( w ) + r ( w ) , (1) w i = 1 where g i ( · ) is the local cost function of agent i , defined as: � g i ( w ) = l ( d i , m , f ( w ; x i , m )) , (2) m ∈S i where l ( · , · ) is a (convex) loss function, and ( x i , m , d i , m ) is a training ex- ample. Problem (1) is typically nonconvex due to f ( x ; w ) .
Network model ◮ The network is modeled as a digraph G [ n ] = ( V , E [ n ]) , where V = { 1 , . . . , I } is the set of agents, and E [ n ] is the set of (possibly) time- varying directed edges. ◮ Associated with each graph G [ n ] , we introduce (possibly) time- varying weights c ij [ n ] matching G [ n ] : � if j ∈ N in θ ij ∈ [ ϑ, 1 ] i [ n ] ; c ij [ n ] = (3) 0 otherwise, for some ϑ ∈ ( 0 , 1 ) , and define the matrix C [ n ] � ( c ij [ n ]) I i , j = 1 . ◮ The weights define the communication topology.
Network assumptions 1. The sequence of graphs G [ n ] is B-strongly connected, i.e.: G [ k ] = ( V , E B [ k ]) with ( k + 1 ) B − 1 � E B [ k ] = E [ n ] n = kB is strongly connected, for all k ≥ 0 and some B > 0. 2. Every weight matrix C [ n ] in (3) is doubly stochastic, i.e. it satisfies 1 T C [ n ] = 1 T C [ n ] 1 = 1 and ∀ n . (4) 3. Each agent i knows only its own cost function g i (but not the entire G ), and the common function r .
Step 1 - Local optimization At every step, a local estimate w i [ n ] is obtained by solving a strongly convex surrogate of the original cost function: g i ( w i ; w i [ n ]) + π i [ n ] T ( w i − w i [ n ]) + r ( w i ) , w i [ n ] = arg min � � (5) w i where � π i [ n ] � ∇ w g j ( w i [ n ]) (6) j � = i and � g i ( w i ; w i [ n ]) is a convex approximation of g i at the point w i [ n ] , preserving the first order properties of g i . π i [ n ] is not available to the agents and must be approximated.
Step 2 - Computation of new estimate The new estimate is obtained as the convex combination: z i [ n ] = w i [ n ] + α [ n ] ( � w i [ n ] − w i [ n ]) , (7) where α [ n ] is a possibly time-varying step-size sequence.
Step 3 - Consensus phase Each agent i updates w i [ n ] with a consensus procedure: � w i [ n + 1 ] = c ij [ n ] z i [ n ] , (8) j ∈N in i [ n ] Finally, we replace π i [ n ] with a local estimate � π i [ n ] , asymptotically con- verging to π i [ n ] . We can update the local estimate � π i [ n ] as: π i [ n ] � I · y i [ n ] − ∇ g i ( w i [ n ]) , � (9) where y i [ n ] is a local auxiliary variable to asymptotically track the av- erage of the gradients, updated as: � I y i [ n + 1 ] � c ij [ n ] y j [ n ] + ( ∇ g i ( w i [ n + 1 ]) − ∇ g i ( w i [ n ])) . (10) j = 1
Convergence Theorem Let { w [ n ] } n � { ( w i [ n ]) I i = 1 } n be the sequence generated by the algorithm, and let { w [ n ] } n � { ( 1 / I ) � I i = 1 w i [ n ] } n be its average. Suppose that the step-size sequence { α [ n ] } n is chosen so that α [ n ] ∈ ( 0 , 1 ] , for all n, � ∞ � ∞ n = 0 α [ n ] 2 < ∞ . n = 0 α [ n ] = ∞ and (11) If the sequence { w [ n ] } n is bounded, then (a) all its limit points are stationary solutions of the original problem; (b) all the sequences { w i [ n ] } n asymptoti- cally agree, i.e., � w i [ n ] − w [ n ] � − n →∞ 0 , for all i. → Proof. See [Di Lorenzo and Scutari, IEEE TSIPN, 2016].
Contents Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions
Choice of surrogate function Strategy (a): Partial linearization (PL) We only linearize the NN mapping as: � f ( w i ; w i [ n ] , x i , m )) + τ i l ( d i , m , � 2 � w i − w i [ n ] � 2 , � g i ( w i ; w i [ n ]) = (12) m ∈S i where τ i ≥ 0, and � f ( w i ; w i [ n ] , x i , m ) = f ( w i [ n ] , x i , m ) + ∇ w f ( w i [ n ]; x i , m ) T ( w i − w i [ n ]) Strategy (b): Full linearization (FL) We linearize g i around w i [ n ] : g i ( w i ; w i [ n ]) = g i ( w i [ n ]) + ∇ g i ( w i [ n ]) T ( w i − w i [ n ]) + τ i 2 � w i − w i [ n ] � 2 . � (13)
Parallel computing of the surrogate function ◮ Assume there are C cores available at each node i , and partition w i = ( w i , c ) C c = 1 in C nonoverlapping blocks. ◮ Choose � g i as additively separable in the blocks: C � � g i ( w i ; w i [ n ]) = � g i , c ( w i , c ; w i , − c [ n ]) c = 1 where each � g i , c ( • ; w i , − c [ n ]) satisfies the assumptions in the vari- able w i , c . ◮ The surrogate optimization problem decomposes in C separate strongly convex subproblems as: π i , c [ n ] T ( w i , c − w i , c [ n ]) + r ( w i , c ) , w i , c [ n ] = arg min � � g i ( w i , c ; w i , − c [ n ]) + � w i , c
A practical example I We consider a squared loss l ( · , · ) = ( d i , m − f ( w ; x i , m )) 2 , and an ℓ 2 norm regularization r ( w ) = λ � w � 2 2 . Define: M � J T A i [ n ] = i , m [ n ] J i , m [ n ] + λ I , (14) m = 1 M � r T b i [ n ] = i , m [ n ] J i , m [ n ] . (15) m = 1 with [ J i , m [ n ]] kl = ∂ f k ( w i [ n ]; x i , m ) . (16) ∂ w l r i , m [ n ] = d i , m − f ( w i [ n ]; x i , m ) + J i , m [ n ] w i [ n ] . (17)
A practical example II The cost function at agent i and core c for the PL formulation can be cast as: � π i , c [ n ]) = w T U i , c ( w i , c ; w i [ n ] , � i , c A i , c , c [ n ] w i , c π i , c [ n ]) T w i − 2 ( b i , c [ n ] + A i , c , − c [ n ] w i , − c [ n ] − 0 . 5 · � (18) where A i , c , c [ n ] is the block of A i [ n ] corresponding to the c -th partition, and similarly for A i , c , − c [ n ] . The solution is given in closed form as: w i , c [ n ] = A − 1 � i , c [ n ]( b i , c [ n ] + A i , c , − c [ n ] w i , − c [ n ] − 0 . 5 · � π i , c [ n ]) , (19)
A practical example III In the FL case, the cost function at agent i and core c , can be cast as: � π i , c [ n ]) = ( 0 . 5 · τ + λ ) � w i , c � 2 U i , c ( w i , c ; w i [ n ] , � π i , c [ n ]) T w i , c , − ( τ i w i , c [ n ] − ∇ c g i ( w i [ n ]) − � (20) This leads to the closed form solution: � � 2 w i , c [ n ] = � ( τ i w i , c [ n ] − ∇ c g i ( w i [ n ]) − � π i , c [ n ]) (21) τ + 2 λ
Recommend
More recommend