Randomized Primal-Dual Algorithms for Asynchronous Distributed Optimization Lin Xiao Microsoft Research Joint work with Adams Wei Yu (CMU), Qihang Lin (University of Iowa) Weizhu Chen (Microsoft) Workshop on Large-Scale and Distributed Optimization Lund Center for Control of Complex Engineering Systems June 14-16, 2017
Motivation big data optimization problems • dataset cannot fit into memory or storage of single computer • require distributed algorithms with inter-machine communication origins • machine learning, data mining, . . . • industry: search, online advertising, social media analysis, . . . goals • asynchronous distributed algorithms deployable in the cloud • nontrivial communication and/or computation complexity 1
Outline • distributed empirical risk minimization • randomized primal-dual algorithms with parameter servers • variance reduction techniques • DSCOVR algorithms (Doubly Stochastic Coordinate Optimization with Variance Reduction) • preliminary experiments
Empirical risk minimization (ERM) • popular formulation in supervised (linear) learning � N = 1 P ( w ) def φ ( x T minimize i w , y i ) + λ g ( w ) N w ∈ R d i =1 – i.i.d. samples: ( x 1 , y 1 ) , . . . ( x N , y N ) where x i ∈ R d , y i ∈ R – loss function: φ ( · , y ) convex for every y – g ( w ) strongly convex, e.g., g ( w ) = ( λ/ 2) � w � 2 √ 2 – regularization parameter λ ∼ 1 / N or smaller • linear regression : φ ( x T w , y ) = ( y − w T x ) 2 • binary classification: y ∈ {± 1 } – logistic regression: φ ( x T w , y ) = log(1 + exp( − y ( w T x ))) � � – hinge loss (SVM): φ ( x T w , y ) = max 0 , 1 − y ( w T x ) 2
Distributed ERM when dataset cannot fit into memory of single machine • data partitioned on m machines X 1: x T 1 X 2: x T . . ∈ R N × d 2 X = . . . . X i : . . x T . N • rewrite objective function m � 1 minimize Φ i ( X i : w ) + g ( w ) N w ∈ R d i =1 where Φ i ( X i : w ) = � j w , y j ) and � m j ∈I i φ j ( x T i =1 |I i | = N 3
Distributed optimization • distributed algorithms: alternate between – a local computation procedure at each machine – a communication round with simple map-reduce operations ( e.g. , broadcasting a vector in R d to m machines, or computing sum or average of m vectors in R d ) • bottleneck: high cost of inter-machine communication – speed/delay, synchronization – energy consumption • communication-efficiency w ) − P ( w ⋆ ) ≤ ǫ – number of communication rounds to find P ( � – often can be measured by iteration complexity 4
Iteration complexity • assumption: f : R d − R twice continuously differentiable, λ I � f ′′ ( w ) � LI , ∀ w ∈ R d in other words, f is λ -strongly convex and L -smooth • condition number κ = L λ we focus on ill-conditioned problems: κ ≫ 1 • iteration complexities of first-order methods – gradient descent method: O ( κ log(1 /ǫ )) – accelerated gradient method: O ( √ κ log(1 /ǫ )) – stochastic gradient method: O ( κ/ǫ ) (population loss) 5
Distributed gradient methods distributed implementation of gradient descent m � P ( w ) = 1 minimize Φ i ( X i : w ) N w ∈ R d i =1 w ( t +1) = w ( t ) − α t ∇ P ( w ( t ) ) master w ( t ) ∇ Φ i communicate O ( d ) bits m 1 i compute ∇ Φ i ( X i : w ( t ) ) � � � � � � • each iteration involves one round of communication • number of communication rounds: O ( κ log(1 /ǫ )) • can use accelerated gradient method: O ( √ κ log(1 /ǫ )) 6
ADMM � m 1 i =1 f i ( u i ) • reformulation: minimize N subject to u i = w , i = 1 , . . . , m • augmented Lagrangian � � � m f i ( u i ) + � v i , u i − w � + ρ 2 � u i − w � 2 L ρ ( u , v , w ) = 2 i =1 w ( t +1) = arg min w L ρ ( u ( t +1) , v ( t ) , w ) master communicate O ( d ) bits u ( t +1) w ( t ) i v ( t ) i u ( t +1) = arg min u i L ρ ( u i , v ( t ) , w ( t ) ) i m 1 i v ( t +1) = v ( t ) + ρ ( u ( t ) − w ( t ) ) � � � � � � i i i • no. of communication rounds: O ( κ log(1 /ǫ )) or O ( √ κ log(1 /ǫ )) 7
The dual ERM problem primal problem m � = 1 P ( w ) def Φ i ( X i : w ) + g ( w ) minimize N w ∈ R d i =1 dual problem � � m m � � = − 1 − 1 D ( α ) def Φ ∗ i ( α i ) − g ∗ ( X i : ) T α i maximize N N α ∈ R N i =1 i =1 where g ∗ and φ ∗ i are convex conjugate functions • g ∗ ( v ) = sup u ∈ R d { v T u − g ( u ) } • Φ ∗ i ( α i ) = sup z ∈ R ni { α T i z − Φ i ( z ) } , for i = 1 , . . . , m recover primal variable from dual: w = ∇ g ∗ � � m � − 1 i =1 ( X i : ) T α i N 8
The CoCoA(+) algorithm (Jaggi et al. 2014, Ma et al. 2015) � � � m � m = − 1 − 1 def ( X i : ) T α i Φ ∗ i ( α i ) − g ∗ maximize D ( α ) N N α ∈ R N i =1 i =1 v ( t +1) = v ( t ) + � m i =1 ∆ v ( t ) i master communicate O ( d ) bits ∆ v ( t ) v ( t ) i α ( t +1) = arg max α i G i ( v ( t ) , α i ) m 1 i i � � � � � � ∆ v ( t ) N ( X i : ) T ( α ( t +1) − α ( t ) = 1 ) i i i • each iteration involves one round of communication • number of communication rounds: O ( κ log(1 /ǫ )) • can be accelerated by PPA (Catalyst, Lin et al.) : O ( √ κ log(1 /ǫ )) 9
Primal and dual variables w α 1 X 1: α 2 X 2: . . . X i : . . . α m � � � m − 1 w = ∇ g ∗ ( X i : ) T α i N i =1 10
Can we do better? • asynchronous distributed algorithms? • better communication complexity? • better computation complexity? 11
Outline • distributed empirical risk minimization • randomized primal-dual algorithms with parameter servers • variance reduction techniques • DSCOVR algorithms (Doubly Stochastic Coordinate Optimization with Variance Reduction) • preliminary experiments
Asynchronism: Hogwild! style idea: exploit sparsity to avoid simultaneous updates (Niu et al. 2011) w ������������������������ ������������������������ machine 1 X 1: ������������������������ ������������������������ machine 2 X 2: . ������������������������ ������������������������ . . ������������������������� ������������������������� X i : . ������������������������� ������������������������� . . machine m problems: • too frequent communication (bottleneck for distributed system) • slow convergence (sublinear rate using stochastic gradients) 12
Tame the hog: forced separation w 1 w 2 w i w K ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� machine 1 machine 2 machine m • partition w into K blocks w 1 , . . . , w K • each machine updates a different block using relevant columns • set K > m so that all machines can work all the time • event-driven asynchronism: – whenever free, each machine request new block to update – update orders can be intentionally randomized 13
Double separation via saddle-point formulation w 1 w 2 w k w K ... α 1 ... α 2 ... α i X i : X ik ... ... ... ... ... α m X : k � 1 � m K m K � � � � i X ik w k − 1 α T Φ ∗ min max i ( α i ) + g ( w k ) N N w ∈ R d α ∈ R N i =1 k =1 i =1 k =1 14
A randomized primal-dual algorithm Algorithm 1: Doubly stochastic primal-dual coordinate update input: initial points w (0) and α (0) for t = 0 , 1 , 2 , . . . , T − 1 1. pick j ∈ { 1 , . . . , m } and l ∈ { 1 , . . . , K } with probabilities p j and q l 2. compute stochastic gradients u ( t +1) q l X jl w ( t ) v ( t +1) N ( X jl ) T α ( t ) 1 1 1 = = , j l l p j j 3. update primal and dual block coordinates: � prox σ j Ψ ∗ α ( t ) + σ j u ( t +1) � � if i = j , α ( t +1) j j = j i α ( t ) if i � = j , i , � prox τ l g l w ( t ) − τ l v ( t +1) � � if k = l , w ( t +1) l l = w ( t ) k if k � = l . k , end for 15
How good is this algorithm? • on the update order – sequence ( i ( t ) , k ( t )) not really i.i.d. – in practice better than i.i.d.? w 1 w 2 w i w K ��� ��� ��� ��� ���� ���� ��� ��� ��� ��� ���� ���� machine 1 machine 2 machine m • bad news: sublinear convergence, with complexity O (1 /ǫ ) 16
Outline • distributed empirical risk minimization • randomized primal-dual algorithms with parameter servers • variance reduction techniques • DSCOVR algorithms (Doubly Stochastic Coordinate Optimization with Variance Reduction) • preliminary experiments
Recommend
More recommend