Randomized Primal-Dual Algorithms for Asynchronous Distributed - PowerPoint PPT Presentation

Randomized Primal-Dual Algorithms for Asynchronous Distributed Optimization Lin Xiao Microsoft Research Joint work with Adams Wei Yu (CMU), Qihang Lin (University of Iowa) Weizhu Chen (Microsoft) Workshop on Large-Scale and Distributed Optimization Lund Center for Control of Complex Engineering Systems June 14-16, 2017

Motivation big data optimization problems • dataset cannot fit into memory or storage of single computer • require distributed algorithms with inter-machine communication origins • machine learning, data mining, . . . • industry: search, online advertising, social media analysis, . . . goals • asynchronous distributed algorithms deployable in the cloud • nontrivial communication and/or computation complexity 1

Outline • distributed empirical risk minimization • randomized primal-dual algorithms with parameter servers • variance reduction techniques • DSCOVR algorithms (Doubly Stochastic Coordinate Optimization with Variance Reduction) • preliminary experiments

Empirical risk minimization (ERM) • popular formulation in supervised (linear) learning � N = 1 P ( w ) def φ ( x T minimize i w , y i ) + λ g ( w ) N w ∈ R d i =1 – i.i.d. samples: ( x 1 , y 1 ) , . . . ( x N , y N ) where x i ∈ R d , y i ∈ R – loss function: φ ( · , y ) convex for every y – g ( w ) strongly convex, e.g., g ( w ) = ( λ/ 2) � w � 2 √ 2 – regularization parameter λ ∼ 1 / N or smaller • linear regression : φ ( x T w , y ) = ( y − w T x ) 2 • binary classification: y ∈ {± 1 } – logistic regression: φ ( x T w , y ) = log(1 + exp( − y ( w T x ))) � � – hinge loss (SVM): φ ( x T w , y ) = max 0 , 1 − y ( w T x ) 2

Distributed ERM when dataset cannot fit into memory of single machine • data partitioned on m machines   X 1: x T 1   X 2:   x T . .   ∈ R N × d 2 X = .  .  .   . X i : . . x T . N • rewrite objective function m � 1 minimize Φ i ( X i : w ) + g ( w ) N w ∈ R d i =1 where Φ i ( X i : w ) = � j w , y j ) and � m j ∈I i φ j ( x T i =1 |I i | = N 3

Distributed optimization • distributed algorithms: alternate between – a local computation procedure at each machine – a communication round with simple map-reduce operations ( e.g. , broadcasting a vector in R d to m machines, or computing sum or average of m vectors in R d ) • bottleneck: high cost of inter-machine communication – speed/delay, synchronization – energy consumption • communication-efficiency w ) − P ( w ⋆ ) ≤ ǫ – number of communication rounds to find P ( � – often can be measured by iteration complexity 4

Iteration complexity • assumption: f : R d − R twice continuously differentiable, λ I � f ′′ ( w ) � LI , ∀ w ∈ R d in other words, f is λ -strongly convex and L -smooth • condition number κ = L λ we focus on ill-conditioned problems: κ ≫ 1 • iteration complexities of first-order methods – gradient descent method: O ( κ log(1 /ǫ )) – accelerated gradient method: O ( √ κ log(1 /ǫ )) – stochastic gradient method: O ( κ/ǫ ) (population loss) 5

Distributed gradient methods distributed implementation of gradient descent m � P ( w ) = 1 minimize Φ i ( X i : w ) N w ∈ R d i =1 w ( t +1) = w ( t ) − α t ∇ P ( w ( t ) ) master w ( t ) ∇ Φ i communicate O ( d ) bits m 1 i compute ∇ Φ i ( X i : w ( t ) ) � � � � � � • each iteration involves one round of communication • number of communication rounds: O ( κ log(1 /ǫ )) • can use accelerated gradient method: O ( √ κ log(1 /ǫ )) 6

ADMM � m 1 i =1 f i ( u i ) • reformulation: minimize N subject to u i = w , i = 1 , . . . , m • augmented Lagrangian � � � m f i ( u i ) + � v i , u i − w � + ρ 2 � u i − w � 2 L ρ ( u , v , w ) = 2 i =1 w ( t +1) = arg min w L ρ ( u ( t +1) , v ( t ) , w ) master communicate O ( d ) bits u ( t +1) w ( t ) i v ( t ) i u ( t +1) = arg min u i L ρ ( u i , v ( t ) , w ( t ) ) i m 1 i v ( t +1) = v ( t ) + ρ ( u ( t ) − w ( t ) ) � � � � � � i i i • no. of communication rounds: O ( κ log(1 /ǫ )) or O ( √ κ log(1 /ǫ )) 7

The dual ERM problem primal problem m � = 1 P ( w ) def Φ i ( X i : w ) + g ( w ) minimize N w ∈ R d i =1 dual problem � � m m � � = − 1 − 1 D ( α ) def Φ ∗ i ( α i ) − g ∗ ( X i : ) T α i maximize N N α ∈ R N i =1 i =1 where g ∗ and φ ∗ i are convex conjugate functions • g ∗ ( v ) = sup u ∈ R d { v T u − g ( u ) } • Φ ∗ i ( α i ) = sup z ∈ R ni { α T i z − Φ i ( z ) } , for i = 1 , . . . , m recover primal variable from dual: w = ∇ g ∗ � � m � − 1 i =1 ( X i : ) T α i N 8

The CoCoA(+) algorithm (Jaggi et al. 2014, Ma et al. 2015) � � � m � m = − 1 − 1 def ( X i : ) T α i Φ ∗ i ( α i ) − g ∗ maximize D ( α ) N N α ∈ R N i =1 i =1 v ( t +1) = v ( t ) + � m i =1 ∆ v ( t ) i master communicate O ( d ) bits ∆ v ( t ) v ( t ) i α ( t +1) = arg max α i G i ( v ( t ) , α i ) m 1 i i � � � � � � ∆ v ( t ) N ( X i : ) T ( α ( t +1) − α ( t ) = 1 ) i i i • each iteration involves one round of communication • number of communication rounds: O ( κ log(1 /ǫ )) • can be accelerated by PPA (Catalyst, Lin et al.) : O ( √ κ log(1 /ǫ )) 9

Primal and dual variables w α 1 X 1: α 2 X 2: . . . X i : . . . α m � � � m − 1 w = ∇ g ∗ ( X i : ) T α i N i =1 10

Can we do better? • asynchronous distributed algorithms? • better communication complexity? • better computation complexity? 11

Asynchronism: Hogwild! style idea: exploit sparsity to avoid simultaneous updates (Niu et al. 2011) w �� machine 1 X 1: �� machine 2 X 2: . �� . . �� X i : . �� . . machine m problems: • too frequent communication (bottleneck for distributed system) • slow convergence (sublinear rate using stochastic gradients) 12

Tame the hog: forced separation w 1 w 2 w i w K �� machine 1 machine 2 machine m • partition w into K blocks w 1 , . . . , w K • each machine updates a different block using relevant columns • set K > m so that all machines can work all the time • event-driven asynchronism: – whenever free, each machine request new block to update – update orders can be intentionally randomized 13

Double separation via saddle-point formulation w 1 w 2 w k w K ... α 1 ... α 2 ... α i X i : X ik ... ... ... ... ... α m X : k � 1 � m K m K � � � � i X ik w k − 1 α T Φ ∗ min max i ( α i ) + g ( w k ) N N w ∈ R d α ∈ R N i =1 k =1 i =1 k =1 14

A randomized primal-dual algorithm Algorithm 1: Doubly stochastic primal-dual coordinate update input: initial points w (0) and α (0) for t = 0 , 1 , 2 , . . . , T − 1 1. pick j ∈ { 1 , . . . , m } and l ∈ { 1 , . . . , K } with probabilities p j and q l 2. compute stochastic gradients u ( t +1) q l X jl w ( t ) v ( t +1) N ( X jl ) T α ( t ) 1 1 1 = = , j l l p j j 3. update primal and dual block coordinates: � prox σ j Ψ ∗ α ( t ) + σ j u ( t +1) � � if i = j , α ( t +1) j j = j i α ( t ) if i � = j , i , � prox τ l g l w ( t ) − τ l v ( t +1) � � if k = l , w ( t +1) l l = w ( t ) k if k � = l . k , end for 15

How good is this algorithm? • on the update order – sequence ( i ( t ) , k ( t )) not really i.i.d. – in practice better than i.i.d.? w 1 w 2 w i w K �� machine 1 machine 2 machine m • bad news: sublinear convergence, with complexity O (1 /ǫ ) 16

Randomized Primal-Dual Algorithms for Asynchronous Distributed - PowerPoint PPT Presentation

Randomized Primal-Dual Algorithms for Asynchronous Distributed Optimization Lin Xiao Microsoft Research Joint work with Adams Wei Yu (CMU), Qihang Lin (University of Iowa) Weizhu Chen (Microsoft) Workshop on Large-Scale and Distributed

Contents 1. General Problem 2. Quasi-primal algebras Logics associated with a quasi-primal

Randomized Algorithms Randomized Algorithms Two Types of Randomized Algorithms Two Types of

optimization problems for primal-dual algorithms minimize f ( x ) + g ( x ) + h ( Ax ) x f ,

4 THE PRIMAL-DUAL METHOD FOR APPROXIMATION ALGORITHMS AND ITS APPLICATION TO NETWORK DESIGN

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

New primal-dual subgradient methods for Convex Problems with Functional Constraints Yurii

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov,

CSC373 Week 11: Randomized Algorithms 373F19 - Nisarg Shah & Karan Singh 1 Randomized

Randomized algorithms Quick-sort Closest pair of points Inge Li Grtz 1 2

Linear Convergence of Randomized Primal-Dual Coordinate Method for Large-scale Linear Constrained

13.1 Review of Last Lecture Review of primal and dual of SVM. Insights: Dual only depends on

Primal-Dual Algorithm Math 482, Lecture 29 Misha Lavrov April 17, 2020 Introduction The

American Meat Cuts vs Chilean Meat Cuts American Primal Cuts Chilean Primal Cuts Cuts &

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Introduction to R Petri Koistinen http://www.rni.helsinki.fi/ pek/ Dept. of Mathematics and

Monomial curves of homogeneous type Raheleh Jafari Kharazmi University The 13th Seminar on

Overlapping Partitioning and Applications to Preconditioning David Fritzsche 1 , Andreas Frommer 1

NV Energy Demand Response Overview Contents Introduction Programs Organization Operations

Multiscale methods for modeling flow in porous media : approaching industrial applications Jrg

In the name of Allah In the name of Allah the compassionate, the merciful Digital Video

A Unified Proposal for Bulk AOR Dynamic Rou:ng dra<roachmar:niup00 MARTINI

BCP for SIP to ISUP mapping draft-camarillo-sip-isup-bcp-00.txt Gonzalo Camarillo / Adam Roach

Randomized Primal-Dual Algorithms for Asynchronous Distributed - PowerPoint PPT Presentation

Randomized Primal-Dual Algorithms for Asynchronous Distributed Optimization Lin Xiao Microsoft Research Joint work with Adams Wei Yu (CMU), Qihang Lin (University of Iowa) Weizhu Chen (Microsoft) Workshop on Large-Scale and Distributed

Contents 1. General Problem 2. Quasi-primal algebras Logics associated with a quasi-primal

Randomized Algorithms Randomized Algorithms Two Types of Randomized Algorithms Two Types of

optimization problems for primal-dual algorithms minimize f ( x ) + g ( x ) + h ( Ax ) x f ,

4 THE PRIMAL-DUAL METHOD FOR APPROXIMATION ALGORITHMS AND ITS APPLICATION TO NETWORK DESIGN

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

New primal-dual subgradient methods for Convex Problems with Functional Constraints Yurii

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov,

CSC373 Week 11: Randomized Algorithms 373F19 - Nisarg Shah &amp; Karan Singh 1 Randomized

Randomized algorithms Quick-sort Closest pair of points Inge Li Grtz 1 2

Linear Convergence of Randomized Primal-Dual Coordinate Method for Large-scale Linear Constrained

13.1 Review of Last Lecture Review of primal and dual of SVM. Insights: Dual only depends on

Primal-Dual Algorithm Math 482, Lecture 29 Misha Lavrov April 17, 2020 Introduction The

American Meat Cuts vs Chilean Meat Cuts American Primal Cuts Chilean Primal Cuts Cuts &amp;

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Introduction to R Petri Koistinen http://www.rni.helsinki.fi/ pek/ Dept. of Mathematics and

Monomial curves of homogeneous type Raheleh Jafari Kharazmi University The 13th Seminar on

Overlapping Partitioning and Applications to Preconditioning David Fritzsche 1 , Andreas Frommer 1

NV Energy Demand Response Overview Contents Introduction Programs Organization Operations

Multiscale methods for modeling flow in porous media : approaching industrial applications Jrg

In the name of Allah In the name of Allah the compassionate, the merciful Digital Video

A Unified Proposal for Bulk AOR Dynamic Rou:ng dra&lt;roachmar:niup00 MARTINI

BCP for SIP to ISUP mapping draft-camarillo-sip-isup-bcp-00.txt Gonzalo Camarillo / Adam Roach

CSC373 Week 11: Randomized Algorithms 373F19 - Nisarg Shah & Karan Singh 1 Randomized

American Meat Cuts vs Chilean Meat Cuts American Primal Cuts Chilean Primal Cuts Cuts &

A Unified Proposal for Bulk AOR Dynamic Rou:ng dra<roachmar:niup00 MARTINI