optimal convergence rates for distributed optimization
play

Optimal convergence rates for distributed optimization Francis Bach - PowerPoint PPT Presentation

Optimal convergence rates for distributed optimization Francis Bach Inria - Ecole Normale Sup erieure, Paris Joint work with K. Scaman , S. Bubeck, Y.-T. Lee and L. Massouli e LCCC Workshop - June 2017 Motivations Typical Machine


  1. Optimal convergence rates for distributed optimization Francis Bach — Inria - Ecole Normale Sup´ erieure, Paris Joint work with K. Scaman , S. Bubeck, Y.-T. Lee and L. Massouli´ e LCCC Workshop - June 2017

  2. Motivations Typical Machine Learning setting ◮ Empirical risk minimization: m 1 � ℓ ( x i , y i ; θ ) + c � θ � 2 min 2 m θ ∈ R d i =1 ◮ Large scale learning systems handle massive amounts of data ◮ Requires multiple machines to train the model Francis Bach 2/27 LCCC workshop

  3. Motivations Typical Machine Learning setting ◮ Empirical risk minimization: logistic regression m 1 � i θ )) + c � θ � 2 log(1 + exp( − y i x ⊤ min 2 m θ ∈ R d i =1 ◮ Large scale learning systems handle massive amounts of data ◮ Requires multiple machines to train the model Francis Bach 2/27 LCCC workshop

  4. Optimization with a single machine “Best” convergence rate for strongly-convex and smooth functions ◮ Number of iterations to reach a precision ε > 0 (Nesterov, 2004): � √ κ ln � 1 �� Θ ε where κ is the condition number of the function to optimize. ◮ Consequence of f ( θ t ) − f ( θ ∗ ) � β (1 − 1 / √ κ ) t � θ 0 − θ ∗ � 2 ◮ ...but each iteration requires m gradients to compute ! Francis Bach 3/27 LCCC workshop

  5. Optimization with a single machine “Best” convergence rate for strongly-convex and smooth functions ◮ Number of iterations to reach a precision ε > 0 (Nesterov, 2004): � √ κ ln � 1 �� Θ ε where κ is the condition number of the function to optimize. ◮ Consequence of f ( θ t ) − f ( θ ∗ ) � β (1 − 1 / √ κ ) t � θ 0 − θ ∗ � 2 ◮ ...but each iteration requires m gradients to compute ! Upper and lower bounds of complexity inf sup #iterations to reach ε algorithms functions ◮ Upper-bound: exhibit an algorithm (here Nesterov acceleration) ◮ Lower-bound: exhibit a hard function where all algorithms fail Francis Bach 3/27 LCCC workshop

  6. Distributing information on a network Centralized algorithms ◮ “Master/slave” ◮ Minimal number of communication steps = Diameter ∆ Decentralized algorithms ◮ Gossip algorithms (Boyd et.al., 2006 ; Shah, 2009) ◮ Mixing time of the Markov chain on the graph ≈ inverse of the second smallest eigenvalue γ of the Laplacian Francis Bach 4/27 LCCC workshop

  7. Goals of this work Beyond single machine optimization m √ κ ln � 1 � �� ◮ Can we improve on Θ ? ε ◮ Is the speed up linear? ◮ How does a limited bandwidth affects optimization algorithms? Francis Bach 5/27 LCCC workshop

  8. Goals of this work Beyond single machine optimization m √ κ ln � 1 � �� ◮ Can we improve on Θ ? ε ◮ Is the speed up linear? ◮ How does a limited bandwidth affects optimization algorithms? Extending optimization theory to distributed architectures ◮ Optimal convergence rates of first order distributed methods, ◮ Optimal algorithms achieving this rate, ◮ Beyond flat (totally connected) architectures (Arjevani and Shamir, 2015), ◮ Explicit dependence on optimization parameters and graph parameters. Francis Bach 5/27 LCCC workshop

  9. Distributed optimization setting Optimization problem Let f i be α -strongly convex and β -smooth functions. We consider minimizing the average of the local functions. n f ( θ ) = 1 θ ∈ R d ¯ � min f i ( θ ) n i =1 ◮ Machine learning: distributed observations Francis Bach 6/27 LCCC workshop

  10. Distributed optimization setting Optimization problem Let f i be α -strongly convex and β -smooth functions. We consider minimizing the average of the local functions. n f ( θ ) = 1 θ ∈ R d ¯ � min f i ( θ ) n i =1 ◮ Machine learning: distributed observations Optimization procedures We consider distributed first-order optimization procedures: access to gradients (or gradients of the Fenchel conjugates). Francis Bach 6/27 LCCC workshop

  11. Distributed optimization setting Optimization problem Let f i be α -strongly convex and β -smooth functions. We consider minimizing the average of the local functions. n f ( θ ) = 1 θ ∈ R d ¯ � min f i ( θ ) n i =1 ◮ Machine learning: distributed observations Optimization procedures We consider distributed first-order optimization procedures: access to gradients (or gradients of the Fenchel conjugates). Network communications Let G = ( V , E ) be a connected simple graph of n computing units and diameter ∆, each having access to a function f i ( θ ) over θ ∈ R d . Francis Bach 6/27 LCCC workshop

  12. Strong convexity and smoothness Strong convexity A function f is α -strongly convex iff. ∀ x , y ∈ R d , f ( y ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + α � y − x � 2 . Smoothness A function f is β -smooth convex iff. ∀ x , y ∈ R d , f ( y ) ≤ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + β � y − x � 2 . Notations ◮ κ l = β α ( local ) condition number of each f i , ◮ κ g = β g ( global ) condition number of ¯ f , α g ◮ κ g � κ l , equal if all functions f i equal. Francis Bach 7/27 LCCC workshop

  13. Communication network Assumptions ◮ Each local computation takes a unit of time, ◮ Each communication between neighbors takes a time τ , ◮ Actions may be performed in parallel and asynchronously . Francis Bach 8/27 LCCC workshop

  14. Distributed optimization algorithms Black-box procedures We consider distributed algorithms verifying the following constraints: 1. Local memory: each node i can store past values in an internal memory M i , t ⊂ R d at time t ≥ 0. M i , t ⊂ M comp ∪ M comm , θ i , t ∈ M i , t . i , t i , t 2. Local computation: each node i can, at time t , compute the gradient of its local function ∇ f i ( θ ) or its Fenchel conjugate ∇ f ∗ i ( θ ), where f ∗ ( θ ) = sup x x ⊤ θ − f ( x ). M comp = Span ( { θ, ∇ f i ( θ ) , ∇ f ∗ i ( θ ) : θ ∈ M i , t − 1 } ) . i , t 3. Local communication: each node i can, at time t , share a value to all or part of its neighbors. � � � M comm = Span M j , t − τ . i , t ( i , j ) ∈E Francis Bach 9/27 LCCC workshop

  15. Centralized vs. decentralized architectures Centralized communication ◮ One master machine is responsible for sending requests and synchronizing computation, ◮ Slave machines perform computations upon request and send the result to the master. Francis Bach 10/27 LCCC workshop

  16. Centralized vs. decentralized architectures Centralized communication ◮ One master machine is responsible for sending requests and synchronizing computation, ◮ Slave machines perform computations upon request and send the result to the master. Decentralized communication ◮ All machines perform local computations and share values with their neighbors, ◮ Local averaging is performed through gossip (Boyd et.al., 2006). ◮ Node i receives � j W ij x j = ( Wx ) i , where W verifies: 1. W is an n × n symmetric matrix, 2. W is defined on the edges of the network: W ij � = 0 only if i = j or ( i , j ) ∈ E , 3. W is positive semi-definite, 4. The kernel of W is the set of constant vectors: Ker ( W ) = Span ( 1 ), where 1 = (1 , ..., 1) ⊤ . ◮ Let γ ( W ) = λ n − 1 ( W ) /λ 1 ( W ) be the (normalized) eigengap of W . Francis Bach 10/27 LCCC workshop

  17. Lower bound on convergence rate Theorem 1 (SBBLM, 2017) Let G be a graph of diameter ∆ > 0 and size n > 0, and β g ≥ α g > 0. There exist n functions f i : ℓ 2 → R such that ¯ f is α g -strongly-convex and β g -smooth, and for any t ≥ 0 and any black-box procedure one has, for all i ∈ { 1 , ..., n } , t � 1+ � f ( θ ∗ ) ≥ α g 4 1+∆ τ f ( θ i , t ) − ¯ ¯ � θ i , 0 − θ ∗ � 2 . 1 − √ κ g 2 Francis Bach 11/27 LCCC workshop

  18. Lower bound on convergence rate Theorem 1 (SBBLM, 2017) Let G be a graph of diameter ∆ > 0 and size n > 0, and β g ≥ α g > 0. There exist n functions f i : ℓ 2 → R such that ¯ f is α g -strongly-convex and β g -smooth, and for any t ≥ 0 and any black-box procedure one has, for all i ∈ { 1 , ..., n } , t � 1+ � f ( θ ∗ ) ≥ α g 4 1+∆ τ ¯ f ( θ i , t ) − ¯ � θ i , 0 − θ ∗ � 2 . 1 − √ κ g 2 Take-home message For any graph of diameter ∆ and any black-box procedure, there exist functions f i such that the time to reach a precision ε > 0 is lower bounded by � √ κ g � 1 �� � � Ω 1 + ∆ τ ln ε ◮ Extends the totally connected result of Arjevani & Shamir (2015) Francis Bach 11/27 LCCC workshop

  19. Proof warm-up: single machine ◮ Simplification: ℓ 2 instead of R d . ◮ Goal : design a worst-case convex function f . ◮ From Nesterov (2004), Bubeck (2015): f ( θ ) = α ( κ − 1) + α θ ⊤ A θ − 2 θ 1 2 � θ � 2 � � 2 8 with A infinite tridiagonal matrix with 2 on the diagonal, and − 1 on the upper and lower diagonal. Francis Bach 12/27 LCCC workshop

Recommend


More recommend