graph oracle models lower bounds and gaps for parallel
play

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic - PowerPoint PPT Presentation

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang Blake Woodworth Adam Smith (UChicago 2" Investments) (TTIC) (Boston University) Nati Srebro H. Brendan McMahan (TTIC) (Google) Parallel


  1. Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang Blake Woodworth Adam Smith (UChicago→ 2" Investments) (TTIC) (Boston University) Nati Srebro H. Brendan McMahan (TTIC) (Google)

  2. Parallel Stochastic Optimization/Learning min x F ( x ) := E z ∼ D [ f ( x ; z )] Many parallelization scenarios: Synchronous parallelism • Asynchronous parallelism • Delayed updates • Few/many workers • Infrequent communication • Federated learning • …

  3. What is the best we can hope for in a given parallelism scenario?

  4. What is the best we can hope for in a given parallelism scenario? • We formalize the parallelism in terms of a dependency graph: Ancestors ( ) u 9 u 1 u 4 u 7 u 9 u 2 u 5 u 10 u 3 u 6 u 8 • At each node u, make a query based only on knowledge of ancestors ’ oracle interaction (plus shared randomness) • Graph defines class of optimization algorithms A ( G ) • Come to our poster for details

  5. • Sequential: • Layer: • Delays: • Intermittent Communication:

  6. Generic Lower Bounds Theorem: For any dependency graph with nodes and depth , no algorithm G D N for optimizing convex, -Lipschitz, -smooth on a bounded domain in high f ( x ; z ) L H dimensions can guarantee error less than: With stochastic gradient oracle: ⇢ L ✓ � ◆ , H L Ω min + √ D 2 √ D N With stochastic prox oracle: ⇢ L ✓ � ◆ D, H L Ω min + D 2 √ N f ( y ; z ) + β 2 k y � x k 2 Prox oracle: x, β , z 7! arg min y i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)

  7. Generic Lower Bounds Theorem: For any dependency graph with nodes and depth , no algorithm G D N for optimizing convex, -Lipschitz, -smooth on a bounded domain in high f ( x ; z ) L H dimensions can guarantee error less than: With stochastic gradient oracle: ⇢ L ✓ � ◆ , H L Ω min + √ D 2 √ D N With stochastic prox oracle: ⇢ L ✓ � ◆ D, H L Ω min + D 2 √ N f ( y ; z ) + β 2 k y � x k 2 Prox oracle: x, β , z 7! arg min y i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)

  8. • Sequential: • SGD is optimal • Layers: • Accelerated minibatch SGD is optimal • Delays: • Delayed-update SGD is not optimal • “Wait-and-Collect” minibatch is optimal • Intermittent Communication: • Gaps between existing algorithms and lower bound

  9. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L T ˜ O min + √ T 2 √ K K K K TK TKM M

  10. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: Minibatch #1 Minibatch #3 Minibatch #2 Minibatch #4 MK MK MK MK 1 1 1 1 X X X X g 3 = r f ( x 3 ; z i ) g 4 = r f ( x 4 ; z i ) g 2 = r f ( x 2 ; z i ) g 1 = r f ( x 1 ; z i ) MK MK MK MK i =1 i =1 i =1 i =1 Calculate Calculate x 3 Calculate Calculate x 4 x 2 x 5

  11. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: Minibatch #1 Minibatch #3 Minibatch #2 Minibatch #4 MK MK MK MK 1 1 1 1 X X X X g 3 = r f ( x 3 ; z i ) g 4 = r f ( x 4 ; z i ) g 2 = r f ( x 2 ; z i ) g 1 = r f ( x 1 ; z i ) MK MK MK MK i =1 i =1 i =1 i =1 Calculate Calculate x 3 Calculate Calculate x 4 x 2 x 5

  12. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Sequential SGD steps

  13. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Sequential SGD steps

  14. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient

  15. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient

  16. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM

  17. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM

  18. ✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM • Option 4: Parallel SGD ???

  19. Come to our poster tonight from 5-7pm

Recommend


More recommend