Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang Blake Woodworth Adam Smith (UChicago→ 2" Investments) (TTIC) (Boston University) Nati Srebro H. Brendan McMahan (TTIC) (Google)
Parallel Stochastic Optimization/Learning min x F ( x ) := E z ∼ D [ f ( x ; z )] Many parallelization scenarios: Synchronous parallelism • Asynchronous parallelism • Delayed updates • Few/many workers • Infrequent communication • Federated learning • …
What is the best we can hope for in a given parallelism scenario?
What is the best we can hope for in a given parallelism scenario? • We formalize the parallelism in terms of a dependency graph: Ancestors ( ) u 9 u 1 u 4 u 7 u 9 u 2 u 5 u 10 u 3 u 6 u 8 • At each node u, make a query based only on knowledge of ancestors ’ oracle interaction (plus shared randomness) • Graph defines class of optimization algorithms A ( G ) • Come to our poster for details
• Sequential: • Layer: • Delays: • Intermittent Communication:
Generic Lower Bounds Theorem: For any dependency graph with nodes and depth , no algorithm G D N for optimizing convex, -Lipschitz, -smooth on a bounded domain in high f ( x ; z ) L H dimensions can guarantee error less than: With stochastic gradient oracle: ⇢ L ✓ � ◆ , H L Ω min + √ D 2 √ D N With stochastic prox oracle: ⇢ L ✓ � ◆ D, H L Ω min + D 2 √ N f ( y ; z ) + β 2 k y � x k 2 Prox oracle: x, β , z 7! arg min y i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)
Generic Lower Bounds Theorem: For any dependency graph with nodes and depth , no algorithm G D N for optimizing convex, -Lipschitz, -smooth on a bounded domain in high f ( x ; z ) L H dimensions can guarantee error less than: With stochastic gradient oracle: ⇢ L ✓ � ◆ , H L Ω min + √ D 2 √ D N With stochastic prox oracle: ⇢ L ✓ � ◆ D, H L Ω min + D 2 √ N f ( y ; z ) + β 2 k y � x k 2 Prox oracle: x, β , z 7! arg min y i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)
• Sequential: • SGD is optimal • Layers: • Accelerated minibatch SGD is optimal • Delays: • Delayed-update SGD is not optimal • “Wait-and-Collect” minibatch is optimal • Intermittent Communication: • Gaps between existing algorithms and lower bound
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L T ˜ O min + √ T 2 √ K K K K TK TKM M
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: Minibatch #1 Minibatch #3 Minibatch #2 Minibatch #4 MK MK MK MK 1 1 1 1 X X X X g 3 = r f ( x 3 ; z i ) g 4 = r f ( x 4 ; z i ) g 2 = r f ( x 2 ; z i ) g 1 = r f ( x 1 ; z i ) MK MK MK MK i =1 i =1 i =1 i =1 Calculate Calculate x 3 Calculate Calculate x 4 x 2 x 5
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: Minibatch #1 Minibatch #3 Minibatch #2 Minibatch #4 MK MK MK MK 1 1 1 1 X X X X g 3 = r f ( x 3 ; z i ) g 4 = r f ( x 4 ; z i ) g 2 = r f ( x 2 ; z i ) g 1 = r f ( x 1 ; z i ) MK MK MK MK i =1 i =1 i =1 i =1 Calculate Calculate x 3 Calculate Calculate x 4 x 2 x 5
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Sequential SGD steps
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Sequential SGD steps
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM • Option 4: Parallel SGD ???
Come to our poster tonight from 5-7pm
Recommend
More recommend