Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic - PowerPoint PPT Presentation
Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang Blake Woodworth Adam Smith (UChicago 2" Investments) (TTIC) (Boston University) Nati Srebro H. Brendan McMahan (TTIC) (Google) Parallel
Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang Blake Woodworth Adam Smith (UChicago→ 2" Investments) (TTIC) (Boston University) Nati Srebro H. Brendan McMahan (TTIC) (Google)
Parallel Stochastic Optimization/Learning min x F ( x ) := E z ∼ D [ f ( x ; z )] Many parallelization scenarios: Synchronous parallelism • Asynchronous parallelism • Delayed updates • Few/many workers • Infrequent communication • Federated learning • …
What is the best we can hope for in a given parallelism scenario?
What is the best we can hope for in a given parallelism scenario? • We formalize the parallelism in terms of a dependency graph: Ancestors ( ) u 9 u 1 u 4 u 7 u 9 u 2 u 5 u 10 u 3 u 6 u 8 • At each node u, make a query based only on knowledge of ancestors ’ oracle interaction (plus shared randomness) • Graph defines class of optimization algorithms A ( G ) • Come to our poster for details
• Sequential: • Layer: • Delays: • Intermittent Communication:
Generic Lower Bounds Theorem: For any dependency graph with nodes and depth , no algorithm G D N for optimizing convex, -Lipschitz, -smooth on a bounded domain in high f ( x ; z ) L H dimensions can guarantee error less than: With stochastic gradient oracle: ⇢ L ✓ � ◆ , H L Ω min + √ D 2 √ D N With stochastic prox oracle: ⇢ L ✓ � ◆ D, H L Ω min + D 2 √ N f ( y ; z ) + β 2 k y � x k 2 Prox oracle: x, β , z 7! arg min y i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)
Generic Lower Bounds Theorem: For any dependency graph with nodes and depth , no algorithm G D N for optimizing convex, -Lipschitz, -smooth on a bounded domain in high f ( x ; z ) L H dimensions can guarantee error less than: With stochastic gradient oracle: ⇢ L ✓ � ◆ , H L Ω min + √ D 2 √ D N With stochastic prox oracle: ⇢ L ✓ � ◆ D, H L Ω min + D 2 √ N f ( y ; z ) + β 2 k y � x k 2 Prox oracle: x, β , z 7! arg min y i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)
• Sequential: • SGD is optimal • Layers: • Accelerated minibatch SGD is optimal • Delays: • Delayed-update SGD is not optimal • “Wait-and-Collect” minibatch is optimal • Intermittent Communication: • Gaps between existing algorithms and lower bound
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L T ˜ O min + √ T 2 √ K K K K TK TKM M
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: Minibatch #1 Minibatch #3 Minibatch #2 Minibatch #4 MK MK MK MK 1 1 1 1 X X X X g 3 = r f ( x 3 ; z i ) g 4 = r f ( x 4 ; z i ) g 2 = r f ( x 2 ; z i ) g 1 = r f ( x 1 ; z i ) MK MK MK MK i =1 i =1 i =1 i =1 Calculate Calculate x 3 Calculate Calculate x 4 x 2 x 5
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: Minibatch #1 Minibatch #3 Minibatch #2 Minibatch #4 MK MK MK MK 1 1 1 1 X X X X g 3 = r f ( x 3 ; z i ) g 4 = r f ( x 4 ; z i ) g 2 = r f ( x 2 ; z i ) g 1 = r f ( x 1 ; z i ) MK MK MK MK i =1 i =1 i =1 i =1 Calculate Calculate x 3 Calculate Calculate x 4 x 2 x 5
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Sequential SGD steps
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Sequential SGD steps
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM
✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM • Option 4: Parallel SGD ???
Come to our poster tonight from 5-7pm
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.