Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic - PowerPoint PPT Presentation

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang Blake Woodworth Adam Smith (UChicago→ 2" Investments) (TTIC) (Boston University) Nati Srebro H. Brendan McMahan (TTIC) (Google)

Parallel Stochastic Optimization/Learning min x F ( x ) := E z ∼ D [ f ( x ; z )] Many parallelization scenarios: Synchronous parallelism • Asynchronous parallelism • Delayed updates • Few/many workers • Infrequent communication • Federated learning • …

What is the best we can hope for in a given parallelism scenario?

What is the best we can hope for in a given parallelism scenario? • We formalize the parallelism in terms of a dependency graph: Ancestors ( ) u 9 u 1 u 4 u 7 u 9 u 2 u 5 u 10 u 3 u 6 u 8 • At each node u, make a query based only on knowledge of ancestors ’ oracle interaction (plus shared randomness) • Graph defines class of optimization algorithms A ( G ) • Come to our poster for details

• Sequential: • Layer: • Delays: • Intermittent Communication:

Generic Lower Bounds Theorem: For any dependency graph with nodes and depth , no algorithm G D N for optimizing convex, -Lipschitz, -smooth on a bounded domain in high f ( x ; z ) L H dimensions can guarantee error less than: With stochastic gradient oracle: ⇢ L ✓ � ◆ , H L Ω min + √ D 2 √ D N With stochastic prox oracle: ⇢ L ✓ � ◆ D, H L Ω min + D 2 √ N f ( y ; z ) + β 2 k y � x k 2 Prox oracle: x, β , z 7! arg min y i.e. exactly optimize subproblem in each node (ADMM, DANE, etc.)

• Sequential: • SGD is optimal • Layers: • Accelerated minibatch SGD is optimal • Delays: • Delayed-update SGD is not optimal • “Wait-and-Collect” minibatch is optimal • Intermittent Communication: • Gaps between existing algorithms and lower bound

✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L T ˜ O min + √ T 2 √ K K K K TK TKM M

✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: Minibatch #1 Minibatch #3 Minibatch #2 Minibatch #4 MK MK MK MK 1 1 1 1 X X X X g 3 = r f ( x 3 ; z i ) g 4 = r f ( x 4 ; z i ) g 2 = r f ( x 2 ; z i ) g 1 = r f ( x 1 ; z i ) MK MK MK MK i =1 i =1 i =1 i =1 Calculate Calculate x 3 Calculate Calculate x 4 x 2 x 5

✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Sequential SGD steps

✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Best of A-MB-SGD, SGD, SVRG: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM Calculate full gradient in parallel Sequential variance-reduced updates Aggregate full gradient

✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM

✓ ⇢ � ◆ L H L • Lower bound: Ω , min + √ T 2 K 2 √ TK TKM ✓ H ◆ L • Option 1: Accelerated Minibatch SGD O T 2 + √ TKM ✓ ◆ L • Option 2: Sequential SGD O √ TK ✓ H ◆ L • Option 3: SVRG on empirical objective ˜ O TK + √ TKM • Combining 1-3: ✓ ⇢ � ◆ L , H TK , H L ˜ O min + √ T 2 √ TK TKM • Option 4: Parallel SGD ???

Come to our poster tonight from 5-7pm

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic - PowerPoint PPT Presentation

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization Jialei Wang Blake Woodworth Adam Smith (UChicago 2" Investments) (TTIC) (Boston University) Nati Srebro H. Brendan McMahan (TTIC) (Google) Parallel

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Lower Bounds on Matrix Rigidity via a Quantum Argument Ronald de Wolf CWI Amsterdam Lower

Oracle Buys AmberPoint Strengthens Oracle Fusion Middleware SOA Suite and Oracle Enterprise

Oracle eBusiness Suite 11i Integration Ulrich Janke Oracle Consulting Deutschland Page 1

Lecture 13: Oracle Turing Machines Arijit Bishnu 13.04.2010 Oracle Turing Machines

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Amit Chakrabarti Dartmouth College WAPMDS, IIT Kanpur, Dec 2009 Amit Chakrabarti 1 Multi-Pass

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Monotone Circuit Depth Lower Bounds Prashant Vasudevan April 10, 2012 Prashant Vasudevan

Oracle Buys Ksplice Oracle Linux Enhanced with Zero Downtime Software Updates July 21, 2011

Oracle SOA Suite Enterprise Service Bus Oracle Integration Product Management Multi Tiered

Oracle SOA Suite Enterprise Service Bus Oracle Integration Product Management Oracle ESB Header

Oracle Database 11g Highly Available Grid made easy with Oracle Enterprise Manager Venkat

Oracle Partner Network (OPN) Specialisms Andy Butchart - Prject (EU) Ltd Frank Lauer - Oracle

Differential inclusions and applications Sweeping process Introduction New assumption Juliette

Introduction to Experimental Robotics CSCI 1108 Lecture 18 Course Review (2) CSCI 1108

RFID SECURITY MODULE 20th december 2017 pepe vila @cgvwzq

Introduction to Machine Learning Random Forests: Proximities compstat-lmu.github.io/lecture_i2ml

Recent Progresses in Stochastic Algorithms for Big Data Optimization Tong Zhang Rutgers

Inferring Visibility: Who is (not) talking to whom? Gonca Grsun, Natali Ruchansky, Evimaria

Brndsted-Rockafellar property of subdifferentials of prox-bounded functions Marc Lassonde

Projective Splitting Methods for Decomposing Convex Optimization Problems Jonat han Eckstein