signsgd
play

signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon - PowerPoint PPT Presentation

signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon Compressed optimisation for non-convex problems Kamyar Azizzadenesheli Anima Anandkumar UCI Caltech/Amazon Snap gradient components to 1 Reduces signSGD


  1. signSGD Jeremy Bernstein 
 Yu-Xiang Wang 
 Caltech UCSB/Amazon Compressed optimisation for 
 non-convex problems Kamyar Azizzadenesheli 
 Anima Anandkumar 
 UCI Caltech/Amazon

  2. Snap gradient components to ± 1 Reduces signSGD communication time Compressed optimisation for 
 non-convex problems Realistic for deep learning

  3. STRUCTURE Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

  4. GRADIENT COMPRESSION……WHY CARE? Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data

  5. GRADIENT COMPRESSION……WHY CARE? COMPRESS? GPU 7 GPU 1 Parameter GPU 7 server GPU 2 GPU 5 GPU 3 GPU 4

  6. DISTRIBUTED SGD Parameter server ∑ g g g GPU 1 GPU 2 With 1/2 data With 1/2 data

  7. SIGN SGD WITH MAJORITY VOTE Parameter server sign [ ∑ sign( g ) ] sign(g) sign(g) GPU 1 GPU 2 With 1/2 data With 1/2 data

  8. COMPRESSION SAVINGS OF MAJORITY VOTE 40 30 # bits 
 per component 20 per iteration 10 0 SGD Majority vote

  9. SIGN SGD IS A SPECIAL CASE OF ADAM Adam signSGD Signum (Sign momentum)

  10. ADAM……………………WHY CARE? 11000 Turing Adam test # of 8250 Google scholar 5500 SGD Robbins & Monro citations Kingma & Ba 2750 Turing 0

  11. UNIFYING ADAPTIVE GRADIENT METHODS + COMPRESSION Sign descent Compressed descent weak theoretical foundation weak theoretical foundation incredibly popular (e.g. Adam) take pains to correct bias empirically successful Need to build a Sign-based gradient compression? theory

  12. STRUCTURE Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

  13. DOES SIGN SGD EVEN CONVERGE? What might we fear? ➤ Might not converge at all ➤ Might have horrible dimension dependence ➤ Majority vote may give no speedup by adding extra machines Compression can be a free lunch Our results ➤ It does converge ➤ We characterise functions where signSGD & majority vote are as nice as SGD ➤ Suggest these functions are typical in deep learning

  14. ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ SINGLE WORKER RESULTS Assumptions Define f * ➤ Objective function lower bound ➤ Number of iterations K ➤ Coordinate-wise variance bound ➤ Number of backpropagations N σ ➤ Coordinate-wise gradient Lipschitz L 피 [ 2 ] ≤ N [ 2 ∥ 2 ] K − 1 1 1 ∑ SGD gets rate ∥ g k ∥ 2 σ ∥ 2 L ∥ ∞ ( f 0 − f * ) + ∥ K k =0 피 [ 2 2 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] N [ σ ∥ 1 ] K − 1 2 ) + 2 ∥ 1 1 signSGD gets rate ∑ ≤ ∥ K k =0

  15. ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ SINGLE WORKER RESULTS Assumptions Define f * ➤ Objective function lower bound ➤ Number of iterations K ➤ Coordinate-wise variance bound ➤ Number of backpropagations N σ ➤ Coordinate-wise gradient Lipschitz L 피 [ 2 ] ≤ N [ 2 ∥ 2 ] K − 1 1 1 ∑ SGD gets rate ∥ g k ∥ 2 σ ∥ 2 L ∥ ∞ ( f 0 − f * ) + ∥ K k =0 피 [ 2 2 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] N [ σ ∥ 1 ] K − 1 2 ) + 2 ∥ 1 1 signSGD gets rate ∑ ≤ ∥ d ∥ σ ∥ 2 d ∥ g k ∥ 2 d ∥ L ∥ ∞ K k =0

  16. ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ with M workers MULTI WORKER RESULTS Assumptions f * ➤ Objective function lower bound ➤ Coordinate-wise variance bound σ ➤ Coordinate-wise gradient Lipschitz L 피 [ 2 ] ≤ N [ 2 ∥ ] 2 K − 1 L ∥ ∞ ( f 0 − f * ) + ∥ σ ∥ 2 1 1 ∑ SGD gets rate ∥ g k ∥ 2 K M k =0 피 [ if gradient noise is 2 2 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] N [ ] K − 1 2 ) + 2 ∥ σ ∥ 1 1 1 ∑ unimodal symmetric ≤ ∥ K M majority vote gets k =0

  17. STRUCTURE Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

  18. CHARACTERISING THE DEEP LEARNING LANDSCAPE EMPIRICALLY ➤ signSGD cares about gradient density ➤ majority vote cares about noise symmetry Natural measure of density =1 for fully dense v ≈ 0 for fully sparse v For large enough mini-batch size, reasonable by Central Limit Theorem.

  19. STRUCTURE Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results

  20. SIGNUM IS COMPETITIVE ON IMAGENET Performance very similar to Adam May want to switch to SGD towards end?

  21. DOES MAJORITY VOTE WORK? Cifar-10, Resnet-18 Jiawei Zhao 
 NUAA

  22. Poster tonight! 6.15—9 PM @ Hall B #72

More recommend