signSGD Jeremy Bernstein Yu-Xiang Wang Caltech UCSB/Amazon Compressed optimisation for non-convex problems Kamyar Azizzadenesheli Anima Anandkumar UCI Caltech/Amazon
Snap gradient components to ± 1 Reduces signSGD communication time Compressed optimisation for non-convex problems Realistic for deep learning
STRUCTURE Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results
GRADIENT COMPRESSION……WHY CARE? Parameter server GPU 1 GPU 2 With 1/2 data With 1/2 data
GRADIENT COMPRESSION……WHY CARE? COMPRESS? GPU 7 GPU 1 Parameter GPU 7 server GPU 2 GPU 5 GPU 3 GPU 4
DISTRIBUTED SGD Parameter server ∑ g g g GPU 1 GPU 2 With 1/2 data With 1/2 data
SIGN SGD WITH MAJORITY VOTE Parameter server sign [ ∑ sign( g ) ] sign(g) sign(g) GPU 1 GPU 2 With 1/2 data With 1/2 data
COMPRESSION SAVINGS OF MAJORITY VOTE 40 30 # bits per component 20 per iteration 10 0 SGD Majority vote
SIGN SGD IS A SPECIAL CASE OF ADAM Adam signSGD Signum (Sign momentum)
ADAM……………………WHY CARE? 11000 Turing Adam test # of 8250 Google scholar 5500 SGD Robbins & Monro citations Kingma & Ba 2750 Turing 0
UNIFYING ADAPTIVE GRADIENT METHODS + COMPRESSION Sign descent Compressed descent weak theoretical foundation weak theoretical foundation incredibly popular (e.g. Adam) take pains to correct bias empirically successful Need to build a Sign-based gradient compression? theory
STRUCTURE Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results
DOES SIGN SGD EVEN CONVERGE? What might we fear? ➤ Might not converge at all ➤ Might have horrible dimension dependence ➤ Majority vote may give no speedup by adding extra machines Compression can be a free lunch Our results ➤ It does converge ➤ We characterise functions where signSGD & majority vote are as nice as SGD ➤ Suggest these functions are typical in deep learning
⃗ ⃗ ⃗ ⃗ ⃗ ⃗ SINGLE WORKER RESULTS Assumptions Define f * ➤ Objective function lower bound ➤ Number of iterations K ➤ Coordinate-wise variance bound ➤ Number of backpropagations N σ ➤ Coordinate-wise gradient Lipschitz L 피 [ 2 ] ≤ N [ 2 ∥ 2 ] K − 1 1 1 ∑ SGD gets rate ∥ g k ∥ 2 σ ∥ 2 L ∥ ∞ ( f 0 − f * ) + ∥ K k =0 피 [ 2 2 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] N [ σ ∥ 1 ] K − 1 2 ) + 2 ∥ 1 1 signSGD gets rate ∑ ≤ ∥ K k =0
⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ⃗ SINGLE WORKER RESULTS Assumptions Define f * ➤ Objective function lower bound ➤ Number of iterations K ➤ Coordinate-wise variance bound ➤ Number of backpropagations N σ ➤ Coordinate-wise gradient Lipschitz L 피 [ 2 ] ≤ N [ 2 ∥ 2 ] K − 1 1 1 ∑ SGD gets rate ∥ g k ∥ 2 σ ∥ 2 L ∥ ∞ ( f 0 − f * ) + ∥ K k =0 피 [ 2 2 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] N [ σ ∥ 1 ] K − 1 2 ) + 2 ∥ 1 1 signSGD gets rate ∑ ≤ ∥ d ∥ σ ∥ 2 d ∥ g k ∥ 2 d ∥ L ∥ ∞ K k =0
⃗ ⃗ ⃗ ⃗ ⃗ ⃗ with M workers MULTI WORKER RESULTS Assumptions f * ➤ Objective function lower bound ➤ Coordinate-wise variance bound σ ➤ Coordinate-wise gradient Lipschitz L 피 [ 2 ] ≤ N [ 2 ∥ ] 2 K − 1 L ∥ ∞ ( f 0 − f * ) + ∥ σ ∥ 2 1 1 ∑ SGD gets rate ∥ g k ∥ 2 K M k =0 피 [ if gradient noise is 2 2 L ∥ 1 ( f 0 − f * + 1 ∥ g k ∥ 1 ] N [ ] K − 1 2 ) + 2 ∥ σ ∥ 1 1 1 ∑ unimodal symmetric ≤ ∥ K M majority vote gets k =0
STRUCTURE Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results
CHARACTERISING THE DEEP LEARNING LANDSCAPE EMPIRICALLY ➤ signSGD cares about gradient density ➤ majority vote cares about noise symmetry Natural measure of density =1 for fully dense v ≈ 0 for fully sparse v For large enough mini-batch size, reasonable by Central Limit Theorem.
STRUCTURE Why care about signSGD? Theoretical convergence results Empirical characterisation of neural net landscape Imagenet results
SIGNUM IS COMPETITIVE ON IMAGENET Performance very similar to Adam May want to switch to SGD towards end?
DOES MAJORITY VOTE WORK? Cifar-10, Resnet-18 Jiawei Zhao NUAA
Poster tonight! 6.15—9 PM @ Hall B #72
More recommend