DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-pass Error-Compensated Compression Hanlin Tang, Xiangru Lian , Chen Yu, Tong Zhang, Ji Liu Presenter: Xiangru Lian
Compressed SGD (existing algorithms) Worker 1 g (1) ¯ g n x t +1 = x t − γ ∑ C ω [ g ( i ) ] Server n i =1 ¯ ¯ g g Compression Operator : g (2) g (3) 1bit Quantization { Worker 2 Worker 3 Clipping Top-k Sparsification
Compressed SGD introduces error: 1.2 → 1; error = − 0.2 We can do better by compensating this error: 1.2 → 1; error = − 0.2 Next Step Next_Grad Next_Grad - error ←
DoubleSqueeze High Level: Compensating Error for Both Server and Workers Worker : i g ( i ) ← ∇ F ( x ; ξ ( i ) ), v ( i ) ← C ω [ g ( i ) + δ ( i ) ] , δ ( i ) ← g ( i ) + δ ( i ) − v ( i ) Server : n g ← 1 ∑ g + ¯ ¯ g + ¯ v ( i ) , v ← C ω [ ¯ δ ] , δ ← ¯ δ − ¯ ¯ ¯ v n i =1 On All Workers (Model Update): x ← x − γ ¯ v
Convergence Rate Assumptions: Non Convex, with L-Lipschitz Gradient; f ( x ) 𝔽 ξ ∼ i ∥∇ F ( x ; ξ ) − ∇ f i ( x ) ∥ 2 ≤ σ 2 , ∀ i , ∀ x ; ∥ C ω [ x ] − x ∥ 2 ≤ ζ 2 T : Total Iterations 2 𝔽∥∇ f ( x T ) ∥ 2 ≲ 1 + σ + ζ 3 (DoubleSqueeze) 2 T nT 3 𝔽∥∇ f ( x T ) ∥ 2 ≲ 1 + σ ζ + (Compressed SGD) nT T
Experiments ResNet-18 on CIFAR-10. 8 Nvidia 1080Ti GPUs. 1 GPU per worker. 1Bit Quantization: Top-k Sparsification: Convergence VDQLllD SGD VDnLllD SGD 1.5 DouEleSqueeze DouEleSTueeze 1.0 Rate LoVV 0E0-SGD 0E0-SGD LoVV 1.0 QSGD ToS-k SGD 0.5 0.5 0.0 0.0 0 50 100 150 200 0 100 200 300 eSoch eSoch VDnillD 6GD 500 VDQillD 6GD 500 DouEle6Tueeze DouEle6queeze 400 0(0-6GD Per-Epoch 400 0(0-6GD VeconGV VecoQGV ToS-k 6GD 46GD 300 300 Time 200 200 100 100 0 0 0.02 0.04 0.06 0.08 0.10 0.02 0.04 0.06 0.08 0.10 BDnGwiGth (1/0E) BDQGwiGth (1/0E)
Thanks Welcome to Pacific Ballroom #99 to see the poster for more detail
Recommend
More recommend