Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning Zijie Yan, Danyang Xiao, MengQiang Chen, Jieying Zhou, Weigang Wu † Sun Yat-sen University Guangzhou, China 1
Outline 1. Introduction 2. The Proposed Algorithm 3. Performance Evaluation 4. Conclusion and Future Work 2
Introduction •Training may take an impractically long time •Growing volume of training data (e.g., ImageNet >1TB) •More complex model •Solution: distributed training •The common practice of current DL frameworks •Enabled by Parameters Servers (PS) or Ring All-Reduce •Synchronous SGD or Asynchronous SGD 3
Introduction 4
Introduction Communication overhead : Distributed training can significantly reduce the total computation time. However, the communication overhead seriously affect the efficiency of training. Solutions : Reduce the frequency / data size of communication. 5
Introduction • Gradient Quantization • 1-Bit SGD, QSGD, TernGrad: use fewer bits to represent value. • Gradient Sparsification • Threshold Sparsification: only gradients which are greater than a predefined threshold will be sent. • Gradient Dropping: remove the gradient of R% with the smallest absolute value. • Deep Gradient Compression: apply momentum correction to correct the disappearance of momentum discounting factor. 6
Introduction PS based ASGD Worker 2 Worker 1 Local Model downward upward …… Global Model Gradients Parameters Server Global Model Parameters 7
Contributions • Dual-Way Gradient Sparsification (DGS) • Model Difference Tracking • Dual-way Gradient Sparsification Operations • Eliminates the communication bottleneck • Sparsification Aware Momentum (SAMomentum) • A novel momentum designed for gradient sparsification scenario • Offers significant optimization 8
Contributions DGS SAMomentum Worker 2 Worker 1 Local Model downward upward Model Difference Dual-way …… Compressed Gradient Local Model Gradients Sparsification Model Difference Tracking Model Server Difference Global Model Tracking 9
Model Difference Tracking • Notions : The accumulation of updates at the time . • M t t : Model difference between the server and the worker . • G k , t k : Accumulation of model difference sent by the server to the worker . • v k k Server receives gradients ∇ 𝑙 , 𝑢 Update , accumulate gradients: M t 𝑁 𝑢 +1 = 𝑁 𝑢 − 𝜃 ∇ 𝑙 , t Calculation of the model difference : 𝐻 𝑙 , 𝑢 +1 = 𝑁 𝑢 +1 − 𝑤 𝑙 , 𝑢 v k v 𝑙 , 𝑢 +1 = v 𝑙 , 𝑢 + 𝐻 𝑙 , 𝑢 +1 Accumulation of : Send model difference 𝐻 𝑙 , 𝑢 +1 10
Model Difference Tracking • What's changed? • DGS chooses to transmits model difference rather than the global G k , t +1 model • Model differences (residual gradients) that have not sent yet are recorded in , implicitly avoiding the loss of information. 𝑁 𝑢 +1 − v 𝑙 , 𝑢 • Now we can compress the downward communication! 11
Dual-way Gradient Sparsification Worker 2 Worker 1 Local Model downward upward Model Difference Compressed Local Model Gradients Worker side Server side Server Global Model 12
Dual-way Gradient Sparsification - Worker Side Select threshold 13
Dual-way Gradient Sparsification - Worker Side Sparsification 14
Dual-way Gradient Sparsification - Server Side Model Difference Tracking 15
Dual-way Gradient Sparsification - Server Side Secondary compression • Secondary compression guarantees the sparsity of the send-ready model difference in downward communication, no matter how many workers are running. • The server implicitly accumulates remaining gradient locally. • Eliminates the overhead of the downward communication. 16
Sparsification Aware Momentum DGS Worker 2 Worker 1 Bring optimization SAMomentum Local Model boost downward upward Dual-way Model Difference …… Eliminates the Compress Gradient Local communication Sparsification bottleneck Server Model Global Difference Tracking 17
SAMomentum - Background • Momentum is commonly used in deep training, which is known to offer a significant optimization boost. • However, indeterminate update intervals in gradient sparsification will result in the disappearance of momentum. 18
SAMomentum - Background Dense update : u t = mu t − 1 + η ∇ t , θ t +1 = θ t − u t After updates T t + T = η [ ⋯ + m T − 2 ∇ ( i ) t +1 ] + m T u ( i ) Dense u ( i ) t +2 + m T − 1 ∇ ( i ) t u ( i ) denotes the -th position of a flattened velocity i u t 19 t
SAMomentum - Background Sparse update : r k , t = r k , t − 1 + η * ∇ k , t , u t = mu t − 1 + sparsify ( r k , t ) r k , t = unsparsify ( r k , t ) , θ t +1 = θ t − u t Remaining Gradients After updates T t + T = η [ ⋯ + ∇ ( i ) t +1 ] + m T u ( i ) Sparse u ( i ) t +2 + ∇ ( i ) t u ( i ) denotes the -th position of a flattened velocity i u t 20 t
Momentum Disappearing t + T = η [ ⋯ + m T − 2 ∇ ( i ) t +1 ] + m T u ( i ) Dense u ( i ) t +2 + m T − 1 ∇ ( i ) t t + T = η [ ⋯ + ∇ ( i ) t +1 ] + m T u ( i ) Sparse u ( i ) t +2 + ∇ ( i ) t • Momentum factor controls the proportion of historical information. m • The disappearance of impairs the convergence performance. m 21
SAMomentum u k , t = mu k ,prev ( k ) + η ∇ k , t + unsparsify ( mu k ,prev( k ) + η ∇ k , t ) * ( m − 1 ) 1 g k , t = sparsify ( mu k ,prev( k ) + η ∇ k , t ) θ t +1 = θ t − g k , t prev( ): The timestamp of the last update on worker , which is also the timestamp of its local model. k k 22
SAMomentum From parameter perspective: u ( i ) Send at and c c + T k u ( i ) k , c + T = mu ( i ) k , c + T − 1 + η ∇ ( i ) k , c + T = m ( ( mu ( i ) m ) + η ∇ ( i ) k , c + T − 1 ) * 1 k , c + T − 2 + η ( i ) k , c + T = mu ( i ) k , c + T − 2 + η ∇ ( i ) k , c + T − 1 + η ∇ ( i ) k , c + T = ⋯ T ∑ = mu ( i ) ∇ ( i ) k , c + η k , c + i i =1 u ( i ) denotes the -th position of a flattened velocity i u t 23 t
SAMomentum and Enlarged Batch Size SAMomentum Enlarged Batch Size u ( i ) k , c + T = mu ( i ) k , c + T − 1 + η ∇ ( i ) k , c + T = m ( ( mu ( i ) k , c + T η * 1 m ) + η ∇ ( i ) k , c + T − 1 ) * 1 T ( ∇ ( i ) k , c + T ) u ( i ) k , c + T = mu ( i ) k , c +1 + ⋯ + ∇ ( i ) k , c + T − 2 + η ( i ) k , c + T T = mu ( i ) k , c + T − 2 + η ∇ ( i ) k , c + T − 1 + η ∇ ( i ) ∑ = mu ( i ) ∇ ( i ) k , c + η k , c + T k , c + i = ⋯ i =1 T ∑ = mu ( i ) ∇ ( i ) k , c + η k , c + i i =1 24
Experiments Setup 1. Comparison to Other Algorithms • Dense: • Single node momentum SGD • Asynchronous SGD • Sparse: • Gradient Dropping (EMNLP 2017) • Deep Gradient Compression (ICLR 2018, STOA) 2. Datasets • ImageNet • CIFAR-10 25
Scalability and Generalization Ability CIFAR-10 Workers Batchsize Training Method Top-1 Accuracy in total per worker 93.08% - MSGD 91.54% -1.54% ASGD 92.15% -0.93% GD-async 1 256 92.75% -0.33% DGC-async 92.97% -0.11% DGS 90.7% -2.38% ASGD 92.01% -1.07% GD-async 4 128 92.64% -0.44% DGC-async 92.91% -0.17% DGS Fig. 4 nodes 90.46% -2.62% ASGD 91.81% -1.27% GD-async 8 64 92.37% -0.71% DGC-async 93.32% +0.24% DGS 90.53% -3.01% ASGD 91.43% -1.65% GD-async 16 32 92.28% -0.80% DGC-async 92.98% -0.10% DGS 88.36% -4.71% ASGD 91% -2.08% GD-async 32 16 91.86% -1.22% DGC-async Fig. 32 nodes 26 92.69% -0.39% DGS
Scalability and Generalization Ability ImageNet 70 Workers in Batchsize per Training Top-1 6 DGS total iteration Method Accuracy 60 ASGD 5 DGCa���� 50 Test Acc�rac� 1 MSGD 69.40% - Train Loss GDa���� 4 40 66.68% 30 3 ASGD -2.72% 20 2 66.26% GD-async 10 1 -3.14% 0 20 40 60 80 20 40 60 80 4 Epochs Epochs 68.37% DGC-async -1.03% Fig. 4 nodes 7 70 69.00% 256 DGS -0.40% DGS 60 6 ASGD 50 66.25% DGCa���� 5 Test Acc�rac� ASGD Train Loss GDa���� -3.15% 40 4 30 66.19% GD-async 3 -3.21% 20 16 2 67.62% 10 DGC-async -1.78% 1 0 20 40 60 80 20 40 60 80 68.25% Epochs Epochs DGS Fig. 16 nodes -1.15% 27
Low Bandwidth Results Fig : Time vs Training Loss on 8 workers with 1Gbps Ethernet 28
Speed up Fig : Speedups for DGS and ASGD on ImageNet with 10Gbps and 1Gbps Ethernet 29
Conclusion and Future Work Conclusion 1. Enable dual-way sparsification for PS-based asynchronous training. 2. Introduce SAMomentum to bring significant optimization. 3. Experiment results show that DGS outperforms existing routing algorithms. Future Work 1. Apply SAMomentum to synchronous training. 2. Combine DGS with other compression approaches. 30
Thanks for listening 31
Recommend
More recommend