spcl.inf.ethz.ch @spcl_eth Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Torsten Hoefler Dan Alistarh ETH Zurich IST Austria PPoPP ’ 20 Feb. 22-26, 2020 San Diego, CA, US
spcl.inf.ethz.ch @spcl_eth Deep learning training Model parallelism The overall objective function: P2 P0 P1 w denotes F is the loss ξ is a data point the model function. sampled from a parameters. distribution D . Training : optimize w to minimize f (using SGD). Dataset 2
spcl.inf.ethz.ch @spcl_eth Deep learning training Pipeline parallelism P0 P1 The overall P2 objective function: w denotes F is the loss ξ is a data point the model function. sampled from a parameters. distribution D . Training : optimize w to minimize f (using SGD). Dataset 3
spcl.inf.ethz.ch @spcl_eth Deep learning training Data parallelism Global synchronization using Allreduce The overall objective function: P0 P1 P2 w denotes F is the loss ξ is a data point the model function. sampled from a parameters. distribution D . Training : optimize w to minimize f (using SGD). Dataset 4
spcl.inf.ethz.ch @spcl_eth Unbalanced training workloads ▪ Load imbalance on application level ▪ Recurrent Neural Networks (RNN/LSTM/GRU) ▪ Transformers (One input (Multiple inputs (Multiple inputs Challenge: stragglers multiple outputs) one output) multiple outputs) dominate the performance. Different types of RNNs ▪ Load imbalance on system level ▪ Performance variability on multitenant Interrupts, cloud systems daemon, ▪ System or network noise page/cache misses, et al. Multitenant cloud system 5
spcl.inf.ethz.ch @spcl_eth Many-to-one RNN for video classification Backward pass L ( w ) 0.13 Playing L ( W 1 ) L ( W 2 ) L ( W 3 ) L ( W T ) 0.14 Basketball 0.41 … h 0 f w h 1 f w h 2 f w h 3 h T 0.09 0.13 0.10 FC 2 FC 1 x 1 x 2 x 3 x T Workload is proportional to T RNN: 6
spcl.inf.ethz.ch @spcl_eth Workload statistics for video classification Distribution : 29 ~ 1,776 frames Distribution: 201 ~ 3,410 ms Mean : 187 frames Mean: 1,235 ms Standard deviation : 97 frames Standard deviation: 706 ms (a) Video length distribution for UCF101 dataset (b) Runtime distribution for the mini- batches to train a LSTM model on P100 7
spcl.inf.ethz.ch @spcl_eth . Transformer power [1] Vaswani, Ashish, Noam Distribution: 179 ~ 3,482 ms Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Mean: 475 ms Gomez, Łukasz Kaiser, and Illia Polosukhin. " Attention is all Standard deviation: 144 ms you need ." In Advances in NeurIPS , pp. 5998-6008. 2017. Decoder Encoder Runtime distribution for the mini-batches to train a Transformer model (using WMT16) on P100 知 识 就 是 力 量 。 . Knowledge is power ? ? The workload is proportional to input_size * output_size . 8
spcl.inf.ethz.ch @spcl_eth Training on Cloud Distribution: 399 ~ 1,892 ms Mean: 454 ms Standard deviation: 116 ms Runtime distribution on Google Cloud with 2xV100 GPUs (batch size=256, ResNet-50 on ImageNet). ▪ Compared with imbalanced applications (e.g., LSTM, Transformer), the load imbalance on cloud servers is relatively light. 9
spcl.inf.ethz.ch @spcl_eth Deep learning training is robust f(g) f(g) g g Allreduce 0.5 p(g) f(g) Top-k Top-k g g 0 1 Gradients 1-bit gradients Hidden units sparsification quantization dropout P+1 P-1 P Gossiping 10
spcl.inf.ethz.ch @spcl_eth Eager-SGD to solve the load imbalance problem (b) eager-SGD (a) synch-SGD W (1) W (2) idle W (1) W (2) Process 0 Process 0 W (0) W (0) partial-allreduce partial-allreduce synch-allreduce synch-allreduce W (1) W (2) W (1) W (2) Process n idle Process n Time Time Eager-SGD exploits the robustness of the training by allowing allreduce on stale gradients. Gossip-based SGDs Communication Number of steps for Consistency mode participants update propagation D-PSGD [1] 2 O(P) synchronous AD-PSGD [2] 1 O(logP) asynchronous eager-SGD P 1 asynchronous 11
spcl.inf.ethz.ch @spcl_eth Partial Allreduce operations ▪ Two phases: the activation and the collective operation P3 schedule ▪ P0 P1 P2 P3 Asynchronous execution : an S0 auxiliary thread would progress the Activation Activation R0 R1 N0 execution (activation and collective) R0 in the background. S1 N1 S0 S1 R1 ▪ Multiple initiators: the same S2 R2 R3 S2 Allreduce Allreduce operation is only executed once R2 even if we may have multiple C0 S3 initiators, i.e. multiple processes S3 C1 arrive at the same time. R3 12
spcl.inf.ethz.ch @spcl_eth Solo allreduce and majority allreduce ▪ Two variants: solo allreduce [3] and majority allreduce. ▪ For solo, at least one process “actively” participates. ▪ For majority, a majority of processes must “actively” participate. Solo allreduce Majority allreduce Initiator The fastest process A randomly specified process Attributes Wait-free Wait for the randomly specified initiator The expectation of Ω (1) Ω (P/2) the participants [3] Di Girolamo, Salvatore, Pierre Jolivet, Keith D. Underwood, and Torsten Hoefler. "Exploiting offload enabled network interfaces." In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects , pp. 26-33. IEEE, 2015. 13
spcl.inf.ethz.ch @spcl_eth Implementation eager-SGD based on Tensorflow control Addition dependency All- 1 Addition reduce All- 2 Conv-BN Conv-BN Conv-BN Conv-BN reduce Conv-BN-ReLU All- 3 Conv-BN-ReLU reduce Max Pool All- 4 Max Pool reduce backward pass forward pass Customized distributed optimizer based on Tensorflow Eager-SGD utilizes the execution engine of TF to exploit the parallelism in the computation DAG. 14
spcl.inf.ethz.ch @spcl_eth Execution of eager-SGD P0 P1 1. Two processes and P1 is faster. w t 1 w t 0 G t 1 2. P1 finishes the calculation for the G null G t 1 sendbuff 0 sendbuff 1 step t gradients of step t , and triggers partial- partial-allreduce 1 G t allreduce. P0 contributes NULL. G t 1 G t 0 recvbuff 0 recvbuff 1 1 w t+1 3. P0 finishes step t , and discovers partial- ( , ) U G t ( , ) 1 U G t w t+1 1 0 allreduce is already done. P0 copies the G t G t 0 0 G t+1 1 stale gradients to its send buffer. sendbuff 0 0 + 0 ' G t+1 =G t+1 G t 0 sendbuff 0 sendbuff 1 step t+1 4. P0 catches up P1 in step t+1 . The stale 1 0 ' G t+1 G t+1 partial-allreduce gradients are combined with the latest recvbuff 0 recvbuff 1 0 ' 0 ' +G t+1 1 G t+1 G t+1 +G t+1 1 gradients, and then commit to partial- allreduce. Computation thread Communication thread 15
spcl.inf.ethz.ch @spcl_eth Convergence of eager-SGD ▪ For a learning rate value , eager-SGD converges after ▪ Note the dependence in 𝜐 iterations. Staleness (staleness bound) and 𝑄 - 𝑅 (the bound number of stale gradients) for The total iterations T . number of ▪ Eager-SGD would converge processes slower if too many stale The number of gradients are used. processes which contribute the latest gradients 16
spcl.inf.ethz.ch @spcl_eth Evaluation ▪ CSCS Piz Daint supercomputer. ▪ Cray Aries interconnected network. ▪ Cray MPICH 7.7.2 communication library. ▪ Each node contains a 12-core Intel Xeon E5-2690 CPU, and one NVIDIA Tesla P100 GPU. ▪ We compare eager-SGD with the allreduce-based synch-SGD ( Horovod and Deep500 ), the asynchronous centralized SGD ( TF parameter server ), and the gossip SGDs ( D-PSGD , SGP ). Simulated load imbalance (traces on cloud machine) Table 1. Neural networks used for evaluation Inherent load imbalance 17
spcl.inf.ethz.ch @spcl_eth Hyperplane regression (light load imbalance) ▪ Eager-SGD (solo) achieves 1.50x , 1.75x , and 2.01x speedup over synch-SGD (Deep500), respectively. ▪ The loss value is equivalent with synch-SGD (Deep500). Synch-SGD vs eager-SGD for hyperplane regression using 8 GPUs. "synch/eager-SGD-200/300/400" represent 200/300/400 ms load imbalance injection for 1 out of 8 processes. 18
spcl.inf.ethz.ch @spcl_eth ResNet-50 on ImageNet (light load imbalance) Synch-SGD vs eager-SGD for ResNet-50 on ImageNet using 64 GPUs. "synch/eager-SGD- 300/460" represent 300/460 ms load imbalance injection for 4 out of 64 processes. 1,4 Throughput (steps/second) 1,2 1 0,8 0,6 0,4 0,2 0 Asynch-PS D-PSGD SGP eager-SGD ▪ Eager-SGD (solo) achieves 1.25x and 1.29x speedup ▪ Eager-SGD (solo) achieves 2.64x , 1.26x , over Deep500, respectively; 1.14x and 1.27x 1.17x over aysnch-PS and gossip-based SGDs speedup over Horovod, respectively. Top-1 accuracy (D-PSGD, SGP) respectively. is almost equivalent (75.2% vs 75.8%). 19
spcl.inf.ethz.ch @spcl_eth LSTM on UCF101 (severe load imbalance) s s a u a eager-SGD eager-SGD s (solo) (majority) s Speedup over 1.64x 1.27x s Horovod a s Top-1 test 60.6% on average, 69.7% on average, a a s accuracy up to 70.4% up to 72.8% a s a a a s s Top-1 test accuracy and runtime for LSTM on UCF101 using 8 GPUs. 20
Recommend
More recommend